Deb Carver (Dean of UO Libraries), John Conery (Professor, Computer and Information Science), Lynn Stearney (Director of Grants, UO Foundation), Sean Sharp (Research and Instructional Technology, Campus Information Services), and I are participating in the ARL/DLF E-Science Institute.
The following is a list of readings compiled by the institute staff, and organized by topic. It’s pretty comprehensive, and may be helpful if you’re interested in gaining some background in these topics.
Readings Organized by Topic Area
Chicken or the egg? Did e-Science cause the so called data deluge or is e-Science a response to this phenomenon? The early e-Science funding initiatives in the U.K. at the beginning of the last decade targeted projects where data infrastructure was integral to managing the massive flow of data from digital instrumentation. This focus anticipated the ever-expanding use of digital instrumentation across research domains and, indeed, within society. The following articles take stock of the data deluge phenomenon. Regardless of whether e-Science has caused or has been a response to this phenomenon, a basic understanding of the massive production of digital data helps focus on the variety of issues related to managing research data.
- The data deluge and Data, data everywhere – The February 25, 2010 issue of the Economist included an article about the general deluge of data in society (the former link) and a technology pull-out section on overall characteristics of data (the latter link.) These are particularly informative articles from the popular press.
- The end of theory: the data deluge makes scientific method obsolete – This June 23, 2008 story from Wired Magzine by Chris Anderson reflects a Google-perspective about making discoveries through huge amounts of data. In other words, with a enough data and the right algorithm you don’t need a scientific model. He suggests a scientific method based on computationally derived patterns from massive data collections that doesn’t require models to test. What makes this method work are petabytes of data.
- The coming data deluge – This short opinion piece from IEEE Spectrum within Technology introduces a number of words appearing in our language because of the data deluge. For example, the author makes reference to data scientists.
- Data -The February 11, 2011 issue of Science was dedicated to the challenges and opportunities arising from the data deluge in research. This is an excellent compendium presenting perspectives on “the increasingly huge influx of research data” from a variety of scientific fields.
The following trilogy provides a solid introduction to current thinking around data-driven science (or more generally, data-driven research). The first title is an anthology describing the emergence of data-driven science. The chapter by Jim Gray on e-Science: A Transformed Scientific Method, which was reproduced from a presentation in January 2007, serves as the framework for the other authors who provide examples of data-driven science in various disciplines. The second title is from the U.S. Interagency Working Group on Digital Data, representing key U.S. agencies involved in scientific research. Working from a set of data principles that they developed, this report outlines a strategic vision around scientific data for U.S. federal agencies. The third title is a report to the European Commission from the High Level Expert Group on Scientific Data . This report provides a useful public statement about the value of scientific data to society and espouses a vision for data in 2030.
- The Fourth Paradigm: Data-Intensive Scientific Discovery – This collection of essays from Microsoft Research is a tribute to Jim Gray and his ideas about data-driven science.
- Harnessing the Power of Digital Data for Science and Society – This document includes a set of principles upon which federal scientific agencies should manage the data they produce. There is an excellent appendix on the roles for organizations and individuals.
- Riding the Wave: How Europe can gain from the rising tide of scientific data – Released in October 2010, this report establishes a strong case for European developments in research data infrastructure over the next several years. The second chapter of this report uses a variety of scenarios that expresses the value proposition for investing in data infrastructure. The third chapter describes challenges that have to be overcome in building new data infrastructure (which they interchangeably call “scientific e-infrastructure.”) The fourth and fifth chapters present a vision for 2030 and a call for action, respectively. This 38-page publication is an excellent follow up to The Fourth Paradigm: Data-Intensive Scientific Discovery, which was released in 2009.
- Science Magazine, Special Online Collection: Dealing with Data (Feb 11, 2011) – Issue devoted to challenges with scientific research data, introducing many key ideas in different scientific disciplines.
The life cycle management of information is fundamental to understanding digital curation, for it is the stewardship and management of digital objects across the life cycle that determines the activities of digital curation. Similarly, the essence of data curation is defined by the context of the research life cycle (see the class glossary for a definition the research life cycle.) The management of research data spans the research life cycle, consisting of the many activities related to the design, production, manipulation, analysis and preservation of the data itself and its supporting metadata. The stewardship of research data ensures that responsibilities for all data and metadata activities across the life cycle are assigned, understood and carried out. It is the combination of the activities of research data management and the responsibilities of data stewardship over the research life cycle that embodies data curation. The following articles introduce data curation and its supporting concepts. Beginning with an article by Anna Gold, an overview of data curation is provided that traces the evolution of the concept and its current state of development.
The Data Life Cycle
- JISC Research Lifecycle diagram – JISC, which historically stood for the Joint Information Systems Committee in the UK but which is now simply known as JISC, employs a life cycle diagram to describe the support their organization provides to researchers across the stages of the research life cycle. This brief, succinct representation of the life cycle shows two interrelated cycles making up an overall research life cycle. One cycle consists of the stages associated with knowledge management and scholarly communications, while the other cycle has stages making up the research process.
- Curation Lifecycle Model – The UK Digital Curation Centre provides an online representation of a life cycle model depicting stages in curating and preserving data from a digital records management perspective.
- The data life cycle is mentioned in some the above readings, including pages 8 and 9 of Harnessing the Power. The entry for data life cycle in the class glossary also links to an article describing characteristics of the research life cycle model.
- e-Science and the Life Cycle of Research – by Charles Humphrey, June, 2008. Brief introduction to the research life cycle (also linked from the glossary of key terms and concepts).
Research Data Management
Research Libraries, Data and e-Science
Many research libraries have been involved over the past decade and a half in developing digital collections, in producing digital content through digitization projects and in preserving digital content through institutional repositories. More recently and in conjunction with the emergence of data-driven science, the inclusion of research data in digital collections has become a focus of many libraries. Some of the following readings explore the retooling that libraries face to incorporate research data into their digital collections. Other readings provide case studies about how some libraries are addressing e-Science and research data. Some of this work can be done within an institution and several of the case studies present local approaches to building research data collections and providing e-Science data services. However, the support for e-Science and research data will increasingly require cross-institutional collaboration among libraries. A typical e-Science project tends to consist of a large research team where the researchers are from different universities, come from a variety of disciplines and are located in institutions from around the globe. Examples of this include the teams of physicists working with the Hadron Collider and the international teams of scientists conducting research under the banner of the International Polar Year. They work together through shared technology that generates massive volumes of data and supports its storage and processing through a distributed high-speed network. No single research library has the capacity to respond to such large-scale projects thus challenging libraries to find new ways to collaborate around e-Science research data. The infrastructural requirements alone to ingest, manage, preserve and provide access to large-scale research data are an impetus for libraries to collaborate.
- Retooling Libraries for the Data Challenge – In this concise article, Dorothea Salo reviews pertinent characteristics of research data, digital libraries and institutional repositories in proposing ways in which libraries can address the data challenge.
- Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century – This is a National Science Board 2005 report.
- Agenda for Developing E-Science in Research Libraries – This November 2007 report contains recommendations about e-Science to the Scholarly Communication Steering Committee, the Public Policies Affecting Research Libraries Steering Committee, and the Research, Teaching, and Learning Steering Committee.
- To Stand the Test of Time: Long-term Stewardship of Digital Data Sets in Science and Engineering – This 2006 ARL report contains the results of NSF-funded workshop to compose an agenda for research data infrastructure in science and engineering.
- Skilling Up to Do Data: Whose Role, Responsibility, Career? – This 2009 IJDC article by Graham Pryor and Martin Donnelly looks on data curation roles and skills in the UK and proposes a framework for skills development in data management.
- Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop – This report “summarizes a 2009 National Research Council workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.”
- The Shape of the Scientific Article in the Developing Cyberinfrastructure” – This report by Cliff Lynch discusses how “E-science represents a significant change, or extension, to the conduct and practice of science. This article speculates about how the character of the scientific article is likely to change to support these changes in scholarly work. In addition to changes to the nature of scientific literature that facilitate the documentation and communication of e-science, it’s also important to recognize that active engagement of scientists with their literature has been, and continues to be, itself an integral and essential part of scholarly practice; in the cyberinfastructure environment, the nature of engagement with, and use of, the scientific literature is becoming more complex and diverse, and taking on novel dimensions.”
- E-Science and Data Support Services: A Study of ARL Member Institutions – This 2010 ARL report by Soehner, Steeves & Ward reviews the different approaches libraries are taking toward e-Science and data support services. Six institutional cases studies are also provided.
- Data Sharing, Small Science, and Institutional Repositories (post-print) – This 2010 article by Cragin, Palmer, Carlson and Witt in Philosophical Transactions of the Royal Society A contains results of the Data Curation Profiles research project done by UIUC and Purdue on how faculty view and practice data sharing.
- Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force (access requires subscription) – This 2011 article by Newton, Miller and Bracke in Collection Management describes the Purdue Libraries task force charged with building faculty-produced collections for a data repository prototype. This project developed an inventory and characterized the resources and skills required of the libraries and its data-collecting librarians. The roles and activities of librarians identiﬁed during the project were explored.
- Determining Data Information Literacy Needs: A Study of Students and Research Faculty (access requires subscription) – This 2011 article by Carlson, Fosmire, Miller and Sapp-Nelson in portal: Libraries and the Academy describes how “researchers increasingly need to integrate the disposition, management, and curation of their data into their current workflows. However, it is not yet clear to what extent faculty and students are sufficiently prepared to take on these responsibilities. This paper articulates the need for a data information literacy program (DIL) to prepare students to engage in such an e-research environment. Assessments of faculty interviews and student performance in a geoinformatics course provide complementary sources of information, which are then filtered through the perspective of ACRL’s information literacy competency standards to produce a draft set of outcomes for a data information literacy program.”
- Data Curation Program Development in U.S. Universities: The Georgia Institute of Technology Example – This 2011 article by Walters in The International Journal of Digital Curation presents GT’s data curation program development. The main characteristic is a program devoid of top-level mandates and incentives, but rich with independent, “bottom-up” action. The paper addresses program antecedents and context, inter-institutional partnerships that advance the library’s curation program, library organizational developments, partnerships with campus research communities, and a proposed model for curation program development.
- Data Services for the Sciences: A Needs Assessment” – This 2010 article by Westra in Ariadne describes scientific research data management as “a fluid and evolving endeavour, reflective of the high rate of change in the information technology landscape, increasing levels of multi-disciplinary research, complex data structures and linkages, advances in data visualisation and analysis, and new tools capable of generating or capturing massive amounts of data. These factors create a complex and challenging environment for managing data, and one in which libraries can have a significant positive role supporting e-science. A needs assessment can help to characterise scientists’ research methods and data management practices, highlighting gaps and barriers, and thereby improve the odds for libraries to plan appropriately and effectively implement services in the local setting.”
- The Cornell University Library (CUL) Data Working Group (DaWG) report – This 2008 report contains five recommendations from the Data Working Group detailing how the Cornell University Library could engage in data curation. Included within these recommendations is a set of services that could be provided to researchers and local infrastructure and policies needed to sustain these services.
- Responding to the Call to Curate: Digital Curation in Practice at Penn State University Libraries (pre-print) – This 2011 article by Hswe, Furlough and Giarlo in the The International Journal of Digital Curation presents how Pennsylvania State University Libraries established a Content Stewardship program for the university, describing the planning and staffing needed for its implementation. They specifically address the challenges of starting and sustaining a stewardship services program.