Conferences/training

Deb Carver (Dean of UO Libraries), John Conery (Professor, Computer and Information Science), Lynn Stearney (Director of Grants, UO Foundation), Sean Sharp (Research and Instructional Technology, Campus Information Services), and I are participating in the ARL/DLF E-Science Institute.

The following is a list of readings compiled by the institute staff, and organized by topic. It’s pretty comprehensive, and may be helpful if you’re interested in gaining some background in these topics.

————————————————-

Readings Organized by Topic Area

July, 2011

Data deluge

Chicken or the egg?  Did e-Science cause the so called data deluge or is e-Science a response to this phenomenon?  The early e-Science funding initiatives in the U.K. at the beginning of the last decade targeted projects where data infrastructure was integral to managing the massive flow of data from digital instrumentation.  This focus anticipated the ever-expanding use of digital instrumentation across research domains and, indeed, within society.  The following articles take stock of the data deluge phenomenon.  Regardless of whether e-Science has caused or has been a response to this phenomenon, a basic understanding of the massive production of digital data helps focus on the variety of issues related to managing research data.

  • The data deluge and Data, data everywhere – The February 25, 2010 issue of the Economist included an article about the general deluge of data in society (the former link) and a technology pull-out section on overall characteristics of data (the latter link.)  These are particularly informative articles from the popular press.
  • The end of theory: the data deluge makes scientific method obsolete  – This June 23, 2008 story from Wired Magzine by Chris Anderson reflects a Google-perspective about making discoveries through huge amounts of data.  In other words, with a enough data and the right algorithm you don’t need a scientific model.  He suggests a scientific method based on computationally derived patterns from massive data collections that doesn’t require models to test.  What makes this method work are petabytes of data.
  • The coming data deluge – This short opinion piece from IEEE Spectrum within Technology introduces a number of words appearing in our language because of the data deluge.  For example, the author makes reference to data scientists.
  • Data -The February 11, 2011 issue of Science was dedicated to the challenges and opportunities arising from the data deluge in research.  This is an excellent compendium presenting perspectives on “the increasingly huge influx of research data” from a variety of scientific fields.

Data-driven science

The following trilogy provides a solid introduction to current thinking around data-driven science (or more generally, data-driven research).  The first title is an anthology describing the emergence of data-driven science.  The chapter by Jim Gray on e-Science: A Transformed Scientific Method, which was reproduced from a presentation in January 2007, serves as the framework for the other authors who provide examples of data-driven science in various disciplines.  The second title is from the U.S. Interagency Working Group on Digital Data, representing key U.S. agencies involved in scientific research.  Working from a set of data principles that they developed, this report outlines a strategic vision around scientific data for U.S. federal agencies.  The third title is a report to the European Commission from the High Level Expert Group on Scientific Data .  This report provides a useful public statement about the value of scientific data to society and espouses a vision for data in 2030.

  • The Fourth Paradigm: Data-Intensive Scientific Discovery  – This collection of essays from Microsoft Research is a tribute to Jim Gray and his ideas about data-driven science.
  • Harnessing the Power of Digital Data for Science and Society - This document includes a set of principles upon which federal scientific agencies should manage the data they produce.  There is an excellent appendix on the roles for organizations and individuals.
  • Riding the Wave: How Europe can gain from the rising tide of scientific data – Released in October 2010, this report establishes a strong case for European developments in research data infrastructure over the next several years.  The second chapter of this report uses a variety of scenarios that expresses the value proposition for investing in data infrastructure.  The third chapter describes challenges that have to be overcome in building new data infrastructure (which they interchangeably call “scientific e-infrastructure.”)  The fourth and fifth chapters present a vision for 2030 and a call for action, respectively.  This 38-page publication is an excellent follow up to The Fourth Paradigm: Data-Intensive Scientific Discovery, which was released in 2009.
  • Science Magazine, Special Online Collection: Dealing with Data (Feb 11, 2011) – Issue devoted to challenges with scientific research data, introducing many key ideas in different scientific disciplines.

Data Curation

The life cycle management of information is fundamental to understanding digital curation, for it is the stewardship and management of digital objects across the life cycle that determines the activities of digital curation.  Similarly, the essence of data curation is defined by the context of the research life cycle (see the class glossary for a definition the research life cycle.)  The management of research data spans the research life cycle, consisting of the many activities related to the design, production, manipulation, analysis and preservation of the data itself and its supporting metadata.  The stewardship of research data ensures that responsibilities for all data and metadata activities across the life cycle are assigned, understood and carried out.  It is the combination of the activities of research data management and the responsibilities of data stewardship over the research life cycle that embodies data curation.  The following articles introduce data curation and its supporting concepts.  Beginning with an article by Anna Gold, an overview of data curation is provided that traces the evolution of the concept and its current state of development.

Data Curation

The Data Life Cycle

  • JISC Research Lifecycle diagram – JISC, which historically stood for the Joint Information Systems Committee in the UK but  which is now simply known as JISC, employs a life cycle diagram to describe the support their organization provides to researchers across the stages of the research life cycle.  This brief, succinct representation of the life cycle shows two interrelated cycles making up an overall research life cycle.  One cycle consists of the stages associated with knowledge management and scholarly communications, while the other cycle has stages making up the research process.
  • Curation Lifecycle Model – The UK Digital Curation Centre provides an online representation of a life cycle model depicting stages in curating and preserving data from a digital records management perspective.
  • The data life cycle is mentioned in some the above readings, including pages 8 and 9 of Harnessing the Power.  The entry for data life cycle in the class glossary also links to an article describing characteristics of the research life cycle model.
  • e-Science and the Life Cycle of Research – by Charles Humphrey, June, 2008. Brief introduction to the research life cycle (also linked from the glossary of key terms and concepts).

Research Data Management

Data Stewardship

Research Libraries, Data and e-Science

Many research libraries have been involved over the past decade and a half in developing digital collections, in producing digital content through digitization projects and in preserving digital content through institutional repositories.  More recently and in conjunction with the emergence of data-driven science, the inclusion of research data in digital collections has become a focus of many libraries.  Some of the following readings explore the retooling that libraries face to incorporate research data into their digital collections.  Other readings provide case studies about how some libraries are addressing e-Science and research data.  Some of this work can be done within an institution and several of the case studies present local approaches to building research data collections and providing e-Science data services.  However, the support for e-Science and research data will increasingly require cross-institutional collaboration among libraries.  A typical e-Science project tends to consist of a large research team where the researchers are from different universities, come from a variety of disciplines and are located in institutions from around the globe. Examples of this include the teams of physicists working with the Hadron Collider and the international teams of scientists conducting research under the banner of the International Polar Year.  They work together through shared technology that generates massive volumes of data and supports its storage and processing through a distributed high-speed network.  No single research library has the capacity to respond to such large-scale projects thus challenging libraries to find new ways to collaborate around e-Science research data.  The infrastructural requirements alone to ingest, manage, preserve and provide access to large-scale research data are an impetus for libraries to collaborate.

Retooling

  • Retooling Libraries for the Data Challenge  – In this concise article, Dorothea Salo reviews pertinent characteristics of research data, digital libraries and institutional repositories in proposing ways in which libraries can address the data challenge.
  • Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century – This is a National Science Board 2005 report.
  • Agenda for Developing E-Science in Research Libraries – This November 2007 report contains recommendations about e-Science to the Scholarly Communication Steering Committee, the Public Policies Affecting Research Libraries Steering Committee, and the Research, Teaching, and Learning Steering Committee.
  • To Stand the Test of Time: Long-term Stewardship of Digital Data Sets in Science and Engineering – This 2006 ARL report contains the results of NSF-funded workshop to compose an agenda for research data infrastructure in science and engineering.
  • Skilling Up to Do Data: Whose Role, Responsibility, Career? – This 2009 IJDC article by Graham Pryor and Martin Donnelly looks on data curation roles and skills in the UK and proposes a framework for skills development in data management.
  • Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop  - This report “summarizes a 2009 National Research Council workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.”
  • The Shape of the Scientific Article in the Developing Cyberinfrastructure” – This report by Cliff Lynch discusses how “E-science represents a significant change, or extension, to the conduct and practice of science.  This article speculates about how the character of the scientific article is likely to change to support these changes in scholarly work. In addition to changes to the nature of scientific literature that facilitate the documentation and communication of e-science, it’s also important to recognize that active engagement of scientists with their literature has been, and continues to be, itself an integral and essential part of scholarly practice; in the cyberinfastructure environment, the nature of engagement with, and use of, the scientific literature is becoming more complex and diverse, and taking on novel dimensions.”

Case Studies

  • E-Science and Data Support Services:  A Study of ARL Member Institutions – This 2010 ARL report by Soehner, Steeves & Ward reviews the different approaches libraries are taking toward e-Science and data support services.  Six institutional cases studies are also provided.
  • Data Sharing, Small Science, and Institutional Repositories (post-print) – This 2010 article by Cragin, Palmer, Carlson and Witt in Philosophical Transactions of the Royal Society A contains results of the Data Curation Profiles research project done by UIUC and Purdue on how faculty view and practice data sharing.
  • Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force (access requires subscription) – This 2011 article by Newton, Miller and Bracke in Collection Management describes the Purdue Libraries task force charged with building faculty-produced collections for a data repository prototype.  This project developed an inventory and characterized the resources and skills required of the libraries and its data-collecting librarians. The roles and activities of librarians identified during the project were explored.
  • Determining Data Information Literacy Needs: A Study of Students and Research Faculty (access requires subscription) – This 2011 article by Carlson, Fosmire, Miller and Sapp-Nelson in portal: Libraries and the Academy describes how “researchers increasingly need to integrate the disposition, management, and curation of their data into their current workflows. However, it is not yet clear to what extent faculty and students are sufficiently prepared to take on these responsibilities. This paper articulates the need for a data information literacy program (DIL) to prepare students to engage in such an e-research environment. Assessments of faculty interviews and student performance in a geoinformatics course provide complementary sources of information, which are then filtered through the perspective of ACRL’s information literacy competency standards to produce a draft set of outcomes for a data information literacy program.”
  • Data Curation Program Development in U.S. Universities:  The Georgia Institute of Technology Example – This 2011 article by Walters in The International Journal of Digital Curation presents GT’s data curation program development. The main characteristic is a program devoid of top-level mandates and incentives, but rich with independent, “bottom-up” action. The paper addresses program antecedents and context, inter-institutional partnerships that advance the library’s curation program, library organizational developments, partnerships with campus research communities, and a proposed model for curation program development.
  • Data Services for the Sciences: A Needs Assessment” – This 2010 article by Westra in Ariadne describes scientific research data management as “a fluid and evolving endeavour, reflective of the high rate of change in the information technology landscape, increasing levels of multi-disciplinary research, complex data structures and linkages, advances in data visualisation and analysis, and new tools capable of generating or capturing massive amounts of data.  These factors create a complex and challenging environment for managing data, and one in which libraries can have a significant positive role supporting e-science. A needs assessment can help to characterise scientists’ research methods and data management practices, highlighting gaps and barriers, and thereby improve the odds for libraries to plan appropriately and effectively implement services in the local setting.”
  • The Cornell University Library (CUL) Data Working Group (DaWG) report – This 2008 report contains five recommendations from the Data Working Group detailing how the Cornell University Library could engage in data curation.  Included within these recommendations is a set of services that could be provided to researchers and local infrastructure and policies needed to sustain these services.
  • Responding to the Call to Curate:  Digital Curation in Practice at Penn State University Libraries (pre-print) – This 2011 article by Hswe, Furlough and Giarlo in the The International Journal of Digital Curation presents how Pennsylvania State University Libraries established a Content Stewardship program for the university, describing the planning and staffing needed for its implementation.  They specifically address the challenges of starting and sustaining a stewardship services program.

I attended the Science Commons Symposium – Pacific Northwest, on the Microsoft campus in Redmond, WA on Feb. 20, 2010.

My notes are below. More detailed notes were compiled by Brian Glanz of the Open Science Foundation, and there are other links, notes and comments are on the Science Commons site.

Videos synchronized with slides are available. The videos will play in any browser, but you will need Silverlight and Windows or OSX for them to play properly and view the video and slides together. Session 1 Session 2 Session 3 Session 4

Videos (no slides) may also be downloaded as .wmv here: Session 1 Session 2 Session 3 Session 4
———–
Lee Dirks gave an overview of Microsoft External research, which is research in collaboration with external parties, such as the Science Commons, with education & scholarly communication and other groups. Some of the partnerships include the add-in for MS Office for Creative Commons license into PowerPoint, Excel, or Word.
The Word add-in for ontology recognition was done by Phil Bourne, Lynn Fink, and John Wilbanks. The ontology travels with the Word doc, and it includes an ontology browser, term recognition and disambiguation. There is also a the Creative Commons Add-in for Office Word 2007, a “plug-in to Office 2007 which connects to the Creative Commons webservice in order to generate and embed XML and bitmap representations of Creative Commons licenses. “

Also mentioned was the book, The Fourth Paradigm: Data Intenstive Scientific Discovery, edited by Kristin Tolle, Tony Hey, and Stewart Tansley. The four themes mirror the emphasis of MS External Research (scholarly communication, health & well-being, infrastructure, and earth & environment). There are 26 short technical papers in the four sections, 45 of the 70 authors are not Microsoft. Data-intensive discovery is the fourth paradigm, the other three are empirical, analytical, and simulation paradigms.

Cameron Neylon: science in the open: why do we need it? How do we do it?
Cameron works on structural biology, see Wikimedia commons; small lab work and big iron facilities mixture of work. The question to ask (and answer): why do the research? Why do we pay for it? Cures; prestige (national and personal); excitement (galaxies, how biology works) and curiosity; fun. It’s a privilege, not a right. So how to maximize on the public investment (not just monetary, but cures, etc.)?
1. Open access. Widest community has greatest access to the results.
2. Formal publication, may be overkill (don’t need a sledgehammer to take down a snow man).

Web makes making things public easy; example was the solubility data he and Jean-Claude generated and was top hit in a Google search. Broadcasting is easy; sharing effectively is much harder. Have to make the choice to put up the signage, and put it into a form(at) that is discoverable.

Interoperability is the key to making this possible: legal, technical, and process interoperability. Systems need to work with the existing processes and the people. Capture the pieces of the research, then add the tools later that help to tell the story. Map the process onto agreed vocabularies. Machines do structure and need structure, humans tell stories.
How to get the structured data out? Generate an RDF from drop-down lists, it knows the inputs, etc.

Tools to deal with the scale of data are needed. Genbank can’t keep up with the amount of data that is being submitted. Human scientist capacity is static, doesn’t scale. The only thing that does, he believes, is the Web. The web scales by distribution effects. Governments, research groups, etc. do not effectively scale. Therefore the scientist must be connected, and sharing.

Open content builds the network; see the Panton Principles http://pantonprinciples.org for a checklist of principles for sharing data/open science; clarity and adherence to these points will facilitate sharing.

Jean-Claude Bradley: Using Free Hosted Web 2.0 for Open Notebook Science (ONS).
The case for ONS: 6 points and how ONS addresses these issues.

How bad is the current system? Look for the solubility of EGCG, using Beilstein, etc. Peer reviewed number is 521.7 g/L (misprint), Sigma Aldrich says you can create a solution of 5 g/L, and another cite that says 2.3 g/L. Chain of provenance is difficult to follow (and absent for the chem company info).

Another example: NaH oxidation controversy, a claim of something that most chemists would say is impossible. People tried to replicate the experiment, and reported the NMR and other info on what they did, with a 15% yield. This opened up the investigation to greater discussion, testing, etc. JC-B’s students did an experiment and got no conversion. Then blogosphere reported that another interim component might interfere with the reaction. In the published realm the article was simply retracted without explanation.

There are various logo’s for ONS, to convey immediate access, delayed access, etc. ONS makes assumptions explicit, thereby maintaining the integrity of the data provenance, moving away from an environment of trust to one of proof. It shouldn’t matter the source, if you can see the evidence (Google/Web vs. a publisher).

Tools. Creating a log of the experiment is critical for reporting the time of each step, and log entries are done manually. Discussion, conclusion, etc. can be included. Raw data can be made public, images, videos (which make it easier to show exactly what and how it was done, sometimes better than written notes). Calculations and other functions are explicit in a Google spreadsheet. Wiki shows revision history. JC-B likes the ability to make comments and interact with students via the wiki, and he receives notifications when something is done.

He uses JSpecView and JCAMP-DX to use a Web interface to interact with the NMR data and zoom in to look at imperfections. He also uploads spectra to ChemSpider, it asks if you want it to be open. If it is open, you can use it for a game (Spectral Game) for education.

The game uses crowdsourcing to allow people to contribute data for solubility data for compounds in organic solvents. Judges could interact with students via the wiki, most responsible scientists were the ones that were awarded. Other teachers used this in their own lab.

For search/browsing, there is a wiki table of contents (but what about semantic relationships?). Can also use an API that enables queries of the spreadsheet content (Google Viz). Longer-term improvements in automation of the process: use bots to interact with data; use measurement of solubility data of JCAMP-DX via an API/Web service.

Finding the data: how to find it if you don’t know about the spreadsheet API? Begin at Wikipedia > lab notebook > raw data > Google spreadsheet with calculations.

Employing an ONS approach to research may impact the prospects for publication. Some publishers will consider ONS a preprint, some will accept the paper, even in one case where they wrote it on a wiki that was publicly available. The author can cite the ONS page in the paper. In one case, the experiment was published in JOVE, another in Nature Predeedings. There is the risk that repurposing content in multiple formats and media will dilute article metrics since people could be citing other repurposed content you created.

Librarians and Science 2.0: The Wayback machine doesn’t archive wiki pages very well. ONSPreserver will go through a spreadsheet with Windows scheduler to backup the data. You can also publish the Google spreadsheet as an .xls, which preserves the calculations.

Other options for archiving the open notebook: ONSPreserverLite, ONSArchive, and Lulu.com for data disks. In Bradley’s case at Drexel, he worked with the library to publish things in a DSpace repository as a zip archive, except can’t view the spectra , and as a book via LuLu.com. The entire record can be exported into a local archive (snapshot on that particular day).

Intellectual Property protection is problematic, though you can submit for a patent within a year of making it public (in the US; this is not applicable outside of US).

Antony Williams ChemSpider: Collecting and Curating the World’s Chemistry with the Community. [presentation, and blog]

ChemSpider started as a hobby project, to connect chemistry on the Web, and integrate chemical structure data, structure-based hub, etc.
Antony believes open data is here now, though it might not be called that. ChemSpider was released in March 2007, with 10.5 million structures. In June 2007, they started looking at how to clean up the data, using a curation layer and a deposition interface to add data. It was acquired by Royal Society of Chemistry and is now hosted by them. Disambiguation is a key to improving the search; names, nomenclature, structures, patents, publications.

PubChem is a good database, but doesn’t do validation. The curation is the time consuming part. It took Chemspider staff 3 days to clean up the records/links for just one compound, Vancomycin (emails to authors, etc.) They are going through compounds on Wikipedia.

Alternatives to ACS/CAS for ID’s for unique compounds: use a standard InChi, and the ability to search for the skeleton and more in-depth on the stereochemistry. They are using crowdsourcing to identify and tag errors, and citizens as data sources, for their own subsets of compounds.

Semantic markup: project Prospect; to search for terms in text, label, and bring them out: entity extraction, etc. dependant on good dictionaries. *** see ChemMantis and CJOC. Species are linked to in Wikipedia. He also mentioned ChemMobi (iPhone app).

Chemspider is not open source, though they use open source components (JMOL…), not an open access database, they don’t assume copyright (rights remain with the depositor).Their focus is to be a community resource, but they don’t intend to make everything free. They’re integrating RSC content.

Peter Murray-Rust Open Data and How to Achieve It.
Scattered notes from the presentation, please view the video. Peter believes that electronic theses and disserations are critical and that libraries are missing the boat on capturing and providing access to them. Theses are useful because they often address what didn’t work in a research project, while peer-reviewed articles tend to only talk about successes.
FOIA in UK is the What do they know.org site. Peter and others created the “Is your data open” resource to see if the data published in the journal is open, title by title (or publisher by publisher). Example: Journal of Chemical Informatics. If open, it’s signified by a button to indicate that it’s open data.

CrystalEye: crawls publishers for their crystal structures, except those that are behind firewalls (Wiley, Elsevier…) Provides the structure, with validated ingestion.
Chem4Word: Open source addin for Word 2010, that highlights the term, and has a chemistry navigator pane, with an editor for the structure and validation against the name that was typed into the paper.

OREChem: sponsored by MS; Penn State, Indiana University, Cornell (OAI-ORE); University of Cambridge; and Southampton; to harvest data and run software against it.

EmMa does the embargo management against releasing different components in a system prior to being openly accessible.

Peter likes Creative Commons “By” license.

Heather Joseph – Is Open Access the New Normal?
SPARC uses the Budapest Open Access Initiative definition for open access. 4755 open access titles right now.
DRIVER: networking European scientific repositories; Driver II is next.
COAR: Confederation of Open Access Repositories.
She included a nice summary from the OECD about maximizing the impact of federally funded research.

The alliance for taxpayer access was convened by SPARC. 4 principles of Taxpayer Access. NIH public access policy, Section 217 of US Consolidate Appropriations Act of 2008. Office of Science and Technology Policy (OSTP) held the meeting that Jim R. attended. This asked how to do open access, not whether or not it should be done.
Opposition has evolved other arguments that are more global and directed toward IR.

Stephen Friend – Setting Expectations: Need for Distributed Tasks and Evolving Disease Models.
Sage Bionetworks (Merck-funded nonprofit, not tied back to Merck) based at the Fred Hutch; and the Sage Commons.

Some disciplines have already gone through these transitions in the relationships between clinicians and scientists.

Three points: emerging way to look at disease models; change how clinicians and biologist work together; and role of patients as drivers of efforts.
Models of disease are not absolute, but dynamic; migration from symptoms/cellular pathological to a molecular/personalized basis of disease. This will fundamentally change the science and practice of medicine. Response signatures are modeled and in use. What clouds the picture now are those that think that DNA, or RNA, or proteomics, is the only path to researching and resolving disease models. Fremd believes that integrative genomic models will instead derive clearer pictures.

We have little idea of the underlying causes of most human disease, and there is an explosion of biological genomic and clinical information, at the possible rate of 1 PB of data per day. Our current models are poor, but appropriate representations can be powerful. Rosetta Integrative Genomics Experiment is an example. He doesn’t think that systems biology will yield the answers in the next 30 years. Co-expression, causality, and Bayesian networks. Preliminary probabalistic models – Rosetta/Schadt approach for identifying causality for obesity. Other examples: cancer, all using Causal Bionetwork to identify weak spots in biological systems lacking redundancy, in a top-down research approach. Takes massive compute structure, and will be complemented by semantic web.

Change clinicians from archivists, and from papers being cited, to ideas and models being created, but Friend recognizes this will not be a trivial change, and realizes there are significant privacy and intellectual property issues. The 1st Inaugural Sage Congress will be held April 23-24, 2010 in San Francisco.

Project example: Non-responders cancer project. Focus on those that do not benefit from the drug trial, not to just save money, but because they could be on a different drug and not wasting their time. The researchers will not be working through physicians, but going directly to the patients for permission.

Peter Binfield: PLoS ONE and article-level metrics – a case study in the Open Access publication of scholarly journals. Slides

PLoS (Public Library of Science) is 6 years old. The concept of the journal hasn’t changed much since 1660’s. Four functions: registration of the primacy of your work, certification of your work (seal of authority), dissemination, and archiving. Other things are also part of what a journal does: filter for “quality” and filter for topic (scope). Many of these functions could be done via the Web, except for peer review which has risks of bias, is often subjective, etc.

Here’s how a the process might typically work:
Submit the article, it gets reviewed, rejected; submit to the next journal, etc. down the tiers of journals until it gets accepted. The process has high opportunity costs, costs to editors, etc. How did these accelerate and improve science? Who benefited?

What is the answer? Binfield says “PLoS ONE, of course!” Open access, online only with no size, topic or scope limitations. Publication fee business model. Peer review asks the “right” questions: is it publishable? is the science sound? plus 7 other basic criteria (not “is this a major advance?” etc.). Growth of PLoS is unparalleled in history; it’s now the largest journal in the world, publishing .5% of all that is in PubMed. 50,000 authors, 1,000 academic editors; paradigm shift from the journal to the article.

Article-level metrics are provided in every journal: putting research in context. How could impact be measured? citations, web hits, bookmarking, comments, community ratings, expert ratings, blog coverage, etc. All things that are hard to “game”; these metrics were implemented in September 2009. They hope other journals use these metrics and maybe standards evolve from this. Sharing detailed research data is associated with increased citation rate. Scopus reference landing page gives first 20 citing page links, WoS doesn’t have such a page for non-subscribers.

Some other sites mentioned: Postgenomic aggregates science blogs and then does data analysis
Manyeyes is a crowdsourcing option for “shared visualization and discovery”
Frontiers series of journals provide some neat metrics.
ACM shows data per author rather than by article.

Submit a note (or a correction) is a great feature in the PLoS titles. Highlight the text after logging in as a commenter, then submit your comment, etc.

Other sources mentioned: Nature blogs, Blogline, Researchblogging.org, CiteULike, Connotea

John Wilbanks – Keynote

Number of Creative Commons licensed objects have passed the point where it can be counted. The goal is to spark generative science, as used by Jonathan Zittrain: “capacity to produce unanticipated change through unfiltered contributions from broad and varied audiences”. Generative is not the same as powerful, because something could be very powerful (radio telescope) but difficult to use and costly so it might not be as generative as something less powerful but more widely used or extensible. For more, see the video.

37th Annual MCN Conference, “Museum Information, Museum Efficiency: Doing More with Less!”, Portland, OR Nov. 11 – 14, 2009.

This conference by MCN came to my attention through a posting on the LITA list serve. The museum perspective on data and information management was interesting and I found the conference useful and plan to follow up on several presentations. Participants were from a full spectrum of museums, from art to anthropology, from the Smithsonian to the Shelburne Museum in Vermont.

Below are some notes on the workshops and presentations I attended. Some of the presentations are online and linked to from the program schedules for each day:Some are also on SlideShare tagged with “mcn2009″ or in the MCN 2009 event section of SlideShare.

Cloud Computing | Keynote | Case Studies I | Museum Data Exchange | More with Less Roundtable | Case Studies II | Strategery | Case Studies III | Semantic Web II

Cloud Computing Workshop (presentation)
by Moad, Stein, and Davidow.

The workshop started out with a helpful introduction to terminology, concerns, and implementations of cloud computing. Gartner’s Hype Cycle for 2009 showed the timeline for cloud computing expectations vs. time to mainstream adoption. Gartner says it is the #1 strategic technology area for 2010. They differentiated between cloud applications (Zoho, Gmail, Google Docs) and utility computing (“elastic IT-related capabilities, renting CPU and/or file space, ie., Rackspace, Windows Azure, Amazon Web Services). In 2008 21% of companies were piloting software as a service. Gartner says cloud is the top trend for 2010.

Top concerns of those looking into cloud as an option: total cost, security, not able to find the applications they need.

  • Pros: fast deployment; lower cost/no capital expense; reduced IT maintenance; elastic and unlimited scalability; reliability in service may not be 99.999 %, but as good or better than their own services.
  • Cons: information security; long term offline storage (where is it?); potential vendor lock-in (ie., Google apps); some bandwidth bottlenecks at your connection end; latency; lack of control of downtime

Presenters provided a quick summary of the items to consider in making choices about cloud computing: what kind of security requirements fit your data; how granular is your information; where are the performance bottlenecks; and what is your IT staff like.

Jungle Disk was mentioned as one example of a way to backup office files to Rackspace or Amazon S3. Upload of large files can be relatively slow, depending on your connection to the larger pipes, but you can do scheduled backups. It retrieves files gracefully with a drag and drop interface.

Amazon Web services (AWS) has simple storage (S3), “Elastic Compute Cloud” (EC2). Transfers between S3 and EC2 are relatively cheap; the transfer to and from AWS can add a lot to the cost of using AWS. With S3 you pay for what you use; with EC2, you pay for a block, regardless of how much or little of it you use.

As an alternative to the cloud and AWS, the Indianapolis Museum of Art plans to continue to use cloud for web apps, but not for storage. They have 16 TB stored onsite, 16TB offsite. They benchmarked growth rate (~14 TB in 4 years) and looked at storage area network (SAN) vs. S3 and concluded that co-location was more cost effective than S3 by approx. $70 K, but there are other costs to colocation, such as server maintenance and admin.

There are a couple of Firefox addons that make it easier to work with AWS: Elastic Fox and S3fox, and the EC2 console in AWS.

Moving Drupal to the Cloud: A step-by-step guide and reference document for hosting a Drupal web site on Amazon Web Services:
http://www.imamuseum.org/cc-mcn09.pdf

Fedora for a Digital Asset Management
This was an exploration of using Fedora on AWS for preservation work. Spent ~$35 K with a vendor to do Ruby on Rails for Fedora, on AWS. This saved them a lot of money vs. a hosted ISP (went from $1200/mo down to $900/mo now). They use subversion for version control of code, and the vendor handed off everything once the instance was set up. They required documentation on a wiki, and an “Amazon Machine Instance” (AMI) that they could then check out of Subversion and run on AWS. The reviewed some technical “gotchas” that are worth looking at in the slides if you want to pursue something like this.

Presenters also talked about the ArtBabble site (neat implementation of AWS for video hosting), and Rightscale.com.

Keynote address by Karen Donoghue.

Most of Donoghue’s talk focused on examples of interface design she has worked on, such as for handheld devices, incorporating a browser into a cell phone. She also talked about content distribution platforms, such as RSS but with more interaction capabilities. She likes to use translucency (content layers display partially through those above them), via CSS. Affordances are important in design, where the representation conveys the action (ie., button graphic -> push/click). Other concepts she covered: presence [indicates whether or not a person is connected]; personalized didactics [museum visitor uses a hand-held device to read a barcode at an exhibit item which translates the text into their language].

Case Studies Session I
Using Open Source Software in an Era of Tight Budgets
Robert Schimelpfenig, Maria Schenk. WSU used WAMP and PHP MyAdmin to work with metadata for ContentDM that are compatible with Dublin Core. Case study summary:
Schimelpfenig and Schenk, Using Open Source Software in an Era of Tight Budgets: A Case Study of WSU Vancouver Library’s Digital Archives Project (PDF)

Crazy Quilts to Patchwork Technology
Narda McKeen LaClair. One of the neat ideas that Narda talked about was the use of Flickr to highlight donor-funded projects throughout the life of the project. Example: adopt an animal for preservation. This information is also linked to from their Facebook page.

Using Server and Storage Virtualization to Build an Economical and Scalable Infrastructure
Sarah Winmill. Sarah reviewed their project to set up 60TB of storage, using VMware. They followed business continuity planning from the start to avoid service breaks. For secondary storage off site, they located servers at another organization that is also using Hitachi, and provided space for that group’s off-site storage at their facilty. This enabled both organizations to reduce costs of their secondary storage while maintaining a good baseline of support. A case study is on the vendor’s web site. Here’s another summary of the project.

The Flickr Commons Experience: The View from the Oregon State University Archives
Tiah Edmunson-Morton. OSU is the only university archives that are currently on Flickr Commons. They joined in Feb, 2009. Commons is more formal than the normal Flickr account, but only costs $25. Photos are scanned into ContentDM, then added to Flickr Commons with batch uploads of the images. Unfortunately, it requires manual upload of the metatdata into Flickr, but they hope to move to auto-updating. Updates of ~50 images twice/month keep the manual aspects manageable. They also use the indicommons blog for Flickr.

Freedom to Experiment! The Luce Foundation Center as Testing Grounds for Innovation
Georgina Goodlander. The take-home lesson from the Luce Foundation presentation was to encourage and try out innovative low-investment ideas, (without a lot of marketing support), and then move forward with the results.

Fill the gap: The Center has a storage area of approximately 3000 objects that are also on display. If an object will be on loan (or out for repair, etc.) for more than 12 months, a replacement is chosen, and online viewers can submit their suggestions for what should be moved into the open spot via their Flickr page

Ghosts of a Chance was an interactive text-based scavenger hunt where the participant(s) use a cell phone with texting. It became a 2009 Webby Awards Honoree. It enabled interaction with the museum’s resources, and could be configured for individuals or groups of more than 10 people. It can still be accessed as a downloadable module.

The Field Museum Gone Googlecase study summary
Drew Ruginis. Drew summarized their move from Microsoft Outlook Exchange to Google Apps for email and calendaring. Some of the issues mentioned were incomplete integration across apps (ie. Reader, Picasa); lack of password synchronization with Active Directory, and lack of granularity in the control panel. Positives were cost savings, and fewer outages.

Museum Data Exchange: Making hay with harvestable data


Goodwin, Museum Data Exchange: Introduction (PDF)
Rubinstein, Museum Data Exchange: COBOAT (PDF)
Oberoi, Museum Data Exchange: Learning How to Share (PDF)
Waibel, Museum Data Exchange: Making Hay with Harvestable Data (PDF)

Full report on this Mellon-funded project was released February, 2010.
This presentation summarized a project by nine museums to create tools using COBOAT (http://www.oclc.org/research/activities/coboat/default.htm) to extract CDWA Lite XML records out of Collections Management Systems, publish records via Open Archives Initiative Protocol for Metadata (OAI-PMH), then analyze data, and examine issues such as standards compliance and interoperability. Implementation was generally easy, the politics were more difficult. Collaborative development required the developers to host and use the same technology, which they were not familiar with. The data aggregation collected more than 850,000 recoreds, and is a source for more research opportunities to look at consistency of implementation of standards, etc. across museums. Other tools and resources mentioned that are worth a look: Omeka, The Cultural Objects Name Authority (CONA). Also see the blog, hangingtogether.org

Doing More With Less: A community-software-based technology roundtable.

CollectionSpace. With sustainable funding, CollectionSpace could be an opensource alternative to commercial products for museums, possibly with natural history collections too. Maturity date is 2010. There are opportunities for partnering. for webinars, the community design workshop results (http://www.collectionspace.org/sites/default/files/Community%20Design%20Workshop%20Report.pdf), how to join the project, etc.

ConservationSpace is for in-house art conservation. Standalone, as a module in CollectionSpace, or interoperable with other collection management systems. Expected 2011.

Museum Scholarly Infrastructure. In development. To improve ability of scholars to collaborate with each other in study of collections. Shared technology infrastructure: Sakai (collab support and social media); ConservationSpace; CollectionSpace. Extensible infrastructure to cut costs. British Museum as tech lead, Partners are National Gallery (Raphael), Courtauld, RKD & Martitshuis, etc. Maybe by summer of 2011.

FluidEngage
Should partner well with CollectionSpace. Common authoring environment for museum content; banners, kiosks, wall tags; usability best practices; accessibility compliant to manage risks for museums. Small device support (smart phones). Many partners, advisers, contributors. Version 0.1 is available now. Being built in a modular fashion. Pull and push info (web presence, plus user comments, tagging). Hope to release 0.3 in January 2010, more user interface improvements. 0.5 by April. Focused on user experience and usability.

Decapod
Long tail digitization project, for collections too small or too fragile or rare to leave the institution for digitization. Initial target is paper, but in principle anything 2D. One-button free
publication to Internet Archive. Optimized for scholarship, books, boxes, collections. Multilingual OCR. About 300-pages/hour. Target: anyone, 1-page instructions, error checking, etc. target cost is $1100 (except for laptop). Self-correcting (de-warping, de-skewing).

Case Studies II
Pathfinder, a New GUI for the Art Institute of Chicago.
They use 7 wall-mounted information kiosks to provide interaction and way-finding. Kiosks were made with Slate Roof Studio and AvantLogic, using a Troll Touch Overlay on 30-inch Apple Cinema displays. They incorporated “ligth boxes”, popup images, and “show path” and “show accessible path”. Nice!

Art Babble: Play Art Loud!
Indianapolis Museum of ARt. This site is a niche site intended to offer something better than the typical YouTube content and an option for institutions to collaborate in providing video about art, which can be hard to find elsewhere. They provide full text transcriptions of the content, and try to make instructional videos exciting and fun. The amount of time a user stays on the site reflects the typical length of most videos.

MOA-CAT: UBC Museum of Anthropology Collection Access Terminals.
They replaced data books in the visible storage galleries with touch-screen interactive information systems, and have all 37,000 objects in the system. Launch date was January 2010. Visitors can view where an object is from via links to Google Earth.

Presenting Musical Compositions as Works of Art.
Wellesley College. The woodworking for the construction of an overhead speaker system for the exhibit was what was most interesting to me. Unfortunately their web site has no information or pictures that I can find.

Strategery: The Realities of Strategic Planning
Edson, Iannacone, Honeysett. I’ll provide more in-depth notes on this session, though the slides for one part of this session are also online. This was one of the more interesting and entertaining presentations–and extremely applicable to museums and libraries. The focus was on digital strategy and “strategy-creation processes, successes and failures, and the relationship between technology platforms and organizational readiness.”

For the Getty (Nik Honeysett, no slides online).
They did a strategic review to redefine academic computing as a networked environment.

8 principles:

  1. Success: define it, give outcomes that indicate it.
  2. There’s no place like home (their web site): use platforms to bring people to the website.
  3. Create more than you see (iceberg): a solid infrastructure, process, and workflows are needed to implement mobile technology
  4. Courage: have the courage to start again when a project fails or falls apart.
  5. Creativity: to make the complicated simple
  6. Give people permission to fail
  7. Improve with use–the essence of Web 2.0 is that it works better as more people use it (ie., social media)
  8. Do more with less: do less with less, but do the stuff that matters more. Segal’s law: “A man with a watch knows what time it is. A man with two watches is never sure.”

They used a set of internal guidelines to inform the reorganization, developing and using a process and infrastructure that is flexible, scalable, and sustainable. A common thread is that people know who’s in charge, who to go to for help, and the process being employed.

For the Smithsonian (Edson, Iannacone): some slides here: Smithsonian Web and New Media Strategy — Drivers, Process, and Execution
The assumption is that strategy has four components, in this order:

  1. Pain, Fear, or Opportunity
  2. Some kind of process
  3. Some kind of assertion
  4. Some kind of work

The Smithsonian used an internal blog and wiki as part of the process, to:

  • Build a shared vocabulary
  • Keep focus on mobile, UX, other subtle things
  • Celebrate internal experts
  • Practice skills to be used later on (“you get what you practice”)

They wanted a process that was public, transparent, and fast. Advantages (copied from a slide):

  • Faster than traditional committee-driven process (Time is the enemy)
  • Increase size of brain trust (Joy’s Law)
  • Improve the odds for change
  • Improve odds for execution (public promises not easily forgotten)
  • Outside champions more likely to support “commons” goals than status-quo insiders
  • Walking the Talk vis-à-vis crowdsourcing and innovation model
  • “You get what you practice”

In the Smithsonian process, the process was the wiki; meeting agendas, notes, etc were all on the wiki. “The main intent of the workshops is to move relevant information to the wiki where it can be openly evaluated, sifted, weighed, and considered by all.” As participant comments in meetings were added to the wikii. This increased transparency, accountability, and speed. Action items and themes were highlighted after the workshops, to promote further synthesis and thought. Presenters also addressed the risks of using such a process. See the SI Web and New Media Strategy wiki.

The Change Model (slide 74)
(Borrowed from software and social entrepreneurship)

  1. Think big, start small, move fast
  2. Focus on doing things that matters (via Tim O’Reilly)
  3. Cultural institutions exist to do work in the culture
  4. Drive change through building A Sense of Urgency (John. P. Kotter)

A few points from Carmen Iannacone’s part of this session (slides not online):
What are hallmarks of disruptive success? innovation, flexibility, creativity, productivity, organizational discipline (better, smoother function), morale, achievement. The “tools” we have are: human capital, technology, goodwill, and luck.

He believes in aligning the infrastructure and operations staff with a maturity model (Gartner’s infrastructure and operations (I&O) maturity model: survival -> awareness -> -> business partnerships, see this example for more details; embracing a process framework, and MBWA – management by walking around. See also Good Projects Gone Bad for more on process maturity.

Other tips: follow the money in technology; assess your capital planning and investment control (CPIC) process; use zero-based budgeting; use earn value management (EVM) techniques if you can; use operational forensics to address and avoid issues with IT roll-outs and changes, and prioritize IT project list. Here are some of the things on their priority list:

  • cloud computing, server virtualization, storage abstraction, mobile computing, ubiquitous wireless access, immersive user experiences, and sourcing & licensing.

Case Studies III

Digital Media Served up Using XML and the Google Maps API
Evelyn Lindberg, Laura Robinson. Based loosely on the concepts behind the Map of Knowledge project (http://www.nytimes.com/2009/03/16/science/16visuals.html?_r=1) they employed the Google Maps API to do a mashup with newspaper coverage.

Print quality downloads

by Sarah Winmill. They put the catalog online to facilitate download of print-quality images from a collection of ~150,000 images, using a proxy to download the files from a digital asset management (DAM) system. Images are cropped, which has been contentious. They’re now using a crowdsourcing approach to get feedback on the cropping. The total cost of this project was equivalent to a mini-exhibit; 6 month timeline, with 3 months of it in heavy codework.

Introducing imaging quality control practices into sustainable workflows.
The vendor, Image Science Associates http://www.imagescienceassociates.com/ , seems to have a good combination of hardware and software for providing a sustainable approach to ensuring that images are consistently high quality, whether from scans or photos.

Enhancing the usability of in-gallery media through data visualization
Paco Link. The topic was touch screen displays, something we don’t often deal with in the library but which I think could greatly improve way-finding and reference services for our users. They used cluster/contour maps of screen touches, generated from logs, to visualize how patrons were using the touch screen displays. The prerequisite is to include the logging capacity from the start in the design of the screen systems.

Hi Definition History
Tamara Georgick, Gabe Kean. Another presentation on kiosks, that looked at the development process for including from 1 to 3 panels, and employed a content management system for a lower-cost approach.

Semantic Web II
Koven Smith, Don Undeen. This project used Semantic mediawiki for quickly creating a system in which conservators could record their treatment notes, and add semantic relationships between terms. This was easier than the old system of written forms, and facilitated search, web output, etc. Since it’s built on the wiki framework and incorporates annotations, they didn’t need to spend a lot of time on setting up a platform. The focus is on terminology and nomenclature; annotations in the wiki create an rdf triple. “Queryable” data appears in a fact box at the bottom of the page, which can be clicked on to view inverse relationships. Other components of the system: CAMEO, a webservices piece written in C#, the CIDOC conceptual reference model, and the Halo graphical annotation extension.

Powered by ScribeFire.