I attended the Science Commons Symposium – Pacific Northwest, on the Microsoft campus in Redmond, WA on Feb. 20, 2010.
My notes are below. More detailed notes were compiled by Brian Glanz of the Open Science Foundation, and there are other links, notes and comments are on the Science Commons site.
Videos synchronized with slides are available. The videos will play in any browser, but you will need Silverlight and Windows or OSX for them to play properly and view the video and slides together. Session 1 Session 2 Session 3 Session 4
Videos (no slides) may also be downloaded as .wmv here: Session 1 Session 2 Session 3 Session 4
Lee Dirks gave an overview of Microsoft External research, which is research in collaboration with external parties, such as the Science Commons, with education & scholarly communication and other groups. Some of the partnerships include the add-in for MS Office for Creative Commons license into PowerPoint, Excel, or Word.
The Word add-in for ontology recognition was done by Phil Bourne, Lynn Fink, and John Wilbanks. The ontology travels with the Word doc, and it includes an ontology browser, term recognition and disambiguation. There is also a the Creative Commons Add-in for Office Word 2007, a “plug-in to Office 2007 which connects to the Creative Commons webservice in order to generate and embed XML and bitmap representations of Creative Commons licenses. ”
Also mentioned was the book, The Fourth Paradigm: Data Intenstive Scientific Discovery, edited by Kristin Tolle, Tony Hey, and Stewart Tansley. The four themes mirror the emphasis of MS External Research (scholarly communication, health & well-being, infrastructure, and earth & environment). There are 26 short technical papers in the four sections, 45 of the 70 authors are not Microsoft. Data-intensive discovery is the fourth paradigm, the other three are empirical, analytical, and simulation paradigms.
Cameron Neylon: science in the open: why do we need it? How do we do it?
Cameron works on structural biology, see Wikimedia commons; small lab work and big iron facilities mixture of work. The question to ask (and answer): why do the research? Why do we pay for it? Cures; prestige (national and personal); excitement (galaxies, how biology works) and curiosity; fun. It’s a privilege, not a right. So how to maximize on the public investment (not just monetary, but cures, etc.)?
1. Open access. Widest community has greatest access to the results.
2. Formal publication, may be overkill (don’t need a sledgehammer to take down a snow man).
Web makes making things public easy; example was the solubility data he and Jean-Claude generated and was top hit in a Google search. Broadcasting is easy; sharing effectively is much harder. Have to make the choice to put up the signage, and put it into a form(at) that is discoverable.
Interoperability is the key to making this possible: legal, technical, and process interoperability. Systems need to work with the existing processes and the people. Capture the pieces of the research, then add the tools later that help to tell the story. Map the process onto agreed vocabularies. Machines do structure and need structure, humans tell stories.
How to get the structured data out? Generate an RDF from drop-down lists, it knows the inputs, etc.
Tools to deal with the scale of data are needed. Genbank can’t keep up with the amount of data that is being submitted. Human scientist capacity is static, doesn’t scale. The only thing that does, he believes, is the Web. The web scales by distribution effects. Governments, research groups, etc. do not effectively scale. Therefore the scientist must be connected, and sharing.
Open content builds the network; see the Panton Principles http://pantonprinciples.org for a checklist of principles for sharing data/open science; clarity and adherence to these points will facilitate sharing.
Jean-Claude Bradley: Using Free Hosted Web 2.0 for Open Notebook Science (ONS).
The case for ONS: 6 points and how ONS addresses these issues.
How bad is the current system? Look for the solubility of EGCG, using Beilstein, etc. Peer reviewed number is 521.7 g/L (misprint), Sigma Aldrich says you can create a solution of 5 g/L, and another cite that says 2.3 g/L. Chain of provenance is difficult to follow (and absent for the chem company info).
Another example: NaH oxidation controversy, a claim of something that most chemists would say is impossible. People tried to replicate the experiment, and reported the NMR and other info on what they did, with a 15% yield. This opened up the investigation to greater discussion, testing, etc. JC-B’s students did an experiment and got no conversion. Then blogosphere reported that another interim component might interfere with the reaction. In the published realm the article was simply retracted without explanation.
There are various logo’s for ONS, to convey immediate access, delayed access, etc. ONS makes assumptions explicit, thereby maintaining the integrity of the data provenance, moving away from an environment of trust to one of proof. It shouldn’t matter the source, if you can see the evidence (Google/Web vs. a publisher).
Tools. Creating a log of the experiment is critical for reporting the time of each step, and log entries are done manually. Discussion, conclusion, etc. can be included. Raw data can be made public, images, videos (which make it easier to show exactly what and how it was done, sometimes better than written notes). Calculations and other functions are explicit in a Google spreadsheet. Wiki shows revision history. JC-B likes the ability to make comments and interact with students via the wiki, and he receives notifications when something is done.
He uses JSpecView and JCAMP-DX to use a Web interface to interact with the NMR data and zoom in to look at imperfections. He also uploads spectra to ChemSpider, it asks if you want it to be open. If it is open, you can use it for a game (Spectral Game) for education.
The game uses crowdsourcing to allow people to contribute data for solubility data for compounds in organic solvents. Judges could interact with students via the wiki, most responsible scientists were the ones that were awarded. Other teachers used this in their own lab.
For search/browsing, there is a wiki table of contents (but what about semantic relationships?). Can also use an API that enables queries of the spreadsheet content (Google Viz). Longer-term improvements in automation of the process: use bots to interact with data; use measurement of solubility data of JCAMP-DX via an API/Web service.
Finding the data: how to find it if you don’t know about the spreadsheet API? Begin at Wikipedia > lab notebook > raw data > Google spreadsheet with calculations.
Employing an ONS approach to research may impact the prospects for publication. Some publishers will consider ONS a preprint, some will accept the paper, even in one case where they wrote it on a wiki that was publicly available. The author can cite the ONS page in the paper. In one case, the experiment was published in JOVE, another in Nature Predeedings. There is the risk that repurposing content in multiple formats and media will dilute article metrics since people could be citing other repurposed content you created.
Librarians and Science 2.0: The Wayback machine doesn’t archive wiki pages very well. ONSPreserver will go through a spreadsheet with Windows scheduler to backup the data. You can also publish the Google spreadsheet as an .xls, which preserves the calculations.
Other options for archiving the open notebook: ONSPreserverLite, ONSArchive, and Lulu.com for data disks. In Bradley’s case at Drexel, he worked with the library to publish things in a DSpace repository as a zip archive, except can’t view the spectra , and as a book via LuLu.com. The entire record can be exported into a local archive (snapshot on that particular day).
Intellectual Property protection is problematic, though you can submit for a patent within a year of making it public (in the US; this is not applicable outside of US).
ChemSpider started as a hobby project, to connect chemistry on the Web, and integrate chemical structure data, structure-based hub, etc.
Antony believes open data is here now, though it might not be called that. ChemSpider was released in March 2007, with 10.5 million structures. In June 2007, they started looking at how to clean up the data, using a curation layer and a deposition interface to add data. It was acquired by Royal Society of Chemistry and is now hosted by them. Disambiguation is a key to improving the search; names, nomenclature, structures, patents, publications.
PubChem is a good database, but doesn’t do validation. The curation is the time consuming part. It took Chemspider staff 3 days to clean up the records/links for just one compound, Vancomycin (emails to authors, etc.) They are going through compounds on Wikipedia.
Alternatives to ACS/CAS for ID’s for unique compounds: use a standard InChi, and the ability to search for the skeleton and more in-depth on the stereochemistry. They are using crowdsourcing to identify and tag errors, and citizens as data sources, for their own subsets of compounds.
Semantic markup: project Prospect; to search for terms in text, label, and bring them out: entity extraction, etc. dependant on good dictionaries. *** see ChemMantis and CJOC. Species are linked to in Wikipedia. He also mentioned ChemMobi (iPhone app).
Chemspider is not open source, though they use open source components (JMOL…), not an open access database, they don’t assume copyright (rights remain with the depositor).Their focus is to be a community resource, but they don’t intend to make everything free. They’re integrating RSC content.
Peter Murray-Rust Open Data and How to Achieve It.
Scattered notes from the presentation, please view the video. Peter believes that electronic theses and disserations are critical and that libraries are missing the boat on capturing and providing access to them. Theses are useful because they often address what didn’t work in a research project, while peer-reviewed articles tend to only talk about successes.
FOIA in UK is the What do they know.org site. Peter and others created the “Is your data open” resource to see if the data published in the journal is open, title by title (or publisher by publisher). Example: Journal of Chemical Informatics. If open, it’s signified by a button to indicate that it’s open data.
CrystalEye: crawls publishers for their crystal structures, except those that are behind firewalls (Wiley, Elsevier…) Provides the structure, with validated ingestion.
Chem4Word: Open source addin for Word 2010, that highlights the term, and has a chemistry navigator pane, with an editor for the structure and validation against the name that was typed into the paper.
OREChem: sponsored by MS; Penn State, Indiana University, Cornell (OAI-ORE); University of Cambridge; and Southampton; to harvest data and run software against it.
EmMa does the embargo management against releasing different components in a system prior to being openly accessible.
Peter likes Creative Commons “By” license.
Heather Joseph – Is Open Access the New Normal?
SPARC uses the Budapest Open Access Initiative definition for open access. 4755 open access titles right now.
DRIVER: networking European scientific repositories; Driver II is next.
COAR: Confederation of Open Access Repositories.
She included a nice summary from the OECD about maximizing the impact of federally funded research.
The alliance for taxpayer access was convened by SPARC. 4 principles of Taxpayer Access. NIH public access policy, Section 217 of US Consolidate Appropriations Act of 2008. Office of Science and Technology Policy (OSTP) held the meeting that Jim R. attended. This asked how to do open access, not whether or not it should be done.
Opposition has evolved other arguments that are more global and directed toward IR.
Stephen Friend – Setting Expectations: Need for Distributed Tasks and Evolving Disease Models.
Sage Bionetworks (Merck-funded nonprofit, not tied back to Merck) based at the Fred Hutch; and the Sage Commons.
Some disciplines have already gone through these transitions in the relationships between clinicians and scientists.
Three points: emerging way to look at disease models; change how clinicians and biologist work together; and role of patients as drivers of efforts.
Models of disease are not absolute, but dynamic; migration from symptoms/cellular pathological to a molecular/personalized basis of disease. This will fundamentally change the science and practice of medicine. Response signatures are modeled and in use. What clouds the picture now are those that think that DNA, or RNA, or proteomics, is the only path to researching and resolving disease models. Fremd believes that integrative genomic models will instead derive clearer pictures.
We have little idea of the underlying causes of most human disease, and there is an explosion of biological genomic and clinical information, at the possible rate of 1 PB of data per day. Our current models are poor, but appropriate representations can be powerful. Rosetta Integrative Genomics Experiment is an example. He doesn’t think that systems biology will yield the answers in the next 30 years. Co-expression, causality, and Bayesian networks. Preliminary probabalistic models – Rosetta/Schadt approach for identifying causality for obesity. Other examples: cancer, all using Causal Bionetwork to identify weak spots in biological systems lacking redundancy, in a top-down research approach. Takes massive compute structure, and will be complemented by semantic web.
Change clinicians from archivists, and from papers being cited, to ideas and models being created, but Friend recognizes this will not be a trivial change, and realizes there are significant privacy and intellectual property issues. The 1st Inaugural Sage Congress will be held April 23-24, 2010 in San Francisco.
Project example: Non-responders cancer project. Focus on those that do not benefit from the drug trial, not to just save money, but because they could be on a different drug and not wasting their time. The researchers will not be working through physicians, but going directly to the patients for permission.
Peter Binfield: PLoS ONE and article-level metrics – a case study in the Open Access publication of scholarly journals. Slides
PLoS (Public Library of Science) is 6 years old. The concept of the journal hasn’t changed much since 1660’s. Four functions: registration of the primacy of your work, certification of your work (seal of authority), dissemination, and archiving. Other things are also part of what a journal does: filter for “quality” and filter for topic (scope). Many of these functions could be done via the Web, except for peer review which has risks of bias, is often subjective, etc.
Here’s how a the process might typically work:
Submit the article, it gets reviewed, rejected; submit to the next journal, etc. down the tiers of journals until it gets accepted. The process has high opportunity costs, costs to editors, etc. How did these accelerate and improve science? Who benefited?
What is the answer? Binfield says “PLoS ONE, of course!” Open access, online only with no size, topic or scope limitations. Publication fee business model. Peer review asks the “right” questions: is it publishable? is the science sound? plus 7 other basic criteria (not “is this a major advance?” etc.). Growth of PLoS is unparalleled in history; it’s now the largest journal in the world, publishing .5% of all that is in PubMed. 50,000 authors, 1,000 academic editors; paradigm shift from the journal to the article.
Article-level metrics are provided in every journal: putting research in context. How could impact be measured? citations, web hits, bookmarking, comments, community ratings, expert ratings, blog coverage, etc. All things that are hard to “game”; these metrics were implemented in September 2009. They hope other journals use these metrics and maybe standards evolve from this. Sharing detailed research data is associated with increased citation rate. Scopus reference landing page gives first 20 citing page links, WoS doesn’t have such a page for non-subscribers.
Some other sites mentioned: Postgenomic aggregates science blogs and then does data analysis
Manyeyes is a crowdsourcing option for “shared visualization and discovery”
Frontiers series of journals provide some neat metrics.
ACM shows data per author rather than by article.
Submit a note (or a correction) is a great feature in the PLoS titles. Highlight the text after logging in as a commenter, then submit your comment, etc.
Other sources mentioned: Nature blogs, Blogline, Researchblogging.org, CiteULike, Connotea
John Wilbanks – Keynote
Number of Creative Commons licensed objects have passed the point where it can be counted. The goal is to spark generative science, as used by Jonathan Zittrain: “capacity to produce unanticipated change through unfiltered contributions from broad and varied audiences”. Generative is not the same as powerful, because something could be very powerful (radio telescope) but difficult to use and costly so it might not be as generative as something less powerful but more widely used or extensible. For more, see the video.