Posted by Rory on November 14th, 2011 @ 12:25 pm
The Encyclopedia of Research: an exciting concept
The Encyclopedia of Original Research (EOR) is being proposed by Daniel Mietchen as a GitHub open repository of scientific articles that evolve along with the topic they cover through a series of reviews. You can read more about this concept/project, and the philosophy behind it, here and here. And Daniel has just written a post called How would you fund research: An Open Science Perspective, where he explores among other things the relationship between EOR and various open funding models, including the SciFund Challenge.
Daniel and I have had a couple of stimulating email exchanges about EOR. Our exchanges covered various issues, such as how to incentivize people to contribute to and use EOR, and how EOR will distinguish itself from services in adjacent areas like Mendeley. I thought it would be interesting to build on those exchanges with a post focusing on a particular issue EOR will have to face, namely the interface between ‘evolving publications’ in EOR and the data that lies behind the publications.
How to deal with data used in the research leading to publications and reviews?
Will the data used in the original publication be presented along with the publication? If so, will this be optional, or a condition of having the publication included in EOR? If the data is presented, will a certain format be required and therefore necessarily supported? These issues will also arise in relation to reviews, because in some cases the reviewer may want to present their own data.
Data presentation: mandatory or optional?
Behind these simple questions lie a couple of difficult decisions or tradeoffs for EOR. A key objective of EOR is to encourage ongoing discussion of publications. This objective will be significantly enhanced by making the data used in the research that led to publications available for readers and reviewers. But, as we all know and many have pointed out, getting people to provide their data is difficult, for a variety of reasons probably foremost among them that the data is often (a) scattered among a variety of things like a paper lab notebook, spreadsheets, wikis, etc., and (b) in multiple formats, so that making it available in a single place or repository is not simple or straightforward. So, the first decision or tradeoff is between requiring data to be provided, thereby enhancing the quality of the project, and making that optional, thereby encouraging more people to submit publications and review them because it will be easier to do that without having to present the data.
The difficulty of developing an interface for data presentation
My understanding is that EOR is planning to make it possible to present data related to publications, and probably to reviews. So, EOR will face a secondary, technical issue: how to make that possible for contributors? I.e. in what format(s) will it be possible to upload data, and what kind of interface will be presented to contributors to enable them to upload data? Again there is a tradeoff: a simple interface is easy to use, which should stimulate greater interest in contributing, but if the interface is too simple it won’t be able to accommodate the diverse kinds of data that people will need to include.
This isn’t just EOR’s problem
These issues EOR will face in determining how to interface with data used in research is of course not just EOR’s problem, and they are not issues that EOR will be able to ‘solve’ on its own. They reflect the broader fact that the interface between data and publication is (a) crucial, but (b) generally ignored by both data collection/sharing apps like electronic notebooks, and publications-focussed apps and services.
As I discussed in last week’s post, data is currently collected in such diverse formats and structures — from paper notebooks to spreadsheets to wikis to blogs to electronic lab notebooks to note taking apps like Evernote to databases — that it is not yet possible to develop publication practices that facilitate anything close to convenient and comprehensive replication and verification of results by including data and code along with the publication.
Towards better presentation of data in EOR, and generally in connection with publications
For things to improve, in my view, there needs to be movement from both ends, i.e. from both the ‘publications aggregators’ (i.e. both existing publishers and innovators like EOR) and from the data collection end, i.e. tools, apps and services that scientists use to collect and organize their data (‘data aggregators’).
What publications aggregators can do
For publications aggregators, the first step is to acknowledge the importance of making data available along with publications in a useful and usable form. EOR, it appears, has taken this first step. The second step is to develop structures that facilitate easy inclusion of data alongside publications, in preparation for the day when data aggregators have made it easier for scientists to export their data from the data aggregation tools. Again, EOR seems to be interested in developing along these lines, and I would say that an opportunity exists for EOR to take the lead in this area.
What data aggregators can do
At their end, data aggregators need to make it far easier to export data from their tools and services. It is difficult and in many cases practically impossible to export data from most electronic lab notebooks, and even generic tools like Evernote do not support export very well. Google is better, and is serving as a model. Google has a Data Liberation Front team, whose goal is to make it easier to move data in and out of Google products. Earlier this year Google introduced Google Takeout, a service which lets you take data from multiple Google products at once. Only a few products, including Buzz, are included now, but the plan is to expand Google Takeout to other products going forward.
An offer to collaborate with EOR
In December Axiope will be releasing a new version of our electronic lab notebook and sample management system eCAT. With this new version, for the first time, export (other than exporting the XML, which is already possible) will be specifically supported. With the new version it will be possible to export from both the Notebook and the Inventory (i.e. sample management) sides of eCAT.
eCAT Notebook records will be exportable to ODF (Open Document Format = the format for Open Office). We have chosen that format for two reasons. The first is that eCAT is platform-agnostic; it runs on Windows, OS X and Linux, and so does ODF. The second is that ODF supports retention of links and embedded images, so after export the formatting will be retained.
It will be possible to export eCAT Inventory records to CSV. We have chosen to start with CSV because in our experience scientists like to see sample data in spreadsheets.
We see this initial export capability as a modest but important starting point for making data from eCAT ‘portable’. We plan a series of future releases with additional kinds of export capabilities. This means that over the same period that EOR is taking shape, export from eCAT will be developing. To us, this seems like a great opportunity to explore better publication/data interfaces from both the data aggregation end and the publications aggregation end. So, here is an open invitation to Daniel and others involved with EOR:
If you keep us advised on the data formats and methods for including data you think EOR will be needing, we will do our best to ensure that it’s possible to export data from eCAT in those formats and using these methods.