Informative guide to electronic lab notebooks from the University of Utah

Posted by Rory on November 28th, 2011 @ 6:13 pm

I recently came across a useful guide to electronic lab notebooks on the University of Utah website http://campusguides.lib.utah.edu/content.php?pid=126157&sid=2131670.

Two things interested me about the guide.

Categorization of electronic lab notebooks

The first is the breakdown of electronic lab notebooks into three categories (each entry is further categorized as being focused on biology, chemistry, general use, etc.):

  1. Vendor ELNs (36 entries, which came from Atrium Research)
  2. Open source ELNs (7 entries)
  3. General note taking and management (14 entries)

This strikes me as a sensible breakdown which accurately captures the three different kinds of electronic lab notebooks scientists are likely to come across.

ELNs have two layers

The second is the characterization of ELNs as having two layers.  The first is a Calculations and Data Manipulation layer where researchers access and work with data:


The second is an ‘IP protection or people layer’ where researchers collaborate around the research data:

I found this characterization to be quite intriguing.  First you engage with the data, then you engage with people — collaborators, your PI, etc., about the data.  That’s quite a neat way of thinking about ELNs.  And it’s simply put and presented.

Thanks and well done to the University of Utah for making such a useful, and accessible, overview of electronic lab notebooks available!

 

The Encyclopedia of Open Research and the data/publication problem

Posted by Rory on November 14th, 2011 @ 12:25 pm

The Encyclopedia of Research: an exciting concept

The Encyclopedia of Original Research (EOR) is being proposed by Daniel Mietchen as a GitHub open repository of scientific articles that evolve along with the topic they cover through a series of reviews. You can read more about this concept/project, and the philosophy behind it, here and here.  And Daniel has just written a post called How would you fund research:  An Open Science Perspective, where he explores among other things the relationship between EOR and various open funding models, including the SciFund Challenge.

Daniel and I have had a couple of stimulating email exchanges about EOR.  Our exchanges covered various issues, such as how to incentivize people to contribute to and use EOR, and how EOR will distinguish itself from services in adjacent areas like Mendeley. I thought it would be interesting to build on those exchanges with a post focusing on a particular issue EOR will have to face, namely the interface between ‘evolving publications’ in EOR and the data that lies behind the publications.

How to deal with data used in the research leading to publications and reviews?

Will the data used in the original publication be presented along with the publication? If so, will this be optional, or a condition of having the publication included in EOR?  If the data is presented, will a certain format be required and therefore necessarily supported?  These issues will also arise in relation to reviews, because in some cases the reviewer may want to present their own data.

Data presentation: mandatory or optional?

Behind these simple questions lie a couple of difficult decisions or tradeoffs for EOR. A key objective of EOR is to encourage ongoing discussion of publications.  This objective will be significantly enhanced by making the data used in the research that led to publications available for readers and reviewers.  But, as we all know and many have pointed out, getting people to provide their data is difficult, for a variety of reasons probably foremost among them that the data is often (a) scattered among a variety of things like a paper lab notebook, spreadsheets, wikis, etc., and (b) in multiple formats, so that making it available in a single place or repository is not simple or straightforward.  So, the first decision or tradeoff is between requiring data to be provided, thereby enhancing the quality of the project, and making that optional, thereby encouraging more people to submit publications and review them because it will be easier to do that without having to present the data.

The difficulty of developing an interface for data presentation

My understanding is that EOR is planning to make it possible to present data related to publications, and probably to reviews.  So, EOR will face a secondary, technical issue: how to make that possible for contributors?  I.e. in what format(s) will it be possible to upload data, and what kind of interface will be presented to contributors to enable them to upload data?  Again there is a tradeoff:  a simple interface is easy to use, which should stimulate greater interest in contributing, but if the interface is too simple it won’t be able to accommodate the diverse kinds of data that people will need to include.

This isn’t just EOR’s problem

These issues EOR will face in determining how to interface with data used in research is of course not just EOR’s problem, and they are not issues that EOR will be able to ‘solve’ on its own.  They reflect the broader fact that the interface between data and publication is (a) crucial, but (b) generally ignored by both data collection/sharing apps like electronic notebooks, and publications-focussed apps and services.

As I discussed in last week’s post, data is currently collected in such diverse formats and structures — from paper notebooks to spreadsheets to wikis to blogs to electronic lab notebooks to note taking apps like Evernote to databases — that it is not yet possible to develop publication practices that facilitate anything close to convenient and comprehensive replication and verification of results by including data and code along with the publication.

Towards better presentation of data in EOR, and generally in connection with publications

For things to improve, in my view, there needs to be movement from both ends, i.e. from both the ‘publications aggregators’ (i.e. both existing publishers and innovators like EOR) and from the data collection end, i.e. tools, apps and services that scientists use to collect and organize their data (‘data aggregators’).

What publications aggregators can do

For publications aggregators, the first step is to acknowledge the importance of making data available along with publications in a useful and usable form.  EOR, it appears, has taken this first step.  The second step is to develop structures that facilitate easy inclusion of data alongside publications, in preparation for the day when data aggregators have made it easier for scientists to export their data from the data aggregation tools.  Again, EOR seems to be interested in developing along these lines, and I would say that an opportunity exists for EOR to take the lead in this area.

What data aggregators can do

At their end, data aggregators need to make it far easier to export data from their tools and services. It is difficult and in many cases practically impossible to export data from most electronic lab notebooks, and even generic tools like Evernote do not support export very well.  Google is better, and is serving as a model.  Google has a Data Liberation Front team, whose goal is to make it easier to move data in and out of Google products.  Earlier this year Google introduced Google Takeout, a service which lets you take data from multiple Google products at once. Only a few products, including Buzz, are included now, but the plan is to expand Google Takeout to other products going forward.

An offer to collaborate with EOR

In December Axiope will be releasing a new version of our electronic lab notebook and sample management system eCAT.  With this new version, for the first time, export (other than exporting the XML, which is already possible) will be specifically supported.  With the new version it will be possible to export from both the Notebook and the Inventory (i.e. sample management) sides of eCAT.

eCAT Notebook records will be exportable to ODF (Open Document Format = the format for Open Office).  We have chosen that format for two reasons.   The first is that eCAT is platform-agnostic; it runs on Windows, OS X and Linux, and so does ODF.  The second is that ODF supports retention of links and embedded images, so after export the formatting will be retained.

It will be possible to export eCAT Inventory records to CSV.  We have chosen to start with CSV because in our experience scientists like to see sample data in spreadsheets.

We see this initial  export capability as a modest but important starting point for making data from eCAT ‘portable’. We plan a series of future releases with additional kinds of export capabilities.  This means that over the same period that EOR is taking shape, export from eCAT will be developing.  To us, this seems like a great opportunity to explore better publication/data interfaces from both the data aggregation end and the publications aggregation end.  So, here is an open invitation to Daniel and others involved with EOR:

If you keep us advised on the data formats and methods for including data you think EOR will be needing, we will do our best to ensure that it’s possible to export data from eCAT in those formats and using these methods.

Reproducibility of data and collaboration: Response to Victoria Stodden with two examples

Posted by Rory on November 9th, 2011 @ 11:24 am

The importance of reproducibility

In a post yesterday, “Disrupt science, but not how you’d think“, Victoria Stodden writes, “I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication.”  To enable this, she goes on to argue, “we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified.”

Two examples of making data available using currently available tools

This seems to me to be a very important point, and one that few would dispute.  But how to make reproducibility happen is not so obvious.  As the following two examples make clear, even when the will is there, the tools do not yet exist to make reproducibility convenient and therefore widespread.

Example 1:  Append an entire paper lab notebook to the publication

Gregory I. Lang and David Botstein recently published a paper (A Test of the Coordinated Expression Hypothesis for the Origin and Maintenance of the GAL Cluster in Yeast) to which an entire 101 page lab book containing all the notes, strain construction, methods and raw data that went into producing the paper was included as supplementary data.  I was so struck by this that I wrote a post about it, pointing out that the authors would have (a) saved themselves a lot of time and (b) made it easier for themselves and others to make use of the data they generated if they had recorded their data electronically.

So how does the inclusion of Greg Lang’s paper notebook stand up to the reproducibility test?  It gets top marks for openness; all the experiments, results, materials used, observations, questions asked, thoughts, etc.,the ‘actual scholarship’ in the words of Victoria Stodden’s thesis advisor, are there for all to see.  But, and it is a big but, notwithstanding the openness this scholarship is actually pretty inaccessible in practical terms.  It’s impossible to search on key terms, there is no linking, internally or with external sources of information, and all the benefits of electronic recording are absent.  It would take a huge effort to plough through the notebook, understand how it all fits together, and pick out relevant bits, and the magnitude of this effort surely acts as a major barrier to anyone trying to reproduce the research.

Example 2: Document and publish online everything that happens in the lab as it happens

The Roberts lab at the University of Washington focuses on characterizing physiological responses of marine organisms to environmental change.   The lab has adopted a highly innovative way of organizing and communicating its research.  The lab uses a wiki as the home for its research activities and results.  Protocols, the lab calendar, and image and data repositories are housed in the wiki. That in itself is not particularly novel.  What is more interesting is that the wiki is also home to each individual lab member’s online lab notebook.

And this organizational platform is just the starting point for the more radical part of the lab’s innovation, which is the way it presents and communicates its research, in real time.  The lab uses facebook, tumblr, youtube, and flickr to post developments related to the lab’s research and activities.

The lab members also use their twitter account, @genefish, to push out an auto-feed of all or virtually all notebook entries, blog posts, calendar modifications, photo uploads, etc.  @genefish currently has more than 13,000 tweets!  The core of @genefish is the tweets about the research being documented in each individual’s lab notebook.  Each notebook + the tweets = a step by step account of each lab member’s research as it takes place, and since all the notebooks are included in the lab’s wiki, the lab notebooks + the tweets = a complete record of the collective research in the lab as it takes place.

Like Greg Lang, the Roberts lab gets high marks for openness.  Indeed they have extended the openness to include (a) collaboration between lab members, and (b) open publication of the research as it happens. It’s hard to imagine a more thorough, or well-organized, example of open science than this.  And in terms of adoption of available online resources — Twitter, Facebook, etc., the Roberts lab is on the bleeding edge.

What about reproducibility?  Everything (absolutely everything, it seems!) is there for others to see and access electronically, and there is a record of the process as well as of the results.  These are big plusses in terms of facilitating reproducibility.  There are, however, also some negatives, relating to information overload. There is so much information there that it could be difficult for someone wanting to reproduce the research to zero in on the important and relevant bits.  Another aspect of this is the use of so many different platforms to capture and communicate information — the wiki, Twitter, Facebook, etc.  Although these are all electronic, searching for a term or a key component of the research would probably be even more difficult here than with the paper lab book.

Even with the will, the way is not there yet . . . but it’s coming

Victoria Stodden concludes her post by saying   “. . . It is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code . . .”  The two examples discussed above demonstrate that currently available tools are nowhere close to being convenient enough or sufficiently fit for purpose to support the development of the kinds of publication practices Victoria Stodden would like to see.   On a brighter note, the examples also reflect the widespread interest in making reproducibility possible, and the wide range of experimentation going on.