I am attending a neuro meeting at the fantastic Janelia Farm facilities to see how experts in the fields of electrophysiology and computer science among others, decide a common format to express recordings of neuronal activity and the surrounding experimental metadata.
The mandate and outline of NWB has a clear mission, timeline and particular steps:
- August 2014: Project Start.
- Phase 1: Identify use cases and evaluation criteria.
- Phase 2: Select/assemble most promising approaches and develop data format and test it.
- Phase 3: Test and fine-tune it.
- July 2015: Project ends.
Now, brace for impact. Here’s a small list of common e-phys file formats that were created by different labs:
For a more nuanced view of some of the main data formats, please have a look at the considerations for developing a standard for storing e-phys data in HDF5 and the NWB data and file formats summary.
Wouldn’t it be a massive win to choose a single data format and not fall in the traditional academic mantra that states: “different formats are good for different things”? Or even worse, create yet another competing standard?
How many of those formats are actually used in research publications? Which is the one seeing the most adoption in academic literature so far?
Why shouldn’t we just choose the top N that share the most mindshare for the greater good (reproducibility, data sharing, interoperability)?
Let’s see if we can find a fix for this e-phys Babel.
Several labs describe and present their custom ephys formats. Most of them have a fairly large overlap on attributes, structure and features. With varying specifications, the labs seem to revolve around HDF5, a hierarchical file format that stores all the attributes of the experiments, from images to timeseries, in varying degrees of complexity.
It is interesting to see how, being an event-processing problem at its core, there are very few mentions of industry and opensource event processing frameworks:
Software developers in the room recommend exposing a strongly typed API that deals with the raw data attributes via an intermediate representation instead of having to change the HDF5 container at every specification change or experimental novelties. This idea resonates quite well with the NEO format approach. An additional problem that arises with internal representations is keeping track of provenance since encapsulation might hide processing details that might be interesting to follow an experiment step by step.
It seems to me that e-phys recording can be approached as a large scale logging problem, therefore:
- Using a framework that aggregates events at scale is crucial to guarantee a smooth and fast data analysis experience. That is, including slicing by data recording sessions or any other criteria that the data (neuro)scientist decides.
- Leaving the internal (intermediate) representation of data in point 1 untouched is the most convenient approach. Specially when HDF5 does not play well with modern parallel frameworks.
- Exporting data from point 1 as HDF5 for sharing, given that is the most popular container within this science niche seems reasonable (to me, at least).
- Writing importers/exporters (serializers) from Thunder to HDF5 seems like an interesting Hackathon challenge. Adopting KWIK, already used by many, as a particular specification could be interesting w.r.t interoperability.