INCF and the quest for global data sharing in neuroscience

Disclaimer: Those are my opinions only and not from my employer, etc, etc…

Today it has been two months since I joined the International Neuroinformatics Coordinating Facility, INCF located inside the Karolinska Institutet campus in Stockholm. Coincidentally I happened to land the job in the mist of a Neuroinformatics conference:

INCF Neuroinformatics congress 2013

Before that, I spent almost 3 years in another field of (data) science, genomics. I think I’m extremely lucky to be involved on those two different cutting-edge computational life sciences disciplines, so rich and diverse at the science level and yet pretty similar in infrastructure needs: more storage, more processing and more standards needed.

Also today I got to answer a series of seemingly routine questions prior to attending a workshop (EUDAT). While I was writing, I realized that I was drawing a nice portrait of today’s data science wins and struggles, be it genomics, neuroscience or other data-hungry sciences I might encounter during my career.

Suddenly I couldn’t resist to share my braindumpings, I hope you enjoy the read :)

Who are your major scientific user communities and how they use the technology with concrete applications?

According to the official INCF’s website “about”:

INCF develops and maintains databases and computational infrastructure for neuroscientists. Software tools and standards
for the international neuroinformatics community are being developed through the INCF Programs (…)

So INCF’s purpose is to be the glia between several neuroscience research groups around the world. Or as Nature put it recently about the upcoming human brain project:

Dataspace, DAI and NeuroLex are outcomes of the INCF initiatives, supporting the community over the years since 2005.

INCF mug

Other interesting community projects that assist scientists in running and keeping track of experiments in the neuroinformatics community are: Sumatra, BrainImagingPipelines and a W3C-PROV framework for data provenance, make sure you open browser tabs on those links, they are definitely worth checking!

While those tools are pretty much domain-specific, they bear some resemblance with counterparts from genomics pipelines, for instance.

What do you think are the most important interoperability aspects of workflows in scientific data infrastructures?

In today’s HPC academic facilities there are no standards on “computation units”. Most of the software is installed and maintained manually in non-portable ways. Without canonical software installation procedures the moving computation where the data is ideal situation will be increasingly difficult to attain.

It’s striking to see those common architectural needs between research fields and the lack of funding incentives to ensure standard packaging of software and coherent data schemes.

A notable example of a data standard in genomics, as put by Francesco, the SAM/BAM file format. It is barely 15 pages long, straight to the point and enough to attract people into using it as de-facto standard for sequence alignment. Even if it varies slightly between versions and has custom tracks, it seems it is here to stay.

Similarly, the OpenFMRI project is set to be an example worth following for the neuroscience community. With a single figure (see below), they capture the essence of structuring brain scans in a filesystem hierarchy:

OpenFMRI standard scheme
So, important interoperability aspects would be to, among others:

  1. To give incentives to portable software packaging efforts.
  2. Early data and paper publication, even before official acceptance.

The first requires skilled people willing and able to dedicate efforts to pave the way towards solid and repeatable software installations.

An example of realizing in time how important is to package software comes from the computer security world, with efforts like BackTrack and Kali Linux. Both are essentially the same pentest linux distribution, except that as a result of a change in policy in regards to software maintenance, Kali can now be ported to a flurry of different hardware, which wasn’t so straightforward with Backtrack.

The second requires a paradigm shift when it comes to data sharing and publication in life sciences:

Where do you see major barriers with respect to workflow support near the data storage?

More efforts into creating an interoperable infrastructure via software and data encapsulation.

Willingness to take some of the academic computation efforts towards HPC research computing as opposed to throw more manual maintenance work at it. The final aim is to run experiments in different HPC facilities without changing default configurations significantly.

More specifically, dedicate some research efforts on deploying and testing solutions like docker and openstack as opposed to stick with silo HPC installations.

What are the most urgent needs of the scientific communities with respect to data processing?

To put it simply, prioritize standardization and incentivize incremental science as opposed to “novel” approaches that do not contribute to the field significantly (wheel reinvention).

Jokes aside, this happens all the time in genomics data formats with major overlap in features.

Let’s hope neuroscience learns from it gets a better grip on community consensus over the years.