• HDF5 and Spark do not play well with each other…

    … so what if we just use HDF5 for sharing (import/export) and Spark for the rest? That’s what Jeremy and Cyrille and me wondered while sitting in Janelia Farm labs… well, actually in its bar with our laptops, but now you are in context ;)

    Please check the nbviewer-formatted notebook and our ongoing tests.

  • Intro

    I am attending a neuro meeting at the fantastic Janelia Farm facilities to see how experts in the fields of electrophysiology and computer science among others, decide a common format to express recordings of neuronal activity and the surrounding experimental metadata.

    The mandate and outline of NWB has a clear mission, timeline and particular steps:

    1. August 2014: Project Start.
    2. Phase 1: Identify use cases and evaluation criteria.
    3. Phase 2: Select/assemble most promising approaches and develop data format and test it.
    4. Phase 3: Test and fine-tune it.
    5. July 2015: Project ends.

    This post would not have been possible without the collaborative editing and many tweets that happened during the event.

    E-phys formats

    Now, brace for impact. Here’s a small list of common e-phys file formats that were created by different labs:

    KWIK, NIX, MEF, Svoboda lab, LBNL BRAIN, StorageBIT, ARF, NSDF, WFDB, epHDF, MTSF, NeXus, NDF, Brainliner, NeuroHDF, NEO, Neuroshare, Ovation, Neuralynx, EEGBase

    For a more nuanced view of some of the main data formats, please have a look at the considerations for developing a standard for storing e-phys data in HDF5 and the NWB data and file formats summary.

    Wouldn’t it be a massive win to choose a single data format and not fall in the traditional academic mantra that states: “different formats are good for different things”? Or even worse, create yet another competing standard?

    How many of those formats are actually used in research publications? Which is the one seeing the most adoption in academic literature so far?

    Why shouldn’t we just choose the top N that share the most mindshare for the greater good (reproducibility, data sharing, interoperability)?

    Is HDF5 really the best format to adopt given the difficulties encountered with modern parallel processing frameworks such as Spark or the neuro-oriented Thunder?

    Let’s see if we can find a fix for this e-phys Babel.

    The talkathon

    Several labs describe and present their custom ephys formats. Most of them have a fairly large overlap on attributes, structure and features. With varying specifications, the labs seem to revolve around HDF5, a hierarchical file format that stores all the attributes of the experiments, from images to timeseries, in varying degrees of complexity.

    It is interesting to see how, being an event-processing problem at its core, there are very few mentions of industry and opensource event processing frameworks:

    With the exception of Jeremy Freeman who demoes a very interesting combination of Spark, Thunder and Lightning, but warns us that the community buy-in into HDF5 complicates things:

    Software developers in the room recommend exposing a strongly typed API that deals with the raw data attributes via an intermediate representation instead of having to change the HDF5 container at every specification change or experimental novelties. This idea resonates quite well with the NEO format approach. An additional problem that arises with internal representations is keeping track of provenance since encapsulation might hide processing details that might be interesting to follow an experiment step by step.

    Personal conclusions

    It seems to me that e-phys recording can be approached as a large scale logging problem, therefore:

    1. Using a framework that aggregates events at scale is crucial to guarantee a smooth and fast data analysis experience. That is, including slicing by data recording sessions or any other criteria that the data (neuro)scientist decides.
    2. Leaving the internal (intermediate) representation of data in point 1 untouched is the most convenient approach. Specially when HDF5 does not play well with modern parallel frameworks.
    3. Exporting data from point 1 as HDF5 for sharing, given that is the most popular container within this science niche seems reasonable (to me, at least).
    4. Writing importers/exporters (serializers) from Thunder to HDF5 seems like an interesting Hackathon challenge. Adopting KWIK, already used by many, as a particular specification could be interesting w.r.t interoperability.

    Comments? Suggestions?

  • Software carpentry and learning

    I am going through the 11th iteration of software carpentry for instructors and I am quite happy about the way Greg Wilson conducts his bi-weekly calls and focuses on evidence-based learning techniques.

    At first my expectations from this course were those of simply learning how to teach specific software carpentry lessons, that is, traditional master classes on software engineering, tools and accompanying cookbooks.

    During the first session I quickly realized that his approach goes far beyond that.

    Greg is attacking the very roots of the learning process in order to provide a solid teaching base, ultimately offering robust, research-based, (scientific) software literacy to the world.

    How motivation works

    Going through chapter 3 of how learning works ebook from Software Carpentry’s recommended reading there’s an illustrative experience from a teacher on her first class speech:

    So I told my students on the first day of class, “This is a very difficult course. You will need to work harder than you have ever worked in a course and still a third of you will not pass”

    Which I heard a lot of times myself and somewhat learned to ignore during my university days. Perhaps unsurprisingly, those words were followed by unintended consequences:

    But to my surprise, they slacked off even more than in previous semesters (…) their test performance was the worst it had been for many semesters.

    By using several research papers as a foundation to understand those situations, how learning works concludes that:

    Limited chances of passing may fuel preexisting negative perceptions about the course, compromise her students’ expectations for success, and undermine their motivation to do the work necessary to succeed.

    Again, I’ve seen this happening time after time on different academic contexts over the years.

    This week’s assignment

    So in this week’s course session, Greg asked us to describe a personal experience during our education where we saw a similar situation.

    Sometime during my high school years, Iñaki, my physics teacher said something like:

    If you don’t put more effort on physics, I think you should consider fixing motorbikes on FP instead.

    FP (Formación Profesional) is the spanish educational route for those who want to skip university and pursue a more applied professional degree. FP is often mocked by spanish society (some teachers too) and regarded as “lower paid” and “less honorable” means to earn a living than going for more traditional academic degrees.

    This is a myth since there are many FP professionals that make a very decent living out of roles such as DBA (in spanish)Germans know that very well and hire accordingly.

    I believe Iñaki wanted me to succeed in physics and meant well, targeting at my self-esteem as a way to push me harder. Looking back, I see that it was a bad move that effectively de-motivated me. Although I did pass, it did not enjoy the subject as I should have, didn’t learn it as thoroughly and, therefore, didn’t earn higher scores.

    I tend to obsess on topics I like. Curiosity keeps me awake at night, it’s like an unconscious autopilot that drives me towards higher understanding. As I discover and dig deeper on subjects I want to learn more about, I completely lose track of time. In my experience, frictionless, smooth learning almost invariably results in well cristalized knowledge and high scores.

    Later on, while undergoing my computational biology masters degree in Sweden, I re-discovered (bio)physics while diving on the incredible world of ion channels and biomedicine in a very different learning environment.

    I loved it.

  • I am at the INCF neuroinformatics congress in Leiden and I am co-organizing an in-congress hackathon. An Etherpad is used to announce and coordinate tasks.

    Participants

    The following participants walked into the room during the 3 days that the room was open, presented themselves:

    Shreejoy Tripathy, University of British Columbia
    Richard Gerkin, Arizona State University
    Zameer Razack, Bournemouth University
    Tristan Glatard, McGill University, CBRAIN
    Joris Slob, Leiden University
    Barry Wark, Physion LLC
    Stephen Larson, OpenWorm
    Jeff Muller, EPFL-Blue Brain Project
    Roman Valls Guimera, INCF
    

    At first we started with very few participants, but there was a stream of people over the days, coming and going. Some exchanging knowledge, others just curious about the concept of a hackathon. Also, local researchers such as Joris Slob, who works with Ontologies and NeuroImaging were highly welcome, growing the neuroinformatics hackathon clique is always a good idea.

    A special mention goes to Barry Wark, Physion CEO kindly sponsored this hackathon and briefly introduced us to Ovation, a great scientific management system that talks with many commercial cloud backends. Thanks Barry!

    Hands on work

    A hackathon basic principle is that of learning by doing. An early example of that started happening during the first minutes of the event via the collaborative integration efforts of Barry Wark and Christian from G-node. After a brief discussion, they both started coding Java bindings for NIX, a HDF5-based neuroscience exchange format. They used SWIG for those bindings so potentially, other programming languages could talk with NIX.

    This is also a very clear, mutually benefiting, hands-on example of collaboration between industry (Ovation) and research institutions (G-node).

    Another interesting initiative that surfaced during the event was a possible integration between CBrain workflow systems and INCF NIDASH neuroinformatics data model and reference implementation. Tristan Glatard and I went through the abundant ipython notebooks and documentation of NIDASH and after understanding the framework, proposed a quickstart for developers, which is coming soon.

    The specific outcome from this would be to have an export provenance information fuctionality when publishing or sharing neuroinformatics research.

    Meanwhile, Stephen Larson from OpenWorm got 2 INCF GSoC slots on 2014 edition. His particular mission was to improve packaging for PyOpenWorm alpha0.5 version that was produced by INCF OpenWorm GSoC student project and getting it ready for merge into master.

    Stephen got to hear about the sorry state of packaging in Python but he took advantage of the hackathon time to fix and publish GSoC outcomes for easier public consumption.

    Those in neurophysiology will love to hear that NeuroElectro API documentation was improved by Rick Gerkin together with Shreejoy Tripathy. It is interesting to see how much electrophysiology data can be extracted from literature, all the better if it can be queried via an API!

    On my side, I simplified nipype (bummer here since it was fixed already) and pymvpa2 installation processes and revived interest in OpenTracker, a bittorrent tracker that could potentially be used as a more scalable data sharing backend for *-stars scientific Q&A sites.

    If bittorrent had so much success in filesharing, why should not happen the same while sharing scientific datasets? Let’s lower the barrier for this to happen!

    DIY hackathon

    As Stephen Larsson from OpenWorm put very well anticipating the upcoming CRCNS hackathon:

    • Define the time and goals of the hackathon in advance that you have in mind and write them up.
    • Target your participants in advance and ask them to provide at least one-liners on who they are and what they do, and help you collaboratively make an ‘ideas list’. Good platforms for this are wikis (maybe a GitHub wiki), a Google Doc that is world editable, or as Roman used recently, Etherpad.
    • Lately I’ve been seeing folks use real-time chat more. Consider opening up an IRC chat room, or I’ve seen people liking HipChat, Slack, or Google Hangout (chat) for this
    • When the time comes, lead by example during the session by being there the whole time and driving the process. Maybe open up an on air google hangout if there is an in-person component and have people watch you hacking while interacting via chat / other collaborative media
    • During, try to collect a list of things that “got done”. This will be the meat you will use to demonstrate the effectiveness of the session to others who ask you to justify its existence :)

    On that line, I’m looking at you, INCF Belgian node, who will hold such an event very soon.

    Another important hint for such events is, in my opinion, to minimize the amount and time allocated to “speakers”. Doing should be prioritized over delivering a 30 minutes presentation, effectively moving away from traditional scientific congress formats.

    Conclusions

    Hackathons (or codefests) are meant to get things done, polished or spark new ideas and interesting collaborations.

    This all happened with a relatively small amount of participants, but outcomes usually grow (to a certain point) the more engaged participants show up and work. See BrainHack as an excellent example of this.

    INCF realized that hackathons should be considered as dedicated events, free from other (highly interesting) congress distractions and will continue to support them via Hackathon Series program.

    A chained stream of science hackathons, such as the #mozsprint, celebrated a few months ago, helps in pushing tools, refinements and integrations forward. That is, standardized and interoperable neuroinformatics research, in line with INCF’s core mission.

    More neuro-task proposals can be found over NeuroStars. Pick your’s for the next hackathon ;)

  • A dashboard with pretty plots and numbers about your organization… It shouldn’t be that difficult, right?

    Luckily, ElasticSearch+Logstash+Kibana did came up with a nice stack to solve this problem. Here’s my result and how I solved some integration hiccups. The resulting beta dashboard, as it is now, looks like this:

    DataSpace beta operations panel

    Many many many other blog posts have addressed this issue, but I would like to share a couple of tweaks I came up with while working on it.

    Logstash and iRODs custom log parsing

    Broadly speaking, I found the following issues when setting up the dashboard:

    1. Index design: Logstash default index formatting (logstash-YYYYMMDD) brought down a 30GB RAM machine (line 63).
    2. Missing year in log timestamps from logs content, getting it from log filenames (lines 31 to 42 below).
    3. GeoIP coordinates not being parsed by Kibana’s bettermap (lines 44 to 58).

    ElasticSearch tweaks

    As mentioned, the default “daily” indexing scheme in logstash did not work for my purposes. When monitoring elasticsearch with the head plugin, the status went red while ingesting events from logs. Thanks to @zackarytong and other’s feedback, I managed to address the issue:

    Then, Kibana could not connect to the elasticsearch backend when large time ranges were defined. After some chrome developer tool and CURLing, I reported the issue, too many indexes were URL-encoded which required to setup the following ElasticSearch directive:

    http.max_initial_line_length: 64kb

    In the future, I might consider further indexing optimizations by using the ElasticSearch index curator tool to optimize and cut down index usage, but for now I want to keep all data accessible from the dashboard.

    Kibana

    The presentation side of this dashboard hipster stack also had some gimmicks. My original idea of showing three separate Kibana bettermaps, one per AWS region, had to wait after another issue got addressed very recently. Both Chrome developer console and AngularJS batarang were very useful to find issues in Kibana and its interactions with the ElasticSearch backend.

    Speaking of Amazon and their regions, while AWS has exposed a great amount of functionality through their developer APIs, at the time of writing these lines there are missing API endpoints, such as billing information in all regions (only available in us-east-1), and fetching remaining credits from your EDU Amazon grant, should you have one. The only viable option left today is scraping it:

    If one wants to keep track of AWS expenses on ES, some specific index type mappings changes on the ElasticSearch side are needed:

    curl -XPUT 'http://ids-panel.incf.net:9200/aws-2014.02/credits/_mapping' -d '{
      "aws_credit_balance": {
        "properties": {
          "aws_credit_balance": {
            "type": "integer"
          }
        }
      }
    }'
    

    In a more large scale/business setting, Netflix’s Ice could be definitely a good fit.

    Conclusions and future improvements

    It has been a fun project to collect data about your organization’s services and be able to expose it as clearly as possible and in realtime, feels like I would like to do this for a living as a consultant some day. Some new insight coming from the dashboard has allowed us to decide on downscale resources, ultimately saving money.

    The feedback from INCF people has been invaluable to rethink how some data is presented and what it all means, always bring third parties and ask them their opinions. Visualization is hard to get right, bring in users and consider their feedback.

    My next iteration in this project is to have finer detail on which activities users are performing (data being shared, copied, downloaded). This could be leveraged with some custom iRODS microservices for ElasticSearch or evaluating other EU-funded projects in the topic such as Chesire3.

    When it comes to who can access the dashboard, there’s a recent blog post on multi-faceted authentication, a.k.a showing different views of the dashboard to different audiences. I’ve already tried Kibana’s authentication proxy, which supports OAuth among other auth systems, but there are a few rough edges to polish.

    On the logstash backend, it might be worth grepping iRODS codebase for log stanzas to assess important log events worth parsing and getting good semantic tokens out of them. Luckily, ElasticSearch is backed by Lucene’s full text engine helps a lot in not having to do this tedious task. Kibana/ElasticSearch search and filtering are excellent.

    Last but not least, some remaining issues leading to total world domination include:

    1. Instrumenting all your organization’s Python code with Logbook sinking to a Redis exchange.
    2. Easily add/include other types of panels in Kibana, perhaps allowing better or more explicit integration possibilities for D3, mpld3 or BokehJS with Kibana.
    3. Getting UNIQ count for records in ElasticSearch (i.e, count unique number of IPs, users, etc…) which are on the roadmap under aggregations, so they are coming soon :)