- One day with python fundamentals with a best practices twist (TDD).
- The second day with a more biological data analysis focus.
Organizing the workshop
Subject: Any interest in putting together a workshop for Stockholm this summer?
After getting the green light from WWCRC, my current employer, it did not took too long to include Oxana Sachenkova to the team and start planning the logistics, lessons, official PhD-level university credits and raise some money to support the event.
That is how the idea to do Carpentry with Software and Data came into fruition by the end of November 2015:
After innumerable emails, talks and commits the event was on the forge. Also the national swedish bioinformatics communities BILS and WABI supported us. We would like to thank both of the organisations for their financial support.
Day 1: Software Carpentry
For day one, Olav had some interactive python console sessions showing how basic Python data structures and control mechanisms look like.
Following up, Radovan prepared excellent TDD lessons, inspired on three sources:
The already mentioned Python Koans.
Some ideas borrowed from the BioPython's comprehensive testsuite.
A late addition from an upcoming SWC TDD lesson, released just a few days before our workshop.
While the infamous installation problem is still an issue, students managed to follow through the lessons, getting the typical python installation issues, majorly solved by a proper installation of Miniconda.
The SWC installation tests, mostly distracted students since packages not being used in the workshop where flagged as uninstalled/failed (i.e: EasyMercurial). In general I perceived that students were getting overwhelmed by too much information from SWC default guides and stopped following up and reading the instructions early.
We need more TL;DR's in software and data carpentry. Perhaps starting by the workshop template.
Day 2: BioData, Jupyter Notebooks, Pandas and Machine Learning
The morning is dedicated to brief students into the Pandas dataframe operations with Ethan's White python-ecology dataset. Due to time constraints the merging and concatenation of dataframes is not covered but pointed out in the lesson. Now the students have enough knowledge to followup on Oxana's Gene Expression dataset:
For which there are exercises for those students willing to earn swedish university credits. After some glitches with Python 2 vs Python 3 Jupyter notebooks, students get to know how to analyze data from the FANTOM5 consortium.
After getting some expression heatmaps and good insight from Oxana, Ahmed KachKach, currently interning at Spotify AB machine learning division, delights the audience with a detailed analysis of a toy dataset on breast cancer by using an extremely well documented introduction to machine learning notebook.
In order to explain PCA graphically to students, Ahmed uses an excellent web visualization to illustrate how variable decomposition/projection works in PCA.
Right after that machine learning introduction, I show how one can enact reproducible (and interactive!) notebooks via mybinder.org service by exploring a small scikit-allel dataset. Furthermore, more visualization techniques are shown via my current explorations of HivePlots as an alternative way of visualizing structural genomic variations in cancer samples.
On top of that, I had a talk prepared about structural variations processed with bcbio, but on the interest of time, I saved it for another event :)
Last but not least, Mikael Huss goes through a fantastic notebook showing some gene expression prediction techniques and clever feature engineering from his current efforts at WABI.
Thoughts and comments
Planned ahead of time, this workshop was a sustained effort to bring instructors and people together, and I am glad it worked.
A surprising early realization of this workshop is how high demand those courses could be: only a few minutes after announcing the event, we got around 40 individuals interested and signing up. The retention changed over time due to cancellations, but we managed to run the workshop with 35+ participants.
Regarding attendance, thanks to link shorteners on our announcement emails and twitter we could track the "funnel" of students that showed interest all the way down to those that were commited to actually show up and complete the courses.
Actual feedback from students
In our post-assessment polls we got an average rating of 8 over 10 on "General satisfaction with the workshop", here are some selected comments:
I learned a lot of things these two days and the workshop really made me more motivated to use pandas next day in the office. :D
Overall it was a very nicely arranged and well prepared workshop. My only suggestion would be to simplify a bit the exercises for day 2 (perhaps by introducing some intermediary steps between two problems). Thanks for arranging such a nice workshop.
great event, I'll recommend it further, should be on regular (annual?) basis.
And also some things to look after in the future:
I enjoyed particularly the first day, particularly the list of challenges/exercises that looked quite overwhelming at first but turned out to be manageable. Also very much appreciated: collection of ideas and questions on etherpad, post-its to request help. It would have been even better with little stricter time management.
such as clearly stating the (minimum) requirements to attend day 2 (intermediate/advanced):
My Python knowlegde was not high enough to follow. exercise.py was nice to learn Python, but didn't help to learn testing process (I was stuck with the exercises). Other exercise had too difficult instructions. The python introduction was at a very basic level but the tasks were at intermediate or above level. This needs adjustment.
or again, not putting too much material in one day, no matter how exciting it sounds at first while preparing the lessons:
The first day was great (11/10). Intro to Python was too basic for me, but I understand it was necessary for some participants of the workshop. Intro to Git, test-driven development etc. was very well performed and I learned a lot. Second day was pretty good (7/10), but too hurried. I feel that there were too many things squeezed into the schedule. The visualization lab had a good premise, but also suffered from too little time.
HDF5 and Spark do not play well with each other...
... so what if we just use HDF5 for sharing (import/export) and Spark for the rest? That's what Jeremy and Cyrille and me wondered while sitting in Janelia Farm labs... well, actually in its bar with our laptops, but now you are in context ;)
I am attending a neuro meeting at the fantastic Janelia Farm facilities to see how experts in the fields of electrophysiology and computer science among others, decide a common format to express recordings of neuronal activity and the surrounding experimental metadata.
The mandate and outline of NWB has a clear mission, timeline and particular steps:
- August 2014: Project Start.
- Phase 1: Identify use cases and evaluation criteria.
- Phase 2: Select/assemble most promising approaches and develop data format and test it.
- Phase 3: Test and fine-tune it.
- July 2015: Project ends.
Now, brace for impact. Here's a small list of common e-phys file formats that were created by different labs:
For a more nuanced view of some of the main data formats, please have a look at the considerations for developing a standard for storing e-phys data in HDF5 and the NWB data and file formats summary.
Wouldn't it be a massive win to choose a single data format and not fall in the traditional academic mantra that states: "different formats are good for different things"? Or even worse, create yet another competing standard?
How many of those formats are actually used in research publications? Which is the one seeing the most adoption in academic literature so far?
Why shouldn't we just choose the top N that share the most mindshare for the greater good (reproducibility, data sharing, interoperability)?
Let's see if we can find a fix for this e-phys Babel.
Several labs describe and present their custom ephys formats. Most of them have a fairly large overlap on attributes, structure and features. With varying specifications, the labs seem to revolve around HDF5, a hierarchical file format that stores all the attributes of the experiments, from images to timeseries, in varying degrees of complexity.
It is interesting to see how, being an event-processing problem at its core, there are very few mentions of industry and opensource event processing frameworks:
Software developers in the room recommend exposing a strongly typed API that deals with the raw data attributes via an intermediate representation instead of having to change the HDF5 container at every specification change or experimental novelties. This idea resonates quite well with the NEO format approach. An additional problem that arises with internal representations is keeping track of provenance since encapsulation might hide processing details that might be interesting to follow an experiment step by step.
It seems to me that e-phys recording can be approached as a large scale logging problem, therefore:
- Using a framework that aggregates events at scale is crucial to guarantee a smooth and fast data analysis experience. That is, including slicing by data recording sessions or any other criteria that the data (neuro)scientist decides.
- Leaving the internal (intermediate) representation of data in point 1 untouched is the most convenient approach. Specially when HDF5 does not play well with modern parallel frameworks.
- Exporting data from point 1 as HDF5 for sharing, given that is the most popular container within this science niche seems reasonable (to me, at least).
- Writing importers/exporters (serializers) from Thunder to HDF5 seems like an interesting Hackathon challenge. Adopting KWIK, already used by many, as a particular specification could be interesting w.r.t interoperability.
Software carpentry and learning
I am going through the 11th iteration of software carpentry for instructors and I am quite happy about the way Greg Wilson conducts his bi-weekly calls and focuses on evidence-based learning techniques.
At first my expectations from this course were those of simply learning how to teach specific software carpentry lessons, that is, traditional master classes on software engineering, tools and accompanying cookbooks.
During the first session I quickly realized that his approach goes far beyond that.
Greg is attacking the very roots of the learning process in order to provide a solid teaching base, ultimately offering robust, research-based, (scientific) software literacy to the world.
How motivation works
So I told my students on the first day of class, "This is a very difficult course. You will need to work harder than you have ever worked in a course and still a third of you will not pass"
Which I heard a lot of times myself and somewhat learned to ignore during my university days. Perhaps unsurprisingly, those words were followed by unintended consequences:
But to my surprise, they slacked off even more than in previous semesters (...) their test performance was the worst it had been for many semesters.
By using several research papers as a foundation to understand those situations, how learning works concludes that:
Limited chances of passing may fuel preexisting negative perceptions about the course, compromise her students' expectations for success, and undermine their motivation to do the work necessary to succeed.
Again, I've seen this happening time after time on different academic contexts over the years.
This week's assignment
So in this week's course session, Greg asked us to describe a personal experience during our education where we saw a similar situation.
Sometime during my high school years, Iñaki, my physics teacher said something like:
If you don't put more effort on physics, I think you should consider fixing motorbikes on FP instead.
FP (Formación Profesional) is the spanish educational route for those who want to skip university and pursue a more applied professional degree. FP is often mocked by spanish society (some teachers too) and regarded as "lower paid" and "less honorable" means to earn a living than going for more traditional academic degrees.
I believe Iñaki wanted me to succeed in physics and meant well, targeting at my self-esteem as a way to push me harder. Looking back, I see that it was a bad move that effectively de-motivated me. Although I did pass, it did not enjoy the subject as I should have, didn't learn it as thoroughly and, therefore, didn't earn higher scores.
I tend to obsess on topics I like. Curiosity keeps me awake at night, it's like an unconscious autopilot that drives me towards higher understanding. As I discover and dig deeper on subjects I want to learn more about, I completely lose track of time. In my experience, frictionless, smooth learning almost invariably results in well cristalized knowledge and high scores.
Later on, while undergoing my computational biology masters degree in Sweden, I re-discovered (bio)physics while diving on the incredible world of ion channels and biomedicine in a very different learning environment.
I loved it.
The following participants walked into the room during the 3 days that the room was open, presented themselves:
Shreejoy Tripathy, University of British Columbia Richard Gerkin, Arizona State University Zameer Razack, Bournemouth University Tristan Glatard, McGill University, CBRAIN Joris Slob, Leiden University Barry Wark, Physion LLC Stephen Larson, OpenWorm Jeff Muller, EPFL-Blue Brain Project Roman Valls Guimera, INCF
At first we started with very few participants, but there was a stream of people over the days, coming and going. Some exchanging knowledge, others just curious about the concept of a hackathon. Also, local researchers such as Joris Slob, who works with Ontologies and NeuroImaging were highly welcome, growing the neuroinformatics hackathon clique is always a good idea.
A special mention goes to Barry Wark, Physion CEO kindly sponsored this hackathon and briefly introduced us to Ovation, a great scientific management system that talks with many commercial cloud backends. Thanks Barry!
Hands on work
A hackathon basic principle is that of learning by doing. An early example of that started happening during the first minutes of the event via the collaborative integration efforts of Barry Wark and Christian from G-node. After a brief discussion, they both started coding Java bindings for NIX, a HDF5-based neuroscience exchange format. They used SWIG for those bindings so potentially, other programming languages could talk with NIX.
This is also a very clear, mutually benefiting, hands-on example of collaboration between industry (Ovation) and research institutions (G-node).
Another interesting initiative that surfaced during the event was a possible integration between CBrain workflow systems and INCF NIDASH neuroinformatics data model and reference implementation. Tristan Glatard and I went through the abundant ipython notebooks and documentation of NIDASH and after understanding the framework, proposed a quickstart for developers, which is coming soon.
The specific outcome from this would be to have an export provenance information fuctionality when publishing or sharing neuroinformatics research.
Meanwhile, Stephen Larson from OpenWorm got 2 INCF GSoC slots on 2014 edition. His particular mission was to improve packaging for PyOpenWorm alpha0.5 version that was produced by INCF OpenWorm GSoC student project and getting it ready for merge into master.
Stephen got to hear about the sorry state of packaging in Python but he took advantage of the hackathon time to fix and publish GSoC outcomes for easier public consumption.
Those in neurophysiology will love to hear that NeuroElectro API documentation was improved by Rick Gerkin together with Shreejoy Tripathy. It is interesting to see how much electrophysiology data can be extracted from literature, all the better if it can be queried via an API!
On my side, I simplified nipype (bummer here since it was fixed already) and pymvpa2 installation processes and revived interest in OpenTracker, a bittorrent tracker that could potentially be used as a more scalable data sharing backend for *-stars scientific Q&A sites.
If bittorrent had so much success in filesharing, why should not happen the same while sharing scientific datasets? Let's lower the barrier for this to happen!
As Stephen Larsson from OpenWorm put very well anticipating the upcoming CRCNS hackathon:
- Define the time and goals of the hackathon in advance that you have in mind and write them up.
- Target your participants in advance and ask them to provide at least one-liners on who they are and what they do, and help you collaboratively make an 'ideas list'. Good platforms for this are wikis (maybe a GitHub wiki), a Google Doc that is world editable, or as Roman used recently, Etherpad.
- Lately I've been seeing folks use real-time chat more. Consider opening up an IRC chat room, or I've seen people liking HipChat, Slack, or Google Hangout (chat) for this
- When the time comes, lead by example during the session by being there the whole time and driving the process. Maybe open up an on air google hangout if there is an in-person component and have people watch you hacking while interacting via chat / other collaborative media
- During, try to collect a list of things that "got done". This will be the meat you will use to demonstrate the effectiveness of the session to others who ask you to justify its existence :)
Another important hint for such events is, in my opinion, to minimize the amount and time allocated to "speakers". Doing should be prioritized over delivering a 30 minutes presentation, effectively moving away from traditional scientific congress formats.
Hackathons (or codefests) are meant to get things done, polished or spark new ideas and interesting collaborations.
This all happened with a relatively small amount of participants, but outcomes usually grow (to a certain point) the more engaged participants show up and work. See BrainHack as an excellent example of this.
INCF realized that hackathons should be considered as dedicated events, free from other (highly interesting) congress distractions and will continue to support them via Hackathon Series program.
A chained stream of science hackathons, such as the #mozsprint, celebrated a few months ago, helps in pushing tools, refinements and integrations forward. That is, standardized and interoperable neuroinformatics research, in line with INCF's core mission.
More neuro-task proposals can be found over NeuroStars. Pick your's for the next hackathon ;)