Software and Data Carpentry Workshop 2015

Organizing the workshop

On the 14th April 2015, three Software Carpentry Instructors from Stockholm, Radovan Bast, Olav Vahtras and myself got an email from Greg Wilson:

Subject: Any interest in putting together a workshop for Stockholm this summer?

After getting the green light from WWCRC, my current employer, it did not took too long to include Oxana Sachenkova to the team and start planning the logistics, lessons, official PhD-level university credits and raise some money to support the event.

Since we recently received our SWC training and had some experience with a past scientific python workshop and a previous edition of software carpentry, we decided to go for a two day workshop:

That is how the idea to do Carpentry with Software and Data came into fruition by the end of November 2015:

https://pythonkurs.github.io/2015-11-30-swc_data/

After innumerable emails, talks and commits the event was on the forge. Also the national swedish bioinformatics communities BILS and WABI supported us. We would like to thank both of the organisations for their financial support.

Day 1: Software Carpentry

For day one, Olav had some interactive python console sessions showing how basic Python data structures and control mechanisms look like.

Following up, Radovan prepared excellent TDD lessons, inspired on three sources:

  1. The already mentioned Python Koans.

  2. Some ideas borrowed from the BioPython’s comprehensive testsuite.

  3. A late addition from an upcoming SWC TDD lesson, released just a few days before our workshop.

The idea to teach basic Git, GitHub, TravisCI and Coveralls in such a short time, was challenging for the instructors but had a very good general reception from the students side.

While the infamous installation problem is still an issue, students managed to follow through the lessons, getting the typical python installation issues, majorly solved by a proper installation of Miniconda.

The SWC installation tests, mostly distracted students since packages not being used in the workshop where flagged as uninstalled/failed (i.e: EasyMercurial). In general I perceived that students were getting overwhelmed by too much information from SWC default guides and stopped following up and reading the instructions early.

We need more TL;DR’s in software and data carpentry. Perhaps starting by the workshop template.

Day 2: BioData, Jupyter Notebooks, Pandas and Machine Learning

The morning is dedicated to brief students into the Pandas dataframe operations with Ethan’s White python-ecology dataset. Due to time constraints the merging and concatenation of dataframes is not covered but pointed out in the lesson. Now the students have enough knowledge to followup on Oxana’s Gene Expression dataset:

https://plot.ly/ipython-notebooks/bioinformatics/

For which there are exercises for those students willing to earn swedish university credits. After some glitches with Python 2 vs Python 3 Jupyter notebooks, students get to know how to analyze data from the FANTOM5 consortium.

oxana_ahmed

After getting some expression heatmaps and good insight from Oxana, Ahmed KachKach, currently interning at Spotify AB machine learning division, delights the audience with a detailed analysis of a toy dataset on breast cancer by using an extremely well documented introduction to machine learning notebook.

In order to explain PCA graphically to students, Ahmed uses an excellent web visualization to illustrate how variable decomposition/projection works in PCA.

swc_data_students

Right after that machine learning introduction, I show how one can enact reproducible (and interactive!) notebooks via mybinder.org service by exploring a small scikit-allel dataset. Furthermore, more visualization techniques are shown via my current explorations of HivePlots as an alternative way of visualizing structural genomic variations in cancer samples.

On top of that, I had a talk prepared about structural variations processed with bcbio, but on the interest of time, I saved it for another event :)

Last but not least, Mikael Huss goes through a fantastic notebook showing some gene expression prediction techniques and clever feature engineering from his current efforts at WABI.

Thoughts and comments

Planned ahead of time, this workshop was a sustained effort to bring instructors and people together, and I am glad it worked.

A surprising early realization of this workshop is how high demand those courses could be: only a few minutes after announcing the event, we got around 40 individuals interested and signing up. The retention changed over time due to cancellations, but we managed to run the workshop with 35+ participants.

Regarding attendance, thanks to link shorteners on our announcement emails and twitter we could track the “funnel” of students that showed interest all the way down to those that were commited to actually show up and complete the courses.

Actual feedback from students

In our post-assessment polls we got an average rating of 8 over 10 on “General satisfaction with the workshop”, here are some selected comments:

I learned a lot of things these two days and the workshop really made me more motivated to use pandas next day in the office. :D

Overall it was a very nicely arranged and well prepared workshop. My only suggestion would be to simplify a bit the exercises for day 2 (perhaps by introducing some intermediary steps between two problems). Thanks for arranging such a nice workshop.

great event, I’ll recommend it further, should be on regular (annual?) basis.

And also some things to look after in the future:

I enjoyed particularly the first day, particularly the list of challenges/exercises that looked quite overwhelming at first but turned out to be manageable. Also very much appreciated: collection of ideas and questions on etherpad, post-its to request help. It would have been even better with little stricter time management.

such as clearly stating the (minimum) requirements to attend day 2 (intermediate/advanced):

My Python knowlegde was not high enough to follow. exercise.py was nice to learn Python, but didn’t help to learn testing process (I was stuck with the exercises). Other exercise had too difficult instructions. The python introduction was at a very basic level but the tasks were at intermediate or above level. This needs adjustment.

or again, not putting too much material in one day, no matter how exciting it sounds at first while preparing the lessons:

The first day was great (1110). Intro to Python was too basic for me, but I understand it was necessary for some participants of the workshop. Intro to Git, test-driven development etc. was very well performed and I learned a lot. Second day was pretty good (710), but too hurried. I feel that there were too many things squeezed into the schedule. The visualization lab had a good premise, but also suffered from too little time.