The case for local, dockerized bioinformatics
Since I started using docker on my local computer (a MacBook Air 11” at the time of writing this), I encountered issues with boot2docker+VirtualBox combo. Installing VirtualBox (plus guest additions), Docker Toolbox via Brew Casks, most of the problems stem from the volume sharing and UID/GID mappings between host and docker containers.
Then to my relief I requested access to the private Docker for Mac beta program, which uses a lightweight hypervisor and base image (hyperkit+alpine) to run containers on OSX, conveniently hiding the installation woes. This setup worked quite well and while docker on OSX does not yet support GPU passthrough processing yet (for those interested in things like Tensorflow and Keras), docker for osx is a really convenient local docker setup.
Frustratingly, my local docker setup was always accompanied by tests on our local HPC cluster that has limited docker support and a small AWS instance. Almost invariably I resumed my development efforts on the HPC/AWS setups instead, because, you know, beta.
In contrast, a large share of my colleagues
do use OSX as a
workstation mosh shell to their respective HPC clusters where they launch test runs. Why don’t we use the local CPUs a bit more for testing?
That was the state of the art with bcbio+docker+osx: not using it locally…
bcbio-nextgen and CWL
As peers in the bioinformatics community have noticed, the Common Workflow Language is getting workflow and pipeline engines migrating to CWL as their underlying workflow representation, including but not limited to Arvados, Galaxy and bcbio-nextgen.
In order to have a minimal development environment while migrating bcbio-nextgen internal logic to CWL, Brad Chapman wrapped a small demo that can launch bcbio_vm, CWL and Toil (SLURM support under Toil is a WIP right now).
So please go ahead and:
- Install Docker for Mac.
- Install Miniconda if you don’t have it already.
conda install bcbio-nextgen-vm -c bioconda.
tar xvfz test_bcbio_cwl.tar.gz&&
chmod +x run_cwltool.sh&&
Those will download a ~2GB bcbio docker image and then run a sample bioinformatics workflow under docker for OSX in your computer.
Thanks to Robin Andeer for being one of the first brave souls to test this out and please feel free to report back your experiences running this experimental setup in the comments section below!
GSoC 2016 Post-midterm ideas
After having a successful midterm evaluation for argparse2cwl where Anton Khodak successfully:
- Surveyed the python argument parsing landscape.
- Covered the argparse API with equivalent CWL terms.
- Generated CWL files out of an example bioinformatics package, CNVkit.
- Wrote tests and documentation for the above points.
- Engaging with the community through gitter, mailing list and his project blog, while getting ideas on how to improve and conduct the project in a way that is helpful to the community.
He will be tackling the next stage of this GSoC 2016 on CWL. This is a summary of the brainstorming and ideas that Anton is going to pick and undertake during the next 6 weeks.
The idea would be to have a commandline tool that:
- Queries/downloads an arbitrary pypi package specified by the user (i.e: ./pypi2cwl cnvkit).
- Checks for argparse presence in its scripts/code.
- Runs argparse2cwl against it and generates the associated CWL output files.
- Generates a pull request for review to add the templates to the CWL community repository
Building on top of the CNVkit experience that Anton has, this would be very valuable to automate the wrapping process further and potentially explore the conversion of python packages in bulk.
A desired outcome for this sub-project would be to have an output similar to that of http://pythonwheels.com, where the overall coverage and bugs can be spotted quickly.
As any conversion tool that has a “2” in the middle, isn’t it worth to flip it over? That’s what cwl2argparse would do ;)
In a future where everyone wraps their tools with CWL (ha!), one would like to be able to:
- Write a cwltool definition.
- Run cwl2argparse against you_newly_wrapped_tool.cwl.
- Generate argparse code to import into your Python program.
Since we have already generated CNVkit’s CWL, what would be the expected output of, for instance cwl2argpase tools/cnvkit-batch.cwl would generate an argparse definition that could be imported in a Python program and used, i.e:
P_batch = AP_subparsers.add_parser('batch', help=_cmd_batch.__doc__) P_batch.add_argument('bam_files', nargs='*', help="Mapped sequence reads (.bam)") P_batch.add_argument('-y', '--male-reference', action='store_true', help="""Use or assume a male reference (i.e. female samples will have +1 log-CNR of chrX; otherwise male samples would have -1 chrX).""") P_batch.add_argument('-c', '--count-reads', action='store_true', help="""Get read depths by counting read midpoints within each bin. (An alternative algorithm).""") P_batch.add_argument("--drop-low-coverage", action='store_true', help="""Drop very-low-coverage bins before segmentation to avoid false-positive deletions in poor-quality tumor samples.""") (...)
Pretty much what John Chilton does for Bash via cwl2script, Anton can try to do it for python’s argparse so that a developer only has to update their tool definition to generate argparse code for newly introduced flags. Adding in new python argparsers such as click, arg[N] or optparse
According to Anton’s first blogpost, there are a few other argument parsers that could use some CWL automation. The next on the list to implement, when accounting for packages that use it, would be python’s core argument array (arg, arg, …). On the other hand, given its free-form nature, it can be challenging to assign and transform semantics from it to CWL like the others parsers. For instance, flags, arguments and types are all well specified on click or argparse.
Or perhaps click would be more interesting to wrap given its ramping up in usage by the community? Worth exploring in any case.
Discarded idea (for now): Fuzzing support for argparse2cwl: argparse2cwl-fuzz
Last and least, after some deliberation, argparse2cwl-fuzz could be a bit too far away from this project scope and resources: writing a stub for bioinformatics fuzzing.
As pointed out in the original proposal:
(…) explore and optimize the parameter/flag space of i.e, bioinformatic aligners with Teaser.”
Instead of going all the way and create a benchmarking/validation suite and file format fuzzing, just providing some support for argparse2cwl and generating different commandline flag combinations, somewhat like biopsy and Teaser do.
Could be tackled in a future GSoC project maybe?
Should electronics be repairable?
The question seems to ask for a resounding “of course” in my mind, but today I woke up to the following video by Louis Rossman about the right to repair bill and how legislators are trying to counter trade secret lobbyists. If you care about your right to repair your electronics, either by you or a professional, take some time to hear what Louis has to say as a talented independent electronics repairman:
Rossmann mentions how stupid it would be to not be able to repair your own electronics and share how you did so with the rest of the world as he does with Apple products. That thought resonated with me, specially when I decided to look into my faulty drone as a hobby repair recently.
So if you don’t care much about that bill and you are a tech geek like me, take a look at how I fixed my ~$300 drone by fixing a software glitch in one of its motor’s microcontrollers. Then please, rethink again how that bill would affect us all and raise your voice.
One motor not responding
A couple of years ago I was showing my drone to a friend when it started to wobble mid-air and crashed. I could not get to fly it again, nothing seemed wrong with the motors or anything. When re-connecting the battery, all propellers twitched (initialization sequence) except one:
How annoying is that? Everything seems to work except one (upon visual inspection) undamaged motor? Why?
Connecting to the drone (via telnet) and seeing the logs:
0.970751 NULL 6 909390645 BLC call for motor 1 1.970768 NULL 6 909390645 BLC motor 1 flash & start FAILED 2.030722 NULL 6 909390645 BLC call for motor 2 2.142945 NULL 6 909390645 BLC motor 2 soft version 1.43, hard version 3.0, supplier 1.1, lot number 11/10, FVT1 17/11/10 2.200720 NULL 6 909390645 BLC call for motor 3 2.312925 NULL 6 909390645 BLC motor 3 soft version 1.43, hard version 3.0, supplier 1.1, lot number 11/10, FVT1 17/11/10 2.370718 NULL 6 909390645 BLC call for motor 4 2.482919 NULL 6 909390645 BLC motor 4 soft version 1.43, hard version 3.0, supplier 1.1, lot number 11/10, FVT1 17/11/10 2.510740 NULL 6 0 BLC motor 1 dead 2.510942 NULL 6 0 BLC reflash required, perform off/on cycle (...) 1.005742 NULL 6 909260344 BLC call for motor 1 1.026300 NULL 6 -1096575148 BLC start flash 1.816389 NULL 6 -1096575148 BLC flash done 1.816557 NULL 6 -1 BLC verify 1.835135 NULL 6 -1 BLC verify FAILED - page 0 (...) !!! Emergency landing from /home/aferran/[...]/version/[...]/Soft/Build/../../Soft/Toy//Os/elinux/Control/motors.c:1593. Reason is Motors have not been initialized correctly
I performed the off/on cycle by reconnecting the battery several times, no luck. Reflash required… but how? There are no instructions for doing that reflash, just power off/on cycle which clearly does not work in my case.
At this point, all the manufacturer tells you is to buy a completely new motor, worth almost $50 plus shipping:
But I insist, nothing seems wrong with the motor itself, neither the few DMC3021LSD MOSFETs that are around the motor board, so it clearly seems like a software issue… with the Atmega8A microcontroller present in each of the 4 motor boards… the microcontroller datasheet states that they should work for 20 years and withstand 100.000 programming cycles. I definitely did not use it neither for that long nor that many times, so it got me curious: what if I could just fix it myself while I still have the right to?
Not bothering opening my drone
So at this point, many people would either buy that motor or throw that dead toy to a pile of e-waste.
First, before reaching to the screwdriver, let’s learn a bit about how that gadget is built by not opening it until it’s necessary via FCCID.io. There’s a great block diagram which details all its ins an outs:
Also external and internal photos on how the different boards and components look like.
See those MISO/MOSI/SCK and RESET test points in the motor pinout? Those can be used to communicate with our confused AVR microcontroller.
Time for some wiring up a couple of motors with a Raspberry PI
Using a Raspberry Pi one’s GPIO pins acting as a microcontroller’s programmer and AVRdude running on it (just a plain
apt-get install avrdude away on a recent raspbian), we can read and write the contents of the faulty motor board:
The pinouts in AVRdude must be defined in their physical mapping. There are tons of diagrams available online on how those are distributed in the different raspberry pi versions, so pick and choose your favorite rpi GPIO pins and tell AVRdude accordingly via
programmer id = "motor_1"; desc = "Use the Linux sysfs interface to bitbang GPIO lines"; type = "linuxgpio"; reset = 25; sck = 11; mosi = 10; miso = 9; ; programmer id = "motor_2"; desc = "Use the Linux sysfs interface to bitbang GPIO lines"; type = "linuxgpio"; reset = 14; sck = 4; mosi = 3; miso = 2; ;
If all goes well and it’s properly wired, you should get this from avrdude:
$ avrdude -p m8 -C /etc/avrdude.conf -c motor_2 -v (…) avrdude: AVR device initialized and ready to accept instructions
Reading ################################################## 100% 0.00s
avrdude: Device signature = 0x1e9307 (probably m8) avrdude: safemode: hfuse reads as DC
avrdude: safemode: hfuse reads as DC avrdude: safemode: Fuses OK (E:FF, H:DC, L:E4)
avrdude done. Thank you.
Then it’s just a matter of running avrdude to read the contents of the flash, eeprom, fuse and lock bits:
# avrdude -p m8 -C /etc/avrdude.conf -c ardrone_motor_2 -U flash:r:flash.hex:i -U eeprom:r:eeprom.hex:i # avrdude -p m8 -C /etc/avrdude.conf -c ardrone_motor_2 -U lock:r:lock.hex:i -U hfuse:r:hfuse.hex:i -U lfuse:lfuse.hex:i
How can we tell the dead from the living? Using UNIX diff.
Diffing the motor’s bits
Why did I connect two motors, that is one healthy and one “dead”, instead of just the faulty one?
As I learned from bioinformatics, comparing NORMAL vs TUMOUR tissue can reveal useful insights about biology. After this really stretched yet handy analogy which I probably should be embarassed about, let’s see what I found:
$ diff -u motor1/eeprom.hex motor2/eeprom.hex --- motor2/eeprom.hex 2016-06-06 08:47:26.815120557 +0000 +++ motor1/eeprom.hex 2016-06-06 09:35:17.664976258 +0000 @@ -1,4 +1,4 @@ -:20000000AC8A0001018A0A110BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0F +:20000000FF030001010B0A120B0AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB6 :20002000FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE0 :20004000FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFC0 :20006000FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFA0 diff -u motor1/lock.hex motor2/lock.hex --- motor1/lock.hex 2016-06-06 10:00:38.532028704 +0000 +++ motor2/lock.hex 2016-06-06 09:09:24.738241220 +0000 @@ -1,2 +1,2 @@ -:0100000003FC +:010000002FD0 :00000001FF
So the EEPROM holds information that I have no time to reverse engineer now (motor coil timing calibration? total flight hours?… no clue).
On the other hand, the lock bits got me interested. As Alexander and Boris say in their AVR workshops: “when in doubt, look at the datasheet!”.
So the datasheet states the following about lock bits near table 86 on page 215:
The ATmega8 provides six Lock Bits which can be left unprogrammed (“1”) or can be programmed (“0”) to obtain the additional features listed in Table 86. The Lock Bits can only be erased to “1” with the Chip Erase command.
Alright, so what if we just erase the chip with
avrdude’s -e command?
avrdude -p m8 -C /etc/avrdude.conf -c motor_1 -e
And then reflash the flash back?:
avrdude -p m8 -C /etc/avrdude.conf -c motor_1 -U flash:w:flash.hex:i
To be fair, those locks might be enforced when the firmware detects that there’s a serious mechanical issue with the rotor, which defaults to cutout/shutdown if there’s something wrong with it, preventing worse damage involving burned MOSFETs, destroyed motor coils, etc…
But since there are no further specs nor documentation from the manufacturer about this topic other than
"Motors have not been initialized correctly"… how should I know if I want to?
In any case, that’s it, I just saved the environment and $50 by unlocking an incorrectly software-locked hardware!
There are a few bits missing on how I debugged this issue and saved some followup reverse engineering work Hugo Perquin did on his blog. About reverse engineering, I might present some work at the first Radare2 conference.
But anyways, I hope to have raised some awareness about the right to repair while entertaining some nerds like me ;)
Organizing the workshop
Subject: Any interest in putting together a workshop for Stockholm this summer?
After getting the green light from WWCRC, my current employer, it did not took too long to include Oxana Sachenkova to the team and start planning the logistics, lessons, official PhD-level university credits and raise some money to support the event.
- One day with python fundamentals with a best practices twist (TDD).
- The second day with a more biological data analysis focus.
That is how the idea to do Carpentry with Software and Data came into fruition by the end of November 2015:
After innumerable emails, talks and commits the event was on the forge. Also the national swedish bioinformatics communities BILS and WABI supported us. We would like to thank both of the organisations for their financial support.
Day 1: Software Carpentry
For day one, Olav had some interactive python console sessions showing how basic Python data structures and control mechanisms look like.
Following up, Radovan prepared excellent TDD lessons, inspired on three sources:
The already mentioned Python Koans.
Some ideas borrowed from the BioPython’s comprehensive testsuite.
A late addition from an upcoming SWC TDD lesson, released just a few days before our workshop.
While the infamous installation problem is still an issue, students managed to follow through the lessons, getting the typical python installation issues, majorly solved by a proper installation of Miniconda.
The SWC installation tests, mostly distracted students since packages not being used in the workshop where flagged as uninstalled/failed (i.e: EasyMercurial). In general I perceived that students were getting overwhelmed by too much information from SWC default guides and stopped following up and reading the instructions early.
We need more TL;DR’s in software and data carpentry. Perhaps starting by the workshop template.
Day 2: BioData, Jupyter Notebooks, Pandas and Machine Learning
The morning is dedicated to brief students into the Pandas dataframe operations with Ethan’s White python-ecology dataset. Due to time constraints the merging and concatenation of dataframes is not covered but pointed out in the lesson. Now the students have enough knowledge to followup on Oxana’s Gene Expression dataset:
For which there are exercises for those students willing to earn swedish university credits. After some glitches with Python 2 vs Python 3 Jupyter notebooks, students get to know how to analyze data from the FANTOM5 consortium.
After getting some expression heatmaps and good insight from Oxana, Ahmed KachKach, currently interning at Spotify AB machine learning division, delights the audience with a detailed analysis of a toy dataset on breast cancer by using an extremely well documented introduction to machine learning notebook.
In order to explain PCA graphically to students, Ahmed uses an excellent web visualization to illustrate how variable decomposition/projection works in PCA.
Right after that machine learning introduction, I show how one can enact reproducible (and interactive!) notebooks via mybinder.org service by exploring a small scikit-allel dataset. Furthermore, more visualization techniques are shown via my current explorations of HivePlots as an alternative way of visualizing structural genomic variations in cancer samples.
On top of that, I had a talk prepared about structural variations processed with bcbio, but on the interest of time, I saved it for another event :)
Last but not least, Mikael Huss goes through a fantastic notebook showing some gene expression prediction techniques and clever feature engineering from his current efforts at WABI.
Thoughts and comments
Planned ahead of time, this workshop was a sustained effort to bring instructors and people together, and I am glad it worked.
A surprising early realization of this workshop is how high demand those courses could be: only a few minutes after announcing the event, we got around 40 individuals interested and signing up. The retention changed over time due to cancellations, but we managed to run the workshop with 35+ participants.
Regarding attendance, thanks to link shorteners on our announcement emails and twitter we could track the “funnel” of students that showed interest all the way down to those that were commited to actually show up and complete the courses.
Actual feedback from students
In our post-assessment polls we got an average rating of 8 over 10 on “General satisfaction with the workshop”, here are some selected comments:
I learned a lot of things these two days and the workshop really made me more motivated to use pandas next day in the office. :D
Overall it was a very nicely arranged and well prepared workshop. My only suggestion would be to simplify a bit the exercises for day 2 (perhaps by introducing some intermediary steps between two problems). Thanks for arranging such a nice workshop.
great event, I’ll recommend it further, should be on regular (annual?) basis.
And also some things to look after in the future:
I enjoyed particularly the first day, particularly the list of challenges/exercises that looked quite overwhelming at first but turned out to be manageable. Also very much appreciated: collection of ideas and questions on etherpad, post-its to request help. It would have been even better with little stricter time management.
such as clearly stating the (minimum) requirements to attend day 2 (intermediate/advanced):
My Python knowlegde was not high enough to follow. exercise.py was nice to learn Python, but didn’t help to learn testing process (I was stuck with the exercises). Other exercise had too difficult instructions. The python introduction was at a very basic level but the tasks were at intermediate or above level. This needs adjustment.
or again, not putting too much material in one day, no matter how exciting it sounds at first while preparing the lessons:
The first day was great (11/10). Intro to Python was too basic for me, but I understand it was necessary for some participants of the workshop. Intro to Git, test-driven development etc. was very well performed and I learned a lot. Second day was pretty good (7/10), but too hurried. I feel that there were too many things squeezed into the schedule. The visualization lab had a good premise, but also suffered from too little time.
HDF5 and Spark do not play well with each other…
… so what if we just use HDF5 for sharing (import/export) and Spark for the rest? That’s what Jeremy and Cyrille and me wondered while sitting in Janelia Farm labs… well, actually in its bar with our laptops, but now you are in context ;)