It has been a while since the GCC2011 took place in Lunteren, in the Netherlands. As a result of my visit, I gained some more valuable insight about what I like to call the metasploit of computational biology, if such an analogy could be made between computer security and biology.
A few words about Galaxy
With a 15+ core team and a very active contributor base, Galaxy is trying hard to provide a fix for the biomedical Babel in which life scientists work nowadays.
From its modest origin as a single perl script, later on morphing into a python web framework, Galaxy evolved rapidly. In short, Galaxy can be thought as the glue code that wraps and uniformizes a considerable amount of bioinformatics programs into a more consistent web interface.
But there’s much more under the hood: cluster job management, data conversion, dataset access controls, security, web services, etc… to name a few components and features.
“Everything is possible in Galaxy, As long as you can run it on the command line, you can incorporate it into Galaxy.”
– Hans-Rudolf Hotz, Friedrich Miescher Institute for Biomedical Research
But not everything shines in the galaxy since NGS tool inclusion hogged its main site at some point. This fact only proves the point that single sites like Galaxy main, handling 130.000 cluster jobs/month and 1TiB uploads per week, face sustainability issues on the big datasets era we’re living in. As a result, other than imposing reasonable cluster quotas, interesting scaling strategies are being tested on real research projects. Therefore, federation and cloud computing are the next steps on this particular quest to the bio-universe.
One interesting realization on the conference is that not only labs are rolling their own Galaxy instances, there was a big sequencing industry player showing some interest on it too:
“Galaxy is an attractive workflow engine candidate”
– Kirt Haden, Illumina Inc
Common concerns
There are some FAQ I’ve been asked by colleagues and that came up on the past conference too. Therefore, I would like to keep them here for future reference, feedback and further questions are very welcome via the comments… and their support options.
Our compute cluster is not used due to IT policy restrictions, what should I do ?
You might want to run it as a single user, as Martin Dahlö describes, setting up a SSH tunnel. Obviously, this option has many problems when it’s aimed at non-developer scenarios: it does not scale. Make sure you explain the big picture and aim to reach consensus with your IT department. Again, some basic key insights and common sense might help here.
Are software versions kept in the history when running a workflow ?
During the conference Kanwei Li came up with some patch to keep track of the software versions by appending a
How’s Galaxy’s sample tracking moving forward ?
There are currently two big sample tracking systems present on the galaxy sphere: the one already present on main and some nextgen patches by Brad Chapman. Try them out, or better yet, join the upcoming BOSC2011 Codefest and improve or merge the current systems.
What is the API status ?
In short: it is growing as needed.
Shorter: @web.expose_api decorator.
Better explained: Read up by slide 26.
My cluster uses a custom job scheduler, will galaxy work with it ?
If your batch system supports DRMAA, there’s a better chance to get it rolling. Check out the recent progress on SLURM system for instance.
How does galaxy splitting of datasets for embarassingly parallel tools ?
There’s a ticket for that.
Next stop: BOSC2011 and ISMB
I have just covered a few topics shown in the conference, but you can take the time to explore it further, both by videos and slides. Obviously the galaxy evolution does not end up here, there’s still more to come in a few days in Vienna.