Bad clouds, good clouds

This is an on-demand blog post, none of the actors are real institutions nor people, anything resembling real life might be pure coincidence ;)

So there’s a day, that day when an organization realizes that there’s a real need to have a solid cloud platform as an official infrastructure offering. Admit it, we all have some idle cycles we could make better use of.

A bad cloud

Then someone owning some computer resources types some commands frenetically in a console and voilà, a beta cloud service is born, a fiction dialog between user and cloud provider follows:

- This is great! I want to have an account on your new service, where can I get it?
- Well, you have to come at our offices, we will scan your passport, have a 1 hour long meeting and then give you an account.
- Ermm, ok, I just want to use that service…

In the meeting, there’s an introduction on how to use the service, many people did not prepare their machines before the session and they get stuck by the overcomplicated client installation instructions, which involve installing pre-compiled binaries and config files inside a .tar.bz (ABI issues galore!). Next, you are given a password via SMS, “abc123″, which you cannot change (nor are encouraged to). You should explicitly ask the admins to change it for you.

Dirty secret: if they are not forced to change it, nobody ever does.

After editing some cloud templates text files, your first instance is up and running. Time to clone your CloudBioLinux copy and get some bioinformatics software installed in it… Unfortunately, it does not take very long to discover that the base distribution is more than 3 releases old. The user emails the imaginary beta cloud support and says:

- Hi cloud-support! Is it possible to have the newest Ubuntu release as an image?
- No it’s not at the moment.
- Emm, ok, I tried to apt-get dist-upgrade but it just runs out of space, can I get more space in the VM to do that upgrade myself then? One cannot do much with 4GB of disk these days, you know.
- No, it is not possible, you can use the 1TB NFS-mounted scratch space instead.
- Why isn’t it possible? Anyway, I see no straightforward way to use that scratch as an extension of the OS, while the VM is running, and I cannot access the filesystem offline and move, say, /usr away without doing some hackish stuff involving squashfs, ramdisks, etc… this is actually giving me more headaches than is worth.
- I’m sorry, we cannot bundle another distribution for you.

So what can we learn from that experience? What can a bad cloud do to become a better cloud?

  1. Distributing a readily installable and tested client CLI package for the most popular platforms instead of a precompiled .tar.gz would have cut down that 1 hour long meeting to nil. Documentation should never be a substitute nor shortcut for a tested, directly installable package.
  2. Distributing passwords, even via SMS, should adhere to basic good password policies at all times, even in beta services. Go double factor authentication if you fancy it.
  3. All services should be auto-provisioned. Asking sysadmins to perform routine operations like changing passwords should be off the table.
  4. Dimensioning a cloud (disk, memory, network interfaces) is not an easy task if the users have wildly different needs, but at least, it should be possible to easily increase VM image space, within reasonable limits. Other metrics such as RAM, network interfaces, DNS records, mountpoints should be directly accessible to the user, auto-provisioned.
  5. Creating new cloud images from vanilla OS’s automatically should be in place somehow and before launching the beta.

One year passes, some more console typing and moving to new hardware resources should get the service a new face, ready for a second try.

The same issues arise, instead of automating the deployment of the whole cloud to other machines, it just has been moved to the new hardware. There is no evidence of automation being done since last year.

A better cloud

Automated testing is about software, since clouds are software, why not automate bits and pieces of the deployment until it becomes fully automatic? It is easier said than done, it takes a great deal of patience to go and:

  1. Build and test a cloud component.
  2. Automate its deployment, testing it elsewhere.
  3. Take the whole cloud stack down, recreating it again from scratch.
  4. Automate basic user-side (stress)-testing: create instance, record a DNS change, attach new volumes, destroy instance, etc…

Automation and testing are hard, it takes time to get them right and not overfit your immediate environment. But look, those guys over there seem to have gotten it right:

- So you only need my public SSH key? That’s all? No meetings nor passport, fingerprints, blood samples or photos?
- That’s exactly right, just login as root, break as much as you want in your own cloud, we can wipe your whole stuff out in less than 20 minutes. We’ll of course be gathering metrics from outside, just in case we detect something bad coming out your instance(s). We don’t want to get in your way.
- Nice! What about having the latest Ubuntu release…
- We just provisioned it as we speak (true story).
- I’m launching some hadoop jobs right now. It took me a few minutes to provision the nodes. thank you guys, you’re awesome!

Continue reading →

Automated Python education via unit testing and Travis-CI

Sometimes education can be a daunting process. It is quite obvious from the student side, we all have gone through exercises, corrections, learning what we did wrong on some of them, fixing and learning from those errors, rinse and repeat. That’s how it generally works.

On the teacher’s side, correcting assignments is easy and unbiased unless the number of students is considerably large. At
one of the sessions of our now official KTH course “DD3436 Scientific Programming in Python for Computational Biology” I was given the task to hold a session on software testing and continuous integration in Python… for around 50 students.

Continue reading →

The “module system”: The good, the bad and the ugly

Dealing with software package management can be a daunting task, even for experienced sysadmins. From the long forgotten graft, going through the modern and insanely tweakable portage to the (allegedly) multiplatform pkgsrc or the very promising xbps, several have tried to build an easy to use, community-driven, simple, with good dependency-handling, optimal, reliable, generic and portable packaging system.

In my experience on both sides of the iron, as a sysadmin and developer, none of them work as one would like to.

But first, let’s explore what several HPC centers have adopted as a solution and why… and most importantly, how to fix it eventually.

Continue reading →

Galaxy on UPPMAX, simplified

This post is intended to be shortened over time, eventually becoming an automated procedure… a wiki-post from dahlo’s magic until upstream patches settle down. All commands are issued on the cluster, unless otherwise stated.

Please report any issues via comments !

  1. Firsly, follow my earlier post on how to setup your own python virtual environment on UPPMAX.
  2. Once you have a prompt similar to: (devel) hostname ~$, you can continue, else, jump to 1.
  3. pip install drmaa Mercurial PyYAML
  4. Add the following env variables to your .bashrc:
    export DRMAA_LIBRARY_PATH=/bubo/sw/apps/build/slurm-drmaa/lib/libdrmaa.so
    export DRMAA_PATH=$DRMAA_LIBRARY_PATH
    
  5. Create a file ~/.slurm_drmaa.conf with the contents:
    job_categories: {
          default: "-A <your project_account> -p devel"
    }
    
  6. hg clone http://bitbucket.org/brainstorm/galaxy-central
  7. Edit universe_wsgi.ini from the provided sample so that it contains:
    admin_users = <your_admin_user>@example.com
    enable_api = True
    start_job_runners = drmaa
    default_cluster_job_runner = drmaa://-A <your project_account> -p devel
    
  8. On your local machine: ssh -f <your_user>@<uppmax> -L 8080:localhost:8080 -N
  9. On your local machine: Fire up your browser and connect to http://localhost:8080

As a betatester you may expect some issues when running galaxy in that way. Firstly, keep in mind that it’ll not perform as fast as a production-quality setup, it’s just a developer instance. Furthermore the node you’re in might have time limit restrictions, meaning that your instance will be killed in 30 minutes if you don’t reserve a slot beforehand as Martin recommended on the section “Run galaxy on a node”.

Galaxy community conference 2011

It has been a while since the GCC2011 took place in Lunteren, in the Netherlands. As a result of my visit, I gained some more valuable insight about what I like to call the metasploit of computational biology, if such an analogy could be made between computer security and biology.

A few words about Galaxy

With a 15+ core team and a very active contributor base, Galaxy is trying hard to provide a fix for the biomedical Babel in which life scientists work nowadays.

From its modest origin as a single perl script, later on morphing into a python web framework, Galaxy evolved rapidly. In short, Galaxy can be thought as the glue code that wraps and uniformizes a considerable amount of bioinformatics programs into a more consistent web interface.

But there’s much more under the hood: cluster job management, data conversion, dataset access controls, security, web services, etc… to name a few components and features.

“Everything is possible in Galaxy, As long as you can run it on the command line, you can incorporate it into Galaxy.”
– Hans-Rudolf Hotz, Friedrich Miescher Institute for Biomedical Research

But not everything shines in the galaxy since NGS tool inclusion hogged its main site at some point. This fact only proves the point that single sites like Galaxy main, handling 130.000 cluster jobs/month and 1TiB uploads per week, face sustainability issues on the big datasets era we’re living in. As a result, other than imposing reasonable cluster quotas, interesting scaling strategies are being tested on real research projects. Therefore, federation and cloud computing are the next steps on this particular quest to the bio-universe.

One interesting realization on the conference is that not only labs are rolling their own Galaxy instances, there was a big sequencing industry player showing some interest on it too:

“Galaxy is an attractive workflow engine candidate”
– Kirt Haden, Illumina Inc

Continue reading →