Hadoop and friends on academic clusters

I’ve just pushed a set of trivial modules system scripts that will hopefully ease your deployment of Cloudera Distribution for Hadoop 3 Beta 4 on your university cluster… partly, at least. This sad “partly” made me think about the current state of things on IT and HPC.

Over time I’ve learnt that there are several unexpected issues when deploying hadoop on custom clusters that you don’t own. Those are mainly related to software management policies, non-root access (being auto-deployment unfriendly), quotas and queueing or “batch” systems.

Ignoring most of these “fixable” issues, it becomes apparent that the most juicy problem for a sysadmin trying to get the most of hadoop-related tools is the batch system. Be it SGE, SLURM or other non-DRMAA compliant exotic batch system implementations, you’ll have to deal with annoying integration quirks at some point, granted.

Making it all work can be challenging to say the least… but the question is: does it have to be that hard ?

Almost four years ago I first approached those problems by building a modest 13-node development rocks cluster out of recycled machines. Despite its low profile, I could deploy hadoop rpm’s and easily scale by adding more nodes, quickly reinstalling the whole cluster with any new software changes when needed.

    <a href="http://blogs.nopcode.org/brainstorm/wp-content/uploads/2011/03/cluster_rear.jpg"><img src="http://blogs.nopcode.org/brainstorm/wp-content/uploads/2011/03/cluster_rear-150x150.jpg" alt="13-node poor mans cluster, rear messy cables" title="13-node poor mans cluster, rear messy cables" width="150" height="150" class="alignright size-thumbnail wp-image-413" /></a>

13-node poor mans cluster, front view

But still, building your own cluster doesn’t feel quite “easy” or cost-effective anyway, right ? If you are willing to pay a few dollar cents, you can get yourself started on your own with amazon, easy but slow when uploading big datasets, assuming you are legally allowed to do so for real datasets. With some more patience and for free you can even use Eucalyptus Community Cloud.

Sane critical thinking and skepticism is always the best approach, but this is not about cloud hype IT fears anymore, it is about pragmatism and getting things done.

Full root access, reasonable quotas, automatic deployment, plain community oriented package-based software management as opposed to $PATH and $LD_LIBRARY_PATH hacks, allows for proper reproducible and shareable research.

Can’t wait to see big university private clouds catching up with their commercial counterparts. I just hope it does not take too long to have such convenient tools on user hands :)