• A dashboard with pretty plots and numbers about your organization… It shouldn’t be that difficult, right?

    Luckily, ElasticSearch+Logstash+Kibana did came up with a nice stack to solve this problem. Here’s my result and how I solved some integration hiccups. The resulting beta dashboard, as it is now, looks like this:

    DataSpace beta operations panel

    Many many many other blog posts have addressed this issue, but I would like to share a couple of tweaks I came up with while working on it.

    Logstash and iRODs custom log parsing

    Broadly speaking, I found the following issues when setting up the dashboard:

    1. Index design: Logstash default index formatting (logstash-YYYYMMDD) brought down a 30GB RAM machine (line 63).
    2. Missing year in log timestamps from logs content, getting it from log filenames (lines 31 to 42 below).
    3. GeoIP coordinates not being parsed by Kibana’s bettermap (lines 44 to 58).

    ElasticSearch tweaks

    As mentioned, the default “daily” indexing scheme in logstash did not work for my purposes. When monitoring elasticsearch with the head plugin, the status went red while ingesting events from logs. Thanks to @zackarytong and other’s feedback, I managed to address the issue:

    Then, Kibana could not connect to the elasticsearch backend when large time ranges were defined. After some chrome developer tool and CURLing, I reported the issue, too many indexes were URL-encoded which required to setup the following ElasticSearch directive:

    http.max_initial_line_length: 64kb

    In the future, I might consider further indexing optimizations by using the ElasticSearch index curator tool to optimize and cut down index usage, but for now I want to keep all data accessible from the dashboard.

    Kibana

    The presentation side of this dashboard hipster stack also had some gimmicks. My original idea of showing three separate Kibana bettermaps, one per AWS region, had to wait after another issue got addressed very recently. Both Chrome developer console and AngularJS batarang were very useful to find issues in Kibana and its interactions with the ElasticSearch backend.

    Speaking of Amazon and their regions, while AWS has exposed a great amount of functionality through their developer APIs, at the time of writing these lines there are missing API endpoints, such as billing information in all regions (only available in us-east-1), and fetching remaining credits from your EDU Amazon grant, should you have one. The only viable option left today is scraping it:

    If one wants to keep track of AWS expenses on ES, some specific index type mappings changes on the ElasticSearch side are needed:

    curl -XPUT 'http://ids-panel.incf.net:9200/aws-2014.02/credits/_mapping' -d '{
      "aws_credit_balance": {
        "properties": {
          "aws_credit_balance": {
            "type": "integer"
          }
        }
      }
    }'
    

    In a more large scale/business setting, Netflix’s Ice could be definitely a good fit.

    Conclusions and future improvements

    It has been a fun project to collect data about your organization’s services and be able to expose it as clearly as possible and in realtime, feels like I would like to do this for a living as a consultant some day. Some new insight coming from the dashboard has allowed us to decide on downscale resources, ultimately saving money.

    The feedback from INCF people has been invaluable to rethink how some data is presented and what it all means, always bring third parties and ask them their opinions. Visualization is hard to get right, bring in users and consider their feedback.

    My next iteration in this project is to have finer detail on which activities users are performing (data being shared, copied, downloaded). This could be leveraged with some custom iRODS microservices for ElasticSearch or evaluating other EU-funded projects in the topic such as Chesire3.

    When it comes to who can access the dashboard, there’s a recent blog post on multi-faceted authentication, a.k.a showing different views of the dashboard to different audiences. I’ve already tried Kibana’s authentication proxy, which supports OAuth among other auth systems, but there are a few rough edges to polish.

    On the logstash backend, it might be worth grepping iRODS codebase for log stanzas to assess important log events worth parsing and getting good semantic tokens out of them. Luckily, ElasticSearch is backed by Lucene’s full text engine helps a lot in not having to do this tedious task. Kibana/ElasticSearch search and filtering are excellent.

    Last but not least, some remaining issues leading to total world domination include:

    1. Instrumenting all your organization’s Python code with Logbook sinking to a Redis exchange.
    2. Easily add/include other types of panels in Kibana, perhaps allowing better or more explicit integration possibilities for D3, mpld3 or BokehJS with Kibana.
    3. Getting UNIQ count for records in ElasticSearch (i.e, count unique number of IPs, users, etc…) which are on the roadmap under aggregations, so they are coming soon :)
  • iRODS logo
    +
    OwnCloud logo

    Context and iDROP

    This is an open, on-demand blog post and drafty software specification.

    When I started at INCF, among other duties I inherited the supervision of an ambitious data sharing project called DataSpace. After going through its system architecture and python helper tools, it seems that the system follows good design principles.

    Still, from the operational and usability sides, there is still plenty of room for improvement.

    One of the most commonly mentioned drawbacks with the system has to do with how the data is presented to the researchers on the web. At the moment of writing this, an iDROP web interface is the canonical interface to access DataSpace for end users. The web client integrates well with the underlying iRODS infrastructure. One can perform <a href=http://www.youtube.com/watch?v=YhciVQCZuBY>common data operations plus manage its metadata attributes</a>, as any commandline user would do with the underlying iRODS i-commands.

    And even share (publish) a file as a (non-shortened, ugly) public link to share data with researchers:

    <a href=https://ids-eu.incf.net/idrop-web/home/link?irodsURI=irods%3A%2F%2Fids-eu-1.incf.net%3A1247%2Fincf%2Fhome%2Fbraincode>

    https://ids-eu.incf.net/idrop-web/home/link?irodsURI=irods%3A%2F%2Fids-eu-1.incf.net%3A1247%2Fincf%2Fhome%2Fbraincode</a>

    OwnCloud development at INCF

    I also took charge of leading an ongoing contract with OwnCloud Inc to completion on the first prototype of an iRODS-OwnCloud integration. After some weeks of testing, debugging, realizing about legacy PHP limitations and plenty of help from Chris Smith and a profficient OwnCloud core developer, a working proof of concept was put together with PRODS, a PHP-based iRODS client.

    Today it works on both OwnCloud community and enterprise editions.

    Despite it being a proof of concept having some serious performance issues, at least two scientific facilities have reported they are testing it already. Even if these are good news, it does need more work to be a robust solution. In the next lines, I will be describing what is needed to make it so, I will try to be as specific as possible.

    But first, please have a look at the following GitHub pullrequest for some context on the current issues that need fixing:

    https://github.com/owncloud/core/pull/6004

    OwnCloud+iRODS, what’s left to do?

    As seen in the previous pullrequest (please, read it), there is a significant performance penalty by operating as the OwnCloud extension does today, since given research facilities A, B and OwnCloud instance C:

    1. Host A uploads a file to C.
    2. PRODS “streams” that file, while still writing on disk anyway, to host B.
    3. On completion, host C deletes the uploaded file from disk.

    A possible performance improvement would be to skip disk altogether or just keep N chunks on disk for reliability, discarding already sent ones. While we talk about chunks, a set of tests should be written and some benchmarks run to empirically test which block size fits best for OwnCloud transfers. It is a quite critical core setting, so it should be tuned accordingly.

    While fixing this issue would give us some iRODS speedup, it would not take advantage of iRODS’s underlying Reliable Blast UDP protocol since all transfers are channeled towards HTTP instead of plain UDP.

    An improvement that would definitely boost the performance would be to just wrap the shell i-commands into well scaffolded PHP functions. If i-commands binaries were installed in the system, OwnCloud would detect it and use them instead of a PRODS-based “compatibility mode”. From a user perspective, the “compat mode” should be announced as a warning in the web interface to hint the user about i-commands and how the sysadmins could install the corresponding RPM or DEB packages for them.

    Since metadata is a big upside of iRODS, it should definitely be exposed in OwnCloud as well. A possible way to meet ends, would be to extend an existing OwnCloud metadata plugin.

    Last but not least, no software engineering project seems to be 100% safe from UTF-8 issues, and PRODS does not seem to be clear from it either:

    {"app":"PHP","message":"SimpleXMLElement::_\_construct(): Entity: line 18: parser error : Input is not proper UTF-8, indicate encoding !nBytes: 0×92 0×73 0x2D 0×61 at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} {"app":"PHP","message":"SimpleXMLElement::_\_construct(): at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} {"app":"PHP","message":"SimpleXMLElement::_\_construct(): ^ at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} (...)

    So a fix for that is in order too. The ultimate goal is to have a well-rounded pull request that OwnCloud developers could merge without concerns, making those features available to serveral scientific communities in need for user-friendly and high performance data management.

    I hope I outlined what needs to be done from a software engineering perspective. Other issues would surely pop up while developing and extending OwnCloud, and those deviations should be accounted for when planning the roadmap.

    Funding? Developers?

    OwnCloud seems to be a good candidate to engulf a heavily used data management framework in academia, iRODS. Recently, an OwnCloud REST API is provided and bindings for modern Cloud storage systems are being worked on. This is a good chance to bring iRODS and other scientific data management systems together with OwnCloud, but we need: developers, developers, developers!

    In the next few days, INCF will be applying for EUDAT funding. We are looking for talented OwnCloud/PHP developers, please feel free to leave a comment below if you are interested in this initiative or reach me on twitter: @braincode or just followup on the aforementioned pullrequest!.

  • Disclaimer: Those are my opinions only and not from my employer, etc, etc...

    Today it has been two months since I joined the International Neuroinformatics Coordinating Facility, INCF located inside the Karolinska Institutet campus in Stockholm. Coincidentally I happened to land the job in the mist of a Neuroinformatics conference:

    INCF Neuroinformatics congress 2013

    Before that, I spent almost 3 years in another field of (data) science, genomics. I think I’m extremely lucky to be involved on those two different cutting-edge computational life sciences disciplines, so rich and diverse at the science level and yet pretty similar in infrastructure needs: more storage, more processing and more standards needed.

    Also today I got to answer a series of seemingly routine questions prior to attending a workshop (EUDAT). While I was writing, I realized that I was drawing a nice portrait of today’s data science wins and struggles, be it genomics, neuroscience or other data-hungry sciences I might encounter during my career.

    Suddenly I couldn’t resist to share my braindumpings, I hope you enjoy the read :)

    Who are your major scientific user communities and how they use the technology with concrete applications?

    According to the official INCF’s website “about”:

    INCF develops and maintains databases and computational infrastructure for neuroscientists. Software tools and standards
    for the international neuroinformatics community are being developed through the INCF Programs (…)

    So INCF’s purpose is to be the glia between several neuroscience research groups around the world. Or as Nature put it recently about the upcoming human brain project:

    Dataspace, DAI and NeuroLex are outcomes of the INCF initiatives, supporting the community over the years since 2005.

    INCF mug

    Other interesting community projects that assist scientists in running and keeping track of experiments in the neuroinformatics community are: Sumatra, BrainImagingPipelines and a W3C-PROV framework for data provenance, make sure you open browser tabs on those links, they are definitely worth checking!

    While those tools are pretty much domain-specific, they bear some resemblance with counterparts from genomics pipelines, for instance.

    What do you think are the most important interoperability aspects of workflows in scientific data infrastructures?

    In today’s HPC academic facilities there are no standards on “computation units”. Most of the software is installed and maintained manually in non-portable ways. Without canonical software installation procedures the moving computation where the data is ideal situation will be increasingly difficult to attain.

    It’s striking to see those common architectural needs between research fields and the lack of funding incentives to ensure standard packaging of software and coherent data schemes.

    A notable example of a data standard in genomics, as put by Francesco, the SAM/BAM file format. It is barely 15 pages long, straight to the point and enough to attract people into using it as de-facto standard for sequence alignment. Even if it varies slightly between versions and has custom tracks, it seems it is here to stay.

    Similarly, the OpenFMRI project is set to be an example worth following for the neuroscience community. With a single figure (see below), they capture the essence of structuring brain scans in a filesystem hierarchy:

    OpenFMRI standard scheme

    So, important interoperability aspects would be to, among others:

    1. To give incentives to portable software packaging efforts.
    2. Early data and paper publication, even before official acceptance.

    The first requires skilled people willing and able to dedicate efforts to pave the way towards solid and repeatable software installations.

    An example of realizing in time how important is to package software comes from the computer security world, with efforts like BackTrack and Kali Linux. Both are essentially the same pentest linux distribution, except that as a result of a change in policy in regards to software maintenance, Kali can now be ported to a flurry of different hardware, which wasn’t so straightforward with Backtrack.

    The second requires a paradigm shift when it comes to data sharing and publication in life sciences:

    Where do you see major barriers with respect to workflow support near the data storage?

    More efforts into creating an interoperable infrastructure via software and data encapsulation.

    Willingness to take some of the academic computation efforts towards HPC research computing as opposed to throw more manual maintenance work at it. The final aim is to run experiments in different HPC facilities without changing default configurations significantly.

    More specifically, dedicate some research efforts on deploying and testing solutions like docker and openstack as opposed to stick with silo HPC installations.

    What are the most urgent needs of the scientific communities with respect to data processing?

    To put it simply, prioritize standardization and incentivize incremental science as opposed to “novel” approaches that do not contribute to the field significantly (wheel reinvention).

    Jokes aside, this happens all the time in genomics data formats with major overlap in features.

    Let’s hope neuroscience learns from it gets a better grip on community consensus over the years.

  • Here I am in my second day of the BOSC hackathon, polishing work from tomorrow, but also seeing new interesting projects taking off. These are my notes from the second day. See also my notes from the first day.

    Pencils down for the coding

    So today we try to wrap up the support for SLURM into ipython-cluster-helper. We realized that generalizing job managers is hard. Even when at the basic level they do the same, namely submit jobs and handle hardware resources, the different flavors exist for a reason.

    Extra arguments or “native specifications” that do not fit in the normal job scheduler blueprint must be passed along nicely and that final mile effort takes some time to nail down.

    Furthermore, a generalized DRMAA patch towards ipython parallel on upstream requires more than 2 days to whip up, so we instead move on to optimize what we have in two different fronts:

    1. Getting old SLURM versions to work with ipython-cluster-helper without job arrays in an efficient way.
    2. Automating the deployment of SLURM server(s) with a configuration management tool: Saltstack

    Other projects

    Per Unneberg manages to setup a proof of concept for a metrics client that reports runtime statistics and system information from different running processes to a web service. This idea stems from bioplanet’s Genome Comparison Analytics Testing. In that site, several pipelines are compared from the accuracy perspective, but nothing is showed about performance, questions such as:

    • How long did it take to run such a pipeline from beggining to end?
    • Which hardware resources such as CPUs and memory where you using?
    • Which organism(s) and to which depth were you running that pipeline against?

    Some interesting talk around biolite, a data provenance system for bioinformatics arises as a side result of this work. In fact, the bcbio-nextgen pipeline includes preliminary support for such a system.

    Guillermo unearths a cool project which he wanted to recover for a while, that is, pytravis, a python API to interact with our favourite continuous integration system at SciLifeLab.

    Meanwhile the guys over Cloudbiolinux come up with nice automation and deployment scripts using puppet/chef that should eventually ease the pain for those genomics centers trying to tame their reference genomes.

    Those are just a few of the many initiatives going on in this hackathon that is over today, tomorrow BOSC starts. If you want to know more, don’t miss the CodeFest 2013 official wiki, there’s a nice wrap up of the many parallell projects there.

  • I’m at the 4th Bioinformatics Open Source Conference in a warm and sunny Berlin.

    After everyone found the venue, a preliminar brainstorming helped everyone organize the tasks across several workgroups:
    Bioruby and Illumina Basespace, Visualization, Cloudbiolinux/Cloudman, Ontologies and metadata, Data Handling, Biopython, etc…

    Our contribution

    Valentine, Guillermo and I sat in front of Rory Kirchner and Brad Chapman to whip up SLURM support into their ipython-cluster-helper module. That would help SciLifeLab to move from the old bcbio-nextgen pipeline to the new ipython-backed version with all the neat parallelization tricks needed to run up to 1500 Human WGS.

    The motivation behind our specific task is to:

    1. Implement basic SLURM support by understanding the already existing classes, which already support SGE, LSF, Torque and Condor schedulers in ipython-cluster-helper.
    2. Learning from that, introduce the use of the DRMAA connector, generalizing all the specific classes for the different job schedulers.
    3. Ultimately port such generalization into ipython so that python scientific computations can be executed efficiently across different clusters around the world.

    That was the idea, what really happened, as with any software jorney is that planning the trip differs somehow from actually walking it:

    1. We realized that array jobs are not supported on SLURM <2.6.x and then implemented a workaround using srun. <blockquote class="twitter-tweet" width="550" lang="en">

    @UPPMAX, can you please update to #SLURM >= 2.6.x so that we can run job arrays and @RoryKirchner's ipython-cluster-helper? #kthxbye

    — Roman Valls (@braincode) July 17, 2013

    </blockquote>

    1. We realized that Since DRMAA does not generate job templates to send via cmdline, it might be wiser to put that support directly into ipython.
    2. Guillermo got his hands dirty with installing SLURM in a couple of vagrant machines so that we don’t have to wait long queues on our compute cluster.

    Other stuff happening outside our coding bubble

    During the day, the original draft ideas outlined in the board changed as participants got to talk to each other. If anything, that would be the highlight and the common modus operandi of most hackathons I’ve been involved in: how self-organized groups turn vague questions such as “what are you up to?” to useful working code and collaborations.

    During a very quick walk around the room, I discovered a variant analysis pipeline based on Ruffus used by the Victorian Life Sciences Computation Initiative, University of Melbourne. This is meant to play well with Enis Afgan’s integration, or CloudBiolinux flavor, for the australian national cloud infrastructure.

    From provenance standardization to workflow systems and a prototype to collect runtime metrics by Per Unneberg gives a grasp on the exciting ways left to walk for genomics and computational biology.

    Definitely looking forward to some more action tomorrow morning :)