• Software carpentry and learning

    I am going through the 11th iteration of software carpentry for instructors and I am quite happy about the way Greg Wilson conducts his bi-weekly calls and focuses on evidence-based learning techniques.

    At first my expectations from this course were those of simply learning how to teach specific software carpentry lessons, that is, traditional master classes on software engineering, tools and accompanying cookbooks.

    During the first session I quickly realized that his approach goes far beyond that.

    Greg is attacking the very roots of the learning process in order to provide a solid teaching base, ultimately offering robust, research-based, (scientific) software literacy to the world.

    How motivation works

    Going through chapter 3 of how learning works ebook from Software Carpentry’s recommended reading there’s an illustrative experience from a teacher on her first class speech:

    So I told my students on the first day of class, “This is a very difficult course. You will need to work harder than you have ever worked in a course and still a third of you will not pass”

    Which I heard a lot of times myself and somewhat learned to ignore during my university days. Perhaps unsurprisingly, those words were followed by unintended consequences:

    But to my surprise, they slacked off even more than in previous semesters (…) their test performance was the worst it had been for many semesters.

    By using several research papers as a foundation to understand those situations, how learning works concludes that:

    Limited chances of passing may fuel preexisting negative perceptions about the course, compromise her students’ expectations for success, and undermine their motivation to do the work necessary to succeed.

    Again, I’ve seen this happening time after time on different academic contexts over the years.

    This week’s assignment

    So in this week’s course session, Greg asked us to describe a personal experience during our education where we saw a similar situation.

    Sometime during my high school years, Iñaki, my physics teacher said something like:

    If you don’t put more effort on physics, I think you should consider fixing motorbikes on FP instead.

    FP (Formación Profesional) is the spanish educational route for those who want to skip university and pursue a more applied professional degree. FP is often mocked by spanish society (some teachers too) and regarded as “lower paid” and “less honorable” means to earn a living than going for more traditional academic degrees.

    This is a myth since there are many FP professionals that make a very decent living out of roles such as DBA (in spanish)Germans know that very well and hire accordingly.

    I believe Iñaki wanted me to succeed in physics and meant well, targeting at my self-esteem as a way to push me harder. Looking back, I see that it was a bad move that effectively de-motivated me. Although I did pass, it did not enjoy the subject as I should have, didn’t learn it as thoroughly and, therefore, didn’t earn higher scores.

    I tend to obsess on topics I like. Curiosity keeps me awake at night, it’s like an unconscious autopilot that drives me towards higher understanding. As I discover and dig deeper on subjects I want to learn more about, I completely lose track of time. In my experience, frictionless, smooth learning almost invariably results in well cristalized knowledge and high scores.

    Later on, while undergoing my computational biology masters degree in Sweden, I re-discovered (bio)physics while diving on the incredible world of ion channels and biomedicine in a very different learning environment.

    I loved it.

  • I am at the INCF neuroinformatics congress in Leiden and I am co-organizing an in-congress hackathon. An Etherpad is used to announce and coordinate tasks.

    Participants

    The following participants walked into the room during the 3 days that the room was open, presented themselves:

    Shreejoy Tripathy, University of British Columbia
    Richard Gerkin, Arizona State University
    Zameer Razack, Bournemouth University
    Tristan Glatard, McGill University, CBRAIN
    Joris Slob, Leiden University
    Barry Wark, Physion LLC
    Stephen Larson, OpenWorm
    Jeff Muller, EPFL-Blue Brain Project
    Roman Valls Guimera, INCF
    

    At first we started with very few participants, but there was a stream of people over the days, coming and going. Some exchanging knowledge, others just curious about the concept of a hackathon. Also, local researchers such as Joris Slob, who works with Ontologies and NeuroImaging were highly welcome, growing the neuroinformatics hackathon clique is always a good idea.

    A special mention goes to Barry Wark, Physion CEO kindly sponsored this hackathon and briefly introduced us to Ovation, a great scientific management system that talks with many commercial cloud backends. Thanks Barry!

    Hands on work

    A hackathon basic principle is that of learning by doing. An early example of that started happening during the first minutes of the event via the collaborative integration efforts of Barry Wark and Christian from G-node. After a brief discussion, they both started coding Java bindings for NIX, a HDF5-based neuroscience exchange format. They used SWIG for those bindings so potentially, other programming languages could talk with NIX.

    This is also a very clear, mutually benefiting, hands-on example of collaboration between industry (Ovation) and research institutions (G-node).

    Another interesting initiative that surfaced during the event was a possible integration between CBrain workflow systems and INCF NIDASH neuroinformatics data model and reference implementation. Tristan Glatard and I went through the abundant ipython notebooks and documentation of NIDASH and after understanding the framework, proposed a quickstart for developers, which is coming soon.

    The specific outcome from this would be to have an export provenance information fuctionality when publishing or sharing neuroinformatics research.

    Meanwhile, Stephen Larson from OpenWorm got 2 INCF GSoC slots on 2014 edition. His particular mission was to improve packaging for PyOpenWorm alpha0.5 version that was produced by INCF OpenWorm GSoC student project and getting it ready for merge into master.

    Stephen got to hear about the sorry state of packaging in Python but he took advantage of the hackathon time to fix and publish GSoC outcomes for easier public consumption.

    Those in neurophysiology will love to hear that NeuroElectro API documentation was improved by Rick Gerkin together with Shreejoy Tripathy. It is interesting to see how much electrophysiology data can be extracted from literature, all the better if it can be queried via an API!

    On my side, I simplified nipype (bummer here since it was fixed already) and pymvpa2 installation processes and revived interest in OpenTracker, a bittorrent tracker that could potentially be used as a more scalable data sharing backend for *-stars scientific Q&A sites.

    If bittorrent had so much success in filesharing, why should not happen the same while sharing scientific datasets? Let’s lower the barrier for this to happen!

    DIY hackathon

    As Stephen Larsson from OpenWorm put very well anticipating the upcoming CRCNS hackathon:

    • Define the time and goals of the hackathon in advance that you have in mind and write them up.
    • Target your participants in advance and ask them to provide at least one-liners on who they are and what they do, and help you collaboratively make an ‘ideas list’. Good platforms for this are wikis (maybe a GitHub wiki), a Google Doc that is world editable, or as Roman used recently, Etherpad.
    • Lately I’ve been seeing folks use real-time chat more. Consider opening up an IRC chat room, or I’ve seen people liking HipChat, Slack, or Google Hangout (chat) for this
    • When the time comes, lead by example during the session by being there the whole time and driving the process. Maybe open up an on air google hangout if there is an in-person component and have people watch you hacking while interacting via chat / other collaborative media
    • During, try to collect a list of things that “got done”. This will be the meat you will use to demonstrate the effectiveness of the session to others who ask you to justify its existence :)

    On that line, I’m looking at you, INCF Belgian node, who will hold such an event very soon.

    Another important hint for such events is, in my opinion, to minimize the amount and time allocated to “speakers”. Doing should be prioritized over delivering a 30 minutes presentation, effectively moving away from traditional scientific congress formats.

    Conclusions

    Hackathons (or codefests) are meant to get things done, polished or spark new ideas and interesting collaborations.

    This all happened with a relatively small amount of participants, but outcomes usually grow (to a certain point) the more engaged participants show up and work. See BrainHack as an excellent example of this.

    INCF realized that hackathons should be considered as dedicated events, free from other (highly interesting) congress distractions and will continue to support them via Hackathon Series program.

    A chained stream of science hackathons, such as the #mozsprint, celebrated a few months ago, helps in pushing tools, refinements and integrations forward. That is, standardized and interoperable neuroinformatics research, in line with INCF’s core mission.

    More neuro-task proposals can be found over NeuroStars. Pick your’s for the next hackathon ;)

  • A dashboard with pretty plots and numbers about your organization… It shouldn’t be that difficult, right?

    Luckily, ElasticSearch+Logstash+Kibana did came up with a nice stack to solve this problem. Here’s my result and how I solved some integration hiccups. The resulting beta dashboard, as it is now, looks like this:

    DataSpace beta operations panel

    Many many many other blog posts have addressed this issue, but I would like to share a couple of tweaks I came up with while working on it.

    Logstash and iRODs custom log parsing

    Broadly speaking, I found the following issues when setting up the dashboard:

    1. Index design: Logstash default index formatting (logstash-YYYYMMDD) brought down a 30GB RAM machine (line 63).
    2. Missing year in log timestamps from logs content, getting it from log filenames (lines 31 to 42 below).
    3. GeoIP coordinates not being parsed by Kibana’s bettermap (lines 44 to 58).

    ElasticSearch tweaks

    As mentioned, the default “daily” indexing scheme in logstash did not work for my purposes. When monitoring elasticsearch with the head plugin, the status went red while ingesting events from logs. Thanks to @zackarytong and other’s feedback, I managed to address the issue:

    Then, Kibana could not connect to the elasticsearch backend when large time ranges were defined. After some chrome developer tool and CURLing, I reported the issue, too many indexes were URL-encoded which required to setup the following ElasticSearch directive:

    http.max_initial_line_length: 64kb

    In the future, I might consider further indexing optimizations by using the ElasticSearch index curator tool to optimize and cut down index usage, but for now I want to keep all data accessible from the dashboard.

    Kibana

    The presentation side of this dashboard hipster stack also had some gimmicks. My original idea of showing three separate Kibana bettermaps, one per AWS region, had to wait after another issue got addressed very recently. Both Chrome developer console and AngularJS batarang were very useful to find issues in Kibana and its interactions with the ElasticSearch backend.

    Speaking of Amazon and their regions, while AWS has exposed a great amount of functionality through their developer APIs, at the time of writing these lines there are missing API endpoints, such as billing information in all regions (only available in us-east-1), and fetching remaining credits from your EDU Amazon grant, should you have one. The only viable option left today is scraping it:

    If one wants to keep track of AWS expenses on ES, some specific index type mappings changes on the ElasticSearch side are needed:

    curl -XPUT 'http://ids-panel.incf.net:9200/aws-2014.02/credits/_mapping' -d '{
      "aws_credit_balance": {
        "properties": {
          "aws_credit_balance": {
            "type": "integer"
          }
        }
      }
    }'
    

    In a more large scale/business setting, Netflix’s Ice could be definitely a good fit.

    Conclusions and future improvements

    It has been a fun project to collect data about your organization’s services and be able to expose it as clearly as possible and in realtime, feels like I would like to do this for a living as a consultant some day. Some new insight coming from the dashboard has allowed us to decide on downscale resources, ultimately saving money.

    The feedback from INCF people has been invaluable to rethink how some data is presented and what it all means, always bring third parties and ask them their opinions. Visualization is hard to get right, bring in users and consider their feedback.

    My next iteration in this project is to have finer detail on which activities users are performing (data being shared, copied, downloaded). This could be leveraged with some custom iRODS microservices for ElasticSearch or evaluating other EU-funded projects in the topic such as Chesire3.

    When it comes to who can access the dashboard, there’s a recent blog post on multi-faceted authentication, a.k.a showing different views of the dashboard to different audiences. I’ve already tried Kibana’s authentication proxy, which supports OAuth among other auth systems, but there are a few rough edges to polish.

    On the logstash backend, it might be worth grepping iRODS codebase for log stanzas to assess important log events worth parsing and getting good semantic tokens out of them. Luckily, ElasticSearch is backed by Lucene’s full text engine helps a lot in not having to do this tedious task. Kibana/ElasticSearch search and filtering are excellent.

    Last but not least, some remaining issues leading to total world domination include:

    1. Instrumenting all your organization’s Python code with Logbook sinking to a Redis exchange.
    2. Easily add/include other types of panels in Kibana, perhaps allowing better or more explicit integration possibilities for D3, mpld3 or BokehJS with Kibana.
    3. Getting UNIQ count for records in ElasticSearch (i.e, count unique number of IPs, users, etc…) which are on the roadmap under aggregations, so they are coming soon :)
  • iRODS logo
    +
    OwnCloud logo

    Context and iDROP

    This is an open, on-demand blog post and drafty software specification.

    When I started at INCF, among other duties I inherited the supervision of an ambitious data sharing project called DataSpace. After going through its system architecture and python helper tools, it seems that the system follows good design principles.

    Still, from the operational and usability sides, there is still plenty of room for improvement.

    One of the most commonly mentioned drawbacks with the system has to do with how the data is presented to the researchers on the web. At the moment of writing this, an iDROP web interface is the canonical interface to access DataSpace for end users. The web client integrates well with the underlying iRODS infrastructure. One can perform <a href=http://www.youtube.com/watch?v=YhciVQCZuBY>common data operations plus manage its metadata attributes</a>, as any commandline user would do with the underlying iRODS i-commands.

    And even share (publish) a file as a (non-shortened, ugly) public link to share data with researchers:

    <a href=https://ids-eu.incf.net/idrop-web/home/link?irodsURI=irods%3A%2F%2Fids-eu-1.incf.net%3A1247%2Fincf%2Fhome%2Fbraincode>

    https://ids-eu.incf.net/idrop-web/home/link?irodsURI=irods%3A%2F%2Fids-eu-1.incf.net%3A1247%2Fincf%2Fhome%2Fbraincode</a>

    OwnCloud development at INCF

    I also took charge of leading an ongoing contract with OwnCloud Inc to completion on the first prototype of an iRODS-OwnCloud integration. After some weeks of testing, debugging, realizing about legacy PHP limitations and plenty of help from Chris Smith and a profficient OwnCloud core developer, a working proof of concept was put together with PRODS, a PHP-based iRODS client.

    Today it works on both OwnCloud community and enterprise editions.

    Despite it being a proof of concept having some serious performance issues, at least two scientific facilities have reported they are testing it already. Even if these are good news, it does need more work to be a robust solution. In the next lines, I will be describing what is needed to make it so, I will try to be as specific as possible.

    But first, please have a look at the following GitHub pullrequest for some context on the current issues that need fixing:

    https://github.com/owncloud/core/pull/6004

    OwnCloud+iRODS, what’s left to do?

    As seen in the previous pullrequest (please, read it), there is a significant performance penalty by operating as the OwnCloud extension does today, since given research facilities A, B and OwnCloud instance C:

    1. Host A uploads a file to C.
    2. PRODS “streams” that file, while still writing on disk anyway, to host B.
    3. On completion, host C deletes the uploaded file from disk.

    A possible performance improvement would be to skip disk altogether or just keep N chunks on disk for reliability, discarding already sent ones. While we talk about chunks, a set of tests should be written and some benchmarks run to empirically test which block size fits best for OwnCloud transfers. It is a quite critical core setting, so it should be tuned accordingly.

    While fixing this issue would give us some iRODS speedup, it would not take advantage of iRODS’s underlying Reliable Blast UDP protocol since all transfers are channeled towards HTTP instead of plain UDP.

    An improvement that would definitely boost the performance would be to just wrap the shell i-commands into well scaffolded PHP functions. If i-commands binaries were installed in the system, OwnCloud would detect it and use them instead of a PRODS-based “compatibility mode”. From a user perspective, the “compat mode” should be announced as a warning in the web interface to hint the user about i-commands and how the sysadmins could install the corresponding RPM or DEB packages for them.

    Since metadata is a big upside of iRODS, it should definitely be exposed in OwnCloud as well. A possible way to meet ends, would be to extend an existing OwnCloud metadata plugin.

    Last but not least, no software engineering project seems to be 100% safe from UTF-8 issues, and PRODS does not seem to be clear from it either:

    {"app":"PHP","message":"SimpleXMLElement::_\_construct(): Entity: line 18: parser error : Input is not proper UTF-8, indicate encoding !nBytes: 0×92 0×73 0x2D 0×61 at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} {"app":"PHP","message":"SimpleXMLElement::_\_construct(): at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} {"app":"PHP","message":"SimpleXMLElement::_\_construct(): ^ at /var/www/owncloud/apps/files\_external/3rdparty/irodsphp/prods/src/RODSMessage.class.php#140″,"level":2,"time":"2013-12-05T14:46:25+00:00″} (...)

    So a fix for that is in order too. The ultimate goal is to have a well-rounded pull request that OwnCloud developers could merge without concerns, making those features available to serveral scientific communities in need for user-friendly and high performance data management.

    I hope I outlined what needs to be done from a software engineering perspective. Other issues would surely pop up while developing and extending OwnCloud, and those deviations should be accounted for when planning the roadmap.

    Funding? Developers?

    OwnCloud seems to be a good candidate to engulf a heavily used data management framework in academia, iRODS. Recently, an OwnCloud REST API is provided and bindings for modern Cloud storage systems are being worked on. This is a good chance to bring iRODS and other scientific data management systems together with OwnCloud, but we need: developers, developers, developers!

    In the next few days, INCF will be applying for EUDAT funding. We are looking for talented OwnCloud/PHP developers, please feel free to leave a comment below if you are interested in this initiative or reach me on twitter: @braincode or just followup on the aforementioned pullrequest!.

  • Disclaimer: Those are my opinions only and not from my employer, etc, etc...

    Today it has been two months since I joined the International Neuroinformatics Coordinating Facility, INCF located inside the Karolinska Institutet campus in Stockholm. Coincidentally I happened to land the job in the mist of a Neuroinformatics conference:

    INCF Neuroinformatics congress 2013

    Before that, I spent almost 3 years in another field of (data) science, genomics. I think I’m extremely lucky to be involved on those two different cutting-edge computational life sciences disciplines, so rich and diverse at the science level and yet pretty similar in infrastructure needs: more storage, more processing and more standards needed.

    Also today I got to answer a series of seemingly routine questions prior to attending a workshop (EUDAT). While I was writing, I realized that I was drawing a nice portrait of today’s data science wins and struggles, be it genomics, neuroscience or other data-hungry sciences I might encounter during my career.

    Suddenly I couldn’t resist to share my braindumpings, I hope you enjoy the read :)

    Who are your major scientific user communities and how they use the technology with concrete applications?

    According to the official INCF’s website “about”:

    INCF develops and maintains databases and computational infrastructure for neuroscientists. Software tools and standards
    for the international neuroinformatics community are being developed through the INCF Programs (…)

    So INCF’s purpose is to be the glia between several neuroscience research groups around the world. Or as Nature put it recently about the upcoming human brain project:

    Dataspace, DAI and NeuroLex are outcomes of the INCF initiatives, supporting the community over the years since 2005.

    INCF mug

    Other interesting community projects that assist scientists in running and keeping track of experiments in the neuroinformatics community are: Sumatra, BrainImagingPipelines and a W3C-PROV framework for data provenance, make sure you open browser tabs on those links, they are definitely worth checking!

    While those tools are pretty much domain-specific, they bear some resemblance with counterparts from genomics pipelines, for instance.

    What do you think are the most important interoperability aspects of workflows in scientific data infrastructures?

    In today’s HPC academic facilities there are no standards on “computation units”. Most of the software is installed and maintained manually in non-portable ways. Without canonical software installation procedures the moving computation where the data is ideal situation will be increasingly difficult to attain.

    It’s striking to see those common architectural needs between research fields and the lack of funding incentives to ensure standard packaging of software and coherent data schemes.

    A notable example of a data standard in genomics, as put by Francesco, the SAM/BAM file format. It is barely 15 pages long, straight to the point and enough to attract people into using it as de-facto standard for sequence alignment. Even if it varies slightly between versions and has custom tracks, it seems it is here to stay.

    Similarly, the OpenFMRI project is set to be an example worth following for the neuroscience community. With a single figure (see below), they capture the essence of structuring brain scans in a filesystem hierarchy:

    OpenFMRI standard scheme

    So, important interoperability aspects would be to, among others:

    1. To give incentives to portable software packaging efforts.
    2. Early data and paper publication, even before official acceptance.

    The first requires skilled people willing and able to dedicate efforts to pave the way towards solid and repeatable software installations.

    An example of realizing in time how important is to package software comes from the computer security world, with efforts like BackTrack and Kali Linux. Both are essentially the same pentest linux distribution, except that as a result of a change in policy in regards to software maintenance, Kali can now be ported to a flurry of different hardware, which wasn’t so straightforward with Backtrack.

    The second requires a paradigm shift when it comes to data sharing and publication in life sciences:

    Where do you see major barriers with respect to workflow support near the data storage?

    More efforts into creating an interoperable infrastructure via software and data encapsulation.

    Willingness to take some of the academic computation efforts towards HPC research computing as opposed to throw more manual maintenance work at it. The final aim is to run experiments in different HPC facilities without changing default configurations significantly.

    More specifically, dedicate some research efforts on deploying and testing solutions like docker and openstack as opposed to stick with silo HPC installations.

    What are the most urgent needs of the scientific communities with respect to data processing?

    To put it simply, prioritize standardization and incentivize incremental science as opposed to “novel” approaches that do not contribute to the field significantly (wheel reinvention).

    Jokes aside, this happens all the time in genomics data formats with major overlap in features.

    Let’s hope neuroscience learns from it gets a better grip on community consensus over the years.