Context and iDROP
This is an open, on-demand blog post and drafty software specification.
When I started at INCF, among other duties I inherited the supervision of an ambitious data sharing project called DataSpace. After going through its system architecture and python helper tools, it seems that the system follows good design principles.
Still, from the operational and usability sides, there is still plenty of room for improvement.
One of the most commonly mentioned drawbacks with the system has to do with how the data is presented to the researchers on the web. At the moment of writing this, an iDROP web interface is the canonical interface to access DataSpace for end users. The web client integrates well with the underlying iRODS infrastructure. One can perform <a href=http://www.youtube.com/watch?v=YhciVQCZuBY>common data operations plus manage its metadata attributes</a>, as any commandline user would do with the underlying iRODS i-commands.
And even share (publish) a file as a (non-shortened, ugly) public link to share data with researchers:
OwnCloud development at INCF
I also took charge of leading an ongoing contract with OwnCloud Inc to completion on the first prototype of an iRODS-OwnCloud integration. After some weeks of testing, debugging, realizing about legacy PHP limitations and plenty of help from Chris Smith and a profficient OwnCloud core developer, a working proof of concept was put together with PRODS, a PHP-based iRODS client.
Today it works on both OwnCloud community and enterprise editions.
Despite it being a proof of concept having some serious performance issues, at least two scientific facilities have reported they are testing it already. Even if these are good news, it does need more work to be a robust solution. In the next lines, I will be describing what is needed to make it so, I will try to be as specific as possible.
But first, please have a look at the following GitHub pullrequest for some context on the current issues that need fixing:
OwnCloud+iRODS, what’s left to do?
As seen in the previous pullrequest (please, read it), there is a significant performance penalty by operating as the OwnCloud extension does today, since given research facilities A, B and OwnCloud instance C:
- Host A uploads a file to C.
- PRODS “streams” that file, while still writing on disk anyway, to host B.
- On completion, host C deletes the uploaded file from disk.
A possible performance improvement would be to skip disk altogether or just keep N chunks on disk for reliability, discarding already sent ones. While we talk about chunks, a set of tests should be written and some benchmarks run to empirically test which block size fits best for OwnCloud transfers. It is a quite critical core setting, so it should be tuned accordingly.
While fixing this issue would give us some iRODS speedup, it would not take advantage of iRODS’s underlying Reliable Blast UDP protocol since all transfers are channeled towards HTTP instead of plain UDP.
An improvement that would definitely boost the performance would be to just wrap the shell i-commands into well scaffolded PHP functions. If i-commands binaries were installed in the system, OwnCloud would detect it and use them instead of a PRODS-based “compatibility mode”. From a user perspective, the “compat mode” should be announced as a warning in the web interface to hint the user about i-commands and how the sysadmins could install the corresponding RPM or DEB packages for them.
Since metadata is a big upside of iRODS, it should definitely be exposed in OwnCloud as well. A possible way to meet ends, would be to extend an existing OwnCloud metadata plugin.
Last but not least, no software engineering project seems to be 100% safe from UTF-8 issues, and PRODS does not seem to be clear from it either:
So a fix for that is in order too. The ultimate goal is to have a well-rounded pull request that OwnCloud developers could merge without concerns, making those features available to serveral scientific communities in need for user-friendly and high performance data management.
I hope I outlined what needs to be done from a software engineering perspective. Other issues would surely pop up while developing and extending OwnCloud, and those deviations should be accounted for when planning the roadmap.
OwnCloud seems to be a good candidate to engulf a heavily used data management framework in academia, iRODS. Recently, an OwnCloud REST API is provided and bindings for modern Cloud storage systems are being worked on. This is a good chance to bring iRODS and other scientific data management systems together with OwnCloud, but we need: developers, developers, developers!
In the next few days, INCF will be applying for EUDAT funding. We are looking for talented OwnCloud/PHP developers, please feel free to leave a comment below if you are interested in this initiative or reach me on twitter: @braincode or just followup on the aforementioned pullrequest!.comments powered by Disqus