2012: Lots of work to be done
January 4, 2012 Leave a Comment
Since the world is going to be ending on the 21st of December anyway, we’d better use our remaining time efficiently!
ApacheCon NA 2011
I spent the fall months preparing for this year’s rendition of ApacheCon NA, held in Vancouver, BC in the first week of November. As always, Apache and their sponsors set up and executed an incredible week full of very interesting and informative applications of Apache software. I was particularly interested in the Big Data and Cloud tracks, and they didn’t disappoint. I gave my own presentation on the application of Mahout (and Hadoop) to biomedical data analysis, and I believe it went pretty well.
The content of my talk broadly introduced the work that I’ll be doing over the next few years. Mahout is going to play a central role, and in particular with the latest 1.0 release of Hadoop, I am more excited each day at the prospect of making distributed computing an integral part of my bioimaging research.
My girlfriend attended ApacheCon with me, and following the conclusion of the conference we toured around Vancouver. It is a beautiful city with quite a lot to offer. We managed only to see a few blocks and check out a few restaurants during our brief visit, but we also explored quite a bit of Stanley Island, including the Vancouver Aquarium and a spontaneous 5K race hosted by the local running store (see my running page for the results).
Qatar Collaboration
Upon returning from Vancouver, nearly all of my time remaining in 2011 was spent either on my Intermediate Statistics course (which I passed!) or working on a proposal for a massive joint venture between my lab and that of Dr. Majd Sakr at the CMU@Qatar campus. It’s a huge undertaking and will likely form the basis of my Ph.D. thesis; the proposal came out to a few dozen pages when all was said and done and the proposal was submitted in early December. We won’t hear back about the results until late January / early February, but we are keeping our fingers crossed nonetheless. Mahout will play a huge role in the project: effectively it will be the analytics engine of the entire framework.
As a little bit of supporting documentation, immediately following the submission of the proposal we also ran a few initial experiments and collated them into a technical report, which was submitted in mid-December to CMU@Qatar. The technical report is listed as CMU-CS-QTR-110 but has not been posted on the main website yet, so I’ve posted it on my own if you are interested. Effectively, we implemented a very small subset of the overall functionality proposed in the larger submission as a sort of proof-of-concept; in this case, the code for computing frequency of a time series.
Eventually we’d like to make this a bit more mature, in particular making use of the higher-order autoregressive models mentioned in my ApacheCon talk. But again, this was just a proof-of-concept to show that it could be done in a distributed environment, and with the specific software we proposed. The program itself isn’t even all that long: I’ve posted it on github for anyone that is interested.
This will comprise the bulk of my focus in the coming semester, and will directly lead to my next section.
Core Mahout Development
With Mahout nearly at a 0.6 release, I finally submitted a fix for issue 524, which should fix both the Path error that was coming up (thanks to Grant for that fix) and the bizarre clustering that was showing up. There are still several other issues: my focus early on will be attending to the numerous housekeeping tasks revolving around bringing the spectral clustering algorithms out of “experimental” status and into a much more proven state (like Mahout’s K-Means implementation). This will involve things like:
- Creating a more universal input format, such as the one used by the technical report code above.
- Creating a more universal output format, which was the main culprit for issue 524.
- Creating an in-memory version of spectral k-means to save on the expensive eigen-decomposition process.
- Experimenting with using the stochastic SVD in place of the current lanczos solver for the eigen-decomposition.
- Fixing the Eigencuts algorithm outright. It does not work as advertised, though I (until now) have not had a chance to exactly diagnose why and hence have not yet created a JIRA ticket for it.
Once these items have been attended to, I will begin adding new functionality to Mahout as per our collaboration with CMU@Qatar. This will involve a lot of work on Hadoop as well, particularly in terms of its input and output formats. In particular, it would be GREAT if we could find a way to have Hadoop natively read and split video and image formats, as for our technical report we ran a Python script which converted each video to a NxM matrix, where each row is a pixel and each column is its value at time m (this is also the input format I’d ultimately like the spectral clustering algorithms to be able to accept, if not native images / videos as well). The Python scripts worked well, but for larger videos–those which necessitate distributed analysis–this is obviously not ideal.
Spring 2012
Obviously the housekeeping tasks will come first; I’d like to name Dan Brickley in particular for reaching out and being a solid source of feedback and for keeping me on task with this. The aforementioned items plus some additional documentation on the Mahout website is in order for the spectral algorithms.
While I’m not taking any full-time courses this semester per se–done intentionally to try and make some room for these very items and for pushing out some papers–I will be a teaching assistant this semester. I will have my usual requirements for Journal Club (1 hour per week) and Seminar (1 hour per week), and I may be traveling to Qatar to give a lecture or two for their cloud computing course. But there won’t be another Machine Learning or Intermediate Statistics demanding 30-40 hours weekly.
Here’s to a very productive 2012!


