Mahout committer, Ph.D. thesis
March 11, 2011 2 Comments
Been awhile, and I apologize for that: the obligations of a first-year Ph.D. student are not trivial. However, I am only a week away from the conclusion of my final rotation, which means a transition to full-time thesis advisorship is imminent. Furthermore, my core classes will be completed at the conclusion of this semester, leaving only specializations and electives over the next few years. Also, I will be working at Google over the summer on distributed computing and predictive advertisement modeling, which is in line with both my thesis and Mahout work.
Here’s what’s on my immediate radar for Mahout:
- Update Mahout’s Getting Started wiki pages. There are two ways of doing this: quick fix and big overhaul. There are a ton of links that criss-cross all over the place, so it’s hard to get a sense of the overall structure. As I’m updating the Quickstart pages, I’d also like to completely reorganize all the information that is linked from there into a more coherent structure. However, this won’t be trivial, and I’ll certainly need to run a lot of these changes by the rest of the community, if only because I’m no expert.
- Investigate Spectral K-means logic bugs. To really make this happen, I’ve been working on implementing a Python version of spectral k-means in order to compare the results (for simpler debugging). The simple fact is, the clustering results may still not be perfect; however, the K-Means visualizer packaged with Mahout gives some pretty bizarre output, so this needs to be looked into further.
- Spectral clustering needs synthetic example data. There needs to be example data that users can simply plug into both spectral K-means and Eigencuts in order to observe how they run. This issue, however, is largely dependent on the following one.
- Algorithms should be able to convert raw data to affinities. Right now the spectral clustering algorithms accept only affinity data. Raw data should be acceptable, with tunable parameters for how the affinities are computed. Furthermore, this needs to be able to accept both cartesian (text) and image data.
- Output format is needed. I purposely held off on this as the previous summer drew to a close since there was significant discussion going on in the community regarding the standardization of input/output formats. However, we haven’t quite reached a consensus yet, and these spectral clustering algorithms need a better output format than simply raw
System.out.println()statements. - Eigencuts needs to be able to programmatically determine rank of eigen-decomposition. The stopgap measure I implemented a few months ago was to make the rank a user-selectable parameter (defaulting to 3). However, there are a lot of heuristics out there for observing the decay in eigenvalues and choosing a relatively optimal cut-off based on the dimensionality of the data, which should be simple to implement.
- Bring DistributedRowMatrix into compliance with Hadoop 0.20.2. This is on hold until Hadoop releases a better API, as there is no way in 0.20.2 to do map-side joins in an efficient fashion (as was done in 0.18).
More distant, but equally important:
- Bring time series support into Mahout. One of the potential GSoC students for this coming summer brought up this issue, and I think it’s a great one, certainly one I’m interested in supporting and even helping. I haven’t fully investigated the existing HMM implementation, but from what I understand it’s not entirely complete or parallelized.
- Help standardize input/output formats. This is big for me, given the heterogeneity of the data I’m working with, and which I’m sure many others who could potentially use Mahout are also working with. In particular, I’d like to make it much simpler for multimedia data–images and videos in particular–to be assimilated and read into Mahout vectors and matrices.
- Generalize HMMs to Autoregressive models. This is somewhat of my grand plan: the more work I do in laying the groundwork for my thesis, the more it’s becoming abundantly clear that time series data will play a very large role in whatever my thesis turns out to be, and autoregressive models will be very important in analyzing that data.
- Run benchmarking tests of spectral algorithms against other parallel implementations. We already have collaborators in Doha, Qatar that have granted us use of a cluster of machines. Furthermore, there is Amazon EC2, in addition to a pair of my own machines with some ridiculously overpowered specifications (can be virtualized for a small cluster).
- Publish a paper on the results of the previous item. This is, of course, dependent on the results of the benchmarking, and may require some performance tweaks, and might even involve waiting for the Hadoop API to mature a little. Still, this would be a very interesting study.
For my thesis, I want to involve the use of distributed machine learning in order to analyze biological datasets that are so massive they are otherwise intractable to analyze. Such an endeavor will require most–if not all–of the above items to be completed, plus many more that I cannot envision yet. As such, even though my current activity within the Mahout community is relatively minimal, should this thesis idea pan out I’m going to fully earn my committer status.
We’ll see what happens from here.




