2012: Lots of work to be done

Since the world is going to be ending on the 21st of December anyway, we’d better use our remaining time efficiently!

ApacheCon NA 2011

I spent the fall months preparing for this year’s rendition of ApacheCon NA, held in Vancouver, BC in the first week of November. As always, Apache and their sponsors set up and executed an incredible week full of very interesting and informative applications of Apache software. I was particularly interested in the Big Data and Cloud tracks, and they didn’t disappoint. I gave my own presentation on the application of Mahout (and Hadoop) to biomedical data analysis, and I believe it went pretty well.

The content of my talk broadly introduced the work that I’ll be doing over the next few years. Mahout is going to play a central role, and in particular with the latest 1.0 release of Hadoop, I am more excited each day at the prospect of making distributed computing an integral part of my bioimaging research.

My girlfriend attended ApacheCon with me, and following the conclusion of the conference we toured around Vancouver. It is a beautiful city with quite a lot to offer. We managed only to see a few blocks and check out a few restaurants during our brief visit, but we also explored quite a bit of Stanley Island, including the Vancouver Aquarium and a spontaneous 5K race hosted by the local running store (see my running page for the results).

Qatar Collaboration

Upon returning from Vancouver, nearly all of my time remaining in 2011 was spent either on my Intermediate Statistics course (which I passed!) or working on a proposal for a massive joint venture between my lab and that of Dr. Majd Sakr at the CMU@Qatar campus. It’s a huge undertaking and will likely form the basis of my Ph.D. thesis; the proposal came out to a few dozen pages when all was said and done and the proposal was submitted in early December. We won’t hear back about the results until late January / early February, but we are keeping our fingers crossed nonetheless. Mahout will play a huge role in the project: effectively it will be the analytics engine of the entire framework.

As a little bit of supporting documentation, immediately following the submission of the proposal we also ran a few initial experiments and collated them into a technical report, which was submitted in mid-December to CMU@Qatar. The technical report is listed as CMU-CS-QTR-110 but has not been posted on the main website yet, so I’ve posted it on my own if you are interested. Effectively, we implemented a very small subset of the overall functionality proposed in the larger submission as a sort of proof-of-concept; in this case, the code for computing frequency of a time series.

Eventually we’d like to make this a bit more mature, in particular making use of the higher-order autoregressive models mentioned in my ApacheCon talk. But again, this was just a proof-of-concept to show that it could be done in a distributed environment, and with the specific software we proposed. The program itself isn’t even all that long: I’ve posted it on github for anyone that is interested.

This will comprise the bulk of my focus in the coming semester, and will directly lead to my next section.

Core Mahout Development

With Mahout nearly at a 0.6 release, I finally submitted a fix for issue 524, which should fix both the Path error that was coming up (thanks to Grant for that fix) and the bizarre clustering that was showing up. There are still several other issues: my focus early on will be attending to the numerous housekeeping tasks revolving around bringing the spectral clustering algorithms out of “experimental” status and into a much more proven state (like Mahout’s K-Means implementation). This will involve things like:

  1. Creating a more universal input format, such as the one used by the technical report code above.
  2. Creating a more universal output format, which was the main culprit for issue 524.
  3. Creating an in-memory version of spectral k-means to save on the expensive eigen-decomposition process.
  4. Experimenting with using the stochastic SVD in place of the current lanczos solver for the eigen-decomposition.
  5. Fixing the Eigencuts algorithm outright. It does not work as advertised, though I (until now) have not had a chance to exactly diagnose why and hence have not yet created a JIRA ticket for it.

Once these items have been attended to, I will begin adding new functionality to Mahout as per our collaboration with CMU@Qatar. This will involve a lot of work on Hadoop as well, particularly in terms of its input and output formats. In particular, it would be GREAT if we could find a way to have Hadoop natively read and split video and image formats, as for our technical report we ran a Python script which converted each video to a NxM matrix, where each row is a pixel and each column is its value at time m (this is also the input format I’d ultimately like the spectral clustering algorithms to be able to accept, if not native images / videos as well). The Python scripts worked well, but for larger videos–those which necessitate distributed analysis–this is obviously not ideal.

Spring 2012

Obviously the housekeeping tasks will come first; I’d like to name Dan Brickley in particular for reaching out and being a solid source of feedback and for keeping me on task with this. The aforementioned items plus some additional documentation on the Mahout website is in order for the spectral algorithms.

While I’m not taking any full-time courses this semester per se–done intentionally to try and make some room for these very items and for pushing out some papers–I will be a teaching assistant this semester. I will have my usual requirements for Journal Club (1 hour per week) and Seminar (1 hour per week), and I may be traveling to Qatar to give a lecture or two for their cloud computing course. But there won’t be another Machine Learning or Intermediate Statistics demanding 30-40 hours weekly.

Here’s to a very productive 2012!

PhD progress, ApacheCon 2011

There’s been a lengthy pause in my updates here, but not for lack of progress on the ground. Here’s what I’ve been up to since my last post in March:

  1. Finishing up core courses in the spring, and presenting at my first conference. The first year of PhD work is remarkably unproductive in terms of research, given the huge amount of time that needs to be devoted to coursework.
  2. Summer internship at Google. While extremely relevant to the type of work I wanted to do in my PhD research, my focus was on my Google work, and consequently research took the back burner. Coupled with the fact that I traveled over the summer approximately 42% of my total days off (weekends, holidays, etc), this left little time for research-related progress.
  3. Research deadlines + coursework upon my return to the fall semester, in addition to ApacheCon 2011 preparation. I’ll expound on this further below.

I am somewhat envious of other individuals in the Mahout community whose work is, for the most part, tangential to what Mahout seeks to accomplish. My field of study–computational biology–while uniformly tangential to computer science in general, any specific part of computer science (except for, perhaps, linear algebra) is vanishingly unrelated to computational biology.

This presents a problem for introducing new computational tools: since the biology is what we’re trying elucidate, we don’t  spend a whole lot of time creating new software. It suffices for the most part, if a certain biological problem can be solved with a mixture of gaussians, to simply find an existing GMM software package and throw it at the problem. Anything beyond a few proof-of-concept scripts require either a bunch of undergraduate slaves or a dedicated on-staff software engineer (there are a few).

So this is the dilemma I am currently in. Now, this is a purely philosophical dilemma: the reality of the situation is that not only is my background in computer science, and my research track in bioimaging, but my advisor’s background is computer vision, and he spent many years at Intel. So he’s incredibly enthusiastic about creating new computational tools, and in particular fleshing out the distributed computing story to help us with our current research goals.

The practical dilemma I am in, then, is that of prior commitments: namely, coursework and publishing deadlines.

  • Coursework this semester consists of a particularly brutal class in Intermediate Statistics. The professor is phenomenal and I’m learning a lot, but it requires much more of my time (approximately 30-40 hours/week, homework + lectures) than I had originally anticipated.
  • Research deadlines from our collaborators have also kept me away from the Mahout story. Without going into too much detail, the collaborators are looking for an input-output solution, and my advisor and I–while chomping at the bit to give Mahout its chance–have to do the implementation in stages to get the input-output requirements satisfied first. We recognize the benefits of “first mover”, aka publishing really quickly even if its edges aren’t smooth yet, so our implementation right now is sequential. Once it has reached publication, the Mahout story will come next.

To be unequivocally clear: Mahout is going to be a part of my research thesis. My advisor and I have submitted several grant proposals that specifically reference Mahout and Hadoop as lynchpins in our grand 2-3 year design. It’s simply a question of time: being able to mitigate both the philosophical issues of distributed computing in computational biology, and the practical issues of coursework and research deadlines.

Stay tuned!

Mahout committer, Ph.D. thesis

Been awhile, and I apologize for that: the obligations of a first-year Ph.D. student are not trivial. However, I am only a week away from the conclusion of my final rotation, which means a transition to full-time thesis advisorship is imminent. Furthermore, my core classes will be completed at the conclusion of this semester, leaving only specializations and electives over the next few years. Also, I will be working at Google over the summer on distributed computing and predictive advertisement modeling, which is in line with both my thesis and Mahout work.

Here’s what’s on my immediate radar for Mahout:

  • Update Mahout’s Getting Started wiki pages. There are two ways of doing this: quick fix and big overhaul. There are a ton of links that criss-cross all over the place, so it’s hard to get a sense of the overall structure. As I’m updating the Quickstart pages, I’d also like to completely reorganize all the information that is linked from there into a more coherent structure. However, this won’t be trivial, and I’ll certainly need to run a lot of these changes by the rest of the community, if only because I’m no expert.
  • Investigate Spectral K-means logic bugs. To really make this happen, I’ve been working on implementing a Python version of spectral k-means in order to compare the results (for simpler debugging). The simple fact is, the clustering results may still not be perfect; however, the K-Means visualizer packaged with Mahout gives some pretty bizarre output, so this needs to be looked into further.
  • Spectral clustering needs synthetic example data. There needs to be example data that users can simply plug into both spectral K-means and Eigencuts in order to observe how they run. This issue, however, is largely dependent on the following one.
  • Algorithms should be able to convert raw data to affinities. Right now the spectral clustering algorithms accept only affinity data. Raw data should be acceptable, with tunable parameters for how the affinities are computed. Furthermore, this needs to be able to accept both cartesian (text) and image data.
  • Output format is needed. I purposely held off on this as the previous summer drew to a close since there was significant discussion going on in the community regarding the standardization of input/output formats. However, we haven’t quite reached a consensus yet, and these spectral clustering algorithms need a better output format than simply raw System.out.println() statements.
  • Eigencuts needs to be able to programmatically determine rank of eigen-decomposition. The stopgap measure I implemented a few months ago was to make the rank a user-selectable parameter (defaulting to 3). However, there are a lot of heuristics out there for observing the decay in eigenvalues and choosing a relatively optimal cut-off based on the dimensionality of the data, which should be simple to implement.
  • Bring DistributedRowMatrix into compliance with Hadoop 0.20.2. This is on hold until Hadoop releases a better API, as there is no way in 0.20.2 to do map-side joins in an efficient fashion (as was done in 0.18).

More distant, but equally important:

  • Bring time series support into Mahout. One of the potential GSoC students for this coming summer brought up this issue, and I think it’s a great one, certainly one I’m interested in supporting and even helping. I haven’t fully investigated the existing HMM implementation, but from what I understand it’s not entirely complete or parallelized.
  • Help standardize input/output formats. This is big for me, given the heterogeneity of the data I’m working with, and which I’m sure many others who could potentially use Mahout are also working with. In particular, I’d like to make it much simpler for multimedia data–images and videos in particular–to be assimilated and read into Mahout vectors and matrices.
  • Generalize HMMs to Autoregressive models. This is somewhat of my grand plan: the more work I do in laying the groundwork for my thesis, the more it’s becoming abundantly clear that time series data will play a very large role in whatever my thesis turns out to be, and autoregressive models will be very important in analyzing that data.
  • Run benchmarking tests of spectral algorithms against other parallel implementations. We already have collaborators in Doha, Qatar that have granted us use of a cluster of machines. Furthermore, there is Amazon EC2, in addition to a pair of my own machines with some ridiculously overpowered specifications (can be virtualized for a small cluster).
  • Publish a paper on the results of the previous item. This is, of course, dependent on the results of the benchmarking, and may require some performance tweaks, and might even involve waiting for the Hadoop API to mature a little. Still, this would be a very interesting study.

For my thesis, I want to involve the use of distributed machine learning in order to analyze biological datasets that are so massive they are otherwise intractable to analyze. Such an endeavor will require most–if not all–of the above items to be completed, plus many more that I cannot envision yet. As such, even though my current activity within the Mahout community is relatively minimal, should this thesis idea pan out I’m going to fully earn my committer status.

We’ll see what happens from here.

DistributedRowMatrix overhaul

I’ve spent the past few days of my holiday break working on bringing Mahout’s DistributedRowMatrix into compliance with Hadoop 0.20.2 (where it currently implements 0.18). This comes with an interesting caveat, given that – for whatever reason – version 0.20 does not have the CompositeInputFormat that allows for joins to be performed on the data. This little bit was vital for conducting matrix multiplication: since the operation is not commutative ( AB != BA ), it is absolutely essential to know, within the Mappers/Reducers, which vector/row came from which matrix.

By performing a join, two corresponding rows – one from each matrix – would appear in the Mapper with the same key. The Mapper would essentially perform an identity function: map the rows with the same keys to the same Combiner/Reducer. However, in the next step (be it Combine or Reduce), we would indeed have two rows, but we’d completely without the ability to determine which row came from which matrix, and therefore unable to properly multiply them (particularly if the dimensions of the two matrices are equal).

From what I can tell, there are two possible solutions here:

  1. Modify the DistributedRowMatrix to have its corresponding rows carry the path to its SequenceFiles. This approach would effectively move the task of slapping a label on each individual row to the very creation of the matrix itself. However, the primary issue here is that a DistributedRowMatrix is not explicitly created when its constructor is called; rather, it is created whenever a user-defined job is executed that requires it. Thus, its structure is subject to the implementation of the job that creates it, so every instance that exists where a DistributedRowMatrix is created would have to be modified (and future instance would have to follow this instantiation rule) for this fix to work.
  2. Guarantee the ordering the two rows in the Combiner/Reducer. Given the serious issue with #1 above, this seems the most plausible. However, this option hits hard on my lack of experience and knowledge of the inner workings of Hadoop. To be able to guarantee the ordering of the two rows that show up in the Combiner/Reducer is something I don’t know how to do, or even where to start. I’ve looked at the implementations for Partitioner, RecordReader, and various InputFileFormats and none of them really seem to be what I’m looking for.

Given these possibilities, it seems that #2 is the way to go, but I’m really not sure how. If anyone has any insights as to how the ordering of the elements in the Combiner/Reducer’s Iterable object is assigned, please let me know! Or if you have any other ideas as to how matrix multiplication could be performed in a map-reduce setting.

Follow this issue on Apache JIRA here.

Mahout and Graduate School

It’s been a few months and I’ve settled into my new role as a Ph.D. student (for the most part). I’ve become exponentially more busy, what with a full course load and rotations as well at the CMU-UPitt Joint Computational Biology program. However, I am committed to staying active in and contributing to the Mahout community, as their work is orthogonal to my own. Plus they’re a really awesome community.

My current rotation advisor is also the author of the Eigencuts algorithm I implemented over the summer (which has now been incorporated into what will soon be the 0.4 release of Mahout), so I am definitely still actively working on patching up the implementation and continuing to contribute. Here are my current endeavors with Mahout:

  • Eigencuts. Still a lot of work to be done here. The biggest issue is coming up with a way of deciding what rank eigen-decomposition to perform for the Lanczos solver. The eigenvalues are how we decide whether that eigenvalue and its associated eigenvector are useful for clustering, but unfortunately there currently isn’t any way to test without performing the full decomposition to some arbitrary rank. If the arbitrary rank is too large, we waste a significant amount of computation; if it’s too small, we get a poor clustering. The current fix is to allow the user to set the rank manually, but this is redundant to an extent as they already set the β0 parameter which is effectively the threshold by which we determine whether or not to use an eigenvalue/eigenvector. Needless to say, an efficient approximation would be worthwhile (perhaps some way of using the characteristic polynomial?).
  • Affinities. Currently, Eigencuts requires the user to submit input that is already in graphical format, and it’s a pretty arbitrary format as well. I’m working on an addition that will allow the user to submit raw image data as input, and Mahout will then automatically parse and convert it to an affinity matrix. The only issues here are technical ones: what image formats to support, what dependencies we’ll incur as a result, and if we want to be able to support other non-image data formats (the algorithm’s author isn’t crazy about this idea, since spectral clustering really isn’t designed for that, even though it can – with some iffy heuristics – be done).
  • Examples. I really, really need to get this done: working examples of Eigencuts in action. I’ve outlined a framework which utilizes a quick-and-dirty PHP front-end for folks to upload images, then feed them to Eigencuts, then have PHP parse the output and display the partitioned image, but it’s not completed yet (obviously).
  • Output. This is somewhat out of my hands, but Eigencuts currently does not have an output format. I did this deliberately, as that very topic was being debated at the time among the Mahout community. There is a strong desire for standardizing Mahout’s input/output formats, but as far as I know, no consensus has been reached as of yet.
  • DistributedRowMatrix. Eigencuts depends on these distributed implementations of sparse matrices in order to perform its calculations; as such, I volunteered to bring its API up to Hadoop 0.20.2 (in particular, eliminate the dependency on JobConf, which has been deprecated) Unfortunately, there are a few operations – most importantly, matrix-matrix multiplication – that are difficult to port over as the 0.20.2 release is missing some of the classes that was in the previous release. In looking through Hadoop’s repository, those crucial classes have been committed but for some reason weren’t in the official release. To add insult to injury, they also have dependencies that are not in 0.20.2, so as of right now I’m trying to figure out a workaround to hold DRM over until the next Hadoop release.

These are, effectively, the points I’m working on whenever I can. Graduate school does come first, and if my midterm performance in my Machine Learning class is any indication I’m going to have to keep spending a lot of time on schoolwork (though if anyone would like to lend their assistance in understanding the coursework, that would be welcomed :) ). Stay tuned for updates.

GSoC 2010 Wrap-up

The completion date for GSoC 2010 has come and gone, and while bug fixes and tweaks are still incoming on the Eigencuts project, the timeline for this summer is over. In short: that’s all, folks.

Here is a basic summary of points regarding the final product:

  • There are effectively two spectral algorithms packaged together: spectral k-means, and the Eigencuts algorithm. Spectral k-means is effectively the first half of Eigencuts with K-means tacked onto the end, so there was no more work involved in making the algorithms distinct.
  • Documentation is available on the Mahout wiki; there are still a few blank spots that will be filled in over the coming weeks.
  • Spectral k-means works as advertised; Eigencuts is still under review and refactoring, as a few unfortunate design flaws that were not revealed until final testing have proven problematic in terms of making the algorithm work. This is actively being worked on.
  • Output for both algorithms is still sketchy. There was never a decision made regarding the format of the final output; currently, they simply display each point and its cluster assignment in STDOUT, which is obviously not a very elegant solution. With the ongoing discussions throughout the Mahout community involving the standardization of the clustering input/output and overall API, however, this would likely have been changed regardless.
  • Spectral K-means does not yet fully allow for all the K-means parameters to be used as well; right now, it sets several defaults internally.
  • Publications are pending and forthcoming…

Thanks again to the Mahout community for allowing me this awesome, awesome opportunity. Given how intrinsically related this work is to my Ph.D. work over the next 5-? years, I very much hope to remain a part of this community for the foreseeable future, working on Eigencuts and the Mahout package in general in any way I can.

Sprint 4: Map/Reduce over two SequenceFiles

This is the final roadblock, and unfortunately it doesn’t appear to be a trivial problem to solve.

In my final map/reduce task, I have two matrices: the affinity matrix A, and the “cut” matrix S. The affinity matrix is a model for the underlying graph structure of the data; as mentioned here in previous entries, it shows how each data point relates to the others, and is how we decide where to partition the data into clusters. The cut matrix S is essentially our final phase – it represents the end results of all our calculations, and tells us where we need to induce “cuts” in the graph to partition the data points into distinct clusters.

Essentially: any non-zero value of S at some arbitrary point i, j represents a needed cut at the same i, j in A.

In order to make this happen within the context of Mahout, I need access to both matrices simultaneously. As the default method for going through matrices in Mahout is the implementation of a map/reduce task (as these matrices could easily have dimensions into the order of millions), I need a way for mapping over both these matrices simultaneously.

Unfortunately, there does not seem to be a readily available method for doing this within the current Hadoop API.

The first possibility I explored was simply mapping over one of the two matrices, and then using the key of each Mapper (corresponding to a row in the matrix) as the key to the corresponding matrix, and reading the SequenceFile for the second matrix manually. However, it has been pointed out to me by Mahout devs that this method is not only crude (which I already knew), but also orders of magnitude slower. It would not scale well with the data.

The second possibility that I had already been looking into (and which the Mahout devs pointed me towards) was to look at how they implemented MatrixMultiplicationJob within the DistributedRowMatrix class – after all, this is the multiplication of two matrices, so obviously they would have to be able to coordinate the rows and columns of two matrices simultaneously.

Here is a snippet of the code:

public class MatrixMultiplicationJob extends AbstractJob {

  private static final String OUT_CARD = "output.vector.cardinality";

  public static JobConf createMatrixMultiplyJobConf(Path aPath, Path bPath, Path outPath, int outCardinality) {
    JobConf conf = new JobConf(MatrixMultiplicationJob.class);
    conf.setInputFormat(CompositeInputFormat.class);
    conf.set("mapred.join.expr", CompositeInputFormat.compose(
          "inner", SequenceFileInputFormat.class, aPath, bPath));
    conf.setInt(OUT_CARD, outCardinality);
    conf.setOutputFormat(SequenceFileOutputFormat.class);
    FileOutputFormat.setOutputPath(conf, outPath);
    conf.setMapperClass(MatrixMultiplyMapper.class);
    conf.setCombinerClass(MatrixMultiplicationReducer.class);
    conf.setReducerClass(MatrixMultiplicationReducer.class);
    conf.setMapOutputKeyClass(IntWritable.class);
    conf.setMapOutputValueClass(VectorWritable.class);
    conf.setOutputKeyClass(IntWritable.class);
    conf.setOutputValueClass(VectorWritable.class);
    return conf;
  }
    ...
      public static class MatrixMultiplyMapper extends MapReduceBase
      implements Mapper<IntWritable,TupleWritable,IntWritable,VectorWritable> {
    ...
    }
  }

See the problem? It uses the older, now-deprecated Hadoop API that relies on the hadoop.mapred.* package, rather than the newer hadoop.mapreduce.* package. While that is not necessarily the problem, the essential classes involved in this implementation, namely CompositeInputFormat and TupleWritable, are not available in the newer package. The Job.setInputFormatClass() method does not accept the class hierarchy that CompositeInputFormat belongs to, and TupleWritable is simply nowhere to be found in the new package.

These two methods are the only ones I am aware of for accessing two separate SequenceFiles that share the same dimensions within the same Mapper. I have already emailed both the Mahout and Hadoop lists in regards to this issue; if you have any ideas or brainstorms on how to solve this problem, I would love to hear them. As it stands, this is the final major hurdle in completing my summer project; all that remains is finishing up unit tests and final debugging once this is in place.

Sprint 4: Closing in

We’re still a few weeks out from the official conclusion of GSoC, but the end of this project is arguably even closer (provided I stick to the remaining schedule). There are only a handful of concrete tasks remaining.

Document, document, document

Probably the most important remaining task – I need to start transferring data from this blog over to the Mahout Wiki. This includes stuff like the theoretical background of the algorithm, how it is implemented in Mahout, and how to run example data sets and interpret the results. I should be starting this immediately, though I also have a week (the final sprint) dedicated specifically to this task.

Unit tests

I have hardly any; this has been one glaring weakness in my development strategy. I need to set up unit tests that can be run to test each component of Eigencuts, of which there are many. I have several map/reduce tasks that need to have independent tests for each one. This task will likely take up most of the remaining time.

Coding

There are still a few modules that are incomplete in the overall algorithm. I’ll briefly list them:

  1. The SensitivityReducer needs to be refactored. Currently, it outputs a series of VectorWritables, ideal for instantiating in a DistributedRowMatrix. However, in the final job that actually performs the clustering, both this output and the original affinity matrix need to be simultaneously accessible. Since the “cut” matrix (the output of the SensitivityReducer) is n-by-n but has at most n elements, it can easily be represented as a single vector. Unfortunately, its elements are complex, so its representation will likely have to be that of an ArrayWritable.
  2. The Drivers for the respective algorithms need to be refactored to handle the new job for multiplying a diagonal matrix (represented as a DenseVector) with another matrix. This replaces the current implementation of two DistributedRowMatrix‘s being multiplied together as D.times(A.times(D)).
  3. The AffinityCuts job needs to be finished. Specifically, the runJob() method needs to write the “cut” vector to the DistributedCache, the setup() method needs to read it back properly, and the map() method needs to be filled out in its entirety.

That’s it!

Sprint 3: Three last M/R tasks

I mentioned in my previous post the fact that I had three more M/R tasks (at most) to implement in my current algorithm. In attempting to implement these, I have run into a problem of data representation: how I get the needed inputs to the M/R jobs, and how do I represent the outputs? Let’s look at these one at a time.

Computing sensitivities

The algorithm will compute an arbitrary number of eigenvector/eigenvalue pairs (as a very, very loose guess, perhaps anywhere from 5-50, depending on the dimensionality and quantity of the data). For each eigenvector/eigenvalue pair, we compute the sensitivity of this eigenvector to changes in the edge weights of the original markov transition matrix. This computation has an explicit form:

lambda = eigenvalue ; b = constant ; u = eigenvector ; u_i, u_j = individual eigenvector elements ; d_i, d_j = individual diagonal matrix elements

It’s not pretty. This has to be computed for all (i, j) pairs in every eigenvector. Thus, for each eigenvector of length n, we compute n(n – 1) different sensitivity values, one for every edge in the underlying graph; effectively, this builds a separate matrix for all the sensitivities. And again, I emphasize that this is just for one single eigenvector.

So here’s the problem: there are four inputs to this sensitivity computation (eigenvectors, eigenvalues, diagonal matrix, b), with an apparent output of a list of matrices. Map/reduce tasks take key/value pairs as input, and return key/value pairs as output. How do we shoehorn the above inputs/outputs into the map/reduce framework?

Non-maximal suppression

This step takes place right after the previous one, where we now have our list of matrices of sensitivities for each eigenvector. This computation can work on each matrix individually, so we will only concern ourselves with a single matrix for the sake of simplicity. For every (i, j) pair not on the diagonal, we want to compare the value in this cell to all other values in the same row and the same column, essentially forming a + in the graph. If any of those other values are more negative than the current one, then we want to “suppress” the current one (or effectively set it to MAX_INT, since we won’t use it again after this).

Thus, in a matrix of dimensions n x n, we are effectively looking for the n smallest elements, none of which share the same row or the same column. We want to repeat this process for all the matrices of sensitivities.

Again, we have a problem of input/output type mapping: our input is a list of matrices, and our output also appears to be a list of matrices (albeit very sparse ones, so this could probably be optimized as long as the i and j coordinates are maintained, as they will be needed in the last step). How again to map these to key/value pairs needed for map/reduce jobs?

Affinity cuts

This is the heart and soul of the Eigencuts algorithm, and also the final step. For all the sensitivities that survived the previous step (i.e. were not suppressed), and of those sensitivities, for all those whose value is less than a predefined threshold τ / δ (largely set by the user), we want to “cut” the edge in the graph specified by the (i, j) coordinates of those sensitivities that fit the aforementioned criteria. By “cut”, this means we want to set the value of the affinity matrix at point (i, j) to 0, which signifies that the vertices i and j in the underlying graph no longer share a connection.

For this particular case, the input consists of the list of sparse matrices from the previous step, as well as the affinity matrix. The output, finally, is just the affinity matrix (or a modified version of it). If the list of matrices could be concatenated together vertically, then each row could be used in independent map tasks, as there will only be a single value in any given row that has not been suppressed. The key would be the row number i, and the non-suppressed value would carry its column number j; the only tricky part would be that, after the first matrix, the key i will take on values larger than the number of rows in a matrix, so it would need to be normalized to still refer to an actual row in the affinity matrix.

But along the lines of the affinity matrix, this is the problem. Since it is constant throughout all the operations, it might be possible for each mapper to construct its own affinity matrix and modify it as necessary (since consistency of the affinity matrix isn’t an issue; no operation depends on another. If the same value is zeroed out twice, who cares – it’s still zeroed out).

Those are the current tasks at hand. If any of this needs clarifying, please let me know. I appreciate any insights; thank you!

Sprint 3: Quick Update

I’m still eyeball-deep in sprint 3 (and rapidly running short on time), but I wanted to make an update about my latest developments.

First and foremost, I have finally quashed the issue of bad affinity matrices. I did, however, discover a slight bug in the sprint 1-2 algorithm code itself: I forgot I had set the convergence parameter for the k-means portion of the algorithm to an extremely relaxed value, and so when I finally had good data to feed it, the clusters were still very loose. Reducing the parameter from 0.5 to 0.01 fixed it right up.

I have also committed the latest version of the data generation scripts to my Google SVN repository. I made one very important change to how affinity matrices are created: the graph is now fully-connected, and the user can no longer specify a discrete “neighborhood”. The reason for this can be illustrated with an example: suppose you have two arbitrary points, A and B, in two-dimensional space somewhere. There are other points as well, but we’re only paying attention to these for now. Within A’s k-neighborhood, B is a member, as it is among the k closest points to A. However, when calculating B’s neighborhood, there are k points surrounding it that are closer than A is. Thus, we have an asymmetry: B is in A’s k-neighborhood, but A is not in B’s, so their affinities will not be symmetric. The way the algorithm is currently implemented, symmetry is assumed, so this is a deal-breaker.

As I mentioned, the solution I implemented was to eliminate k-neighborhoods entirely and connect the entire affinity matrix – no zeroed-out values.

Here are the results from the new data. First, here are three synthetic clusters:

Three clusters, generated from Gaussian centroids. (click for larger image)

Each cluster contains 150 points. When fed into the algorithm in the form of a symmetric (and full) affinity matrix, the spectral k-means algorithm of sprints 1-2 clustered all point with 100% accuracy – all three clusters were fully recovered. For reference, here are the three eigenvectors generated by the algorithm:

Three eigenvectors of the normalized Laplacian (click for larger image)

As you can see, the piecewise constancy we were hoping for in previous attempts is decidedly more present here. There is obviously some deviation – quite a bit of variance is present in each “component” of an eigenvector, though some more than others – but this did not seem to be a problem when performing k-means clustering on the eigenvector components, as every point was correctly clustered.

Needless to say, the algorithm works (and quite well!), but it is highly dependent on the data being properly formatted and, possibly even more important, symmetric: the affinity from point A to point B must be identical to the affinity from point B to point A, whether 0 (e.g. outside a k-neighborhood) or nonzero.

As for sprint 3 progress, I’ve identified about 4-5 separate map/reduce tasks that will need to be utilized. The first two are exactly identical to the map/reduce tasks employed in sprints 1-2, since Eigencuts uses the exact same foundation as spectral k-means.

  1. Read input. Already implemented in spectral k-means: read the input and create a DistributedRowMatrix of the affinities.
  2. Create D. Already implemented in spectral k-means: sum the rows of the affinity matrix to create the diagonal matrix D.
  3. Compute sensitivities. Loop through the eigenvectors of the matrix L, computing the sensitivity Si,jk of each.
  4. Non-maximal suppression. Suppress non-maximal sensitivities by “marking” them for the next step.
  5. Cut affinities. Set values of the affinity matrix A to 0 (not before adding the affinity to the diagonal, though!) if their computed sensitivity falls beneath a threshold.

I’m still working out the details – I may be able to combine some of these into fewer M/R tasks, but each step of the algorithm does depend on the outcome of the previous one, so it may be difficult. We shall see.

Follow

Get every new post delivered to your Inbox.