Mahout committer, Ph.D. thesis

Been awhile, and I apologize for that: the obligations of a first-year Ph.D. student are not trivial. However, I am only a week away from the conclusion of my final rotation, which means a transition to full-time thesis advisorship is imminent. Furthermore, my core classes will be completed at the conclusion of this semester, leaving only specializations and electives over the next few years. Also, I will be working at Google over the summer on distributed computing and predictive advertisement modeling, which is in line with both my thesis and Mahout work.

Here’s what’s on my immediate radar for Mahout:

  • Update Mahout’s Getting Started wiki pages. There are two ways of doing this: quick fix and big overhaul. There are a ton of links that criss-cross all over the place, so it’s hard to get a sense of the overall structure. As I’m updating the Quickstart pages, I’d also like to completely reorganize all the information that is linked from there into a more coherent structure. However, this won’t be trivial, and I’ll certainly need to run a lot of these changes by the rest of the community, if only because I’m no expert.
  • Investigate Spectral K-means logic bugs. To really make this happen, I’ve been working on implementing a Python version of spectral k-means in order to compare the results (for simpler debugging). The simple fact is, the clustering results may still not be perfect; however, the K-Means visualizer packaged with Mahout gives some pretty bizarre output, so this needs to be looked into further.
  • Spectral clustering needs synthetic example data. There needs to be example data that users can simply plug into both spectral K-means and Eigencuts in order to observe how they run. This issue, however, is largely dependent on the following one.
  • Algorithms should be able to convert raw data to affinities. Right now the spectral clustering algorithms accept only affinity data. Raw data should be acceptable, with tunable parameters for how the affinities are computed. Furthermore, this needs to be able to accept both cartesian (text) and image data.
  • Output format is needed. I purposely held off on this as the previous summer drew to a close since there was significant discussion going on in the community regarding the standardization of input/output formats. However, we haven’t quite reached a consensus yet, and these spectral clustering algorithms need a better output format than simply raw System.out.println() statements.
  • Eigencuts needs to be able to programmatically determine rank of eigen-decomposition. The stopgap measure I implemented a few months ago was to make the rank a user-selectable parameter (defaulting to 3). However, there are a lot of heuristics out there for observing the decay in eigenvalues and choosing a relatively optimal cut-off based on the dimensionality of the data, which should be simple to implement.
  • Bring DistributedRowMatrix into compliance with Hadoop 0.20.2. This is on hold until Hadoop releases a better API, as there is no way in 0.20.2 to do map-side joins in an efficient fashion (as was done in 0.18).

More distant, but equally important:

  • Bring time series support into Mahout. One of the potential GSoC students for this coming summer brought up this issue, and I think it’s a great one, certainly one I’m interested in supporting and even helping. I haven’t fully investigated the existing HMM implementation, but from what I understand it’s not entirely complete or parallelized.
  • Help standardize input/output formats. This is big for me, given the heterogeneity of the data I’m working with, and which I’m sure many others who could potentially use Mahout are also working with. In particular, I’d like to make it much simpler for multimedia data–images and videos in particular–to be assimilated and read into Mahout vectors and matrices.
  • Generalize HMMs to Autoregressive models. This is somewhat of my grand plan: the more work I do in laying the groundwork for my thesis, the more it’s becoming abundantly clear that time series data will play a very large role in whatever my thesis turns out to be, and autoregressive models will be very important in analyzing that data.
  • Run benchmarking tests of spectral algorithms against other parallel implementations. We already have collaborators in Doha, Qatar that have granted us use of a cluster of machines. Furthermore, there is Amazon EC2, in addition to a pair of my own machines with some ridiculously overpowered specifications (can be virtualized for a small cluster).
  • Publish a paper on the results of the previous item. This is, of course, dependent on the results of the benchmarking, and may require some performance tweaks, and might even involve waiting for the Hadoop API to mature a little. Still, this would be a very interesting study.

For my thesis, I want to involve the use of distributed machine learning in order to analyze biological datasets that are so massive they are otherwise intractable to analyze. Such an endeavor will require most–if not all–of the above items to be completed, plus many more that I cannot envision yet. As such, even though my current activity within the Mahout community is relatively minimal, should this thesis idea pan out I’m going to fully earn my committer status.

We’ll see what happens from here.

GSoC 2010 Wrap-up

The completion date for GSoC 2010 has come and gone, and while bug fixes and tweaks are still incoming on the Eigencuts project, the timeline for this summer is over. In short: that’s all, folks.

Here is a basic summary of points regarding the final product:

  • There are effectively two spectral algorithms packaged together: spectral k-means, and the Eigencuts algorithm. Spectral k-means is effectively the first half of Eigencuts with K-means tacked onto the end, so there was no more work involved in making the algorithms distinct.
  • Documentation is available on the Mahout wiki; there are still a few blank spots that will be filled in over the coming weeks.
  • Spectral k-means works as advertised; Eigencuts is still under review and refactoring, as a few unfortunate design flaws that were not revealed until final testing have proven problematic in terms of making the algorithm work. This is actively being worked on.
  • Output for both algorithms is still sketchy. There was never a decision made regarding the format of the final output; currently, they simply display each point and its cluster assignment in STDOUT, which is obviously not a very elegant solution. With the ongoing discussions throughout the Mahout community involving the standardization of the clustering input/output and overall API, however, this would likely have been changed regardless.
  • Spectral K-means does not yet fully allow for all the K-means parameters to be used as well; right now, it sets several defaults internally.
  • Publications are pending and forthcoming…

Thanks again to the Mahout community for allowing me this awesome, awesome opportunity. Given how intrinsically related this work is to my Ph.D. work over the next 5-? years, I very much hope to remain a part of this community for the foreseeable future, working on Eigencuts and the Mahout package in general in any way I can.

Sprint 3: Quick Update

I’m still eyeball-deep in sprint 3 (and rapidly running short on time), but I wanted to make an update about my latest developments.

First and foremost, I have finally quashed the issue of bad affinity matrices. I did, however, discover a slight bug in the sprint 1-2 algorithm code itself: I forgot I had set the convergence parameter for the k-means portion of the algorithm to an extremely relaxed value, and so when I finally had good data to feed it, the clusters were still very loose. Reducing the parameter from 0.5 to 0.01 fixed it right up.

I have also committed the latest version of the data generation scripts to my Google SVN repository. I made one very important change to how affinity matrices are created: the graph is now fully-connected, and the user can no longer specify a discrete “neighborhood”. The reason for this can be illustrated with an example: suppose you have two arbitrary points, A and B, in two-dimensional space somewhere. There are other points as well, but we’re only paying attention to these for now. Within A’s k-neighborhood, B is a member, as it is among the k closest points to A. However, when calculating B’s neighborhood, there are k points surrounding it that are closer than A is. Thus, we have an asymmetry: B is in A’s k-neighborhood, but A is not in B’s, so their affinities will not be symmetric. The way the algorithm is currently implemented, symmetry is assumed, so this is a deal-breaker.

As I mentioned, the solution I implemented was to eliminate k-neighborhoods entirely and connect the entire affinity matrix – no zeroed-out values.

Here are the results from the new data. First, here are three synthetic clusters:

Three clusters, generated from Gaussian centroids. (click for larger image)

Each cluster contains 150 points. When fed into the algorithm in the form of a symmetric (and full) affinity matrix, the spectral k-means algorithm of sprints 1-2 clustered all point with 100% accuracy – all three clusters were fully recovered. For reference, here are the three eigenvectors generated by the algorithm:

Three eigenvectors of the normalized Laplacian (click for larger image)

As you can see, the piecewise constancy we were hoping for in previous attempts is decidedly more present here. There is obviously some deviation – quite a bit of variance is present in each “component” of an eigenvector, though some more than others – but this did not seem to be a problem when performing k-means clustering on the eigenvector components, as every point was correctly clustered.

Needless to say, the algorithm works (and quite well!), but it is highly dependent on the data being properly formatted and, possibly even more important, symmetric: the affinity from point A to point B must be identical to the affinity from point B to point A, whether 0 (e.g. outside a k-neighborhood) or nonzero.

As for sprint 3 progress, I’ve identified about 4-5 separate map/reduce tasks that will need to be utilized. The first two are exactly identical to the map/reduce tasks employed in sprints 1-2, since Eigencuts uses the exact same foundation as spectral k-means.

  1. Read input. Already implemented in spectral k-means: read the input and create a DistributedRowMatrix of the affinities.
  2. Create D. Already implemented in spectral k-means: sum the rows of the affinity matrix to create the diagonal matrix D.
  3. Compute sensitivities. Loop through the eigenvectors of the matrix L, computing the sensitivity Si,jk of each.
  4. Non-maximal suppression. Suppress non-maximal sensitivities by “marking” them for the next step.
  5. Cut affinities. Set values of the affinity matrix A to 0 (not before adding the affinity to the diagonal, though!) if their computed sensitivity falls beneath a threshold.

I’m still working out the details – I may be able to combine some of these into fewer M/R tasks, but each step of the algorithm does depend on the outcome of the previous one, so it may be difficult. We shall see.

Sprint 3: Introduction to Eigencuts

“Old News”

First things first: I have a few other updates. Namely, I upgraded my synthetic testing data scripts, as well as patched a bug that was creating incorrect indices in the testing data. I’ve incorporated these scripts into my Google SVN repository, so please feel free to run them (in the “util” folder). The upgrade includes the ability to create “radial” clusters, or what are essentially concentric circles of uniformly-distributed two-dimensional points, an interesting test case with which standard k-means struggles but which spectral clustering, in theory, does a great job.

Unfortunately, where my eigenvectors are supposed to be piecewise-constant, they look like this instead:

Each point corresponds to a single dimension of an eigenvector. (Click for full image)

Not quite what they’re supposed to look like. The first 150 points actually doesn’t look too bad; the three eigenvectors are relatively constant within a specific range of values and overall behavior. By elements 151-300 and 301-450, however, most of the patterns are gone, which would explain the lackluster clustering results I’ve been getting back. For instance, in one of my random tests (the one which generated the above eigenvectors), 115 points were assigned one cluster, 286 to another, and 49 to a third. Here’s what the original data in this test looked like:

Each cluster consists of 150 uniformly-distributed points. (Click for full image)

The clusters are very well-separated (by design), and a relatively small amount of random Gaussian noise within each cluster. This should have been a veritable cakewalk for the clustering algorithm, but as evidenced by the results, it wasn’t. My guess is that the problem lies somewhere in the creation of the eigenvectors themselves; as I mentioned in a previous post, I am not performing any normalizing operations on the eigenvector matrix. Still, the lack of normalization shouldn’t be creating this level of confusion on such well-behaved data.

I would like to continue running more tests by adjusting the parameters of the data and how it is created; in particular, the number K of neighbors by which relative affinities are calculated (I have been using K = 8; a larger value for K, corresponding to a more connected underlying graph, may alleviate the clustering confusion). I will update more on this later.

EigenCuts

In my first post, I went into detail about the theory behind “standard” spectral clustering, along with a few pointers to where the EigenCuts algorithm would pick up in a new direction. For this particular spectral clustering derivation, I’m picking up with the matrices L (not the Laplacian, but it has similar properties, so I’ve been calling it a “normalized Laplacian”) and M (the Markov transition matrix). The original affinity matrix, A, and the diagonal matrix, D, are still within scope, but their purposes are for calculating other intermediate values, so the initial steps involving these matrices will be brief.

1: Form the affinity matrix, A. This is done exactly as before, via whatever scaling factor and neighborhood assessment the user wishes (we use Euclidean distance within an 8-neighborhood, scaled by 2, where σ is the standard deviation of the neighborhood affinities).

2: Create diagonal matrix D of rows of A, and form L. This is done by summing the affinities of each row i of A and placing the result in Di,i. Now, compute L = D-1/2AD-1/2.

3: Perform a decomposition of L into eigenvectors U and eigenvalues Λ. From here, EigenCuts diverges from k-means spectral clustering.

4: For each eigenvector which exceeds a minimal half-life, compute its half-life sensitivity to edge weight perturbations for every edge. Firstly, the user supplies a threshold β0 which represents the minimal half-life threshold the user is interested in. For each eigenvector whose half-life βk > εβ0, we compute its half-life sensitivity. The half-life of each eigenvector is computed as follows:

βk = – log(2) / log(λk)

This half-life represents how the “eigenflow” – orthogonal to probability flow within a Markov system – decays, or approaches a stationary distribution, over time. An eigenflow’s sensitivity is defined by how much it changes if the edge weights in the graph are altered; this is very similar to the min-cut, max-flow problem in that we are attempting to locate bottlenecks. Long half-lives are associated with weakly-coupled clusters, and intuitively are more likely to have bottlenecks. Strongly-coupled clusters have many possible routes from one edge to another, and thus no bottleneck really exists, so the half-life is extremely short (corresponding to very fast decay to the stationary distribution). Without going into the derivation, this is the formula for computing the half-life sensitivity for eigenvector k for each edge weight i,j in the graph:

Ski,j = d log(βk + β0) / i,j

where αi,j is the amount of perturbation added to edge i,j. There is an analytical form for this derivative, but it’s exceptionally long and would be very difficult to reproduce on WordPress’ WYSIWYG editor :P I will delve into its specifics in a later post.

5: Perform non-maximal suppression for each of the computed sensitivities. I’m not 100% clear on this step yet, but the idea here is to suppress the computed sensitivity Ski,j if there is a strictly more negative value Skm,j for some vertex vm in the neighborhood of vj, or Ski,n for some vertex vn in the neighborhood of vi. It’s the concept of “non-maximal suppression” that is foreign to me, but the analytical steps involve seem straightforward enough. Essentially, nodes are “marked” if, within their local neighborhood, they are not the maximal node (or in this case, the most negative). Everything that is not maximal within the neighborhood is “marked”, or suppressed, and is not eligible to be used in the next step.

6: Cut the edges in A for which the sensitivity exceeds some threshold. Within the affinity matrix A, set those edges i,j to 0 where the computed sensitivity, Ski,j, is greater than a predefined threshold of τ / δ and was not suppressed in the previous step. δ is another scaling factor, this time defined as the median of the values in the diagonal matrix D. τ is a thresholding value that is currently chosen by inspection, but Dr Chennubhotla is looking into this himself right now and will be updating on this very soon. Provided this threshold is met, we’ll cut the edge, i.e. set those affinities to 0 within A. Very important to note, however, is that the values that are to be cut need to be moved to the diagonal. So the progression of operations is as follows:

ai,i ← ai,i + ai,j
aj,j ← aj,j + ai,j
ai,j ← 0
aj,i ← 0

7: If any new edges have been cut, return to step 2; otherwise, terminate.

Those are the basic steps of the EigenCuts spectral clustering algorithm. Currently I am working to finish a testing framework for it, but I have already started converting the existing k-means spectral clustering into EigenCuts, branching from step 3 above (as steps 1-3 are already implemented) into steps 4-7. I am still trying to determine where map/reduce tasks will needed, in addition to the data types that will be required.

Also, I’d still like to figure out why the k-means spectral clustering isn’t correctly assigning labels. But we’ll pick up on all this later.

Sprint 1: Wrap-up

I have to admit, this has been extremely educational. Even though the first sprint encompassed my entire “buffer” second sprint, it was well worth it. I still have other wrap-up work to do, but for all practical purposes, sprint 1 is complete.

My final question for this sprint is as follows: separating out the k-means clustering assignments. Here is the final snippet of code from my driver:


	Path initialclusters = RandomSeedGenerator.buildRandom(Wt.getRowPath(),
			new Path(output, Cluster.INITIAL_CLUSTERS_DIR), clusters);
	String measureClass = "org.apache.mahout.common.distance.EuclideanDistanceMeasure";
	KMeansDriver.runJob(Wt.getRowPath(), initialclusters, output,
			measureClass, 0.5, 10, 1, true);

	// Read through the cluster assignments
	Path clusteredPointsPath = new Path(output, "clusteredPoints");
	FileSystem fs = FileSystem.get(conf);
	SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(clusteredPointsPath, "part-m-00000"), conf);
	// The key is the clusterId
	IntWritable clusterId = new IntWritable(0);
	// The value is the weighted vector
	WeightedVectorWritable value = new WeightedVectorWritable();
	while (reader.next(clusterId, value)) {
	    	// TODO: Assign the clusterIds to the original data
	    	System.out.println(clusterId.get());
	    	clusterId = new IntWritable(0);
	    	value = new WeightedVectorWritable();
	}
	reader.close();

	return 0;

I’ve run this a few times on the same synthetic data set (generated using the Python script in the previous post), and I’m getting results that are difficult for me to interpret. Briefly, I can’t tell whether the algorithm is working correctly or not. It’s working (which is good), but I’m not sure if the results are accurate.

Here are the clustering assignments for the first several points after two identical runs:


432 443
444 447
444 447
432 443
432 432
442 432
432 443
442 432
444 432
432 443

This can be difficult to interpret at first glance. After each run, you can see from the code that the driver outputs the cluster assignments of all 450 points. I’m not sure what order they’re in (issue #1), since the cluster assignment outputs correspond directly to components of the eigenvectors, but I’m assuming the cluster assignments are always in the same order when they’re outputted. Thus, the first cluster assignment from the first run’s output (first column) refers to the same data point as the first cluster assignment in the output from the second run (second column).

Since each run initializes things differently, the cluster assignments themselves will be different, but relatively they should be similar, if not identical. From reading the output (using k = 3, and 150 points per cluster), you can get a sense of which cluster assignments parallel each other between runs. However, you can also see where they differ. In lines 5 and 9 in the above output, the two assignments differ in their relative clusterings. I can chalk this up due to the random initialization of clusters via buildRandom(), and the fact that I did not implement normalization of the eigenvector components to unit length, but I’m still somewhat unnerved by this since the clusters, when generated, were purposely well-separated and tightly coupled (issue #2). This is something I would like to investigate more, particularly regarding the form of the input, as the creation of the affinity matrix via the formula mentioned in previous posts likely has something to do with this as well.

Graph of three leading eigenvectors

Three leading eigenvectors for the affinity matrix. As you can see, piecewise constancy is not exactly present, indicating a poor clustering. Either the data is not as well-separated as I'd wanted, or the algorithm itself is flawed.

Finally (issue #3), this code requires a testing framework; it’s not in place yet. This is on the docket for sprint 3, in particular, test-driven development and the requirement of writing the unit tests first.

To recap:

  • Issue 1: How do I effectively line up the cluster assignments given to the components of the eigenvectors with the original data points?
  • Issue 2: How can we verify that the data is being clustered in a way that accurately reflects the geometry of the data?
  • Issue 3: Need a unit testing framework.

Now for the next step: sprint 3.

Sprint 3 Overview

I’ve updated the schedule page with revamped overviews of the overall timeline for the sprints. This was necessary due to the virtual nonexistence of sprint 0 (master’s thesis conclusion, final exams, graduation, and moving back to ATL effectively put that sprint out in its entirety). The previous schedule had enough time in it (and was built with this in mind, in fact) to account for the shift of sprint 0 into sprints 1, and the execution of the previous sprint 1 as sprint 2. So while technically behind my previous schedule, for all intents and purposes development is right on schedule.

Provided, of course, that this new schedule is adhered to.

Here’s the decomposition of sprint 3 into weeks:

  • Week 1 (June 28 – July 4): Familiarization with EigenCuts theory beyond basic spectral clustering (markov transition matrix, observing and perturbing eigenflows). Write up initial testing framework, and finish any sprint 1-2 loose ends as needed.
  • Week 2 (July 5 – 11): Code EigenCuts. Work with Dr. Chennubhotla on any opaque points in the theory.
  • Week 3 (July 12 – 18): Finish up EigenCuts. Ensure all tests are passing. GSoC midterm evaluations.

Ideally, sprints 1-2 will have provided me with enough of a foundation in the Mahout architecture for issues along those lines to be secondary to the theory. Or so the plan goes; I suspect Isabel and the rest of the Mahout committers will be fielding just as many questions from me as always :)

Sprint 4, as a brief introduction, revolves primarily around procuring a research-grade dataset, along the lines of the Reuters or Wikipedia data sets for K-means. Writing up documentation, instructions, and completing the testing framework are among the main goals. Thus, sprint 3 really is going to be the meat of this project.

So without further loquacity, I take my leave to read up more thoroughly on the EigenCuts theory and to begin implementing a unit testing framework for the algorithm. I’ll have my latest patch on the JIRA page shortly, along with the same code in my Google SVN repository.

Sprint 1: So very close…

I’m down to the last few lines of my Driver file, now being stymied by another Exception keeping my first sprint deliverable from being complete. The culprit? A java.lang.IndexOutOfBoundsException, if you can believe that.

In order to adequately test the latest code I’ve written – which is 98% complete for the first sprint, minus this error – I have written a few Python scripts to generate synthetic data using the formulas prescribed in my two primary sources for spectral clustering[1][2]. One script, given a number of clusters and a number of points per cluster, will generate the raw data.

import sys
import csv
import numpy

if len(sys.argv) < 4:
    quit('$> python generaterawdata.py <clusters> <points> <output>\n' + \
        '\tclusters\tNumber of distinct clusters in the data\n' + \
        '\tpoints\t\tNumber of points per distinct cluster\n' + \
        '\toutput\t\tName of the output file containing synthetic data\n')

numclusters = int(sys.argv[1])
numpoints = int(sys.argv[2])
output = csv.writer(open(sys.argv[3], "w"))

# each cluster is going to be roughly normally distributed in two
# dimensions. each centroid needs to be unique
for i in range(0, numclusters):
    # generate the centroid and distribution parameters
    mu_x, mu_y = numpy.random.randint(-10, 10, 2)
    sigma = 0.5

    # now, generate the data points
    for j in range(0, numpoints):
        x, y = numpy.random.normal([mu_x, mu_y], sigma, 2)
        output.writerow([x, y])

This creates a file of raw data points in (x, y) format, one point on each line. The second script will then process this data into an affinity graph (along the lines of what’s under discussion for later sprints) according to whatever assumptions the user wishes to make about how all the data points relate to each other. For the purposes of this test data, I followed my two primary sources exactly, in that the affinity was calculated within an 8-point neighborhood of the current point’s Euclidean distance (meaning all other points outside the 8 nearest were set to an affinity of 0 with respect to the current point). The affinities were then output in the graph format specified in the last entry: (x, y, weight)

import csv
import sys
import math
import numpy

def distances(current, others, knn):
    lengths = [(math.pow((current[0] - point[0]), 2) + \
                math.pow((current[1] - point[1]), 2)) \
                for point in others]
    clone = lengths[:]
    clone.sort()

    # zero out all the elements greater than the knn
    # implements an knn setup
    for i in range(knn, len(clone)):
        lengths[lengths.index(clone[i])] = 0

    return lengths

def median(elements):
    nonzero = []
    for i in elements:
        if i > 0:
            nonzero.append(i)
    return numpy.median(nonzero)

if len(sys.argv) < 4:
    quit('$> python createaffinity.py <rawdata> <neighborhood> <affinity>\n' + \
        '\trawdata\t\tPath to the raw data input file\n' + \
        '\tneighborhood\tNumber to use for KNN\n' + \
        '\taffinity\tOutput file of the affinity matrix\n')

rawpoints = [[float(row[0]), float(row[1])] \
            for row in csv.reader(open(sys.argv[1]))]
neighborhood = int(sys.argv[2])
output = csv.writer(open(sys.argv[3], "w"))

# need to do O(n^2) comparison of all points to all other points, and
# mirror these comparisons such that: 2->1 = 1->2
for i in range(0, len(rawpoints)):
    magnitudes = [rawpoints[j] for j in range(i + 1, len(rawpoints))]
    lengths = distances(rawpoints[i], magnitudes, neighborhood)
    med = median(lengths)
    for j in range(0, len(lengths)):
        affinity = math.exp(-lengths[j] / math.pow((3 * med), 2))
        if affinity == 1.0: affinity = 0
        output.writerow([i, j, affinity])
        output.writerow([j, i, affinity])

This has worked extremely well for simulating live data. Of course, the data are still trivial in that they are very tightly-bound clusters, most likely well-separated; this is by design. According to [2], the test data used to really elucidate spectral clustering’s advantage over standard k-means is when the clusters form concentric circles, which throws k-means for a (proverbial) loop.

Currently, my Driver is able to execute every task necessary to input, organize, and decompose the data for k-means clustering. Here are the last few lines of the driver file, in which the eigen-decomposition is performed, and k-means clustering initiated on the matrix of eigenvectors:

		// perform eigen-decomposition
		DistributedLanczosSolver solver = new DistributedLanczosSolver();
		Path lanczosSeqFiles = new Path(output, "eigenvectors-" + (System.nanoTime() & 0xFF));
		solver.runJob(conf, L.getRowPath(), new Path(input.getParent(),
				"lanczos-" + (System.nanoTime() & 0xFF)), L.numRows(),
				L.numCols(), true, (clusters + 1), eigenVectors, eigenValues,
				lanczosSeqFiles.toString());

		// generate random initial clusters
		Path initialclusters = RandomSeedGenerator.buildRandom(lanczosSeqFiles,
				new Path(output, Cluster.INITIAL_CLUSTERS_DIR), clusters);
		// FIXME: IndexOutOfBoundsException at this spot, at line 106
		// in RandomSeedGenerator; cause is an off-by-1 indexing (both
		// ArrayLists seem to be of length k-1)

		String measureClass = "org.apache.mahout.common.distance.EuclideanDistanceMeasure";
		KMeansDriver.runJob(lanczosSeqFiles, initialclusters, output,
				measureClass, 0.5, 10, 1, true);

As you can see from the comments, when attempting to generate random initial cluster centroids, there is an IndexOutOfBoundsException thrown by java.util.ArrayList from within the buildRandom() method (line 10 above): unfortunately, it’s looping through a pair of arrays that should have length clusters, but are actually of length clusters - 1.

I’m unsure how to address this issue, given I am using the outputs from the eigen-decomposition directly. Here’s the start of the full exception report (where k = 3 clusters):

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:106)
	at org.apache.mahout.clustering.eigencuts.EigencutsDriver.run(EigencutsDriver.java:117)

References:
[1] – EigenCuts: Half-Lives of EigenFlows for Spectral Clustering
[2] – The Elements of Statistical Learning, 2nd Ed.

Sprint 1: K-means spectral clustering

For the past few weeks I’ve been performing the usual tasks of getting my development environment set up, interfacing with the Apache Mahout repository, learning more about Maven’s idiosyncrasies (gotta love missing environment variables), and building Mahout from scratch. I’ve also been working on the sprint 1 deliverable: a working implementation of k-means spectral clustering, a vanilla version of the final EigenCuts algorithm I’m poised to deliver by the end of the summer.

K-means spectral clustering is much simpler than the full EigenCuts algorithm in that its computations effectively cease after performing eigen-decompositions on the normalized Laplacian matrix. Again, I’ll delve into the specifics of EigenCuts in a later post, but what is outlined in the previous post is, essentially, all this algorithm requires. I’ll run through the steps here, along with some choice Python code that I’ve been working on as a prototype before implementing it in the Mahout framework.

But first, I need to define some terms that I’ll be using.

  • R: Raw data matrix. This is the K × n-dimensional data the user is interested in clustering.
  • S: Similarity matrix, a K × K transformation of R by showing how “related” each point is, pairwise. This “relation” function can be anything, from pixel intensity to radial Euclidean distance.
  • A: Adjacency/Affinity matrix, also K × K, this time a transformation of S by applying a k-nearest neighbor filter to build a representation of the graph (or, for a fully-connected graph, A = S). This matrix is critical to our calculations.
  • D: Diagonal degree matrix, also K × K, formed by summing the degree of each vertex and placing it on the diagonal.
  • L: Normalized and symmetric Laplacian matrix, formed in an operation with A and D. This is the matrix on which we will perform eigen-decompositions.
  • U: Matrix of eigenvectors of L. We’ll perform k-means clustering on their components.

Step 1: Process raw data

This is a somewhat tricky process by itself. The data can be comprised of some arbitrary K points, all in some arbitrary n-dimensional space. Each data point in this abstract matrix R needs to be processed into a pairwise similarity score via a “relation” function; in most cases, this function takes the form

exp{ -(xixj)2 / c }

where each x is the raw data point, and c is some scaling factor that can be constant or relative to the data itself (such as the median). These processes construct the similarity matrix S.

Here is a Python implementation of the construction of S, where the scaling factor c = 2σ2, where σ = pg. p is a constant of 1.5 that is used in the EigenCuts paper, and g is the raw median of the Euclidean distances of all neighboring data points.

for i in range(0, len(points)):
        row = []
        for j in range(0, len(points)):
                # scaled pairwise comparison
                if i == j:
                        row.append(0)
                else:
                        d = 0
                        diffs = []
                        for k in range(0, len(points[j])):
                                diff = ((points[j][k] - points[i][k]) ** 2)
                                diffs.append(diff)
                                d += diff
                        row.append(math.exp((-d) / \
                                (((numpy.median(diffs) * 1.5) ** 2) * 2)))
        S.append(row)

Step 2: Construct adjacency matrix

This step is a simple transformation of the similarity matrix we calculated in the previous step, and the specifics depend on the needs of the user. Specifically, if the user requires a fully-connected graph, then in this case the adjacency matrix A will be exactly the same as the similarity matrix S. Within the EigenCuts paper, the choice was made to use an 8-neighborhood, or take the 8 closest points out of all the neighboring points.

Given our current relation function, taking the 8 nearest points will equate to taking the 8 points with the largest values in the current row of S. Here is a Python implementation:

if knn < len(points):
        for i in range(0, len(S)):
                A.append([])
                row = S[i][:]
                row.sort(reverse=True)
                row = row[:knn]
                for j in range(0, len(S[i])):
                        if S[i][j] in row:
                                A[i].append(S[i][j])
                        else:
                                A[i].append(0)
else:
        A = S

Step 3: Construct diagonal degree matrix

This step is probably the most straightforward: simply sum the degrees of each vertex and place the value along the diagonal. Again, a Python implementation:

for j in range(0, len(S)):
        D.append([])
        dj = 0
        index = 0
        for i in range(0, len(S[j])):
                D[j].append(0)
                if i == j:
                        index = i
                dj += A[j][i]
        D[j][index] = dj

Step 4: Construct normalized Laplacian

Here’s where it all starts coming together. We now want to build the normalized Laplacian matrix out of the matrices D and A that we previously built. This is accomplished via the formula from the previous post:

L = D-1/2AD-1/2

This is still pretty simple matrix algebra; the Python package numpy can handle this very well.

Step 5: Perform eigen-decomposition

Another straightforward computation: using the matrix L, we want to decompose it into its eigenvectors and eigenvalues. We’ll use the eigenvectors (or the M largest ones, again depending on the user’s need) to build a matrix U of column eigenvectors of L.

Step 6: Perform k-means clustering

Using the matrix U of eigenvectors, we perform a k-means clustering on the rows, essentially creating a new data set of M dimensions, with the nth component of each eigenvector acting as a single data point and, by proxy, representing the nth data point in R, S, or A. Essentially, we have created a “conglomerate” data set of the eigenvector components. We perform k-means clustering on this new K × M data set, and whichever cluster the nth “conglomerate” point is assigned, the corresponding nth data point in the original set also receives the same cluster assignment.

Step 7: ???

Take what has been written in Python and finish translating it on the Mahout framework! :)

Follow

Get every new post delivered to your Inbox.

Join 42 other followers