KMeans – PlinyCompute

Imagine that a user wished to use Pliny Compute to build a high-performance library implementation of a k-means algorithm. The complete listing of this application can be found on the github repository TestKMeans. Once the programmer had defined the basic type over which the clustering is to be performed (such as the DataPoint class), a programmer would likely next define a simple class that allows the averaging of vectors:

class Avg : public Object {
	long cnt = 1;
	Handle <Vector <double>> data = nullptr;
	Avg &operator + (Avg &addMe) {/* add addMe into this */}
};

The programmer might next add a method to the DataPoint class that converts the DataPoint object to an Avg object:

Avg DataPoint :: fromMe () {
	Avg returnVal;
	returnVal.data = data;
	return returnVal;
}

And also add a method to the DataPoint class that accepts a set of centroids, computes the Euclidean distance to each, and returns the closest:

	long DataPoint :: getClose (Vector <Vector <double>> &centroids) {...}

Next, a programmer using PC would define an AggregateComp class using PC’s lambda calculus, since, after all, the k-means algorithm is essentially an aggregation:

class GetNewCentroids : public AggregateComp <Centroid, long, Avg, DataPoint> {
	public:
		Vector <Vector <double>> centroids;
		
		Lambda <long> getKeyProjection (Handle <DataPoint> aggMe) override {
			return makeLambda (aggMe, [&] (Handle <DataPoint> &aggMe)
			{return aggMe->getClose (centroids);});
		}
		Lambda <Avg> getValueProjection (Handle <DataPoint> aggMe) override {
			return makeLambdaFromMethod (aggMe, fromMe);
		}
};

The declaration AggregateComp <Centroid, long, Avg, DataPoint> means that this computation aggregates DataPoint objects. For each data point, it will extract a key of type long, a value of type Avg, which will be aggregated into objects of type Centroid. To process each data point, the aggregation will use the lambdas constructed by getKeyProjection and getValueProjection. In this case, for example, getKeyProjection builds a lambda, which simply invokes the native C++ lambda given in the code—this native C++ lambda returns the identity of the centroid closest to the data point. To build a computation using this aggregation class, a programmer would need to specify the Centroid class (the result of this aggregation):

class Centroid : public Object {
	long centroidId;
	Avg data;
	public:
	long &getKey () {return centroidId;}
	Avg &getVal () {return data;}
};

And then build up a computation using these pieces:

	Handle <Computation> myReader =
	makeObject <ObjectReader <DataPoint>> ("myDB", "mySet");
	Handle <Computation> myAgg = makeObject <GetNewCentroids> ();
	myAgg->centroids = ... // initialize the model
	myAgg->setInput (myReader);
	Handle <Computation> myWriter =
	makeObject <Writer <Centroid>> ("myDB", "myOutSet");
	myWriter->setInput (myAgg);
	pcClient.executeComputations (myWriter);

After execution, the set of updated centroids would be stored in myDB.mySet. Performing this computation in a loop, where the centroids are repeatedly updated until convergence, completes the implementation.