Latent Dirichlet Allocation

Here we demonstrate an implementation of Gibbs sampler for Latent Dirichlet Allocation (LDA). LDA is a classic topic model, and has been widely used in text processing. While LDA implementations are common, we choose a particularly challenging form of LDA learning: a word-based, non-collapsed Gibbs sampler [1]. The LDA implementation is non-collapsed because it does not integrate out the word-probability-per-topic and topic-probability-per-document random variables. In general, collapsed implementations that do integrate out these values are more common, but such collapsed implementations cannot be made ergodic in a distributed setting (where ergodicity implies theoretical correctness in some sense).

Our implementation is word based because the fundamental data objects it operates over are (docID, wordID, count) triples. This generally results in a more challenging implementation from the platform’s point-of-view because it requires a many-to-one join between words and the topic-probability-per-document vectors. This PC LDA implementation uses the GSL library to perform all necessary random sampling (non-collapsed LDA requires sampling from both Multinomial and Dirichlet distributions).

The full PC LDA implementation requires fifteen different Computation Objects, as shown in the figure below. The complete source code can be found on our github repository TestLDA.

Pliny Compute LDA’s Computation Objects and input-output dependencies. Computations connected by dash lines will only run once, during initialization. Computations connected by solid lines will run iteratively.

Each iteration requires the following computations:

Simple selections:

DocWordTopicAssignmentIdentity assigns words, documents, and topics.
DocTopicProbSelection samples the topic probability for a document.

Multi Joins:

DocWordTopicJoin joins documents, document-topic probability, and word-topic probability.

Multi-selection computations

DocAssignmentMultiSelection extracts document assignment.
TopicAssignmentMultiSelection extracts topic assignment.
TopicWordProbMultiSelection generates the entries for the topic probability for each word.

Aggregation computations

DocTopicAggregate aggregates document assignments.
TopicWordAggregate aggregates topic assignment.
WordTopicAggregate aggregates topic word probabilities.

References

Zhuhua Cai, Zekai J Gao, Shangyu Luo, Luis L Perez, Zografoula Vagena, and Christopher Jermaine. 2014. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 1371–1382.