Here we demonstrate an implementation of Gibbs sampler for Latent Dirichlet Allocation (LDA). LDA is a classic topic model, and has been widely used in text processing. While LDA implementations are common, we choose a particularly challenging form of LDA learning: a word-based, non-collapsed Gibbs sampler . The LDA implementation is non-collapsed because it does not integrate out the word-probability-per-topic and topic-probability-per-document random variables. In general, collapsed implementations that do integrate out these values are more common, but such collapsed implementations cannot be made ergodic in a distributed setting (where ergodicity implies theoretical correctness in some sense).
Our implementation is word based because the fundamental data objects it operates over are (docID, wordID, count) triples. This generally results in a more challenging implementation from the platform’s point-of-view because it requires a many-to-one join between words and the topic-probability-per-document vectors. This PC LDA implementation uses the GSL library to perform all necessary random sampling (non-collapsed LDA requires sampling from both Multinomial and Dirichlet distributions).
The full PC LDA implementation requires fifteen different Computation Objects, as shown in the figure below. The complete source code can be found on our github repository TestLDA.
Each iteration requires the following computations:
- Simple selections:
- Multi Joins:
- Multi-selection computations
- Aggregation computations
DocWordTopicJoin joins documents, document-topic probability, and word-topic probability.
DocAssignmentMultiSelection extracts document assignment.
TopicAssignmentMultiSelection extracts topic assignment.
TopicWordProbMultiSelection generates the entries for the topic probability for each word.
- Zhuhua Cai, Zekai J Gao, Shangyu Luo, Luis L Perez, Zografoula Vagena, and Christopher Jermaine. 2014. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 1371–1382.