Latent Dirichlet Allocation using Gibbs Sampling Technique is a framework for analyzing

hidden/latent topic structures of large scale datasets like a collection of text documents.

Input to the LDA Algorithm:

LDA is used for parameter estimation and Inference as below.
a)Parameter Estimation from Scratch:
> lda -est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>]

[-savestep<int>] [-twords<int>] –dfile <string>

b) Parameter Estimation from a previously estimated model:
> lda -estc –dir <string> -model <string> [-niters <int>] [-savestep <int>] [-twords <int>]
c) Inference for new data:
> lda -inf -dir <string> -model <string> [-niters <int>] [-twords <int>] –dfile <string>

Parameters: ([] – indicates optional)

-est – Estimate from Scratch
-estc – Continue Estimation
-inf – Inference for New data
-alpha – value of alpha( hyper parameter)
-beta – value of beta( hyper parameter)
-ntopics – Number of topics
-niters - # of Gibbs sampling Iterations
-savestep – Step at which LDA is to be saved
-twords – # of top most likely words to be printed
Outputs of Latent Dirichlet Allocation

The following files are the outputs of LDA.

1)<model_name>.others -> contains some parameters of LDA model
2) <model_name>.phi -> word-topic distribution(rows->topics, cols-> words in document)
0.112849 0.001117 0.883799 0.001117 0.001117
0.001143 0.561143 0.046857 0.389714 0.001143
0.164444 0.045926 0.001481 0.075556 0.712593
3) <model_name>.theta -> topic-document distribution
(Rows-> document, cols-> topic)
0.008621 0.008621 0.008621 0.008621 0.008621 0.008621 …….
4) <model_name>.tassign -> contains <[word_i]> : <[topic of word_i]>
0:10 1:95 2:5 2:57 3:95 3:69 3:4 4:98
0:28 1:96 2:85 2:7 3:14 3:28 3:13 4:8
5) <model_name>.twords -> contains most likely words of each topic
Topic 0th:
acquisit 0.883799
abil 0.112849
absenc 0.001117
agreem 0.001117
ail 0.001117
Important Parameters and Variables:

M - # of Documents
V - vocabulary size
K - number of topics
alpha, beta - LDA hyper parameters
z – Matrix containing topic assignments for words
nw – Matrix containing # of instances of word i to topic I [Size is V x K]
nd – Matrix containing # of words in document i to topic i [Size is M x K]
nwsum – total # of words assigned to topic I [Size is K]
ndsum – total number of words in document i [Size is M]
theta – Matrix having document-topic distributions [Size is M x K]
phi – topic-word distributions [Size K x V]

