{smcl}
{* 16aug2016}{...}
{cmd:help ldagibbs}{right: ({browse "http://www.stata-journal.com/":SJxx-x: dm00xx})}
{hline}

{title:Title}

{p2colset 5 20 22 2}
{p2col :{hi:ldagibbs} {hline 2} Document Clustering using Latent Dirichlet Allocation}{p_end}
{p2colreset}{...}


{title:Syntax}

{p 8 17 2}
{cmd:ldagibbs {varname}} [{cmd:,} {it:options}]

{synoptset 30 tabbed}{...}
{synopthdr}
{synoptline}

{syntab:Gibbs Sampler Options}
{synopt :{opth t:opics(it:integer)}} number of topics {p_end}
{synopt :{opth burn:in_iter(it:integer)}} number of burn-in iterations {p_end}
{synopt :{opth a:lpha(it:real)}} document-topic-prior{p_end}
{synopt :{opth b:eta(it:real)}} word-topic-prior {p_end}
{synopt :{opth sam:ples(it:integer)}} number of samples to take {p_end}
{synopt :{opth sampli:ng_iter(it:integer)}} number of iterations between samples {p_end}
{synopt :{opth se:ed(it:integer)}} random seed {p_end}
{synopt :{opt like:lihood}}  calculate and report likelihood {p_end}

{syntab:Text Cleaning Options}
{synopt :{opth  mi:n_char(it:integer)}} minimal number of characters in word {p_end}
{synopt :{opth  stop:words(strings:string)}}  list of stopwords {p_end}

{syntab:Output Options}
{synopt :{opth  na:me_new_var(strings:string)}} name of new variables {p_end}
{synopt :{opt norm:alize}} normalize output of ldagibbs {p_end}
{synopt :{opt ma:t_save}} save word probability matrix{p_end}
{synopt :{opth pa:th(strings:string)}} path in which to save word probability matrix  {p_end}
{synoptline}

{pstd}See {help ldagibbs##Options:{it:Options}} for details on specifying options.

 
{title:Description} 

{p 4 4 2} {cmd:ldagibbs} implements a Gibbs Sampling Algorithm for
Latent Dirichlet Allocation. 

{p 4 4 2} The {cmd:ldagibbs} command generates new variables for the topic
assignment of the clustered text strings. Each of these variables contains the
probability of the document to belong to one of the topics.

{marker Options}{...}


{title:Options}


{dlgtab:Gibbs Sampler Options}

{phang} {cmd:topics(}{it:integer}{cmd:)} specifies the number of topics LDA should create. The default number is {cmd:topics(}10{cmd:)}.

{phang} {cmd:burnin_iter(}{it:integer}{cmd:)} specifies how many iterations the Gibbs Sampler should run as a burn-in period. The default is {cmd:burnin_iter(}500{cmd:)}. 

{phang} {cmd:alpha(}{it:real}{cmd:)} sets the prior for topic probability distribution. For this option, a value between 0 and 1 should be chosen.
As a heuristic users can use {cmd:alpha(}50/T{cmd:)} (only applicable if T>50). The default value is {cmd:alpha(}0.25{cmd:)}.

{phang} {cmd:beta(}{it:real}{cmd:)}  sets the prior for the word probability distribution. The value for {cmd:beta} should be between 0 and 1. The default value is {cmd:beta(}0.1{cmd:)}.

{phang} {cmd:samples(}{it:integer}{cmd:)}  specifies how many samples the algorithm should collect after the burn-in period. To obtain robust results at least 10 samples should be taken. The default is {cmd:samples(}10{cmd:)}.

{phang} {cmd:sampling_iter(}{it:integer}{cmd:)} specifies how many iterations the Gibbs Sampler should ignore between the individual samples.
Running additional iterations of the Gibbs Sampler guaranties the statistical independence of the samples. The default number of iterations between the samples is {cmd:sampling_iter(}50{cmd:)}.

{phang} {cmd:seed(}{it:integer}{cmd:)} sets the seed for the random number generator to guarantee the reproducibility of the results. The default is {cmd:seed(}0{cmd:)}

{phang} {cmd:likelihood} specifies that the Gibbs Sampler should calculate and report the log-likelihood of the LDA model every 50 iterations. This option allows to analyze the convergence of the Gibbs Sampler but slows down the sampling process.


{dlgtab:Text Cleaning Options}

{phang} {cmd:min_char(}{it:integer}{cmd:)} allows the removal of short words from the texts. Words with less characters than {cmd:min_char(}{it:integer}{cmd:)} will be excluded from the sampling algorithm. The default is {cmd:min_char(}0{cmd:)}.

{phang} {cmd:stopwords(}{it:string}{cmd:)} specifies a list of words to exclude from the Gibbs Sampler. 
Usually highly frequent words such as "I", "you", etc. are removed from the text, since these words do not help with the classification of the documents. Predefined stopword lists for different languages are available online. 


{dlgtab:Output Options}

{phang} {cmd:name_new_var(}{it:string}{cmd:)} specifies the name of the output variable created by ldagibbs. These variables contain the topic assignments for each document.
The user should ensure that {cmd:name_new_var(}{it:string}{cmd:)} is unique in the data set. If nothing is specified, the default is {cmd:name_new_var(}"topic_prob"{cmd:)}, such that the names of the new variables will be topic_prob1-topic_probT, where the T is the number of the topic.

{phang} {cmd:normalize} specifies if {cmd:ldagibbs} should return the raw topic assignments counts or if {cmd:ldagibbs} should normalize the counts to probabilities. By default, the sampler will not return probabilities.
You almost always want to specify this option.

{phang} {cmd:mat_save} specifies if the word probability matrix should be saved. This matrix defines the most frequent words in each topic. By default, the matrix will not be saved.

{phang} {cmd:path(}{it:string}{cmd:)} sets the path where the word probability matrix is saved.


{title:Remarks}

{pstd} To run ldagibbs the user needs to specify the variables containing the text strings for the classificiation. The options allow to adjust the behaviour of the sampler. 

{title:How to intepret the Output}

{p 4 4 2} {cmd: ldagibbs} generates T new variables. These variables describe the topic assignments of each document. 

{p 4 4 2} To save the word probability vectors the {cmd: mat_save} option has to be specified. A Mata matrix file with the name "word_prob" is then stored in {cmd:path(}{it:string}{cmd:)}.

{p 4 4 2} The file contains the word probabilities for each of the T topics. {cmd:ldagibbs} also provides a {cmd: wprobimport} command, which imports the stored word probability data into Stata. The syntax for {cmd: wprobimport} is simply:

{p 8 17 2} {cmd: wprobimport using {it:filename}}{p_end} 


{title:Examples}

To run Latent Dirichlet Allocation:

{p 4 8 2}{cmd:. ldagibbs title, topics( 20)  alpha(0.20) beta( 0.05)  seed(5) burnin_iter(750) samples(3) sampling_iter(100) likelihood min_char(3) name_new_var("topic_probability") normalize stopwords("I you she he")}{p_end}

To import word probability matrix:

{p 4 8 2}{cmd:. wprobimport using "word_prob"}



{title:Authors}

{pstd}Carlo Schwarz{p_end}
{pstd}University of Warwick{p_end}
{pstd}United Kingdom{p_end}
{pstd}{browse "www.carloschwarz.eu"}{p_end}
{pstd}c.r.schwarz@warwick.ac.uk{p_end}



{title:Also see}  

{p 4 14 2}
Article:  {it:Stata Journal}, volume x, number x: {browse "http://www.stata-journal.com/":dm00xx}

