Topic Modelling : Latent Dirichlet Allocation, an introduction


Imagine finding meaning out of a news content which you haven’t seen before. Topic Modelling is a type of statistical modelling that solves the purpose of finding abstract topics from a corpus of documents. Topic modelling provides methods for searching, organising and summarising large volumes of data. This modelling technique is primarily used in cases where you are trying to recommend shopping items or cluster emails or set of images into categories.

In this blog, I will try to cover one of the topic modelling technique called Latent Dirichlet Allocation or LDA (in short) on a BBC News dataset along with providing intuition and code using Scikit-learn.


Latent Dirichlet Allocation

Latent Dirichlet Allocation is a generative probabilistic model for collection of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities.

Introduction from Original Paper by David M. Blei; Andrew Y. Ng; Michael I. Jordan (2003) [Link]
Fig 0: Overall view of LDA + Topic Modelling
Intuition

Let’s assume you have list of 1000 documents and each document has around 1000 words in them. If we take all of the words from all of 1000 documents and map them to every document in the list, we could find a pattern where some documents would map to similar set of words. Hence we could generate a cluster from the similar words. But this approach is computationally expensive since you will have to iterate over 1000×1000 times for finding pattern.

Now, with the above scenario, let us introduce a hidden layer between the documents and the words it contains. Let us now assume we have 3 topics that could possibly be created from document list. These 3 topics are some random topics which are not observed i.e. we have no idea of the existence of any such topic before hand in the documents.

Topic Modelling - LDA - Document to Topic to Word Mapping - Rishu Shrivastava
Fig 1 : Document – Topic – Word Mapping using LDA

Let us now map each of the words in the document to a particular topic. With this approach we map the words to random topics and documents to each of the topics. This helped us in reducing the iteration to 1000×3 (words to topics) + 3×1000 (topics to documents) from the initial iteration. Every topic that is being represented is rather a probability percentage of distribution of words as in the fig.

Latent Dirichlet Allocation uses the above intuition to determine cluster out of collection of words in a document. The above model gives a trim-down version of a model.

Mathematical Model

I will try to briefly cover the mathematical model of the LDA that was originally provided in the paper.

Fig 2: Latent Dirichlet Allocation Model – Plate Notation – Reference Link

Assumptions:

  • We assume the data to be a list of documents M1,M2,M3,.. Mn
  • Every document is a collection of words without the stop-words.
  • Relationship between two words is not considered in the model.
  • We want K topics from the list of words in a document.

Parameters:

  • M denotes the number of documents
  • N is number of words in a given document (document i has N_{i} words)
  • α is the parameter of the Dirichlet prior on the per-document topic distributions
  • β is the parameter of the Dirichlet prior on the per-topic word distribution
  • \theta _{i} is the topic distribution for document
  • i\varphi _{k} is the word distribution for topic 
  • kz_{ij} is the topic for the j-th word in document
  • iw_{ij} is the specific word.

Using the above model our goal is to infer the probability distribution of the words in a hidden topic. The variable W is grey out which means it is an observable variable and other hidden. As proposed in the original paper, a sparse Dirichlet prior can be used to model the topic-word distribution, following the intuition that the probability distribution over words in a topic is skewed, so that only a small set of words have high probability.


CODE WALK

All the codes are available in a Jupyter notebook. In this section, I will briefly touch on the main points of the code and steps followed on implementing LDA on a BBC News dataset.

Step 1: Understanding the Data and its distribution

First of all we load the BBC dataset and analyse its contents. Our Dataset is a JSON file which I have used to load the data. The distribution of the data looks as below:

Fig 3: Distribution of news data across the categories provided by the

The BBC dataset originally contains news from 5 different categories. The above fig 3, shows the distribution of these documents. Note: We will not be using categories information in our workflow any further. Instead we will use this category information to cross validate our topic representation from the model.

Now let us look into 1 sample row of the news and try to figure out the possible entities across the document:

Fig 4: One sample rows showing entities from the news dataset

From Fig 4, we can say that out of 1 single document, there are many words within the document that doesn’t hold any meaningful context. So when we are aiming at generating topics, it is better to remove the non-meaningful words out and only filter out relevant words. This will help model to predict better.

STEP 2: Transformation

The next step is to transform the dataset by removing any stops words or any other unnecessary words. Once that steps is completed, we would require to create a sparse matrix to fit into the model. For this I have used Count Vectorizer with a parameter of max_df to be 0.95 (remove words that appear more than 95% in total corpus) and min_df to be 1 (must be available in at least 1 document). This convert the words into vectors for it to be machine parseable.

Fig 5: Converting word to vectors using Count-Vectorizer

Once you have applied CountVectorizer, you will fit and transform your data frame / dataset. This will in turn returns a sparse matrix in the form of : <total number of documents>x<total words in corpus>. Here in fig 5, we can see, there are 2225 total number of documents across 9297 total words in the corpus.

Once you have a vectorised dataframe, we are now ready to fit this data into our model.

STEP 3: LDA model fitting

First initialise the LDA model and then fit in your vectorised dataframe

Fig 7: LDA model

In this section, we have assumed to cluster the words in the document into 5 categories. This number is based on the original dataset which came with labelled category. When we define our LDA model, we pass the parameter : n_components to equals 5 and keep rest of the parameters to default. Note: You can try different learning_decay rates and choose the best output. For the purpose of this document, I went with the default params.

Next we fit the vector generated in step 2 into the model. Once the model is fit successfully, it would mean LDA has not learned the dataset and has generated the topics.

Step 4: Generating the Topics

Now we have a successful model, we will transform our dataset and try to display the topic outputs.

Fig 8: Generating the topics with word distribution

Based on the model, fig 8 shows the list of topics along with the top 15 most relevant words per topic. If you look closely, we can categories the generated topics into a high level categories. This is purely based on observing the words in the topic.

  • Topic 0 -> relates to tv or music (Entertainment)
  • Topic 1 -> relates to Politics or election related news
  • Topic 2 -> relates to business or company 
  • Topic 3 -> relates to Technology
  • Topic 4 -> relates to some sort of game or play (Sports)

Note: LDA only generates a list of observable variables or topics. This does not relate to the actual category. It is upto the users to define meaning to the topics as I have provided.

Step 5: Distribution of Topics on each of the documents

In the below fig 9, we can see the list of documents and the distribution of topic weights across each of the documents. This is critical to understand as how a document is a collection of topics and its underlying words.

Fig 9: Distribution of topics across the documents

As you could see the first document is spread across 4 topics, with the dominant being topic 1 with 54% of the weightage. Upon close and careful reading of the document, you would find that the data is mostly around Politics.

Step 6: Visualizing LDA

PyLDAvis is a python based visualisation library that looks at visualising LDA model and topic modelling data set. This is by far the one of the most intuitive viz library for topic modelling available in the market.

I have applied our learned LDA model into this library to generate the visualisation as below fig. 10. Each of the generated topics on the left side shows the weightage of the word in the topic (in red) to the word frequency across the corpus.


Reference Links:


Processing…
Success! You're on the list.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: