Imagine finding meaning out of a news content which you haven’t seen before. **Topic Modelling** is a type of **statistical modelling** that solves the purpose of finding abstract topics from a corpus of documents. Topic modelling provides methods for searching, organising and summarising large volumes of data. This modelling technique is primarily used in cases where you are trying to recommend shopping items or cluster emails or set of images into categories.

In this blog, I will try to cover one of the topic modelling technique called **Latent Dirichlet Allocation** or **LDA ***(in short)* on a BBC News dataset along with providing intuition and code using Scikit-learn.

**Latent Dirichlet Allocation**

Latent Dirichlet Allocation is a generative probabilistic model for collection of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities.

Introduction from Original Paper by David M. Blei; Andrew Y. Ng; Michael I. Jordan (2003) [Link]

##### Intuition

Let’s assume you have list of 1000 documents and each document has around 1000 words in them. If we take all of the words from all of 1000 documents and map them to every document in the list, we could find a pattern where some documents would map to similar set of words. Hence we could generate a cluster from the similar words. But this approach is computationally expensive since you will have to iterate over 1000×1000 times for finding pattern.

Now, with the above scenario, let us introduce a hidden layer between the documents and the words it contains. Let us now assume we have 3 topics that could possibly be created from document list. These 3 topics are some random topics which are not observed i.e. we have no idea of the existence of any such topic before hand in the documents.

Let us now map each of the words in the document to a particular topic. With this approach we map the words to random topics and documents to each of the topics. This helped us in reducing the iteration to 1000×3 *(words to topics)* + 3×1000 *(topics to documents)* from the initial iteration. Every topic that is being represented is rather a probability percentage of distribution of words as in the fig.

**Latent Dirichlet Allocation** uses the above intuition to determine cluster out of collection of words in a document. The above model gives a trim-down version of a model.

##### Mathematical Model

I will try to briefly cover the mathematical model of the LDA that was originally provided in the paper.

**Assumptions:**

- We assume the data to be a list of documents
**M1,M2,M3,.. Mn** - Every document is a collection of
**words**without the stop-words. - Relationship between two words is not considered in the model.
- We want
**K**topics from the list of words in a document.

**Parameters:**

*M*denotes the number of documents*N*is number of words in a given document (document*i*has words)*α*is the parameter of the Dirichlet prior on the per-document topic distributions*β*is the parameter of the Dirichlet prior on the per-topic word distribution- is the
**topic distribution**for document *i*is the word distribution for topic*k*is the**topic**for the*j*-th word in document*i*is the specific word.

Using the above model our goal is to infer the probability distribution of the words in a hidden topic. The variable W is grey out which means it is an observable variable and other hidden. As proposed in the original paper, a sparse Dirichlet prior can be used to model the topic-word distribution, following the intuition that the probability distribution over words in a topic is skewed, so that only a small set of words have high probability.

##### CODE WALK

All the codes are available in a Jupyter notebook. In this section, I will briefly touch on the main points of the code and steps followed on implementing LDA on a BBC News dataset.

**Step 1: Understanding the Data and its distribution**

First of all we load the BBC dataset and analyse its contents. Our Dataset is a JSON file which I have used to load the data. The distribution of the data looks as below:

The BBC dataset originally contains news from 5 different categories. The above fig 3, shows the distribution of these documents. *Note: We will not be using categories information in our workflow any further. Instead we will use this category information to cross validate our topic representation from the model.*

Now let us look into 1 sample row of the news and try to figure out the possible entities across the document:

From Fig 4, we can say that out of 1 single document, there are many words within the document that doesn’t hold any meaningful context. So when we are aiming at generating topics, it is better to remove the non-meaningful words out and only filter out relevant words. This will help model to predict better.

**STEP 2: Transformation**

The next step is to transform the dataset by removing any stops words or any other unnecessary words. Once that steps is completed, we would require to create a sparse matrix to fit into the model. For this I have used Count Vectorizer with a parameter of max_df to be 0.95 *(remove words that appear more than 95% in total corpus)* and min_df to be 1 *(must be available in at least 1 document)*. This convert the words into vectors for it to be machine parseable.

Once you have applied CountVectorizer, you will fit and transform your data frame / dataset. This will in turn returns a sparse matrix in the form of : <total number of documents>x<total words in corpus>. Here in fig 5, we can see, there are 2225 total number of documents across 9297 total words in the corpus.

Once you have a vectorised dataframe, we are now ready to fit this data into our model.

###### STEP 3: LDA model fitting

First initialise the LDA model and then fit in your vectorised dataframe

In this section, we have assumed to cluster the words in the document into 5 categories. This number is based on the original dataset which came with labelled category. When we define our LDA model, we pass the parameter : n_components to equals 5 and keep rest of the parameters to default. *Note: You can try different learning_decay rates and choose the best output. For the purpose of this document, I went with the default params.*

Next we fit the vector generated in step 2 into the model. Once the model is fit successfully, it would mean LDA has not learned the dataset and has generated the topics.

###### Step 4: Generating the Topics

Now we have a successful model, we will transform our dataset and try to display the topic outputs.

Based on the model, fig 8 shows the list of topics along with the top 15 most relevant words per topic. If you look closely, we can categories the generated topics into a high level categories. *This is purely based on observing the words in the topic.*

- Topic 0 -> relates to tv or music (Entertainment)
- Topic 1 -> relates to Politics or election related news
- Topic 2 -> relates to business or company
- Topic 3 -> relates to Technology
- Topic 4 -> relates to some sort of game or play (Sports)

*Note: LDA only generates a list of observable variables or topics. This does not relate to the actual category. It is upto the users to define meaning to the topics as I have provided.*

###### Step 5: Distribution of Topics on each of the documents

In the below fig 9, we can see the list of documents and the distribution of topic weights across each of the documents. This is critical to understand as how a document is a collection of topics and its underlying words.

As you could see the first document is spread across 4 topics, with the dominant being topic 1 with 54% of the weightage. Upon close and careful reading of the document, you would find that the data is mostly around Politics.

###### Step 6: Visualizing LDA

PyLDAvis is a python based visualisation library that looks at visualising LDA model and topic modelling data set. This is by far the one of the most intuitive viz library for topic modelling available in the market.

I have applied our learned LDA model into this library to generate the visualisation as below fig. 10. Each of the generated topics on the left side shows the weightage of the word in the topic *(in red)* to the word frequency across the corpus.

**Reference Links:**

- https://user.eng.umd.edu/~smiran/LDA.pdf
- http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
- https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
- https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158