At Trifork Machine Learning we help companies to innovate by exploring how Machine Learning can help them in their business. Many of our clients have large amounts of textual data that we want to make sense of. This blog post showcases how we can apply Topic Modelling to automatically extract valuable information from real data.

Whenever we are dealing with unstructured textual data, we are often confronted with a huge amount of it. Imagine being a law firm, publishing agency or a historian; the sheer number of documents created and collected makes them very difficult to navigate. If we are lucky, there’s a great archival system in place, but often all we have to work with is just a collection of text documents. We could use a little help here to get an idea of the topics of our documents, without having to read them first. For this, we can use a technique called Topic Modelling. I have used Trifork blog posts to showcase how to use Topic Modelling to get value from a large collection of free-form text. In this blog I will explain the experiment which you can find in this jupyter notebook.

What is Topic Modelling?

Imagine we have a large collection of documents that we want to understand without going through the arduous process of reading each and every one of them. Since we don’t have any prior information that tells us the content or subject of each document we have what we call in Machine Learning an unsupervised problem.

Topic Modelling is a great option for this task, since it is an unsupervised technique that aims to find latent topics or themes in a collection of documents. We refer to the topics as latent because they are unknown to us, hence the need for an algorithm to uncover these latent topics.

The data

For this experiment we use posts from the Trifork blog as our collection of documents. These posts are manually assigned a category at posting time and, while we will not make use of these categories to discover the topics, we can use them to validate the outcome of our Topic Modelling. The data is pre-processed using a common pipeline for NLP tasks: tokenization, removing stop words, lemmatization and removing punctuation (which you can see in a jupyter notebook in the repository).

Topic Modelling with Non-Negative Matrix Factorization

Let’s try to understand how Topic Modelling discovers latent topics in textual data. There are several algorithms for this task: Latent Dirichlet allocation (LDA), Latent Semantic Analysis (LSA) and Non-Negative Matrix Factorization (NMF). These algorithms are not limited to the Topic Modelling domain, for instance, they are also used in recommender systems. In this experiment I will use NMF since it is a popular approach.

For Topic Modelling, the algorithms mentioned above make two main assumptions:

  1. A document is a mixture of topics.
  2. A topic is described by several words.

The essence of NMF is to find the k number of topics present in a collection of documents and to determine the degree to which each document and individual word belongs to each topic. The input to the algorithm is a matrix indicating whether a word is present in a document or not (this is called a Bag of Words matrix).

Using the concept of matrix factorization from linear algebra, the algorithm outputs two matrices:

  1. Matrix W contains scores that indicate how much each document belongs to a topic.
  2. Matrix H contains scores that indicate how much each word belongs to each topic.

Using these scores, we can determine the top-n topics to which each document belongs and hopefully give a human interpretable definition to each topic based on their top-n words. Nicolas Gillis gives a nice explanation of the mathematics behind this algorithm if you’re keen to learn more [1].

For the experiment I used the scikit-learn library in Python, which includes an implementation of NMF. Once you have trained an NMF model, you can get the two output matrices and explore the scores. In the image below you can see the vector with the scores for each topic for a given blog post (document). This blog post chiefly belongs to topic 11 (index 11) with a score of 0.6366 and to a smaller extent to topics 1 and 2. Note that these are not probabilities so they should not add to 1.

Snippet of the vector representing a blog post.

When using NMF, you have to specify the number of topics up front (for the above example I chose 15). However this number is often unknown to us because we don’t know the content of the documents well enough to know the topics. This brings us to the two main challenges of Topic Modelling:

  1. Selecting a good number of topics.
  2. Interpreting the result. The resulting topics are represented by a set of words and not by a name or anything that gives them meaning. Therefore, it is up to us to interpret the set of words and identify the meaning behind the topic, if any.

Challenge 1: Selecting the number of topics with Topic Coherence

To select the correct number of topics we need a quantitative way to evaluate if the topics are meaningful. The Topic Coherence-Word2Vec (TC-W2V) metric measures the coherence between words assigned to a topic, i.e.: how semantically close are the words that describe a topic. We can train a Word2Vec [2] model on our collection of documents that will organise the words in a n-dimensional space where semantically similar words are close to each other. The TC-W2V for a topic will be the average similarity between all pairs of the top-n words describing the topic (we define similarity to be 1 when the distance between the words in the n-dimensional space is 0). We then train an NMF model for different values of the number of topics (k) and for each we calculate the average TC-W2V across all topics. The k with the highest average TC-W2V is used to train a final NMF model. In this case, k=15 yields the highest average value, as shown in the graph. The paper on Topic Coherence [3] also provides several interesting heuristics to help with the analysis and interpretation of the retrieved topics.

Challenge 2: Interpreting the topics with visualizations

We have trained a model with an optimal number of topics without having to manually read the documents. The next step is to interpret the topic scores from the two output matrices, H and W. In our case, we will describe each topic by its top-10 words. Below I show the words for some of the topics:

Pandas DataFrame showing the words per topic.

From this we can see that the model has found, amongst others, topics about Java, Spring tutorials and the Axon Framework which match the manually assigned categories in the Trifork’s blog top menu bar, as seen below.

Top menu bar of Trifork’s blog with the predefined categories.

To get a deeper understanding of the topics that have been identified we use pyLDAviz. This tool allows you to interact with the topic distribution; showing us  the words assigned to each topic, and the frequency of their usage in the topic and across the corpus. Below you can see the interactive visualisation for the NMF model that I have trained (note that the topic indices in the visualisation do not correspond to the ones shown before). We can see how topic 1 englobes topics 3, 4, 5, 6: topic 1 is about Spring and other frameworks (Axon, Elastic search) commonly used with Spring. Also, it intersects with topics 2 and 7 which are business-related topics, probably about conferences and similar events where these technologies are discussed. There are also smaller topics farther away, whose documents likely aren’t related to the other topics as much. As an example, if I were interested in reading about containers I would start with documents assigned to topic 9 in the visualisation (topic 8 in the topics above since they were numbered differently).

You can click on the topics (the circles) in the interactive image below.

Finally, we can also manually explore the documents that are assigned to each topic by using matrix H and checking which topic has the highest score for each document. You can find this process in the jupyter notebook.

Conclusion

After these steps we have a much clearer picture of the content covered in Trifork’s blog posts, with our effort expended on automation instead of reading through each and every document. If we wanted to read about the Axon framework we would get the documents assigned to such topic. Moreover, on top of this relatively simple example, we can now develop functionality to automatically assign labels to new blog posts and help authors categorise their work. As you can see, Topic Modelling is a powerful way to navigate and get an insight into the nature of a text corpus, enabling other processes such as summarisation, text analytics and predictive tasks.

[1] Gillis, N. (2014). The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12(257).

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

[3] O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications.

Trifork
Read what else Trifork is doing with Machine Learning on our website:
https://trifork.com/machine-learning/