Explore your own text collection with a topic model – without prior knowledge.

The text mining technique topic modeling has become a popular procedure for clustering documents into semantic groups. This application introduces a user-friendly workflow which leads from raw text data to an interactive visualization of the topic model. All you need is a text corpus and a little time.

Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.

1 Preprocessing

The corpus is tokenized first. This splits a text into individual words, so-called tokens. Token frequencies are typical units of analysis when working with text corpora. It may come as a surprise that reducing a book to a list of token frequencies retains useful information, but practice has shown this to be the case.

One assumption that topic models make is the bag of words assumption, that the order of the words in the document does not matter.

You can select any plain text files – markup will be stripped. Check out TextGrid for an extensive collection of German texts.

The frequency distribution of words in a text corpus follows Zipf’s law, which implies that few types occur very frequently, and many types occur very rarely. In topic modeling, we are only interested in words in the middle frequency range; the most common words are usually empty function words, and the rarest words so specific that they are of no use to the model.

You can either set a threshold for the most common words to remove:

or select an external list of words to be removed (which is recommended):

2 Modeling

A parameter is any characteristic that can help in defining or classifying a particular system – the topic model. You will have to adjust two model parameters: the number of topics, i.e. how many semantic clusters should be formed, and the number of iterations, i.e. how long the model should learn from the data.

Latent Dirichlet allocation, a generative probabilistic topic model, is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.

The ideal number of topics depends on what you are looking for in the model. The default value gives a broad overview of your text collection’s contents:

The number of sampling iterations should be a trade-off between the time taken to complete sampling and the quality of the model:

3 Visualizing

When using topic models to explore text collections, one is typically interested in examining texts in terms of their constituent topics – instead of pure word frequencies. Because the number of topics is so much smaller than the number of unique vocabulary elements (say, 10 versus 10,000), a range of data visualization methods become available.

You will be able to navigate through topics and documents, get similar topics and documents displayed, read excerpts from the original texts, and inspect the document-topic distributions in a heatmap.

Topic models are high-level statistical tools. A user must scrutinize numerical distributions to understand and explore their results; the raw output of the model is not enough to create an easily explored corpus.