Natural Language Processing in the Social Sector — Supplemental Materials

Topic Modeling

Given a collection of texts, you may expect that some texts are more similar to others based on the words that tend to occur together, e.g., common words across US news articles are different from those across international financial news.

Topic modeling discovers these patterns. One of the most popular approaches — Latent Dirichlet Allocation (LDA) — categorizes documents and terms into unlabeled topics, which are then interpreted by the user. The visualization below displays several different aspects of an LDA model. We’ll walk through these one by one. To start, select the five topic model:

The five topics in our model are represented as circles, on the left. The size and numbering of the circles indicate how common the topic is in our dataset. (The topic labeled 1 is the most common, while the topic labeled 5 is the least common.) The closer together two circles are to each other, the more closely related they are. For example, topics 2 and 5 overlap with each other. This overlap indicates that they are similar to each other (i.e., the topics share many similar words). Topics 1 and 4, however, are very far apart; they are dissimilar. Overall, the image on the left provides a picture of how common the topics are, and how they are related to each other.

Click on the circle for topic 1. You will notice that the bar chart on the right changes. From top to bottom, the words listed here are the most common words in topic 1. The dark orange bar indicates how many times a given word appears in topic 1; the pale orange bar indicates how many times a given word appears overall. For example, you can see that the word “child” appears over 10,000 times in topic 1, and more than 20,000 times overall. If you hover over a word on the right hand side of the visualization, the circles on the left hand side will change in size to reflect how commonly that word appears in the topics. If you hover over the word “year,” the circles for topics 1 and 3 on the left grows until they are nearly equal in size, and dwarf all the other topics. This is because most of the time when the word “child” appears in our dataset, it is in either topic 1 or topic 3.

Looking at the most common terms for each topic can begin to give us a sense of what that topic is about. However, very common terms will show up in multiple topics, as we have seen with the word “child.” This can help us understand what makes topics 1 and 3 different from the other topics in the model. However, we also need to understand what differentiates these two topics from each other. For this, we can look at the words that are more exclusive — i.e., unique — to a given topic.

Select the circle for topic 1 and slide the bar on the right labeled λ all the way to 0. Notice that the numbers on the x axis for the bar graph get much smaller; we are now looking at terms that are less common overall. Also notice that the bars are entirely, or almost entirely, red. This means that every time a given term appears, it is in topic 1. Hover over the word “predator” and you will see the topic 1 circle on the left grow, and all the other circles disappear.

The best way to get an idea of what a given topic is about is to read through this list of terms at various settings of λ. Setting λ = 1 gives you the most common terms in a topic; λ = 0 gives you the most unique terms in a topic; λ in the middle is somewhere between the two. One study showed that setting λ = 0.6 gives the most interpretable list of words for a given topic. Play around with different settings and see what helps you best understand these topics and this model.

For more information about this visualization, check out the paper written by Carson Sievert and Kenneth E. Shirley, who designed it.

Based on these results, we generated the following labels for the topics in each model. It is important to note that labeling topics is subjective; you may have a better label for one or more of the topics below. This is one of the reasons that topic modeling is most useful when performing an exploratory analysis; it can be helpful to have multiple subject matter experts review and label the generated topics.

Three Topic Model:

  1. School Governance
  2. Campus Construction
  3. Taxes

Five Topic Model:

  1. Licensing
  2. Taxes
  3. Child Welfare/Assistance
  4. Campus Construction
  5. School Governance/Pensions

Ten Topic Model:

  1. Child Welfare/Assistance
  2. Campus Construction
  3. Licensing
  4. Taxes
  5. Foster Care
  6. School Governance
  7. Budgeting
  8. Gambling Licensing/Revenue
  9. Pensions
  10. Corporate Governance

Once we generated labels for all of the topics, it was interesting to see how each document is categorized in the three, five, and ten topic model. We can visualize this with a Sankey diagram, which is designed to show the “flow” between different categories.

Looking at this diagram, we unsurprisingly see that almost all bills labeled “Pensions” in the ten topic model were labeled “School Governance/Pensions” in the five topic model. Other topics, such as “Budgeting,” draw from a wider range of sources. It is not unusual for LDA to generate a “miscellaneous” topic like this one, which is far less cohesive than others. This is because every document must be categorized in some way, and some just don’t fit neatly into one of a few categories. Contrast the “Budgeting” topic with the “Campus Construction” topic. While small, “Campus Construction” appears consistently in all three models. This indicates that it is a distinct, well-defined topic.

With LDA and some easy-to-generate visualizations, you can readily explore prevalent topics in thousands of documents — without having to read all of them!

Sample code