Chapter 2 Topic Modeling

Topic models are generative probabilistic models that identify topics as clusters of words with an associated probability distribution and a probability distribution of topics within each document.

Topic models such as Latent Dirichlet Allocation (LDA) and Structural Topic Modeling (STM) treat documents within a corpora as “bags of words” and identifies groups of words that tend to co-occur. The groups are the topics, formally conceptualized as probability distributions over vocabulary. LDA and STM are generative models of word counts, meaning they model a process that generates text which is a mixture of topics composed of words both of which follow probability distributions. Think of documents as the product of an algorithm that selects each word in two stages: 1) sample a topic, then 2) sample a word from that topic. The task in topic modeling is to tune the hyperparameters that define the probability distributions. In a way, topic models do the opposite of what you might expect. They do not estimate the probability that document x is about topic y. Rather, they estimate the contribution of all Y topics to document x.

This leads to two frameworks for thinking about topics. Prevalance estimates the proportions of the document generated by the topic. Content is the probability distribution of words within the topic. LDA and STM differ only in how they handle these frameworks. STM controls for covariates associated with prevalence and content while LDA does not. LDA is implemented in the topicmodels package and STM is implemented in the stm package. Whether you use LDA or STM, start with the bag-of-words that you created in Chapter 1. This chapter continues from there and follows the ideas from Nagelkerke (2020), Meaney (2022), and the stm package vignette.

library(tidyverse)
library(tidymodels)
library(topicmodels)
library(tidytext)
library(stm)
library(scales)
library(glue)
library(httr2)
library(jsonlite)

load("input/hotel_prepped.Rdata")

glimpse(prepped_hotel)

## Rows: 1,448
## Columns: 9
## $ review_id      <int> 14478, 24627, 25306, 10904, 21306, 605, 14923, 2264, 99…
## $ hotel          <ord> Savoy, Ridgemount, Corinthia, Savoy, Other, Rembrandt, …
## $ rating         <fct> 5, 5, 5, 5, 5, 1-2, 1-2, 1-2, 5, 4, 4, 1-2, 5, 5, 5, 5,…
## $ review         <chr> "Love Love Love The Savoy If you are looking for a luxe…
## $ reviewer_loc   <fct> United Kingdom, Other, Other, Other, Other, United King…
## $ raw_chrcnt     <int> 401, 406, 1297, 213, 1057, 395, 1764, 527, 1060, 395, 1…
## $ raw_wordcnt    <int> 79, 68, 242, 33, 192, 84, 348, 96, 197, 74, 202, 236, 1…
## $ prepped_review <chr> "love love love savoy luxe stuffy feel millionaire hang…
## $ prepped_wrdcnt <int> 23, 28, 90, 16, 66, 19, 106, 41, 79, 30, 58, 85, 44, 26…

glimpse(token)

## Rows: 69,129
## Columns: 2
## $ review_id <int> 14478, 14478, 14478, 14478, 14478, 14478, 14478, 14478, 1447…
## $ word      <chr> "love", "love", "love", "savoy", "luxe", "stuffy", "feel", "…

glimpse(bigram)

## Rows: 21,354
## Columns: 2
## $ review_id <int> 14478, 14478, 14478, 14478, 14478, 14478, 24627, 24627, 2462…
## $ bigram    <chr> "love love", "love love", "classy decor", "decor wonderful",…

References

Meaney, Escobar, C. 2022. “Comparison of Methods for Estimating Temporal Topic Models from Primary Care Clinical Text Data: Retrospective Closed Cohort Study.” JMIR Medical Informatics 10: e40102. https://doi.org/10.2196/40102.

Nagelkerke, Jurriaan. 2020. “NLP with r Part 1: Topic Modeling to Identify Topics in Restaurant Reviews,” November. https://medium.com/cmotions/nlp-with-r-part-1-topic-modeling-to-identify-topics-in-restaurant-reviews-3ee870e6cd8.