Section 4 of my Battle of the Bands text mining project compares recurring topics in songs by the writers from Queen, Rush, and AC/DC. Topic models identify dominant themes in text and are useful tools for summarizing large corposes. The topic model weights are useful inputs into a cluster analysis that finds similar songs. Fan of Freddie Mercury’s The Fairy Feller’s Master-Stroke? Try Neil Peart’s Anthem - their lyrics clustered together!

Background

Topic modeling searches for patterns of co-occurring words in a corpus of documents. Most topic models represent documents as the output of a probabilistic data-generating mechanism. The model optimizes the distribution parameters according to some criteria. The data generating mechanism is a pair of probability distributions.

The first probability distribution defines topics as weighted probabilities of terms from the vocabulary vector. The second probability distribution defines documents as weighted probabilities of topics. You might interpret the topic weights as the probability that document x is about topic y, or as the degree of the association between document x and topic y, or as how much this topic y contributed to document x.

STM differs from other topic models in that you can fit the model with controlling variables so that a) the first probability distribution (sometimes called the beta matrix) is a function of the controlling variables (a topical content model) and/or b) the second probability distribution (sometimes called the gamma matrix) is a function of the controlling variables (a topical prevalence model). If you think how a topic is discussed depends on the metadata features, control for them in a topical content model. E.g., middle-school children will discuss a topic differently form doctoral candidates. If you think what topics are expressed depends on the metadata features, control for them in a topical prevalence model. E.g., a survey with and open text comment accompanying a Likert rating will probably focus on different topics for each rating level.

That is my approach here. I assume there is overlap among the song writers, but that some topics are more strongly associated with some writers than others. I wonder if I can use topic models to identify similar songs.

My lyrics data frame consists of 466 songs and 16 columns. After the first eight descriptive attributes columns are 8 columns of features engineered in the text complexity section.

glimpse(lyrics)
## Rows: 466
## Columns: 16
## $ song_id            <int> 118582, 308581, 308216, 307643, 307531, 73849, 3075~
## $ band               <chr> "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC~
## $ album              <chr> "Dirty Deeds Done Dirt Cheap", "Stiff Upper Lip", "~
## $ song               <chr> "Ain't No Fun (Waiting 'Round to Be a Millionaire)"~
## $ writer             <fct> AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC~
## $ released           <int> 1976, 2000, 2008, 1990, 1975, 1980, 1985, 1977, 198~
## $ song_url           <chr> "https://genius.com/Ac-dc-aint-no-fun-waiting-round~
## $ lyrics             <chr> "The following is a true story Only the names have ~
## $ n_lines            <int> 78, 49, 37, 88, 42, 55, 36, 47, 40, 39, 47, 42, 46,~
## $ n_words            <int> 509, 218, 196, 326, 239, 262, 223, 219, 187, 168, 1~
## $ words_per_line     <dbl> 6.697368, 4.448980, 5.297297, 4.724638, 5.690476, 5~
## $ syllables_per_word <dbl> 1.176817, 1.036697, 1.260204, 1.128834, 1.150628, 1~
## $ flesch             <dbl> 100.47843, 114.61470, 94.84498, 106.54011, 103.7160~
## $ flesch_kincaid     <dbl> 0.90841769, -1.62187044, 1.34635411, -0.42714591, 0~
## $ dale_chall         <dbl> 42.02125, 56.13663, 53.55915, 58.40871, 57.68863, 5~
## $ ttr                <dbl> 0.3339921, 0.3240741, 0.4081633, 0.1907692, 0.27196~

N-Grams

A good first approach to analyzing lyrical content is to construct unigram and bigram counts. The tidytext provides an easy framework for tokenizing text into unigrams, pulling out connector words and other common “stop” words, and then taking summary statistics (usually just counts). Tidytext tokenizes into bigrams too, but bigrams are less insightful if one of the words is a stop word. One option is to tokenize into unigrams, remove stop words, reassemble into a stop-free document, then tokenize into bigrams.

lyrics_tidy <- lyrics %>%
  
  # remove special chars
  mutate(lyrics_tidy = str_remove_all(lyrics, "[[:punct:]]")) %>%
  
  # create unigrams
  unnest_tokens(output = "word", input = lyrics_tidy) %>%
  
  # no misspellings here, so I'm skipping this step
  # left_join(
  #   fuzzyjoin::misspellings %>% distinct(misspelling, .keep_all = TRUE),
  #   by = c("word" = "misspelling")
  # ) %>%
  # mutate(word = coalesce(correct, word)) %>%
  # select(-correct) %>%

  # lemmatize words 
  mutate(word = textstem::lemmatize_words(word, dictionary = lexicon::hash_lemmas)) %>%
  
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  
  # reconstruct the lyrics 
  nest(word_list = word) %>%
  mutate(lyrics_tidy = map_chr(word_list, ~ unlist(.) %>% paste(., collapse = " "))) %>%
  select(-word_list)

lyrics_tidy has a new column, lyrics_tidy that has stop words removed, and words lemmatized. Here is what Freddie Mercury’s Bohemian Rhapsody looks like before and after this processing.

lyrics_tidy %>%
  filter(song == "Bohemian Rhapsody") %>%
  pivot_longer(cols = c(lyrics, lyrics_tidy)) %>%
  select(song, value) %>%
  head() %>%
  flextable::as_grouped_data("song") %>% 
  flextable::flextable() %>%
  flextable::autofit() %>%
  flextable::theme_vanilla() %>%
  flextable::bg(i = ~ !is.na(song), bg = "gray80") %>%
  flextable::border(i = c(2), border.bottom = officer::fp_border()) %>%
  flextable::border(j = 2, border.right = officer::fp_border())

Unigrams

Unnesting tokens into unigrams with counts gives the following “top 5” words for each writer.

tidy_top5_plot <- function(x, title_text) {
  x %>%
    group_by(band, writer) %>%
    count(token) %>%
    mutate(token_pct = n / sum(n)) %>%
    slice_max(order_by = token_pct, n = 5, with_ties = FALSE) %>%
    mutate(token = reorder_within(token, token_pct, writer)) %>%
    ggplot(aes(x = token, y = token_pct, fill = band)) +
    geom_col() +
    scale_color_manual(values = band_palette) +
    scale_x_reordered() +
    scale_y_continuous(labels = scales::percent_format(accuracey = 1)) +
    facet_wrap(facets = vars(writer), scales = "free_y") +
    coord_flip() +
    theme_light() +
    theme(legend.position = "none") +
    labs(x = NULL, y = "frequency", title = title_text)
}

# This is a first attempt at a word count. A better one comes later.
lyrics_word <- lyrics_tidy %>%
  unnest_tokens(output = "token", input = lyrics_tidy, token = "words")

lyrics_word %>% tidy_top5_plot("Top 5 Words (a first look).")

“I’m” is a top-5 word for every writer except Neil Peart. Peart rarely wrote explicity about himself. There are some gibberish words like “ooh” that don’t add much insight. I’ll remove them and try again.

custom_stop_words <- data.frame(token = c("ooh", "whoa"))

# This is the final word count
lyrics_word_2 <- lyrics_word %>%
  anti_join(custom_stop_words, by = "token")

lyrics_word_2 %>% tidy_top5_plot("Top 5 Words (improved).")

“Love” is another prevalent word, appearing in the top-5 for everyone except Roger Taylor and Queen. AC/DC was distinctive with reference to rock ’n roll. Peart had a relatively low word frequency for his top-5, suggesting he had a wider range of words and themes in songs.

Bigrams

Bigrams are easier to interpret, but they also can mask ideas with variation in phrasing. Here are the bigrams pulled from the distilled lyrics_tidy column.

lyrics_bigram <- lyrics_tidy %>%
  unnest_tokens(output = "token", input = lyrics_tidy, token = "ngrams", n = 2) %>%
  filter(!is.na(token))

lyrics_bigram %>% tidy_top5_plot("Top 5 Bigrams")

AC/DC is all about rock ’n roll. Brian May’s songs It’s Late, Dancer, and Sweet Lady are evident in his list. The same is true for Deacon, Queen, and Neil Peart. Songs with repeating phrases appear in the bigrams - not very interesting.

Modeling

I will fit a structural topic model (STM) following the procedure from the stm vignette.]

Process and Prepare

Process the data first. Some of this is overlap with the processing I’ve just completed.

processed <- stm::textProcessor(
  lyrics_tidy$lyrics_tidy,
  metadata = lyrics_tidy,
  stem = FALSE,
  customstopwords = custom_stop_words$token
)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Creating Output...

textProcessor() produces a list object with three main components:

  • a vocab named vocabulary vector. The vector has 4,844 words.
  • a documents list of matrices, one per document. Each matrix has 2 rows of integers. The first row is indices from the vocabulary vector; the second is their associated word counts. This is a concise representation of a document term matrix. The processing step sometimes removes a few documents that are empty after removing chaff. However, I still have all 466 rows in my documents list.
  • a meta metadata data frame. There is one row per document (466 rows) containing all the song features I’ve collected in prior sections.

Next, “prepare” the corpus by removing infrequently used words. You can leave all words in, but in improves performance to cull out words with such low frequencies that they are unlikely to contribute to topics. The following diagnostic plot helps.

stm::plotRemoved(
  processed$documents, 
  lower.thresh = seq(1, length(processed$documents), by = 10)
)

If you remove words appearing in less than half the songs, all songs will be empty. 1% is a conservative threshold. Only words appearing in at least 1% of songs in the corpus will be included in a topic. prepDocuments() removes words with frequencies below the defined threshold, then updates the vocabulary, documents, and metadata.

prepared <- stm::prepDocuments(
  processed$documents,
  processed$vocab,
  processed$meta,
  lower.thresh = length(processed$documents) * .01
)
## Removing 3910 of 4844 terms (5910 of 20876 tokens) due to frequency 
## Your corpus now has 466 documents, 934 terms and 14966 tokens.

I didn’t lose any songs - I’m still at 466 songs. The vocabulary vector shrank from 4,844 words to 934 words. Here are the first 100, just to get a sense.

prepared$vocab[1:100]
##   [1] "aah"       "ache"      "act"       "action"    "adventure" "afraid"   
##   [7] "afternoon" "age"       "ago"       "agree"     "ahead"     "ahh"      
##  [13] "aim"       "aint"      "air"       "alive"     "alright"   "ambition" 
##  [19] "ancient"   "angel"     "anger"     "angry"     "animal"    "answer"   
##  [25] "anymore"   "applause"  "arm"       "ass"       "attack"    "attention"
##  [31] "babe"      "baby"      "bad"       "balance"   "ball"      "band"     
##  [37] "bang"      "bar"       "bare"      "battle"    "beam"      "bear"     
##  [43] "beast"     "beat"      "beatin"    "beautiful" "beauty"    "bed"      
##  [49] "bedroom"   "begin"     "belief"    "bell"      "belong"    "bend"     
##  [55] "beneath"   "bill"      "bind"      "bird"      "bitch"     "bite"     
##  [61] "bitter"    "black"     "blame"     "blast"     "bleed"     "blind"    
##  [67] "blood"     "bloody"    "blow"      "blue"      "body"      "bodys"    
##  [73] "boil"      "bomb"      "bone"      "book"      "booze"     "boss"     
##  [79] "bottle"    "bottom"    "bout"      "bow"       "boy"       "brain"    
##  [85] "brand"     "break"     "breath"    "breathe"   "bridge"    "bright"   
##  [91] "bring"     "brother"   "build"     "bullet"    "burn"      "burnin"   
##  [97] "business"  "busy"      "buy"       "cadillac"

Fit the Model

The stm package allows you to either specify the number of topics (K) to identify, or it can choose an optimal number by setting parameter K = 0. I’ll let stm choose. The resulting probability distribution of topic words will then be a K x 934 matrix, sometimes called the beta_matrix. The probability distribution of song topics will be a 466 x K matrix, sometimes called the gamma_matrix (theta in the stm package).

I expect lyrics to be correlated with the writer, so I will fit a prevalence model with writer as a covariate.

set.seed(1234)

fit_prevalence <- stm::stm(
  documents = prepared$documents,
  vocab = prepared$vocab,
  K = 0,
  prevalence = ~ writer,
  data = prepared$meta,
  init.type = "Spectral",
  verbose = FALSE
)

summary(fit_prevalence)
## A topic model with 45 topics, 466 documents and a 934 word dictionary.
## Topic 1 Top Words:
##       Highest Prob: learn, feel, miss, star, adventure, live, boy 
##       FREX: miss, adventure, learn, star, promise, cage, feelin 
##       Lift: adventure, miss, reel, cage, promise, feelin, trade 
##       Score: adventure, miss, learn, cage, star, bare, promise 
## Topic 2 Top Words:
##       Highest Prob: world, hes, half, boy, afraid, lose, hold 
##       FREX: half, hes, world, afraid, begin, queen, boy 
##       Lift: afraid, half, hes, crack, noble, world, weapon 
##       Score: afraid, hes, half, world, weapon, queen, crack 
## Topic 3 Top Words:
##       Highest Prob: day, bear, heaven, wait, life, stand, word 
##       FREX: day, bear, single, key, cloud, heaven, instinct 
##       Lift: agree, instinct, stormy, cloud, page, single, key 
##       Score: agree, day, heaven, instinct, bear, shore, beautiful 
## Topic 4 Top Words:
##       Highest Prob: bring, bad, heaven, measure, choose, life, faith 
##       FREX: bring, measure, choose, heaven, faith, belief, tick 
##       Lift: belief, measure, tick, market, bring, choose, faith 
##       Score: measure, belief, heaven, bring, choose, faith, bad 
## Topic 5 Top Words:
##       Highest Prob: money, life, real, honey, taste, waste, live 
##       FREX: money, real, taste, bitch, honey, waste, devil 
##       Lift: bitch, money, real, cash, relation, taste, mighty 
##       Score: money, bitch, real, honey, waste, taste, devil 
## Topic 6 Top Words:
##       Highest Prob: blood, hot, white, feel, business, bad, ive 
##       FREX: blood, hot, white, business, animal, guy, disguise 
##       Lift: blood, white, hot, business, animal, disguise, haha 
##       Score: blood, hot, white, business, animal, dog, disguise 
## Topic 7 Top Words:
##       Highest Prob: body, dancer, talk, dance, chance, baby, wheel 
##       FREX: body, dancer, dance, fate, beatin, talk, chance 
##       Lift: body, dancer, beatin, prayer, mmm, twist, iron 
##       Score: body, dancer, talk, beatin, dance, fate, prayer 
## Topic 8 Top Words:
##       Highest Prob: fight, happen, wild, inside, child, land, tie 
##       FREX: happen, fight, wild, tie, child, reputation, card 
##       Lift: boil, happen, card, reputation, tie, mouth, fight 
##       Score: happen, wild, boil, child, fight, tie, reputation 
## Topic 9 Top Words:
##       Highest Prob: tough, light, demand, eye, music, time, home 
##       FREX: demand, tough, music, form, flow, contact, magic 
##       Lift: contact, spiral, form, demand, gift, tough, flow 
##       Score: contact, demand, tough, magic, form, flow, music 
## Topic 10 Top Words:
##       Highest Prob: talk, danger, red, cut, car, stranger, wheel 
##       FREX: danger, talk, stranger, car, cut, red, difference 
##       Lift: danger, stranger, difference, talk, car, shoulder, tip 
##       Score: danger, talk, stranger, difference, red, cut, car 
## Topic 11 Top Words:
##       Highest Prob: dead, shake, leg, rise, power, action, day 
##       FREX: dead, leg, rise, action, power, shake, wake 
##       Lift: dead, leg, action, bodys, pump, wake, heartbeat 
##       Score: dead, leg, shake, action, rise, power, wake 
## Topic 12 Top Words:
##       Highest Prob: deep, breath, soul, heart, eye, tear, reach 
##       FREX: deep, breath, soul, reach, theyll, mountain, tear 
##       Lift: deep, breath, flower, theyll, sigh, misty, flood 
##       Score: deep, breath, reach, theyll, soul, mountain, emotion 
## Topic 13 Top Words:
##       Highest Prob: dream, play, game, fly, vision, world, hes 
##       FREX: dream, vision, play, game, mission, answer, wing 
##       Lift: field, vision, dream, soar, mission, madness, mysterious 
##       Score: dream, vision, field, play, hes, mission, game 
## Topic 14 Top Words:
##       Highest Prob: fool, live, late, god, break, free, hard 
##       FREX: fool, late, god, rule, free, live, hard 
##       Lift: fool, lesson, rule, late, kid, sorrow, company 
##       Score: fool, late, god, kid, lesson, rule, hard 
## Topic 15 Top Words:
##       Highest Prob: light, city, sun, gold, ahh, lose, everyday 
##       FREX: gold, ahh, everyday, mystic, city, sun, freeze 
##       Lift: gold, ahh, mystic, everyday, distant, cast, unknown 
##       Score: gold, ahh, city, everyday, mystic, sun, glory 
## Topic 16 Top Words:
##       Highest Prob: love, friend, life, hungry, home, youre, wanna 
##       FREX: love, friend, hungry, home, heel, til, care 
##       Lift: heel, love, hungry, friend, til, size, letter 
##       Score: love, heel, friend, hungry, wanna, home, letter 
## Topic 17 Top Words:
##       Highest Prob: hell, thunder, highway, bell, easy, aint, leave 
##       FREX: hell, highway, bell, thunder, easy, wont, home 
##       Lift: highway, bell, hell, thunder, pour, easy, temperature 
##       Score: highway, hell, bell, thunder, easy, nobodys, aint 
## Topic 18 Top Words:
##       Highest Prob: drive, blue, fire, hole, red, road, load 
##       FREX: blue, drive, hole, load, road, fire, red 
##       Lift: hole, blue, load, drive, cadillac, weapon, strike 
##       Score: hole, blue, drive, fire, load, edge, red 
## Topic 19 Top Words:
##       Highest Prob: fire, house, flame, finger, demon, burn, burnin 
##       FREX: house, fire, finger, flame, burnin, demon, wire 
##       Lift: house, burnin, fire, finger, demon, bedroom, flame 
##       Score: house, fire, demon, flame, burnin, finger, wire 
## Topic 20 Top Words:
##       Highest Prob: shake, brain, girl, edge, sit, heart, razor 
##       FREX: brain, shake, edge, sit, razor, invisible, girl 
##       Lift: invisible, razor, brain, edge, sit, shake, joy 
##       Score: invisible, shake, brain, edge, razor, sit, girl 
## Topic 21 Top Words:
##       Highest Prob: dog, bone, givin, war, fortune, eat, son 
##       FREX: dog, bone, givin, fortune, eat, itch, war 
##       Lift: itch, dog, bone, givin, fortune, eat, son 
##       Score: dog, bone, givin, itch, war, fortune, son 
## Topic 22 Top Words:
##       Highest Prob: night, light, knife, fly, fight, change, dark 
##       FREX: knife, night, isnt, light, pretend, color, ship 
##       Lift: knife, isnt, night, thin, beast, color, pretend 
##       Score: knife, night, isnt, pretend, goodbye, ship, thin 
## Topic 23 Top Words:
##       Highest Prob: shoot, bite, lip, gun, bullet, dust, thrill 
##       FREX: lip, bullet, bite, thrill, dust, shoot, gun 
##       Lift: lip, thrill, bullet, shoot, doctor, dust, plenty 
##       Score: lip, shoot, bullet, bite, thrill, dust, doctor 
## Topic 24 Top Words:
##       Highest Prob: hear, world, listen, day, feel, people, cry 
##       FREX: hear, listen, mercury, universe, earth, fear, wise 
##       Lift: mercury, belong, universe, wise, hear, listen, judge 
##       Score: mercury, wise, universe, hear, listen, king, fear 
## Topic 25 Top Words:
##       Highest Prob: feel, pain, stick, pleasure, swallow, neck, spit 
##       FREX: pleasure, neck, pain, swallow, spit, stick, close 
##       Lift: neck, pleasure, spit, swallow, rainy, pain, sunshine 
##       Score: neck, swallow, pleasure, stick, spit, pain, rainy 
## Topic 26 Top Words:
##       Highest Prob: aint, gonna, fun, round, machine, drink, radio 
##       FREX: fun, machine, radio, rockin, aint, noise, drink 
##       Lift: noise, fun, rollin, rockin, radio, fat, machine 
##       Score: fun, noise, radio, machine, aint, rockin, drink 
## Topic 27 Top Words:
##       Highest Prob: baby, feel, city, move, power, safe, stick 
##       FREX: city, safe, move, oil, baby, cover, stick 
##       Lift: oil, safe, cover, movin, spine, city, move 
##       Score: oil, safe, baby, city, cover, stick, move 
## Topic 28 Top Words:
##       Highest Prob: hand, heart, save, cold, paper, touch, cut 
##       FREX: save, paper, hand, cold, fist, heart, lord 
##       Lift: paper, save, fist, gentle, cold, hand, lord 
##       Score: paper, save, cold, fist, heart, hand, gentle 
## Topic 29 Top Words:
##       Highest Prob: time, leave, war, peace, gotta, whiskey, lie 
##       FREX: peace, war, leave, time, whiskey, hurry, double 
##       Lift: peace, hurry, whiskey, war, double, leave, ticket 
##       Score: peace, war, time, leave, whiskey, hurry, double 
## Topic 30 Top Words:
##       Highest Prob: hey, god, magic, ride, leave, feel, prize 
##       FREX: hey, magic, god, prize, wild, attack, ride 
##       Lift: prize, hey, attack, magic, kingdom, god, fence 
##       Score: hey, prize, magic, god, ride, wild, wind 
## Topic 31 Top Words:
##       Highest Prob: lose, love, funny, head, bomb, speed, wheel 
##       FREX: funny, bomb, lose, speed, pay, remain, count 
##       Lift: remain, funny, bomb, count, speed, forgive, pay 
##       Score: remain, funny, bomb, speed, waste, lose, wheel 
## Topic 32 Top Words:
##       Highest Prob: rock, roll, gonna, hard, play, night, trust 
##       FREX: rock, roll, trust, guitar, hard, top, loud 
##       Lift: rock, roll, guitar, drum, jive, whip, cmon 
##       Score: rock, roll, gonna, hard, guitar, trust, cmon 
## Topic 33 Top Words:
##       Highest Prob: youre, walk, screw, kick, aint, headlong, wanna 
##       FREX: youre, screw, walk, headlong, kick, kiss, wall 
##       Lift: screw, youre, headlong, kiss, kick, pause, walk 
##       Score: screw, youre, headlong, kick, walk, kiss, aint 
## Topic 34 Top Words:
##       Highest Prob: shes, ball, play, rhythm, girl, stuff, woman 
##       FREX: shes, ball, rhythm, stuff, shed, play, weve 
##       Lift: shes, ball, rhythm, shed, stuff, tune, loser 
##       Score: shes, ball, rhythm, stuff, shed, girl, gimme 
## Topic 35 Top Words:
##       Highest Prob: woman, black, didnt, ice, dare, kick, evil 
##       FREX: ice, woman, didnt, dare, black, tooth, evil 
##       Lift: shuffle, ice, dare, didnt, march, woman, black 
##       Score: woman, shuffle, ice, black, dare, didnt, evil 
## Topic 36 Top Words:
##       Highest Prob: ride, catch, gonna, girl, spell, play, black 
##       FREX: catch, ride, spell, playin, girl, black, tale 
##       Lift: spell, ride, playin, catch, tale, society, empty 
##       Score: spell, ride, catch, playin, girl, gonna, tale 
## Topic 37 Top Words:
##       Highest Prob: stop, feel, gimme, gonna, time, road, day 
##       FREX: stop, gimme, road, party, shadow, huh, travel 
##       Lift: stop, gimme, huh, hundred, road, travel, sugar 
##       Score: stop, gimme, road, gonna, huh, party, shadow 
## Topic 38 Top Words:
##       Highest Prob: boy, bad, lie, mirror, hero, time, survive 
##       FREX: hero, mirror, survive, wouldnt, straight, sink, paint 
##       Lift: straight, wouldnt, survive, hero, sink, mirror, paint 
##       Score: straight, paint, mirror, hero, wouldnt, nobodys, survive 
## Topic 39 Top Words:
##       Highest Prob: baby, gonna, hey, suicide, blow, youre, gotta 
##       FREX: suicide, baby, worth, window, blow, pass, gonna 
##       Lift: suicide, worth, grab, window, join, dragon, stir 
##       Score: suicide, baby, worth, hey, gonna, window, pass 
## Topic 40 Top Words:
##       Highest Prob: ill, time, ready, stand, line, hold, hard 
##       FREX: ill, line, stand, ready, mist, time, sunday 
##       Lift: sunday, mist, ill, damn, line, front, midnight 
##       Score: sunday, ill, ready, mist, time, line, paint 
## Topic 41 Top Words:
##       Highest Prob: sweet, call, miracle, lady, world, mother, child 
##       FREX: sweet, miracle, call, sail, lady, mother, precious 
##       Lift: sweet, miracle, sail, sister, hum, weak, call 
##       Score: sweet, miracle, call, lady, sail, child, mother 
## Topic 42 Top Words:
##       Highest Prob: sky, fly, system, round, moment, burn, grind 
##       FREX: system, sky, fee, trap, spark, grind, moment 
##       Lift: system, trap, fee, engine, spark, grind, blast 
##       Score: system, trap, fee, sky, spark, moment, grind 
## Topic 43 Top Words:
##       Highest Prob: wind, touch, time, blow, wave, light, carry 
##       FREX: wind, test, wave, echo, shore, touch, carry 
##       Lift: test, gently, shore, echo, crash, west, wind 
##       Score: test, wind, shore, touch, echo, wave, endless 
## Topic 44 Top Words:
##       Highest Prob: theyre, dirty, crazy, track, run, cheap, train 
##       FREX: theyre, crazy, track, dirty, cheap, train, pick 
##       Lift: theyre, cheap, track, crazy, dirty, train, ten 
##       Score: theyre, dirty, cheap, crazy, track, train, gotta 
## Topic 45 Top Words:
##       Highest Prob: shoot, love, warn, life, honey, bite, bleed 
##       FREX: shoot, warn, love, bleed, jump, trip, bite 
##       Lift: warn, shoot, bleed, jump, trip, ecstasy, pump 
##       Score: warn, shoot, love, bleed, ecstasy, honey, trip

Interpretation

stm() produced a model with The model summary printed above (you can also print it with stm::labelTopics(fit_prevalence)) shows the top words based on four metrics: highest probability, FREX, lift, and score.

  • Highest Probability weights words by their overall frequency.
  • FREX weights words by their overall frequency and how exclusive they are to the topic.
  • Lift weights words by dividing by their frequency in other topics, therefore giving higher weight to words that appear less frequently in other topics.
  • Score divides the log frequency of the word in the topic by the log frequency of the word in other topics.

Let’s look at Topic 2:

stm::labelTopics(fit_prevalence, topics = 2)
## Topic 2 Top Words:
##       Highest Prob: world, hes, half, boy, afraid, lose, hold 
##       FREX: half, hes, world, afraid, begin, queen, boy 
##       Lift: afraid, half, hes, crack, noble, world, weapon 
##       Score: afraid, hes, half, world, weapon, queen, crack

findThoughts() shows comments that mapped highly to the topic. The top 3 mappings are below.

topic_thoughts <- stm::findThoughts(
  fit_prevalence, 
  n = 3, 
  texts = processed$meta$lyrics, 
  topics = 2,
  meta = processed$meta
)

processed$meta[pluck(topic_thoughts$index, 1), ] %>% select(writer, song, song_url)
##         writer           song                                      song_url
## 375 Neil Peart Half the World https://genius.com/Rush-half-the-world-lyrics
## 400 Neil Peart  New World Man  https://genius.com/Rush-new-world-man-lyrics
## 450 Neil Peart     The Weapon     https://genius.com/Rush-the-weapon-lyrics

Model Exploration

For reporting purposes, you might want to sum each topic up with a title. That’s not going to be useful here with 45 topics. Instead, I’ll use the top-5 Highest Probability words.

topic_lbl <- stm::labelTopics(fit_prevalence, n = 5) %>% 
  pluck("prob") %>%
  as.data.frame() %>%
  mutate(
    topic_num = row_number(),
    topic_id = paste0("topic_", topic_num)
  ) %>%
  rowwise() %>%
  mutate(
    topic = paste(c(V1, V2, V3, V4, V5), collapse = ", "),
    topic_long = paste0(topic_num, ": ", topic)
  ) %>%
  select(topic_id, topic, topic_long)

topic_df <- fit_prevalence$theta %>% 
  as.data.frame() 
colnames(topic_df) <- topic_lbl$topic_id

lyrics_tidy_2 <- processed$meta %>%
  bind_cols(topic_df)

Data frame lyrics_tidy_2 has the 45 topic weights attached as columns with names topic_1, topic_2, …, topic_45. The topic descriptors are in data frame topic_lbl which I can use as a lookup after pivoting lyrics_tidy_2.

lyrics_tidy_2_long <- lyrics_tidy_2 %>% 
  pivot_longer(
    cols = starts_with("topic_"), 
    names_to = "topic_id", 
    values_to = "topic_weight"
  ) %>%
  inner_join(topic_lbl, by = "topic_id") %>%
  mutate(
    topic = factor(topic, levels = topic_lbl$topic),
    topic_long = factor(topic_long, levels = topic_lbl$topic_long)
  ) 

Each song is a mix of topics, so for each song the topic weights sum to 1.

Topics usually make a negligible contribution to most songs, and a substantial contribution to a few songs. Viewed from the other side, most songs are composed primarily of one or two topics.

p <- lyrics_tidy_2_long %>%
  ggplot(aes(x = fct_rev(topic_long), y = topic_weight, color = writer,
             text = glue("song: {song} <br>",
                         "topic: {topic} <br>",
                         "weight: {scales::percent(topic_weight, accuracy = 1)}"))) +
  geom_point(alpha = .8) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_color_manual(values = writer_palette) +
  coord_flip() +
  theme_light() +
  labs(
    title = "Topics usually make up <10% of a song",
    subtitle = "Song Weights for each topic.",
    x = NULL, y = "Topic Weight")
ggplotly(p, tooltip = "text")

Here are the top topics for We Will Rock You.

lyrics_tidy_2_long %>%
  filter(song == "We Will Rock You") %>%
  arrange(desc(topic_weight)) %>%
  mutate(topic_weight = topic_weight * 100) %>%
  head() %>%
  select(topic_long, topic_weight) %>%
  flextable::flextable() %>%
  flextable::colformat_double(digits = 0, suffix = "%") %>%
  flextable::autofit() %>%
  flextable::set_caption("We Will Rock You topics")

A heat map of topic importance shows a higher topic concentration Deacon, Taylor, and Queen, but that is likely due to their small corpus sizes. Neil Peart weighed heavily on topic 15 (light, city, sun, gold, ahh) while other writers used it hardly at all. AC/DC relied on topic 32 (rock, roll, gonna, hard, play), and so did Roger Taylor.

topic_importance <- lyrics_tidy_2_long %>%
  group_by(writer, topic_long) %>%
  summarize(
    .groups = "drop",
    songs = n(),
    important = sum(topic_weight >= .25),
    pct_import = important / songs)

topic_importance %>%
  ggplot(aes(x = writer, y = topic_long, label = pct_import)) +
  geom_tile(aes(fill = pct_import)) +
  scale_fill_gradient(low = "#FFFFFF", high = "goldenrod") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 8)) +
  theme_light() +
  theme(legend.position = "top") +
  labs(title = "Topic Importance by Writer", fill = NULL, y = NULL, x = NULL)

Which writers were most similar? Topic weights for AC/DC were highly correlated with Roger Taylor, but negatively correlated with the other writers, especially Neil Peart. Peart was no correlated with anyone. The most similar writers were Brian May and Freddie Mercury. John Deacon was in between, correlated less strongly to both Mercury and May.

topic_importance %>% 
  select(writer, topic_long, pct_import) %>%
  pivot_wider(names_from = writer, values_from = pct_import) %>%
  column_to_rownames(var = "topic_long") %>%
  cor() %>%
  corrplot::corrplot(type = "upper")

Similar Songs

A cluster analysis can use the topics to find similar songs. I’ll follow the steps from my unsupervised learning handbook chapter on K-mediods cluster analysis.

The first step is to calculate the song distances from each other. I use the Gower distance.

lyrics_gower <- lyrics_tidy_2 %>% 
  select(-c(song_id:lyrics, lyrics_tidy)) %>%
  cluster::daisy(metric = "gower")

Let’s see the most similar and dissimilar pairs of songs according to their Gower distance.

x <- as.matrix(lyrics_gower)
bind_rows(
  lyrics_tidy_2[which(x == min(x[x != 0]), arr.ind = TRUE)[1, ], ],
  lyrics_tidy_2[which(x == max(x[x != 0]), arr.ind = TRUE)[1, ], ]
) %>%
  as.data.frame() %>%
  select(-c(lyrics_tidy, topic_1:topic_39)) %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(cols = -song) %>%
  pivot_wider(names_from = song) %>%
  flextable::flextable() %>%
  flextable::add_header_row(values = c("", "Similar", "Dissimilar"), colwidths = c(1, 2, 2)) %>%
  flextable::border(j = c(1, 3), border.right = officer::fp_border("gray80"), part = "all")

The K-means algorithm randomly assigns all observations to one of K clusters. K-means iteratively calculates the cluster centroids and reassigns observations to their nearest centroid. Centroids are set of mean values for each feature (hence the name “K-means”). The iterations continue until either the centroids stabilize or the iterations reach a set maximum (typically 50). The result is K clusters with the minimum total intra-cluster variation.

What value should K take? Construct a silhouette plot.

set.seed(1234)

pam_mdl <- data.frame(k = 2:100) %>%
  mutate(
    mdl = map(k, ~pam(lyrics_gower, k = .)),
    sil = map_dbl(mdl, ~ .$silinfo$avg.width)
  )

pam_mdl %>%
  ggplot(aes(x = k, y = sil)) +
  geom_point(size = 2) +
  geom_line() +
  geom_vline(aes(xintercept = 47), linetype = 2, size = 1, color = "goldenrod") +
  scale_x_continuous(breaks = seq(0, 100, by = 5)) +
  theme_light() +
  labs(title = "Silhouette plot max occurs at K = 47 clusters.", 
       subtitle = "K-Medoids within-cluster average silhouette width at candidate values of K.", 
       y = "")

Attach the results to the original table for visualization and summary statistics.

pam_mdl_final <- pam_mdl %>% filter(k == 47) %>% pluck("mdl", 1)

lyrics_tidy_3 <- lyrics_tidy_2 %>% 
  mutate(cluster = as.factor(pam_mdl_final$clustering))

If each cluster were an album, which songs would be on them? Use the filter below to change the albums.

lyrics_tidy_3 %>%
  select(cluster, band, writer, song) %>%
  arrange(cluster, band, writer, song) %>%
  DT::datatable(
    filter = "top",
    options = list(pageLength = 20)
  )

Save work

Save the lyrics with complexity stats for subsequent steps.

saveRDS(lyrics_tidy_3, "./4_lyrics.Rds")
saveRDS(fit_prevalence, "./4_model.Rds")
saveRDS(lyrics_tidy_2_long, "./4_lyrics_long.Rds")