Section 4 of my Battle of the Bands text mining project compares recurring topics in songs by the writers from Queen, Rush, and AC/DC. Topic models identify dominant themes in text and are useful tools for summarizing large corposes. The topic model weights are useful inputs into a cluster analysis that finds similar songs. Fan of Freddie Mercury’s The Fairy Feller’s Master-Stroke? Try Neil Peart’s Anthem - their lyrics clustered together!
Topic modeling searches for patterns of co-occurring words in a corpus of documents. Most topic models represent documents as the output of a probabilistic data-generating mechanism. The model optimizes the distribution parameters according to some criteria. The data generating mechanism is a pair of probability distributions.
The first probability distribution defines topics as weighted probabilities of terms from the vocabulary vector. The second probability distribution defines documents as weighted probabilities of topics. You might interpret the topic weights as the probability that document x is about topic y, or as the degree of the association between document x and topic y, or as how much this topic y contributed to document x.
STM differs from other topic models in that you can fit the model with controlling variables so that a) the first probability distribution (sometimes called the beta matrix) is a function of the controlling variables (a topical content model) and/or b) the second probability distribution (sometimes called the gamma matrix) is a function of the controlling variables (a topical prevalence model). If you think how a topic is discussed depends on the metadata features, control for them in a topical content model. E.g., middle-school children will discuss a topic differently form doctoral candidates. If you think what topics are expressed depends on the metadata features, control for them in a topical prevalence model. E.g., a survey with and open text comment accompanying a Likert rating will probably focus on different topics for each rating level.
That is my approach here. I assume there is overlap among the song writers, but that some topics are more strongly associated with some writers than others. I wonder if I can use topic models to identify similar songs.
My lyrics
data frame consists of 466 songs and 16 columns. After the first eight descriptive attributes columns are 8 columns of features engineered in the text complexity section.
glimpse(lyrics)
## Rows: 466
## Columns: 16
## $ song_id <int> 118582, 308581, 308216, 307643, 307531, 73849, 3075~
## $ band <chr> "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC~
## $ album <chr> "Dirty Deeds Done Dirt Cheap", "Stiff Upper Lip", "~
## $ song <chr> "Ain't No Fun (Waiting 'Round to Be a Millionaire)"~
## $ writer <fct> AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC~
## $ released <int> 1976, 2000, 2008, 1990, 1975, 1980, 1985, 1977, 198~
## $ song_url <chr> "https://genius.com/Ac-dc-aint-no-fun-waiting-round~
## $ lyrics <chr> "The following is a true story Only the names have ~
## $ n_lines <int> 78, 49, 37, 88, 42, 55, 36, 47, 40, 39, 47, 42, 46,~
## $ n_words <int> 509, 218, 196, 326, 239, 262, 223, 219, 187, 168, 1~
## $ words_per_line <dbl> 6.697368, 4.448980, 5.297297, 4.724638, 5.690476, 5~
## $ syllables_per_word <dbl> 1.176817, 1.036697, 1.260204, 1.128834, 1.150628, 1~
## $ flesch <dbl> 100.47843, 114.61470, 94.84498, 106.54011, 103.7160~
## $ flesch_kincaid <dbl> 0.90841769, -1.62187044, 1.34635411, -0.42714591, 0~
## $ dale_chall <dbl> 42.02125, 56.13663, 53.55915, 58.40871, 57.68863, 5~
## $ ttr <dbl> 0.3339921, 0.3240741, 0.4081633, 0.1907692, 0.27196~
A good first approach to analyzing lyrical content is to construct unigram and bigram counts. The tidytext provides an easy framework for tokenizing text into unigrams, pulling out connector words and other common “stop” words, and then taking summary statistics (usually just counts). Tidytext tokenizes into bigrams too, but bigrams are less insightful if one of the words is a stop word. One option is to tokenize into unigrams, remove stop words, reassemble into a stop-free document, then tokenize into bigrams.
<- lyrics %>%
lyrics_tidy
# remove special chars
mutate(lyrics_tidy = str_remove_all(lyrics, "[[:punct:]]")) %>%
# create unigrams
unnest_tokens(output = "word", input = lyrics_tidy) %>%
# no misspellings here, so I'm skipping this step
# left_join(
# fuzzyjoin::misspellings %>% distinct(misspelling, .keep_all = TRUE),
# by = c("word" = "misspelling")
# ) %>%
# mutate(word = coalesce(correct, word)) %>%
# select(-correct) %>%
# lemmatize words
mutate(word = textstem::lemmatize_words(word, dictionary = lexicon::hash_lemmas)) %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
# reconstruct the lyrics
nest(word_list = word) %>%
mutate(lyrics_tidy = map_chr(word_list, ~ unlist(.) %>% paste(., collapse = " "))) %>%
select(-word_list)
lyrics_tidy
has a new column, lyrics_tidy
that has stop words removed, and words lemmatized. Here is what Freddie Mercury’s Bohemian Rhapsody looks like before and after this processing.
%>%
lyrics_tidy filter(song == "Bohemian Rhapsody") %>%
pivot_longer(cols = c(lyrics, lyrics_tidy)) %>%
select(song, value) %>%
head() %>%
::as_grouped_data("song") %>%
flextable::flextable() %>%
flextable::autofit() %>%
flextable::theme_vanilla() %>%
flextable::bg(i = ~ !is.na(song), bg = "gray80") %>%
flextable::border(i = c(2), border.bottom = officer::fp_border()) %>%
flextable::border(j = 2, border.right = officer::fp_border()) flextable
song | value |
Bohemian Rhapsody | |
Is this the real life? Is this just fantasy? Caught in a landslide, no escape from reality Open your eyes, look up to the skies and see I'm just a poor boy, I need no sympathy Because I'm easy come, easy go, little high, little low Any way the wind blows doesn't really matter to me, to me Mama, just killed a man Put a gun against his head, pulled my trigger, now he's dead Mama, life had just begun But now I've gone and thrown it all away Mama, ooh, didn't mean to make you cry If I'm not back again this time tomorrow Carry on, carry on as if nothing really matters Too late, my time has come Sends shivers down my spine, body's aching all the time Goodbye, everybody, I've got to go Gotta leave you all behind and face the truth Mama, ooh ( Any way the wind blows) I don't wanna die I sometimes wish I'd never been born at all I see a little silhouetto of a man Scaramouche, Scaramouche, will you do the Fandango? Thunderbolt and lightning, very, very frightening me ( Galileo) Galileo, ( Galileo) Galileo, Galileo Figaro magnifico But I'm just a poor boy, nobody loves me He's just a poor boy from a poor family Spare him his life from this monstrosity Easy come, easy go, will you let me go? Bismillah! No, we will not let you go( Let him go) Bismillah! We will not let you go( Let him go) Bismillah! We will not let you go( Let me go) Will not let you go( Let me go) Will not let you go( Never, never, never, never let me go) Ah No, no, no, no, no, no, no ( Oh, mamma mia, mamma mia) Mamma mia, let me go Beelzebub has a devil put aside for me, for me, for me! So you think you can stone me and spit in my eye? So you think you can love me and leave me to die? Oh, baby, can't do this to me, baby! Just gotta get out, just gotta get right outta here ( Ooh) ( Ooh, yeah, ooh, yeah) Nothing really matters, anyone can see Nothing really matters Nothing really matters to me Any way the wind blows 656 | |
real life fantasy catch landslide escape reality eye sky im poor boy sympathy im easy easy low wind blow doesnt matter mama kill gun head pull trigger hes dead mama life begin ive throw mama ooh didnt cry im time tomorrow carry carry matter late time send shiver spine bodys ache time goodbye ive gotta leave truth mama ooh wind blow wanna die id bear silhouetto scaramouche scaramouche fandango thunderbolt lightning frighten galileo galileo galileo galileo galileo figaro magnifico im poor boy love hes poor boy poor family spare life monstrosity easy easy bismillah bismillah bismillah ah mamma mia mamma mia mamma mia beelzebub devil stone spit eye love leave die baby baby gotta gotta outta ooh ooh ooh matter matter matter wind blow 656 |
Unnesting tokens into unigrams with counts gives the following “top 5” words for each writer.
<- function(x, title_text) {
tidy_top5_plot %>%
x group_by(band, writer) %>%
count(token) %>%
mutate(token_pct = n / sum(n)) %>%
slice_max(order_by = token_pct, n = 5, with_ties = FALSE) %>%
mutate(token = reorder_within(token, token_pct, writer)) %>%
ggplot(aes(x = token, y = token_pct, fill = band)) +
geom_col() +
scale_color_manual(values = band_palette) +
scale_x_reordered() +
scale_y_continuous(labels = scales::percent_format(accuracey = 1)) +
facet_wrap(facets = vars(writer), scales = "free_y") +
coord_flip() +
theme_light() +
theme(legend.position = "none") +
labs(x = NULL, y = "frequency", title = title_text)
}
# This is a first attempt at a word count. A better one comes later.
<- lyrics_tidy %>%
lyrics_word unnest_tokens(output = "token", input = lyrics_tidy, token = "words")
%>% tidy_top5_plot("Top 5 Words (a first look).") lyrics_word
“I’m” is a top-5 word for every writer except Neil Peart. Peart rarely wrote explicity about himself. There are some gibberish words like “ooh” that don’t add much insight. I’ll remove them and try again.
<- data.frame(token = c("ooh", "whoa"))
custom_stop_words
# This is the final word count
<- lyrics_word %>%
lyrics_word_2 anti_join(custom_stop_words, by = "token")
%>% tidy_top5_plot("Top 5 Words (improved).") lyrics_word_2
“Love” is another prevalent word, appearing in the top-5 for everyone except Roger Taylor and Queen. AC/DC was distinctive with reference to rock ’n roll. Peart had a relatively low word frequency for his top-5, suggesting he had a wider range of words and themes in songs.
Bigrams are easier to interpret, but they also can mask ideas with variation in phrasing. Here are the bigrams pulled from the distilled lyrics_tidy
column.
<- lyrics_tidy %>%
lyrics_bigram unnest_tokens(output = "token", input = lyrics_tidy, token = "ngrams", n = 2) %>%
filter(!is.na(token))
%>% tidy_top5_plot("Top 5 Bigrams") lyrics_bigram
AC/DC is all about rock ’n roll. Brian May’s songs It’s Late, Dancer, and Sweet Lady are evident in his list. The same is true for Deacon, Queen, and Neil Peart. Songs with repeating phrases appear in the bigrams - not very interesting.
I will fit a structural topic model (STM) following the procedure from the stm vignette.]
Process the data first. Some of this is overlap with the processing I’ve just completed.
<- stm::textProcessor(
processed $lyrics_tidy,
lyrics_tidymetadata = lyrics_tidy,
stem = FALSE,
customstopwords = custom_stop_words$token
)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Creating Output...
textProcessor()
produces a list object with three main components:
vocab
named vocabulary vector. The vector has 4,844 words.documents
list of matrices, one per document. Each matrix has 2 rows of integers. The first row is indices from the vocabulary vector; the second is their associated word counts. This is a concise representation of a document term matrix. The processing step sometimes removes a few documents that are empty after removing chaff. However, I still have all 466 rows in my documents
list.meta
metadata data frame. There is one row per document (466 rows) containing all the song features I’ve collected in prior sections.Next, “prepare” the corpus by removing infrequently used words. You can leave all words in, but in improves performance to cull out words with such low frequencies that they are unlikely to contribute to topics. The following diagnostic plot helps.
::plotRemoved(
stm$documents,
processedlower.thresh = seq(1, length(processed$documents), by = 10)
)
If you remove words appearing in less than half the songs, all songs will be empty. 1% is a conservative threshold. Only words appearing in at least 1% of songs in the corpus will be included in a topic. prepDocuments()
removes words with frequencies below the defined threshold, then updates the vocabulary, documents, and metadata.
<- stm::prepDocuments(
prepared $documents,
processed$vocab,
processed$meta,
processedlower.thresh = length(processed$documents) * .01
)
## Removing 3910 of 4844 terms (5910 of 20876 tokens) due to frequency
## Your corpus now has 466 documents, 934 terms and 14966 tokens.
I didn’t lose any songs - I’m still at 466 songs. The vocabulary vector shrank from 4,844 words to 934 words. Here are the first 100, just to get a sense.
$vocab[1:100] prepared
## [1] "aah" "ache" "act" "action" "adventure" "afraid"
## [7] "afternoon" "age" "ago" "agree" "ahead" "ahh"
## [13] "aim" "aint" "air" "alive" "alright" "ambition"
## [19] "ancient" "angel" "anger" "angry" "animal" "answer"
## [25] "anymore" "applause" "arm" "ass" "attack" "attention"
## [31] "babe" "baby" "bad" "balance" "ball" "band"
## [37] "bang" "bar" "bare" "battle" "beam" "bear"
## [43] "beast" "beat" "beatin" "beautiful" "beauty" "bed"
## [49] "bedroom" "begin" "belief" "bell" "belong" "bend"
## [55] "beneath" "bill" "bind" "bird" "bitch" "bite"
## [61] "bitter" "black" "blame" "blast" "bleed" "blind"
## [67] "blood" "bloody" "blow" "blue" "body" "bodys"
## [73] "boil" "bomb" "bone" "book" "booze" "boss"
## [79] "bottle" "bottom" "bout" "bow" "boy" "brain"
## [85] "brand" "break" "breath" "breathe" "bridge" "bright"
## [91] "bring" "brother" "build" "bullet" "burn" "burnin"
## [97] "business" "busy" "buy" "cadillac"
The stm package allows you to either specify the number of topics (K) to identify, or it can choose an optimal number by setting parameter K = 0
. I’ll let stm choose. The resulting probability distribution of topic words will then be a K x 934 matrix, sometimes called the beta_matrix. The probability distribution of song topics will be a 466 x K matrix, sometimes called the gamma_matrix (theta in the stm package).
I expect lyrics to be correlated with the writer, so I will fit a prevalence model with writer
as a covariate.
set.seed(1234)
<- stm::stm(
fit_prevalence documents = prepared$documents,
vocab = prepared$vocab,
K = 0,
prevalence = ~ writer,
data = prepared$meta,
init.type = "Spectral",
verbose = FALSE
)
summary(fit_prevalence)
## A topic model with 45 topics, 466 documents and a 934 word dictionary.
## Topic 1 Top Words:
## Highest Prob: learn, feel, miss, star, adventure, live, boy
## FREX: miss, adventure, learn, star, promise, cage, feelin
## Lift: adventure, miss, reel, cage, promise, feelin, trade
## Score: adventure, miss, learn, cage, star, bare, promise
## Topic 2 Top Words:
## Highest Prob: world, hes, half, boy, afraid, lose, hold
## FREX: half, hes, world, afraid, begin, queen, boy
## Lift: afraid, half, hes, crack, noble, world, weapon
## Score: afraid, hes, half, world, weapon, queen, crack
## Topic 3 Top Words:
## Highest Prob: day, bear, heaven, wait, life, stand, word
## FREX: day, bear, single, key, cloud, heaven, instinct
## Lift: agree, instinct, stormy, cloud, page, single, key
## Score: agree, day, heaven, instinct, bear, shore, beautiful
## Topic 4 Top Words:
## Highest Prob: bring, bad, heaven, measure, choose, life, faith
## FREX: bring, measure, choose, heaven, faith, belief, tick
## Lift: belief, measure, tick, market, bring, choose, faith
## Score: measure, belief, heaven, bring, choose, faith, bad
## Topic 5 Top Words:
## Highest Prob: money, life, real, honey, taste, waste, live
## FREX: money, real, taste, bitch, honey, waste, devil
## Lift: bitch, money, real, cash, relation, taste, mighty
## Score: money, bitch, real, honey, waste, taste, devil
## Topic 6 Top Words:
## Highest Prob: blood, hot, white, feel, business, bad, ive
## FREX: blood, hot, white, business, animal, guy, disguise
## Lift: blood, white, hot, business, animal, disguise, haha
## Score: blood, hot, white, business, animal, dog, disguise
## Topic 7 Top Words:
## Highest Prob: body, dancer, talk, dance, chance, baby, wheel
## FREX: body, dancer, dance, fate, beatin, talk, chance
## Lift: body, dancer, beatin, prayer, mmm, twist, iron
## Score: body, dancer, talk, beatin, dance, fate, prayer
## Topic 8 Top Words:
## Highest Prob: fight, happen, wild, inside, child, land, tie
## FREX: happen, fight, wild, tie, child, reputation, card
## Lift: boil, happen, card, reputation, tie, mouth, fight
## Score: happen, wild, boil, child, fight, tie, reputation
## Topic 9 Top Words:
## Highest Prob: tough, light, demand, eye, music, time, home
## FREX: demand, tough, music, form, flow, contact, magic
## Lift: contact, spiral, form, demand, gift, tough, flow
## Score: contact, demand, tough, magic, form, flow, music
## Topic 10 Top Words:
## Highest Prob: talk, danger, red, cut, car, stranger, wheel
## FREX: danger, talk, stranger, car, cut, red, difference
## Lift: danger, stranger, difference, talk, car, shoulder, tip
## Score: danger, talk, stranger, difference, red, cut, car
## Topic 11 Top Words:
## Highest Prob: dead, shake, leg, rise, power, action, day
## FREX: dead, leg, rise, action, power, shake, wake
## Lift: dead, leg, action, bodys, pump, wake, heartbeat
## Score: dead, leg, shake, action, rise, power, wake
## Topic 12 Top Words:
## Highest Prob: deep, breath, soul, heart, eye, tear, reach
## FREX: deep, breath, soul, reach, theyll, mountain, tear
## Lift: deep, breath, flower, theyll, sigh, misty, flood
## Score: deep, breath, reach, theyll, soul, mountain, emotion
## Topic 13 Top Words:
## Highest Prob: dream, play, game, fly, vision, world, hes
## FREX: dream, vision, play, game, mission, answer, wing
## Lift: field, vision, dream, soar, mission, madness, mysterious
## Score: dream, vision, field, play, hes, mission, game
## Topic 14 Top Words:
## Highest Prob: fool, live, late, god, break, free, hard
## FREX: fool, late, god, rule, free, live, hard
## Lift: fool, lesson, rule, late, kid, sorrow, company
## Score: fool, late, god, kid, lesson, rule, hard
## Topic 15 Top Words:
## Highest Prob: light, city, sun, gold, ahh, lose, everyday
## FREX: gold, ahh, everyday, mystic, city, sun, freeze
## Lift: gold, ahh, mystic, everyday, distant, cast, unknown
## Score: gold, ahh, city, everyday, mystic, sun, glory
## Topic 16 Top Words:
## Highest Prob: love, friend, life, hungry, home, youre, wanna
## FREX: love, friend, hungry, home, heel, til, care
## Lift: heel, love, hungry, friend, til, size, letter
## Score: love, heel, friend, hungry, wanna, home, letter
## Topic 17 Top Words:
## Highest Prob: hell, thunder, highway, bell, easy, aint, leave
## FREX: hell, highway, bell, thunder, easy, wont, home
## Lift: highway, bell, hell, thunder, pour, easy, temperature
## Score: highway, hell, bell, thunder, easy, nobodys, aint
## Topic 18 Top Words:
## Highest Prob: drive, blue, fire, hole, red, road, load
## FREX: blue, drive, hole, load, road, fire, red
## Lift: hole, blue, load, drive, cadillac, weapon, strike
## Score: hole, blue, drive, fire, load, edge, red
## Topic 19 Top Words:
## Highest Prob: fire, house, flame, finger, demon, burn, burnin
## FREX: house, fire, finger, flame, burnin, demon, wire
## Lift: house, burnin, fire, finger, demon, bedroom, flame
## Score: house, fire, demon, flame, burnin, finger, wire
## Topic 20 Top Words:
## Highest Prob: shake, brain, girl, edge, sit, heart, razor
## FREX: brain, shake, edge, sit, razor, invisible, girl
## Lift: invisible, razor, brain, edge, sit, shake, joy
## Score: invisible, shake, brain, edge, razor, sit, girl
## Topic 21 Top Words:
## Highest Prob: dog, bone, givin, war, fortune, eat, son
## FREX: dog, bone, givin, fortune, eat, itch, war
## Lift: itch, dog, bone, givin, fortune, eat, son
## Score: dog, bone, givin, itch, war, fortune, son
## Topic 22 Top Words:
## Highest Prob: night, light, knife, fly, fight, change, dark
## FREX: knife, night, isnt, light, pretend, color, ship
## Lift: knife, isnt, night, thin, beast, color, pretend
## Score: knife, night, isnt, pretend, goodbye, ship, thin
## Topic 23 Top Words:
## Highest Prob: shoot, bite, lip, gun, bullet, dust, thrill
## FREX: lip, bullet, bite, thrill, dust, shoot, gun
## Lift: lip, thrill, bullet, shoot, doctor, dust, plenty
## Score: lip, shoot, bullet, bite, thrill, dust, doctor
## Topic 24 Top Words:
## Highest Prob: hear, world, listen, day, feel, people, cry
## FREX: hear, listen, mercury, universe, earth, fear, wise
## Lift: mercury, belong, universe, wise, hear, listen, judge
## Score: mercury, wise, universe, hear, listen, king, fear
## Topic 25 Top Words:
## Highest Prob: feel, pain, stick, pleasure, swallow, neck, spit
## FREX: pleasure, neck, pain, swallow, spit, stick, close
## Lift: neck, pleasure, spit, swallow, rainy, pain, sunshine
## Score: neck, swallow, pleasure, stick, spit, pain, rainy
## Topic 26 Top Words:
## Highest Prob: aint, gonna, fun, round, machine, drink, radio
## FREX: fun, machine, radio, rockin, aint, noise, drink
## Lift: noise, fun, rollin, rockin, radio, fat, machine
## Score: fun, noise, radio, machine, aint, rockin, drink
## Topic 27 Top Words:
## Highest Prob: baby, feel, city, move, power, safe, stick
## FREX: city, safe, move, oil, baby, cover, stick
## Lift: oil, safe, cover, movin, spine, city, move
## Score: oil, safe, baby, city, cover, stick, move
## Topic 28 Top Words:
## Highest Prob: hand, heart, save, cold, paper, touch, cut
## FREX: save, paper, hand, cold, fist, heart, lord
## Lift: paper, save, fist, gentle, cold, hand, lord
## Score: paper, save, cold, fist, heart, hand, gentle
## Topic 29 Top Words:
## Highest Prob: time, leave, war, peace, gotta, whiskey, lie
## FREX: peace, war, leave, time, whiskey, hurry, double
## Lift: peace, hurry, whiskey, war, double, leave, ticket
## Score: peace, war, time, leave, whiskey, hurry, double
## Topic 30 Top Words:
## Highest Prob: hey, god, magic, ride, leave, feel, prize
## FREX: hey, magic, god, prize, wild, attack, ride
## Lift: prize, hey, attack, magic, kingdom, god, fence
## Score: hey, prize, magic, god, ride, wild, wind
## Topic 31 Top Words:
## Highest Prob: lose, love, funny, head, bomb, speed, wheel
## FREX: funny, bomb, lose, speed, pay, remain, count
## Lift: remain, funny, bomb, count, speed, forgive, pay
## Score: remain, funny, bomb, speed, waste, lose, wheel
## Topic 32 Top Words:
## Highest Prob: rock, roll, gonna, hard, play, night, trust
## FREX: rock, roll, trust, guitar, hard, top, loud
## Lift: rock, roll, guitar, drum, jive, whip, cmon
## Score: rock, roll, gonna, hard, guitar, trust, cmon
## Topic 33 Top Words:
## Highest Prob: youre, walk, screw, kick, aint, headlong, wanna
## FREX: youre, screw, walk, headlong, kick, kiss, wall
## Lift: screw, youre, headlong, kiss, kick, pause, walk
## Score: screw, youre, headlong, kick, walk, kiss, aint
## Topic 34 Top Words:
## Highest Prob: shes, ball, play, rhythm, girl, stuff, woman
## FREX: shes, ball, rhythm, stuff, shed, play, weve
## Lift: shes, ball, rhythm, shed, stuff, tune, loser
## Score: shes, ball, rhythm, stuff, shed, girl, gimme
## Topic 35 Top Words:
## Highest Prob: woman, black, didnt, ice, dare, kick, evil
## FREX: ice, woman, didnt, dare, black, tooth, evil
## Lift: shuffle, ice, dare, didnt, march, woman, black
## Score: woman, shuffle, ice, black, dare, didnt, evil
## Topic 36 Top Words:
## Highest Prob: ride, catch, gonna, girl, spell, play, black
## FREX: catch, ride, spell, playin, girl, black, tale
## Lift: spell, ride, playin, catch, tale, society, empty
## Score: spell, ride, catch, playin, girl, gonna, tale
## Topic 37 Top Words:
## Highest Prob: stop, feel, gimme, gonna, time, road, day
## FREX: stop, gimme, road, party, shadow, huh, travel
## Lift: stop, gimme, huh, hundred, road, travel, sugar
## Score: stop, gimme, road, gonna, huh, party, shadow
## Topic 38 Top Words:
## Highest Prob: boy, bad, lie, mirror, hero, time, survive
## FREX: hero, mirror, survive, wouldnt, straight, sink, paint
## Lift: straight, wouldnt, survive, hero, sink, mirror, paint
## Score: straight, paint, mirror, hero, wouldnt, nobodys, survive
## Topic 39 Top Words:
## Highest Prob: baby, gonna, hey, suicide, blow, youre, gotta
## FREX: suicide, baby, worth, window, blow, pass, gonna
## Lift: suicide, worth, grab, window, join, dragon, stir
## Score: suicide, baby, worth, hey, gonna, window, pass
## Topic 40 Top Words:
## Highest Prob: ill, time, ready, stand, line, hold, hard
## FREX: ill, line, stand, ready, mist, time, sunday
## Lift: sunday, mist, ill, damn, line, front, midnight
## Score: sunday, ill, ready, mist, time, line, paint
## Topic 41 Top Words:
## Highest Prob: sweet, call, miracle, lady, world, mother, child
## FREX: sweet, miracle, call, sail, lady, mother, precious
## Lift: sweet, miracle, sail, sister, hum, weak, call
## Score: sweet, miracle, call, lady, sail, child, mother
## Topic 42 Top Words:
## Highest Prob: sky, fly, system, round, moment, burn, grind
## FREX: system, sky, fee, trap, spark, grind, moment
## Lift: system, trap, fee, engine, spark, grind, blast
## Score: system, trap, fee, sky, spark, moment, grind
## Topic 43 Top Words:
## Highest Prob: wind, touch, time, blow, wave, light, carry
## FREX: wind, test, wave, echo, shore, touch, carry
## Lift: test, gently, shore, echo, crash, west, wind
## Score: test, wind, shore, touch, echo, wave, endless
## Topic 44 Top Words:
## Highest Prob: theyre, dirty, crazy, track, run, cheap, train
## FREX: theyre, crazy, track, dirty, cheap, train, pick
## Lift: theyre, cheap, track, crazy, dirty, train, ten
## Score: theyre, dirty, cheap, crazy, track, train, gotta
## Topic 45 Top Words:
## Highest Prob: shoot, love, warn, life, honey, bite, bleed
## FREX: shoot, warn, love, bleed, jump, trip, bite
## Lift: warn, shoot, bleed, jump, trip, ecstasy, pump
## Score: warn, shoot, love, bleed, ecstasy, honey, trip
stm()
produced a model with The model summary printed above (you can also print it with stm::labelTopics(fit_prevalence)
) shows the top words based on four metrics: highest probability, FREX, lift, and score.
Let’s look at Topic 2:
::labelTopics(fit_prevalence, topics = 2) stm
## Topic 2 Top Words:
## Highest Prob: world, hes, half, boy, afraid, lose, hold
## FREX: half, hes, world, afraid, begin, queen, boy
## Lift: afraid, half, hes, crack, noble, world, weapon
## Score: afraid, hes, half, world, weapon, queen, crack
findThoughts()
shows comments that mapped highly to the topic. The top 3 mappings are below.
<- stm::findThoughts(
topic_thoughts
fit_prevalence, n = 3,
texts = processed$meta$lyrics,
topics = 2,
meta = processed$meta
)
$meta[pluck(topic_thoughts$index, 1), ] %>% select(writer, song, song_url) processed
## writer song song_url
## 375 Neil Peart Half the World https://genius.com/Rush-half-the-world-lyrics
## 400 Neil Peart New World Man https://genius.com/Rush-new-world-man-lyrics
## 450 Neil Peart The Weapon https://genius.com/Rush-the-weapon-lyrics
For reporting purposes, you might want to sum each topic up with a title. That’s not going to be useful here with 45 topics. Instead, I’ll use the top-5 Highest Probability words.
<- stm::labelTopics(fit_prevalence, n = 5) %>%
topic_lbl pluck("prob") %>%
as.data.frame() %>%
mutate(
topic_num = row_number(),
topic_id = paste0("topic_", topic_num)
%>%
) rowwise() %>%
mutate(
topic = paste(c(V1, V2, V3, V4, V5), collapse = ", "),
topic_long = paste0(topic_num, ": ", topic)
%>%
) select(topic_id, topic, topic_long)
<- fit_prevalence$theta %>%
topic_df as.data.frame()
colnames(topic_df) <- topic_lbl$topic_id
<- processed$meta %>%
lyrics_tidy_2 bind_cols(topic_df)
Data frame lyrics_tidy_2
has the 45 topic weights attached as columns with names topic_1, topic_2, …, topic_45. The topic descriptors are in data frame topic_lbl
which I can use as a lookup after pivoting lyrics_tidy_2
.
<- lyrics_tidy_2 %>%
lyrics_tidy_2_long pivot_longer(
cols = starts_with("topic_"),
names_to = "topic_id",
values_to = "topic_weight"
%>%
) inner_join(topic_lbl, by = "topic_id") %>%
mutate(
topic = factor(topic, levels = topic_lbl$topic),
topic_long = factor(topic_long, levels = topic_lbl$topic_long)
)
Each song is a mix of topics, so for each song the topic weights sum to 1.
Topics usually make a negligible contribution to most songs, and a substantial contribution to a few songs. Viewed from the other side, most songs are composed primarily of one or two topics.
<- lyrics_tidy_2_long %>%
p ggplot(aes(x = fct_rev(topic_long), y = topic_weight, color = writer,
text = glue("song: {song} <br>",
"topic: {topic} <br>",
"weight: {scales::percent(topic_weight, accuracy = 1)}"))) +
geom_point(alpha = .8) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_color_manual(values = writer_palette) +
coord_flip() +
theme_light() +
labs(
title = "Topics usually make up <10% of a song",
subtitle = "Song Weights for each topic.",
x = NULL, y = "Topic Weight")
ggplotly(p, tooltip = "text")
Here are the top topics for We Will Rock You.
%>%
lyrics_tidy_2_long filter(song == "We Will Rock You") %>%
arrange(desc(topic_weight)) %>%
mutate(topic_weight = topic_weight * 100) %>%
head() %>%
select(topic_long, topic_weight) %>%
::flextable() %>%
flextable::colformat_double(digits = 0, suffix = "%") %>%
flextable::autofit() %>%
flextable::set_caption("We Will Rock You topics") flextable
topic_long | topic_weight |
32: rock, roll, gonna, hard, play | 53% |
33: youre, walk, screw, kick, aint | 11% |
2: world, hes, half, boy, afraid | 6% |
6: blood, hot, white, feel, business | 6% |
29: time, leave, war, peace, gotta | 6% |
26: aint, gonna, fun, round, machine | 4% |
A heat map of topic importance shows a higher topic concentration Deacon, Taylor, and Queen, but that is likely due to their small corpus sizes. Neil Peart weighed heavily on topic 15 (light, city, sun, gold, ahh) while other writers used it hardly at all. AC/DC relied on topic 32 (rock, roll, gonna, hard, play), and so did Roger Taylor.
<- lyrics_tidy_2_long %>%
topic_importance group_by(writer, topic_long) %>%
summarize(
.groups = "drop",
songs = n(),
important = sum(topic_weight >= .25),
pct_import = important / songs)
%>%
topic_importance ggplot(aes(x = writer, y = topic_long, label = pct_import)) +
geom_tile(aes(fill = pct_import)) +
scale_fill_gradient(low = "#FFFFFF", high = "goldenrod") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 8)) +
theme_light() +
theme(legend.position = "top") +
labs(title = "Topic Importance by Writer", fill = NULL, y = NULL, x = NULL)
Which writers were most similar? Topic weights for AC/DC were highly correlated with Roger Taylor, but negatively correlated with the other writers, especially Neil Peart. Peart was no correlated with anyone. The most similar writers were Brian May and Freddie Mercury. John Deacon was in between, correlated less strongly to both Mercury and May.
%>%
topic_importance select(writer, topic_long, pct_import) %>%
pivot_wider(names_from = writer, values_from = pct_import) %>%
column_to_rownames(var = "topic_long") %>%
cor() %>%
::corrplot(type = "upper") corrplot
A cluster analysis can use the topics to find similar songs. I’ll follow the steps from my unsupervised learning handbook chapter on K-mediods cluster analysis.
The first step is to calculate the song distances from each other. I use the Gower distance.
<- lyrics_tidy_2 %>%
lyrics_gower select(-c(song_id:lyrics, lyrics_tidy)) %>%
::daisy(metric = "gower") cluster
Let’s see the most similar and dissimilar pairs of songs according to their Gower distance.
<- as.matrix(lyrics_gower)
x bind_rows(
which(x == min(x[x != 0]), arr.ind = TRUE)[1, ], ],
lyrics_tidy_2[which(x == max(x[x != 0]), arr.ind = TRUE)[1, ], ]
lyrics_tidy_2[%>%
) as.data.frame() %>%
select(-c(lyrics_tidy, topic_1:topic_39)) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = -song) %>%
pivot_wider(names_from = song) %>%
::flextable() %>%
flextable::add_header_row(values = c("", "Similar", "Dissimilar"), colwidths = c(1, 2, 2)) %>%
flextable::border(j = c(1, 3), border.right = officer::fp_border("gray80"), part = "all") flextable
Similar | Dissimilar | |||
name | Ride the Wild Wind | A Kind of Magic | Mustapha | Dear Friends |
song_id | 310066 | 309616 | 309019 | 308899 |
band | Queen | Queen | Queen | Queen |
album | Innuendo | A Kind of Magic | Jazz | Sheer Heart Attack |
writer | Roger Taylor | Roger Taylor | Freddie Mercury | Brian May |
released | 1991 | 1986 | 1978 | 1974 |
song_url | https://genius.com/Queen-ride-the-wild-wind-lyrics | https://genius.com/Queen-a-kind-of-magic-lyrics | https://genius.com/Queen-mustapha-lyrics | https://genius.com/Queen-dear-friends-lyrics |
lyrics | Ride the wild wind ( Push the envelope don't sit on the fence) Hey, hey, hey, hey Ride the wild wind ( Live life on the razor's edge) Hey, hey, hey Gonna ride the whirlwind It ain't dangerous enough for me Get your head down baby (yeah) We're gonna ride tonight Your angel eyes Are shining bright I wanna take your hand Lead you from this place Gonna leave it all behind Check out of this rat race Ride the wild wind (hey hey hey) Ride the wild wind (hey hey hey) Gonna ride the wild wind It ain't dangerous enough for me Tie your hair back baby We're gonna ride tonight (yeah) We got freaks to the left We got jerks to the right Sometimes I get so low I just have to ride Let me take your hand Let me be your guide Ooh, ride the wild wind ( Don't sit on the fence) Hey, hey, hey Ride the wild wind ( Live life on the razor's edge) ( Hey, hey, hey) Gonna ride the whirlwind It ain't dangerous enough for me Yeah, ride the wild wind Hey, hey, hey, hey Ride the wild wind Hey, hey, hey, hey (ha ha ha) Gonna ride the whirlwind It ain't dangerous enough for me Ride the wild wind Ride the wild wind ( Hey, hey, hey) The wild wind The wild wind 3 | It's a kind of magic It's a kind of magic A kind of magic One dream, one soul One prize, one goal One golden glance of what should be ( It's a kind of magic) One shaft of light that shows the way No mortal man can win this day ( It's a kind of magic) The bell that rings inside your mind Is challenging the doors of time ( It's a kind of magic) The waiting seems eternity The day will dawn of sanity Ooh, ooh, ooh, ooh Is this a kind of magic?( It's a kind of magic) There can be only one This rage that lasts a thousand years Will soon be done This flame that burns inside of me I'm hearing secret harmonies ( It's a kind of magic) The bell that rings inside your mind Is challenging the doors of time ( It's a kind of magic) ( It's a kind of magic) This rage that lasts a thousand years Will soon be, will soon be Will soon be done! This is (this is) a kind (a kind) Of magic (yeah) There can be only one, one, one, one This rage that lasts a thousand years Will soon be done (done!) ( Magic), it's a kind of magic It's a kind of magic Magic, magic, magic, magic! ( Magic), ha-ha-ha-ha, it's magic Ha-ha Yeah, yeah Whoo It's a kind of magic! 17 | Ibrahim, Ibrahim Ibrahim Allah, Allah Allah, Allah we'll pray for you, hey! Mustapha, Mustapha Mustapha Ibrahim Mustapha, Mustapha Mustapha Ibrahim Mustapha Ibrahim, Mustapha Ibrahim Allah, Allah, Allah We'll pray for you Mustapha Ibrahim, al havra kris vanin Allah, Allah, Allah we'll pray for you Mustapha, hey! Mustapha Mustapha Ibrahim Mustapha Ibrahim, hey! ( Ooh, ah!) Allah- I, Allah- I, Allah- I Ibra- Ibra- Ibrahim, yeah! Ibrahim, Ibrahim, Ibrahim Allah Allah Allah- I hey! Mustapha, Mustapha Allah- I na stolei Mustapha, Mustapha Achtar es na sholei Mustapha, Mustapha Mochamut dei ya low eshelei Mustapha, Mustapha Ai ai ai ai ahelei Mustapha, mustapha Ist avil ahiln avil ahiln adhim Mustapha, salaam Aleikum! Mustapha Ibrahim, Mustapha Ibrahim Allah, Allah, Allah We'll pray for you Mustapha Ibrahim, achbar ish navin Allah, Allah, Allah we'll pray for you Mustapha, mustapha Mustapha Ibrahim Mustapha Ibrahim, hey! Allah- I, Allah- I, Allah- I Ibra- Ibra- Ibrahim, yeah! Ibrahim, Ibrahim, Ibrahim Allah Allah Allah- I hey! Mustapha, Mustapha Mustapha, Mustapha ( Hey!) Mustapha, Mustapha Mustapha, Mustapha Mustapha, mustapha Vontap ist ahiln avil ahiln adhim Mustapha, aleikum Salaam, hey! 7 | So dear friends your love is gone Only tears to dwell upon I dare not say as the wind must blow So a love is lost, a love is won Go to sleep and dream again Soon your hopes will rise and then From all this gloom life can start anew And there'll be no crying soon 1 |
n_lines | 54 | 53 | 67 | 9 |
n_words | 227 | 233 | 179 | 58 |
words_per_line | 4.72916666666667 | 5.17777777777778 | 2.67164179104478 | 6.44444444444444 |
syllables_per_word | 1.15418502202643 | 1.19742489270386 | 2.07262569832402 | 1.10344827586207 |
flesch | 104.390842969897 | 100.277409632809 | 28.7791495038773 | 106.942164750958 |
flesch_kincaid | -0.126241740088105 | 0.558947067238915 | 9.90892353873093 | -0.0559770114942513 |
dale_chall | 51.111324339207 | 52.6805522174535 | -12.1451088134745 | 53.0016091954023 |
ttr | 0.323008849557522 | 0.267241379310345 | 0.219101123595506 | 0.807017543859649 |
topic_40 | 0.00107618769653324 | 0.00388861253144969 | 0.00705770950175043 | 0.0167258942922743 |
topic_41 | 1.05499542063853e-05 | 3.1959865111622e-05 | 0.000255210656501508 | 0.0186648877666116 |
topic_42 | 4.8253383753524e-05 | 0.000110745199651069 | 0.000783473251671529 | 0.00706149338073792 |
topic_43 | 5.77980927941548e-05 | 0.000175619954754743 | 0.00030332946815038 | 0.0827262212686349 |
topic_44 | 0.000968727068583611 | 0.00211648082141354 | 0.00332224498306624 | 0.000549309165348524 |
topic_45 | 1.27120253986297e-05 | 2.89224230673e-05 | 0.000366246189243187 | 0.00223559199545014 |
The K-means algorithm randomly assigns all observations to one of K clusters. K-means iteratively calculates the cluster centroids and reassigns observations to their nearest centroid. Centroids are set of mean values for each feature (hence the name “K-means”). The iterations continue until either the centroids stabilize or the iterations reach a set maximum (typically 50). The result is K clusters with the minimum total intra-cluster variation.
What value should K take? Construct a silhouette plot.
set.seed(1234)
<- data.frame(k = 2:100) %>%
pam_mdl mutate(
mdl = map(k, ~pam(lyrics_gower, k = .)),
sil = map_dbl(mdl, ~ .$silinfo$avg.width)
)
%>%
pam_mdl ggplot(aes(x = k, y = sil)) +
geom_point(size = 2) +
geom_line() +
geom_vline(aes(xintercept = 47), linetype = 2, size = 1, color = "goldenrod") +
scale_x_continuous(breaks = seq(0, 100, by = 5)) +
theme_light() +
labs(title = "Silhouette plot max occurs at K = 47 clusters.",
subtitle = "K-Medoids within-cluster average silhouette width at candidate values of K.",
y = "")
Attach the results to the original table for visualization and summary statistics.
<- pam_mdl %>% filter(k == 47) %>% pluck("mdl", 1)
pam_mdl_final
<- lyrics_tidy_2 %>%
lyrics_tidy_3 mutate(cluster = as.factor(pam_mdl_final$clustering))
If each cluster were an album, which songs would be on them? Use the filter below to change the albums.
%>%
lyrics_tidy_3 select(cluster, band, writer, song) %>%
arrange(cluster, band, writer, song) %>%
::datatable(
DTfilter = "top",
options = list(pageLength = 20)
)
Save the lyrics with complexity stats for subsequent steps.
saveRDS(lyrics_tidy_3, "./4_lyrics.Rds")
saveRDS(fit_prevalence, "./4_model.Rds")
saveRDS(lyrics_tidy_2_long, "./4_lyrics_long.Rds")