Section 3 of my Battle of the Bands text mining project compares the lyrical complexity from songs by the writers from Queen, Rush, and AC/DC. Text complexity measures are common in assessing age-appropriate reading material in schools, and in refining advertising content. They work less well for poetry and song lyrics where delineated sentences are often absent, but are still effective for comparative analysis. Neil Peart emerges as the clear winner here, primarily due to his tendency toward using long words. Brian May earned an honorable mention for textual richness.

Background

A recent HBR article featured a text analysis of employee performance evaluations in which the researchers measured text complexity. Text complexity measures are used in education to evaluate a book’s required reading level. In the HBR article, the researches used complexity as a measure of the rigor and thoughtfulness that went into each performance evaluation. The concept may not transfer perfectly to song lyrics. Longer thoughts and more precise language may not make for a better or more artistic song. Complexity is at least suggestive of how eruditely the writer approached the subject in the song.

There are some online resources an R packages that are helpful. Celine Van den Rul wrote a tutorial in Towards Data Science showing how to use the quanteda package to measure readability in terms of sentence length and syllables per word, and richness in terms of distinct words. An article at the Illinois State Board of Education discusses text complexity more broadly, including qualitative, quantitative, and “reader and task” dimensions of complexity. The qualitative and “reader and task” dimensions are manual coding exercises that I’m not qualified (or inclined) to tackle yet, but the quantitative measures are within my grasp. The Flesch-Kincaid test (available in the quanteda package) analyzes word and sentence length (longer words and sentences are more complex). The Dale-Chall readability formula measures the proportion of less-familiar words. The article also discusses the Lexile framework and the Coh-Metrix report, but these appear to be proprietary and difficult to reproduce. Maybe an option for a more in-depth analysis.

Each section below calculates a complexity measure. My expectation is for the Rush songs to be the most complex because Neil Peart wrote such detailed lyrics. Queen, especially earlier songs written by Freddie Mercury, were very creative (see The Fairy Feller’s Master Stroke). I expect Mercury will come a close second. I’m not huge fan of AC/DC, but my impression is that the songs will be simple, mostly about sex and possibly electricity.

Measuring Complexity with Quanteda

Quanteda and companion package quanteda.texts calculate everything I want in a few function calls. Following Van den Rul’s tutorial, I’ll create a quanteda corpus, but first, I have one modification to make to the lyrics. The complexity measures assume your text is organized into sentences, not lines of lyrics. I’ll make an assumption that lines are approximately sentences and collapse the lines with period separators.

lyrics_1 <- lyrics_lines %>% 
  group_by(song_id) %>%
  mutate(lyrics_sentences = paste(lyrics, collapse = ". ")) %>%
  select(-c(line_no, lyrics)) %>%
  unique()

Now I can create the corpus.

lyrics_corpus <- lyrics_1 %>% 
  quanteda::corpus(text_field = "lyrics_sentences", docid_field = "song_id")

Van den Rul explains that the most popular ways to evaluate text complexity are readability and richness measures. Readability measures use sentence length and word length as indicators of complexity. Richness measures use vocabulary diversity as indicators of complexity. Package quanteda.textstats supports both. The help file for the textstat_readability() lists a myriad of readability measures. I’ll use five.

readability <- textstat_readability(
  lyrics_corpus, 
  measure = c(
    "meanSentenceLength",
    "meanWordSyllables", 
    "Flesch", 
    "Flesch.Kincaid", 
    "Dale.Chall"
  )
)

textstat_lexdiv() has several richness measures. Van den Rul discusses just TTR, so I’ll stick with it. The richness measures operate on document-term matrices, so I need to create one first.

lyrics_dtm <- lyrics_corpus %>%
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_numbers = TRUE,
    remove_url = TRUE,
    split_hyphens = TRUE,
    verbose = TRUE
  )
## Creating a tokens object from a corpus input...
##  ...starting tokenization
##  ...78816 to 75951
##  ...preserving social media tags (#, @)
##  ...segmenting into words
##  ...8,295 unique types
##  ...removing separators, punctuation, symbols, numbers, URLs
##  ...complete, elapsed time: 0.25 seconds.
## Finished constructing tokens from 466 documents.
richness <- textstat_lexdiv(lyrics_dtm, measure = "TTR")

I’ll attach these measures to my lyrics dataset for examination, then dive in to figure out what they mean!

lyrics_2 <- lyrics %>%
  inner_join(readability %>% mutate(song_id = as.integer(document)), by = "song_id") %>%
  inner_join(richness %>% mutate(song_id = as.integer(document)), by = "song_id") %>%
  select(-starts_with("document")) %>%
  rename(words_per_line = meanSentenceLength, syllables_per_word = meanWordSyllables) %>%
  janitor::clean_names("snake")

glimpse(lyrics_2)
## Rows: 466
## Columns: 16
## $ song_id            <int> 118582, 308581, 308216, 307643, 307531, 73849, 3075~
## $ band               <chr> "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC", "AC/DC~
## $ album              <chr> "Dirty Deeds Done Dirt Cheap", "Stiff Upper Lip", "~
## $ song               <chr> "Ain't No Fun (Waiting 'Round to Be a Millionaire)"~
## $ writer             <fct> AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC/DC, AC~
## $ released           <int> 1976, 2000, 2008, 1990, 1975, 1980, 1985, 1977, 198~
## $ song_url           <chr> "https://genius.com/Ac-dc-aint-no-fun-waiting-round~
## $ lyrics             <chr> "The following is a true story Only the names have ~
## $ n_lines            <int> 78, 49, 37, 88, 42, 55, 36, 47, 40, 39, 47, 42, 46,~
## $ n_words            <int> 509, 218, 196, 326, 239, 262, 223, 219, 187, 168, 1~
## $ words_per_line     <dbl> 6.697368, 4.448980, 5.297297, 4.724638, 5.690476, 5~
## $ syllables_per_word <dbl> 1.176817, 1.036697, 1.260204, 1.128834, 1.150628, 1~
## $ flesch             <dbl> 100.47843, 114.61470, 94.84498, 106.54011, 103.7160~
## $ flesch_kincaid     <dbl> 0.90841769, -1.62187044, 1.34635411, -0.42714591, 0~
## $ dale_chall         <dbl> 42.02125, 56.13663, 53.55915, 58.40871, 57.68863, 5~
## $ ttr                <dbl> 0.3339921, 0.3240741, 0.4081633, 0.1907692, 0.27196~
lyrics_3 <- lyrics_2 %>%
  arrange(band, writer) %>%
  mutate(writer = fct_inorder(writer))

summary_stats <- lyrics_3 %>% 
  group_by(band, writer) %>%
  summarize(.groups = "drop",
    songs = n(),
    asl_mean = mean(words_per_line), 
    asl_median = median(words_per_line),
    spw_mean = mean(syllables_per_word), 
    spw_median = median(syllables_per_word),
    flesch_mean = mean(flesch),
    flesch_median = median(flesch),
    flesch_kincaid_mean = mean(flesch_kincaid),
    flesch_kincaid_median = median(flesch_kincaid),
    dale_chall_mean = mean(dale_chall),
    dale_chall_median = median(dale_chall),
    ttr_mean = mean(ttr),
    ttr_median = median(ttr),
  )

Visualization

I will present the complexity measures with a summary table and box plots to capture the distribution. Since they presentation is consistent, I have functions to cut down on repeated code.

plot_gt <- function(metric, title_text) {
  summary_stats %>%
    select(band, writer, starts_with(metric)) %>%
    flextable::flextable() %>%
    flextable::colformat_double(digits = 2) %>%
    flextable::autofit() %>%
    flextable::set_caption(title_text)  
}

plot_metric <- function(metric, title_text) {
  p <- lyrics_3 %>%
    ggplot(aes(x = writer, y = !!ensym(metric), color = band,
               text = glue("<b>{song}</b> <br><br>",
                           "Band: {band} <br>",
                           "Album: {album} <br>",
                           "Lyrics: {writer} <br>",
                           "Lines: {n_lines} <br>",
                           "Words: {n_words} <br>",
                           "Words per Line: {scales::number(words_per_line, accuracy = .1)} <br>",
                           "Syllables per Word: {scales::number(syllables_per_word, accuracy = .1)} <br>",
                           "Flesch Reading Ease: {scales::number(flesch, accuracy = .1)} <br>",
                           "Flesch-Kincaid Grade: {scales::number(flesch_kincaid, accuracy = .1)} <br>",
                           "Dale-Chall Readability: {scales::number(dale_chall, accuracy = .1)} <br>",
                           "Type-Token Ratio (TTR): {scales::number(ttr, accuracy = .1)} <br>"))) +
    geom_boxplot(show.legend = FALSE) +
    geom_jitter(width = 0.2, alpha = .6) +
    scale_color_manual(values = band_palette) +
    labs(x = NULL, y = NULL, title = title_text) +
    theme_light() +
    theme(panel.grid.major.x = element_blank())
  
  ggplotly(p, tooltip = "text")
}

plot_album <- function(metric, title_text) {
  lyrics_3 %>% 
    ggplot(aes(x = released, y = !!ensym(metric), color = band)) +
    geom_point() +
    geom_smooth(method = "lm", formula = "y ~ x") +
    scale_color_manual(values = band_palette) +
    facet_wrap(vars(writer)) +
    theme_light() +
    theme(legend.position = "none") +
    labs(x = NULL, y = NULL, title = title_text)
}

Average Sentence Length

Average sentence length (ASL) is just words per sentence, or in this case, words per line. I already calculated n_words and n_lines, so this is just their ratio.

Neil Peart had the longest average line length at 5.59 words per line. It’s amusing to look at some of the outliers. AC/DC’s Can I Sit Next to You Girl averaged 9.5 words per line, T.N.T. just 2.3 (Oi!). The distributions are fairly symmetric, and the outliers mostly offset each other. However, I’m going to rely on the median going forward. Using the median, Neil Peart still had the longest ASL at 5.60 words per line. John Deacon (5.33) and Brian May (5.31) were a distant second. When Queen collaborated on song lyrics, they produced the least complex songs as measured by words per line (4.74).

plot_gt(c("asl_mean", "asl_median"), title_text = "Words per Line")
plot_metric("words_per_line", title_text = "Words per Line")

All of the writers had flat trends.

plot_album("words_per_line", "ASL Trend")

Average Word Syllables

I don’t know how quanteda calculates syllables per word (SPW), especially since it processes whole documents of text at a time rather than a document-term matrix. But it does, so let’s just go to the results.

No surprise, Neil Peart had the longest median SPW (1.30). It’s striking how tighly packed the median values were for the other writers. The other median values ranged from 1.18 (AC/DC) to 1.23 (Freddie Mercury and Queen).

plot_gt(c("spw_mean", "spw_median"), title_text = "Syllables per Word")

Freddie Mercury’s Mustapha is an interesting outlier. It averaged 2.07 syllables because it repeats the word “Mustapha” throughout.

plot_metric("syllables_per_word", title_text = "Syllables per Word")

Syllables per word did not change much over time, although earlier songs for Freddie Mercury and Neil Peart had more variation.

plot_album("syllables_per_word", "SPW Trend")

Flesch

Flesch’s Reading Ease score (Flesch) combines the first two measures. Flesch scores usually range from 0 to 100, although from the formula below you can see how any number under 206.835, including negative numbers, is possible. In general, the lower the score, the more difficult the text is to read.

\[\mathrm{Flesch} = 206.835 - (1.015 \cdot \mathrm{ASL}) - \left(84.6 \cdot \mathrm{SPW}\right)\] where ASL again is average sentence length (words per line), and SPW is average syllables per word. Flesch’s Reading Ease scores decrease with longer sentences and longer words. Scores between 60-80 are generally understood by 12-15 year olds (see WebFX). For complexity, I’m looking for smaller Flesch values.

Neil Peart’s lyrics were the least easy to read (median 91.2). AC/DC and Brian May wrote the easiest to read lyrics (Brian May 100.6, AC/DC 101.2).

plot_gt(c("flesch_mean", "flesch_median"), title_text = "Flesch Reading Ease Score")

Equating lines of text with sentences lowers the measured average sentence length. That’s why these Flesch scores are ridiculously high. The measure is still valuable as a comparison of the relative complexity of songs.

Flesch weights syllables per word more highly than words per sentence. That is evident in Freddie Mercury’s outlier, Mustapha. Mustapha had one of the lowest ASL scores (2.67 words per line) and the highest SPW score (2.07 syllables per word). The combined effect was the lowest Flesch score (28.8). The same was true for Neil Peart’s Chemistry. It had an ASL of 2.85 and SPW of 1.86, resulting in a low Flesch score of 46.2.

plot_metric("flesch", title_text = "Flesch Reading Ease Score")

Flesch scores did not change much for any of the writers. No surprise since ASL and SPW did not have a strong time component.

plot_album("flesch", "Flesch Trend")

Flesch-Kincaid

The Flesch-Kincaid Grade Level is the school grade attainment needed to comprehend the text. For reference, most news publications communicate at a seventh grade level. As Flesch did, this metric is a straightforward combination of ASL and SPW.

\[\mathrm{Flesch-Kincaid} = 0.39 \cdot \mathrm{ASL} + 11.8 \cdot \mathrm{SPW} - 15.59\]

Like Flesch, Flesch-Kincaid weights syllables per word highly. Neil Peart’s lyrics had by far the highest median grade level (grade 2.06). The second highest was Roger Taylor (grade 1.00). AC/DC and Brian May had remarkably low median grades (AC/DC 0.54, May 0.55).

Obviously, equating lines of lyrics with sentences results in absurdly low grade level estimates. Second graders reading Neil Peart?

plot_gt(c("flesch_kincaid_mean", "flesch_kincaid_median"), title_text = "Flesch-Kincaid Grade Level")
plot_metric("flesch_kincaid", title_text = "Flesch-Kincaid Grade Level")

Flesch scores did not change much for any of the writers. Again, no surprise.

plot_album("flesch_kincaid", "Flesch-Kincaid Trend")

Dale-Chall

In their 1948 article, A Formula for Predicting Readability, Edgar Dale and Jeanne Chall compiled a list of 763 words that 80% of fourth-graders were familiar with.1 A 1995 article extended the list to 3,000 words. I found the list here. The Dale-Chall readability score decreases with the proportion of non-familiar (PNF) words and average words per sentence.

\[\mathrm{Dale-Chall} = 64 - (0.95 \cdot 100 \cdot \mathrm{PNF}) - (0.69 \cdot \mathrm{ASL})\] where PNF is the proportion of words that are not in the list of 3,000 familiar words, and ASL is the average sentence length (in words).

The lowest median Dale-Chall scores were for Queen (47.3). The highest was for John Deacon (53.5). Surprisingly, Freddie Mercury was in the middle (49.5)

plot_gt(c("dale_chall_mean", "dale_chall_median"), title_text = "Dale-Chall Readability")
plot_metric("dale_chall", title_text = "Dale-Chall Readability")

Freddie Mercury’s Mustapha is an interesting outlier again. It has words in Arabic, Hebrew, Farsi, and Avestan, none of which are in the list of 3,000 familiar words. That’s not really fare. I could pull that song out of the data set, but taking the median value reduces sensitivity to outliers like this. Incidentally, Brian May wrote a couple songs featuring other languages, Las Palaras de Amor (Spanish) and Teo Torriatte (Japanese), but they only include a few lines of non-English.

Dale-Chall readability was flat to rising for all of the writers.

plot_album("dale_chall", "Dale-Chall Trend")

TTR

The Type-Token Ratio (TTR) is the ratio of the number of types (count of distinct words) divided by the number of tokens (count of all words). It measures how varied (rich) the text is.

\[\mathrm{TTR} = \mathrm{Unique} / \mathrm{Total}\]

Brian May had the highest median TTR (0.45). Queen was close behind with 0.43. Freddie Mercury and Roger Taylor had the lowest text richness (0.37).

plot_gt(c("ttr_mean", "ttr_median"), title_text = "Type-Token Ratio (TTR)")

The least textually rich song was AC/DC’s If You Dare (0.28). The richest was Queen’s Bijou (0.84), but it was only 31 words. The richest song that was several lines long was Neil Peart’s Jacob’s Ladder (0.76).

plot_metric(ttr, "Type-Token Ratio (TTR)")

AC/DC and Freddie Mercury had flat TTR scores over time, but the other writers had falling scores.

plot_album("ttr", "TTR Trend")

Summary

Let’s take stock. Neil Peart songs were relatively lyric heavy in terms of how many words per line, but Brian May and John Deacon were not far behind. When Queen collaborated, they used few words per line. Winner: Peart.

Neil Peart had by far the highest median syllables per word. Whereas the median for all other writers fell within a very narrow band around 1.20, Neil Peart had a whopping of 1.30. Peart had a large vocabulary and that is evident from this measure. Winner: Neil Peart.

Flesch’s Reading Ease Score combines words per line and syllables per line into a single score. It weights syllables more than words, so Neil Peart dominated with the lowest score (lower means more complex). Neil Peart’s median Flesch score was 91.2 while everyone else ranged from 97.6 to 101.2. Winner: Neil Peart.

Flesch-Kincaid’s Grade Level score is similar to Flesch, except complex text results in higher grades. Peart’s proclivity for polysyllabic words effectuated a transcendent median Flesch-Kincaid score of 2.06. Winner: Neil Peart.

Dale-Chall factors the proportion of unfamiliar words in the song (lower score means more complex). Surprisingly, Queen had the most complex Dale-Chall median score (47.3). Winner: Queen.

Brian May edged out Queen and John Deacon for the most textually rich lyrics measured by TTR. May scored a median TTR of 0.45. Winner: Brian May.

lyrics_3 %>% 
  select(writer, words_per_line:ttr) %>%
  gtsummary::tbl_summary(
    by = "writer",
    statistic = list(c(words_per_line, syllables_per_word, flesch, 
                       flesch_kincaid, dale_chall, ttr) ~ "{median}"))
Characteristic AC/DC, N = 1751 Brian May, N = 391 Freddie Mercury, N = 501 John Deacon, N = 161 Roger Taylor, N = 231 Queen, N = 151 Neil Peart, N = 1481
words_per_line 5.10 5.31 5.06 5.33 4.98 4.74 5.60
syllables_per_word 1.18 1.19 1.23 1.22 1.22 1.23 1.30
flesch 101 101 98 99 99 98 91
flesch_kincaid 0.54 0.55 0.95 0.83 1.00 0.84 2.06
dale_chall 49.7 52.4 49.5 53.5 51.1 47.3 49.2
ttr 0.40 0.45 0.37 0.42 0.37 0.43 0.40

1 Median



It is interesting to note that the trend of most measures were flat or moving mildly toward less-complex songs. This is somewhat at odds with my intuition. I expected Freddie Mercury’s songs to become dramatically less complex over time, and Neil Peart’s to become gradually more complex. Neither were true.

Save Work

Save the lyrics with complexity stats for subsequent steps.

saveRDS(lyrics_3, "./3_lyrics.Rds")

  1. See this Wikipedia article.↩︎