1.4 Lemmatize

Stemming and lemmatizing convert word variations like “staying”, “stayed”, and “stay” into a generic form: “stay”. Stemming tends to chop off endings to create a root word, but the stem is often not a word itself. E.g., “staying” becomes “stai”. Lemmatize gives you the more natural “stay”.

token_2 <- token_1 %>% mutate(word = textstem::lemmatize_words(word))

tibble(before = token_1$word, after = token_2$word) %>% 
  filter(before != after) %>% 
  count(before, after, sort = TRUE)
## # A tibble: 2,711 × 3
##    before after     n
##    <chr>  <chr> <int>
##  1 was    be     4156
##  2 is     be     2291
##  3 were   be     1522
##  4 had    have   1224
##  5 are    be      916
##  6 an     a       586
##  7 stayed stay    550
##  8 rooms  room    528
##  9 well   good    440
## 10 more   much    341
## # ℹ 2,701 more rows