1.3 Spell-check

Run a spell-check to regularize the data. It’s possible to land on the wrong correction, but there is probably more to gain than lose. Only a very small fraction of these tokens were misspellings.

# There are multiple possible right spellings, so just choose one.
spell_check <- fuzzyjoin::misspellings %>% distinct(misspelling, .keep_all = TRUE)

token_1 <-
  token_0 %>%
  left_join(spell_check, by = join_by(word == misspelling)) %>%
  mutate(word = coalesce(correct, word)) %>%
  select(-correct)

# Only .09% of words were misspelled.
mean(token_0$word != token_1$word)

## [1] 0.000964401

# Examples.
tibble(before = token_0$word, after = token_1$word) %>% filter(before != after) %>% 
  count(before, after, sort = TRUE)

## # A tibble: 85 × 3
##    before        after              n
##    <chr>         <chr>          <int>
##  1 didnt         didn't            35
##  2 wasnt         wasn't            16
##  3 definately    definitely         8
##  4 helpfull      helpful            8
##  5 accomodating  accommodating      4
##  6 definetly     definitely         4
##  7 upto          up to              4
##  8 accomodation  accommodation      3
##  9 accomodations accommodations     3
## 10 altho         although           3
## # ℹ 75 more rows