1.3 Spell-check
Run a spell-check to regularize the data. It’s possible to land on the wrong correction, but there is probably more to gain than lose. Only a very small fraction of these tokens were misspellings.
# There are multiple possible right spellings, so just choose one.
spell_check <- fuzzyjoin::misspellings %>% distinct(misspelling, .keep_all = TRUE)
token_1 <-
token_0 %>%
left_join(spell_check, by = join_by(word == misspelling)) %>%
mutate(word = coalesce(correct, word)) %>%
select(-correct)
# Only .09% of words were misspelled.
mean(token_0$word != token_1$word)
## [1] 0.000964401
# Examples.
tibble(before = token_0$word, after = token_1$word) %>% filter(before != after) %>%
count(before, after, sort = TRUE)
## # A tibble: 85 × 3
## before after n
## <chr> <chr> <int>
## 1 didnt didn't 35
## 2 wasnt wasn't 16
## 3 definately definitely 8
## 4 helpfull helpful 8
## 5 accomodating accommodating 4
## 6 definetly definitely 4
## 7 upto up to 4
## 8 accomodation accommodation 3
## 9 accomodations accommodations 3
## 10 altho although 3
## # ℹ 75 more rows