1.7 Bigrams
If you intend to present bigrams, don’t simply tokenize the raw or prepped text into bigrams because you don’t want stop words in bigram, nor do you want words that aren’t actually adjacent because you’ve removed stop words. Instead, tokenize into bigrams, split the bigrams into words, and filter out rows where one or both words is stop word.
# Reassemble token_2 into text and re-tokenize so you get the spelling corrections.
bigram_0 <-
token_2 %>%
summarize(.by = review_id, reconstructed = paste(word, collapse = " ")) %>%
unnest_tokens("bigram", reconstructed, token = "ngrams", n = 2)
# Remove bigrams where one or both words are stop words.
bigram <-
bigram_0 %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
anti_join(stop, by = join_by(word1 == word)) %>%
anti_join(stop, by = join_by(word2 == word)) %>%
mutate(bigram = paste(word1, word2)) %>%
select(review_id, bigram)
# Example
bind_cols(
hotel_2 %>% filter(review_id == hotel_2[1, ]$review_id) %>% select(review),
bigram %>% filter(review_id == hotel_2[1, ]$review_id) %>%
summarize(bigrams = paste(bigram, collapse = "\n"))
) %>%
flextable::flextable() %>%
flextable::autofit() %>%
flextable::width(j = 1, width = 4.5, unit = "in") %>%
flextable::width(j = 2, width = 1.5, unit = "in") %>%
flextable::valign(valign = "top")
review | bigrams |
---|---|
Love Love Love The Savoy If you are looking for a luxe hotel that isn't stuffy this is the place for you You feel like a millionaire even if you're not one and it's a great place to hang out if you fancy keeping off the streets of London for a while Classy decor wonderful food and cocktails and the staff are wonderful it is an absolute allrounder and if I could afford it I would probably live there | love love |