1.1 Scrub

The data needs to be cleaned. I’ll follow some of the techniques used by Nagelkerke (2020). One issue is tags like <e9> and unicode characters like <U+0440>. One way to get rid of unicode characters is to convert them to ASCII tags with iconv() and then remove the ASCII tags with str_remove(). E.g., iconv() converts <U+0093> to <93> which you can remove with regex "\\<[:alnum]+\\>]".1 There are also some reviews in other languages that I’ll just drop. And some hotel names are pretty long, so I’ll abbreviate them.

hotel_1 <- hotel_0 %>%
  mutate(
    # Create ASCII bytes
    review = iconv(review, from = "", to = "ASCII", sub = "byte"),
    # Remove <..>
    review = str_remove_all(review, "\\<[[:alnum:]]+\\>"),
    # Remove <U+....>
    review = str_remove_all(review, "\\<U\\+[[:alnum:]]{4}\\>"),
    # Only keep letters, numbers, and apostrophes.
    review = str_remove_all(review, "[^[:alnum:][\\s][\\']]"),
    review = str_squish(review),
    # Shorten some of the hotel names.
    hotel = str_remove_all(
      hotel, 
      "( - .*)|(, .*)|( Hotel)|( London)|(The )|( at .*)|( Hyde .*)|( Knights.*)"
    ), 
    hotel = factor(hotel, ordered = TRUE),
    # Reducing number of hotels for modeling simplicity.
    hotel = fct_lump_prop(hotel, prop = .05),
    # Bin common locations,
    reviewer_loc = factor(case_when(
      str_detect(reviewer_loc, "(London)|(United Kingdom)|(UK)") ~ "United Kingdom",
      str_detect(reviewer_loc, "(New York)|(California)") ~ "United States",
      TRUE ~ "Other"
    )),
    # Low ratings are so rare, lump the bottom two.
    rating = fct_collapse(as.character(rating), `1-2` = c("1", "2")),
    # Interesting metadata
    raw_chrcnt = str_length(review)
  ) %>%
  # Exclude reviews written in a foreign language. One heuristic to handle this 
  # is to look for words common in other languages that do not also occur in English.
  filter(
    !str_detect(review, "( das )|( der )|( und )|( en )"), # German
    !str_detect(review, "( et )|( de )|( le )|( les )"),   # French
    !str_detect(review, "( di )|( e )|( la )"),            # Italian
    !str_detect(review, "( un )|( y )"),                   # Spanish
    raw_chrcnt > 0
  ) 

That might be enough. Let’s explore the data. We have 9 hotels. Reviewers are binned into 3 locations. 90% of reviews rate the property a 4 or 5. Some reviews are as small as 1 character, but they can get quite large.

skimr::skim(hotel_1)
Table 1.1: Data summary
Name hotel_1
Number of rows 1448
Number of columns 6
_______________________
Column type frequency:
character 1
factor 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
review 0 1 1 7157 0 1448 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
hotel 0 1 TRUE 10 Sav: 307, Mon: 237, Rem: 176, Cor: 156
rating 0 1 FALSE 4 5: 959, 4: 299, 3: 115, 1-2: 75
reviewer_loc 0 1 FALSE 3 Oth: 830, Uni: 549, Uni: 69

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
review_id 0 1 13972.61 7879.17 34 7063.5 14472 20657 27295 ▇▇▇▇▇
raw_chrcnt 0 1 712.05 653.40 1 308.0 521 870 7157 ▇▁▁▁▁

Nagelkerke (2020) recommends removing punctuation to focus on the entire text rather than the sentences within. Nagelkerke also suggests removing very short (<= 3 chars) for anything other than sentiment analysis. I’m going to keep punctuation and short reviews for now even though some of those extremely short reviews are gibberish.

References

Nagelkerke, Jurriaan. 2020. “NLP with r Part 0: Preparing Review Data for NLP and Predictive Modeling,” November. https://medium.com/cmotions/nlp-with-r-part-0-preparing-review-data-for-nlp-and-predictive-modeling-c1f2907d8312.

  1. More help with regex on RStudio’s cheat sheets.↩︎