1.1 Scrub

The data needs to be cleaned. I’ll follow some of the techniques used by Nagelkerke (2020). One issue is tags like <e9> and unicode characters like <U+0440>. One way to get rid of unicode characters is to convert them to ASCII tags with iconv() and then remove the ASCII tags with str_remove(). E.g., iconv() converts <U+0093> to <93> which you can remove with regex "\\<[:alnum]+\\>]".¹ There are also some reviews in other languages that I’ll just drop. And some hotel names are pretty long, so I’ll abbreviate them.

hotel_1 <- hotel_0 %>%
  mutate(
    # Create ASCII bytes
    review = iconv(review, from = "", to = "ASCII", sub = "byte"),
    # Remove <..>
    review = str_remove_all(review, "\\<[[:alnum:]]+\\>"),
    # Remove <U+....>
    review = str_remove_all(review, "\\<U\\+[[:alnum:]]{4}\\>"),
    # Only keep letters, numbers, and apostrophes.
    review = str_remove_all(review, "[^[:alnum:][\\s][\\']]"),
    review = str_squish(review),
    # Shorten some of the hotel names.
    hotel = str_remove_all(
      hotel, 
      "( - .*)|(, .*)|( Hotel)|( London)|(The )|( at .*)|( Hyde .*)|( Knights.*)"
    ), 
    hotel = factor(hotel, ordered = TRUE),
    # Reducing number of hotels for modeling simplicity.
    hotel = fct_lump_prop(hotel, prop = .05),
    # Bin common locations,
    reviewer_loc = factor(case_when(
      str_detect(reviewer_loc, "(London)|(United Kingdom)|(UK)") ~ "United Kingdom",
      str_detect(reviewer_loc, "(New York)|(California)") ~ "United States",
      TRUE ~ "Other"
    )),
    # Low ratings are so rare, lump the bottom two.
    rating = fct_collapse(as.character(rating), `1-2` = c("1", "2")),
    # Interesting metadata
    raw_chrcnt = str_length(review)
  ) %>%
  # Exclude reviews written in a foreign language. One heuristic to handle this 
  # is to look for words common in other languages that do not also occur in English.
  filter(
    !str_detect(review, "( das )|( der )|( und )|( en )"), # German
    !str_detect(review, "( et )|( de )|( le )|( les )"),   # French
    !str_detect(review, "( di )|( e )|( la )"),            # Italian
    !str_detect(review, "( un )|( y )"),                   # Spanish
    raw_chrcnt > 0
  )

That might be enough. Let’s explore the data. We have 9 hotels. Reviewers are binned into 3 locations. 90% of reviews rate the property a 4 or 5. Some reviews are as small as 1 character, but they can get quite large.

skimr::skim(hotel_1)

Table 1.1: Data summary
Name	hotel_1
Number of rows	1448
Number of columns	6
_______________________
Column type frequency:
character	1
factor	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
review	0	1	1	7157	0	1448	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
hotel	1	TRUE	10	Sav: 307, Mon: 237, Rem: 176, Cor: 156
rating	1	FALSE	4	5: 959, 4: 299, 3: 115, 1-2: 75
reviewer_loc	1	FALSE	3	Oth: 830, Uni: 549, Uni: 69

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
review_id	0	1	13972.61	7879.17	34	7063.5	14472	20657	27295	▇▇▇▇▇
raw_chrcnt	0	1	712.05	653.40	1	308.0	521	870	7157	▇▁▁▁▁

Nagelkerke (2020) recommends removing punctuation to focus on the entire text rather than the sentences within. Nagelkerke also suggests removing very short (<= 3 chars) for anything other than sentiment analysis. I’m going to keep punctuation and short reviews for now even though some of those extremely short reviews are gibberish.

References

Nagelkerke, Jurriaan. 2020. “NLP with r Part 0: Preparing Review Data for NLP and Predictive Modeling,” November. https://medium.com/cmotions/nlp-with-r-part-0-preparing-review-data-for-nlp-and-predictive-modeling-c1f2907d8312.

More help with regex on RStudio’s cheat sheets.↩︎