1.1 Scrub
The data needs to be cleaned. I’ll follow some of the techniques used by Nagelkerke (2020). One issue is tags like <e9> and unicode characters like <U+0440>. One way to get rid of unicode characters is to convert them to ASCII tags with iconv()
and then remove the ASCII tags with str_remove()
. E.g., iconv()
converts <U+0093> to <93> which you can remove with regex "\\<[:alnum]+\\>]"
.1 There are also some reviews in other languages that I’ll just drop. And some hotel names are pretty long, so I’ll abbreviate them.
hotel_1 <- hotel_0 %>%
mutate(
# Create ASCII bytes
review = iconv(review, from = "", to = "ASCII", sub = "byte"),
# Remove <..>
review = str_remove_all(review, "\\<[[:alnum:]]+\\>"),
# Remove <U+....>
review = str_remove_all(review, "\\<U\\+[[:alnum:]]{4}\\>"),
# Only keep letters, numbers, and apostrophes.
review = str_remove_all(review, "[^[:alnum:][\\s][\\']]"),
review = str_squish(review),
# Shorten some of the hotel names.
hotel = str_remove_all(
hotel,
"( - .*)|(, .*)|( Hotel)|( London)|(The )|( at .*)|( Hyde .*)|( Knights.*)"
),
hotel = factor(hotel, ordered = TRUE),
# Reducing number of hotels for modeling simplicity.
hotel = fct_lump_prop(hotel, prop = .05),
# Bin common locations,
reviewer_loc = factor(case_when(
str_detect(reviewer_loc, "(London)|(United Kingdom)|(UK)") ~ "United Kingdom",
str_detect(reviewer_loc, "(New York)|(California)") ~ "United States",
TRUE ~ "Other"
)),
# Low ratings are so rare, lump the bottom two.
rating = fct_collapse(as.character(rating), `1-2` = c("1", "2")),
# Interesting metadata
raw_chrcnt = str_length(review)
) %>%
# Exclude reviews written in a foreign language. One heuristic to handle this
# is to look for words common in other languages that do not also occur in English.
filter(
!str_detect(review, "( das )|( der )|( und )|( en )"), # German
!str_detect(review, "( et )|( de )|( le )|( les )"), # French
!str_detect(review, "( di )|( e )|( la )"), # Italian
!str_detect(review, "( un )|( y )"), # Spanish
raw_chrcnt > 0
)
That might be enough. Let’s explore the data. We have 9 hotels. Reviewers are binned into 3 locations. 90% of reviews rate the property a 4 or 5. Some reviews are as small as 1 character, but they can get quite large.
Name | hotel_1 |
Number of rows | 1448 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
review | 0 | 1 | 1 | 7157 | 0 | 1448 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
hotel | 0 | 1 | TRUE | 10 | Sav: 307, Mon: 237, Rem: 176, Cor: 156 |
rating | 0 | 1 | FALSE | 4 | 5: 959, 4: 299, 3: 115, 1-2: 75 |
reviewer_loc | 0 | 1 | FALSE | 3 | Oth: 830, Uni: 549, Uni: 69 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
review_id | 0 | 1 | 13972.61 | 7879.17 | 34 | 7063.5 | 14472 | 20657 | 27295 | ▇▇▇▇▇ |
raw_chrcnt | 0 | 1 | 712.05 | 653.40 | 1 | 308.0 | 521 | 870 | 7157 | ▇▁▁▁▁ |
Nagelkerke (2020) recommends removing punctuation to focus on the entire text rather than the sentences within. Nagelkerke also suggests removing very short (<= 3 chars) for anything other than sentiment analysis. I’m going to keep punctuation and short reviews for now even though some of those extremely short reviews are gibberish.
References
More help with regex on RStudio’s cheat sheets.↩︎