Chapter 1 Data Preparation
This section covers how to prepare a corpus for text analysis. I’ll work with the customer reviews of London-based hotels data set hosted on data.world. hotel_raw
contains 27K reviews of the ten most- and ten least-expensive hotels in London. The csv file is located online here. I saved it to my \inputs directory. To help my analysis steps go quicker, I’ll just use 10% of the reviews.
set.seed(12345)
hotel_0 <-
read_csv(
"input/london_hotel_reviews.csv",
col_types = "cicccc",
col_names = c("hotel", "rating", "title", "review", "reviewer_loc", "review_dt"),
skip = 1
) %>%
mutate(review_id = row_number()) %>%
select(review_id, everything(), -c(title, review_dt)) %>%
slice_sample(n = 1700)
glimpse(hotel_0)
## Rows: 1,700
## Columns: 5
## $ review_id <int> 14478, 24627, 17104, 25306, 10904, 21306, 605, 14923, 226…
## $ hotel <chr> "The Savoy", "Ridgemount Hotel", "Apex London Wall Hotel"…
## $ rating <int> 5, 5, 4, 5, 5, 5, 1, 1, 2, 5, 4, 4, 2, 5, 5, 5, 5, 5, 5, …
## $ review <chr> "Love Love Love The Savoy. If you are looking for a luxe …
## $ reviewer_loc <chr> "Southend-on-Sea, United Kingdom", "Canberra, Australia",…