Chapter 1 Data Preparation

This section covers how to prepare a corpus for text analysis. I’ll work with the customer reviews of London-based hotels data set hosted on data.world. hotel_raw contains 27K reviews of the ten most- and ten least-expensive hotels in London. The csv file is located online here. I saved it to my \inputs directory. To help my analysis steps go quicker, I’ll just use 10% of the reviews.

library(tidyverse)
library(tidytext)
library(janitor)
library(scales)
library(glue)

set.seed(12345)

hotel_0 <- 
  read_csv(
    "input/london_hotel_reviews.csv", 
    col_types = "cicccc",
    col_names = c("hotel", "rating", "title", "review", "reviewer_loc", "review_dt"),
    skip = 1
  ) %>%
  mutate(review_id = row_number()) %>%
  select(review_id, everything(), -c(title, review_dt)) %>%
  slice_sample(n = 1700)

glimpse(hotel_0)

## Rows: 1,700
## Columns: 5
## $ review_id    <int> 14478, 24627, 17104, 25306, 10904, 21306, 605, 14923, 226…
## $ hotel        <chr> "The Savoy", "Ridgemount Hotel", "Apex London Wall Hotel"…
## $ rating       <int> 5, 5, 4, 5, 5, 5, 1, 1, 2, 5, 4, 4, 2, 5, 5, 5, 5, 5, 5, …
## $ review       <chr> "Love Love Love The Savoy. If you are looking for a luxe …
## $ reviewer_loc <chr> "Southend-on-Sea, United Kingdom", "Canberra, Australia",…