3.3 Statistical Test

You can fit an ordinal logistic regression model to predict the rating based on the review sentiment. Which performs better, tidytext or sentimentr? Start with an intercept-only model for a baseline and review of ordinal logistic regression.

fit_intercept <- ordinal::clm(rating ~ 1, data = hotel_2)

summary(fit_intercept)

## formula: rating ~ 1
## data:    hotel_2
## 
##  link  threshold nobs  logLik    AIC      niter max.grad cond.H 
##  logit flexible  23611 -20916.27 41838.53 6(0)  2.17e-11 1.0e+01
## 
## Threshold coefficients:
##       Estimate Std. Error z value
## 1-2|3 -3.03502    0.03111  -97.56
## 3|4   -2.15562    0.02134 -101.03
## 4|5   -0.80833    0.01409  -57.35

The threshold coefficients in the summary table are the log-odds of the outcome variable having a level at or below vs above. Below, 10.3% of ratings were <=3 and 89.7% were >3 for a log-odds of log(.103/.897) = -2.1620836, corresponding to the 3|4 line in the regression summary.

hotel_2 %>% tabyl(rating) %>% mutate(cum = cumsum(percent), `1-cum` = 1 - cum)

##  rating     n    percent        cum     1-cum
##     1-2  1083 0.04586845 0.04586845 0.9541315
##       3  1368 0.05793910 0.10380755 0.8961925
##       4  4827 0.20443861 0.30824616 0.6917538
##       5 16333 0.69175384 1.00000000 0.0000000

Now fit the bing and sentimentr models. The bing model has the higher -2 * log-likelihood.

fit_bing <- ordinal::clm(rating ~ bing_polarity, data = hotel_2)

fit_sentimentr <- ordinal::clm(rating ~ sentimentr_polarity, data = hotel_2)

anova(fit_bing, fit_sentimentr, fit_intercept)

## Likelihood ratio tests of cumulative link models:
##  
##                formula:                     link: threshold:
## fit_intercept  rating ~ 1                   logit flexible  
## fit_bing       rating ~ bing_polarity       logit flexible  
## fit_sentimentr rating ~ sentimentr_polarity logit flexible  
## 
##                no.par   AIC logLik LR.stat df Pr(>Chisq)    
## fit_intercept       3 41839 -20916                          
## fit_bing            4 36771 -18382 5069.10  1  < 2.2e-16 ***
## fit_sentimentr      4 37701 -18847 -929.79  0               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How about predictive performance? They performed about the same.

bing_conf <- 
  fit_bing %>% 
  augment(type = "class") %>% 
  conf_mat(truth = rating, estimate = .fitted)

sentimentr_conf <- 
  fit_sentimentr %>% 
  augment(type = "class") %>% 
  conf_mat(truth = rating, estimate = .fitted)

bind_rows(
  bing = as_tibble(bing_conf$table),
  sentimentr = as_tibble(sentimentr_conf$table), 
  .id = "lexicon"
) %>%
  pivot_wider(names_from = Truth, values_from = n) %>%
  flextable::flextable() %>%
  flextable::merge_v(j = 1) %>%
  flextable::valign(j = 1, valign = "top")

lexicon	Prediction	1-2	3	4	5
bing	1-2	231	54	31	16
	3	0	0	0	0
	4	587	467	567	816
	5	265	847	4,229	15,501
sentimentr	1-2	122	22	7	9
	3	0	0	0	0
	4	587	426	381	494
	5	374	920	4,439	15,830

bind_rows(
  bing = summary(bing_conf),
  sentimentr = summary(sentimentr_conf),
  .id = "lexicon"
) %>%
  pivot_wider(names_from = lexicon, values_from = .estimate)

## # A tibble: 13 × 4
##    .metric              .estimator   bing sentimentr
##    <chr>                <chr>       <dbl>      <dbl>
##  1 accuracy             multiclass  0.690     0.692 
##  2 kap                  multiclass  0.158     0.123 
##  3 sens                 macro       0.320     0.290 
##  4 spec                 macro       0.791     0.783 
##  5 ppv                  macro      NA        NA     
##  6 npv                  macro       0.851     0.863 
##  7 mcc                  multiclass  0.184     0.158 
##  8 j_index              macro       0.110     0.0728
##  9 bal_accuracy         macro       0.555     0.536 
## 10 detection_prevalence macro       0.25      0.25  
## 11 precision            macro       0.557     0.566 
## 12 recall               macro       0.320     0.290 
## 13 f_meas               macro       0.439     0.382