3.3 Statistical Test
You can fit an ordinal logistic regression model to predict the rating based on the review sentiment. Which performs better, tidytext or sentimentr? Start with an intercept-only model for a baseline and review of ordinal logistic regression.
## formula: rating ~ 1
## data: hotel_2
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 23611 -20916.27 41838.53 6(0) 2.17e-11 1.0e+01
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1-2|3 -3.03502 0.03111 -97.56
## 3|4 -2.15562 0.02134 -101.03
## 4|5 -0.80833 0.01409 -57.35
The threshold coefficients in the summary table are the log-odds of the outcome variable having a level at or below vs above. Below, 10.3% of ratings were <=3 and 89.7% were >3 for a log-odds of log(.103/.897) = -2.1620836, corresponding to the 3|4
line in the regression summary.
## rating n percent cum 1-cum
## 1-2 1083 0.04586845 0.04586845 0.9541315
## 3 1368 0.05793910 0.10380755 0.8961925
## 4 4827 0.20443861 0.30824616 0.6917538
## 5 16333 0.69175384 1.00000000 0.0000000
Now fit the bing and sentimentr models. The bing model has the higher -2 * log-likelihood.
fit_bing <- ordinal::clm(rating ~ bing_polarity, data = hotel_2)
fit_sentimentr <- ordinal::clm(rating ~ sentimentr_polarity, data = hotel_2)
anova(fit_bing, fit_sentimentr, fit_intercept)
## Likelihood ratio tests of cumulative link models:
##
## formula: link: threshold:
## fit_intercept rating ~ 1 logit flexible
## fit_bing rating ~ bing_polarity logit flexible
## fit_sentimentr rating ~ sentimentr_polarity logit flexible
##
## no.par AIC logLik LR.stat df Pr(>Chisq)
## fit_intercept 3 41839 -20916
## fit_bing 4 36771 -18382 5069.10 1 < 2.2e-16 ***
## fit_sentimentr 4 37701 -18847 -929.79 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How about predictive performance? They performed about the same.
bing_conf <-
fit_bing %>%
augment(type = "class") %>%
conf_mat(truth = rating, estimate = .fitted)
sentimentr_conf <-
fit_sentimentr %>%
augment(type = "class") %>%
conf_mat(truth = rating, estimate = .fitted)
bind_rows(
bing = as_tibble(bing_conf$table),
sentimentr = as_tibble(sentimentr_conf$table),
.id = "lexicon"
) %>%
pivot_wider(names_from = Truth, values_from = n) %>%
flextable::flextable() %>%
flextable::merge_v(j = 1) %>%
flextable::valign(j = 1, valign = "top")
lexicon | Prediction | 1-2 | 3 | 4 | 5 |
---|---|---|---|---|---|
bing | 1-2 | 231 | 54 | 31 | 16 |
3 | 0 | 0 | 0 | 0 | |
4 | 587 | 467 | 567 | 816 | |
5 | 265 | 847 | 4,229 | 15,501 | |
sentimentr | 1-2 | 122 | 22 | 7 | 9 |
3 | 0 | 0 | 0 | 0 | |
4 | 587 | 426 | 381 | 494 | |
5 | 374 | 920 | 4,439 | 15,830 |
bind_rows(
bing = summary(bing_conf),
sentimentr = summary(sentimentr_conf),
.id = "lexicon"
) %>%
pivot_wider(names_from = lexicon, values_from = .estimate)
## # A tibble: 13 × 4
## .metric .estimator bing sentimentr
## <chr> <chr> <dbl> <dbl>
## 1 accuracy multiclass 0.690 0.692
## 2 kap multiclass 0.158 0.123
## 3 sens macro 0.320 0.290
## 4 spec macro 0.791 0.783
## 5 ppv macro NA NA
## 6 npv macro 0.851 0.863
## 7 mcc multiclass 0.184 0.158
## 8 j_index macro 0.110 0.0728
## 9 bal_accuracy macro 0.555 0.536
## 10 detection_prevalence macro 0.25 0.25
## 11 precision macro 0.557 0.566
## 12 recall macro 0.320 0.290
## 13 f_meas macro 0.439 0.382