1 Frequentist Statistics

1.1 Central Tendancy and Dispersion

Suppose you have a data set with $j = [1 .. p]$ variables. Each variable will have a distribution that can be characterized by its mean and variance. If you consider them together, you can see how variance and covariance are related. For the matrix algebra that follows, assume the data are organized in rows, so $X_j$ is a row vector of $n$ observations. $X_{ij}$ refers to column $i$ of row $j$.

The mean of row vector $X_j$ is $\bar{x}_j = \frac{1}{n} \sum_{i = 1}^n X_{ij}$. $\bar{x}_j$ estimates the population mean, $\mu_j = E(X_j)$. The collection of means are a column vector.

\[\boldsymbol{\bar{x}} = \begin{pmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \cdots \\ \bar{x}_p \end{pmatrix}\]

The variance of row vector $X_j$ is the average squared difference from the mean, $s_j^2 = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j)^2$. $s_j^2$ estimates the population variance, $\sigma_j^2 = E(X_j - \mu_j)^2$. Again, the collection is represented as a column vector,

\[\boldsymbol{s}^2 = \begin{pmatrix} s_1^2 \\ s_2^2 \\ \cdots \\ s_p^2 \end{pmatrix}\]

The square root of $s^2$ is called the standard deviation. The concept of variance can be extended to pairs of variables, $j$ and $k$. The covariance of $X_j$ and $X_k$ is the average product of differences from their respective means, $s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)$. $s_{jk}$ estimates the population covariance, $\sigma_{jk} = E\{ (X_{ij} - \mu_j) (X_{ik} - \mu_k)\}$. Notice how the sign of $s_{jk}$ tells you how the variables relate. It’s positive if when one variable is larger than its mean, so is the other. It’s zero if the value of one variable tells you nothing about the other. It can be shown that $s_{jk}$ is equivalently expressed as

\[ s_{jk} = \frac{1}{n-1} \left[ \sum_{i=1}^n X_{ij}X_{ik} - \frac{\sum_{i = 1}^n X_{ij} \sum_{i = 1}^n X_{ik}}{n} \right] \]

This is how it is actually calculated. The first term is dot product, $X_j \cdot X_k$. The second term is the product of the averages. Use matrix algebra to generalize across all $p$ variables to form the variance-covariance matrix.

\[ \begin{align} S &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{x}) (X_i - \bar{x})' \\ &= \frac{1}{n-1} \left[ \sum_{i=1}^n X_i X_i^{'} - \frac{\sum_{i = 1}^n X_i \sum_{i = 1}^n X_i}{n} \right] \end{align} \]

$S$ estimates the population variance-covariance matrix, $\boldsymbol{\Sigma}$. Divide the covariances by their product of their standard deviations to get their correlation, $r_{jk} = \frac{s_{jk}}{s_j s_k}$. $r_{jk}$ estimates the population correlation, $\rho_{jk} = \frac{\sigma_{jk}}{\sigma_j \sigma_k}$.

1.2 Hypothesis Testing

P-values express how surprising a summary measure is given the null hypothesis (H0). Suppose you hypothesize that IQs have increased from their intended mean of $\mu$ = 100, $\sigma$ = 15. H0 is $\mu_0$ = 100 with alternative H1 that $\mu$ > 100. Suppose also that you are right: $\mu$ is actually 106. The presumed and actual distribution of IQs would look like this:

According to the Central Limit Theorem (CLM), repeated samples of size n from a large population will yield $\bar{x}$ values that approach a normal distribution centered at $\mu$ with a standard deviation equal to the the standard error, $SE = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{30}}$ = 2.7. Repeated samples would be distributed $\sim N(100, 2.7^2).$

Suppose you take a random sample of n = 30 IQs from the population and measure $\bar{x}$ = 104, SD = 15.2. How surprising is this if H0 is true? I.e., what is the probability of observing an $\bar{x}$ of 104 if $\mu$ is 100 and $\sigma$ is 15? The situation looks like this:

The $\alpha$ region starts at qnorm(.05, 100, 2.7, lower.tail = FALSE) = 104.5. $\bar{x}$ would have had to be at least that large to reject H0. The probability of measuring $\bar{x} \ge 104$ is derived from the z score, $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{104 - 100}{2.7}$ = 1.5. The probability of measuring a z score that high given H0 is $P(z \ge 1.5)$ = 0.072, meaning 7.2% of the area under the null distribution is to the right of 104. Therefore, do not reject H0 at the $\alpha$ = .05 level of significance.

1.3 Type I and II Errors

Either H0 ($\mu = 106$) or H1 ($\mu \ge 106$) is correct. You must choose to either reject or not reject H0. That means there are four possible states at the end of your analysis. If your summary measure is extreme enough for you to declare a “positive” result and reject H0, you are either correct (true positive) or incorrect (false positive). False positives are called Type I errors. Alternatively, if it is not extreme enough, you are either correct (true negative) or incorrect (false negative). False negatives are called Type II errors.

The probabilities of landing in these four states depend on your chosen significance level, $\alpha$, and on the statistical power of the study, 1 - $\beta$.

	H0 True	H0 False
Positive test, reject H0.	False Positive Type I Error Probability = $\alpha$	True Positive Good Call! Probability = 1 - $\beta$
Negative test, do not reject H0.	True Negative Good Call! Probability = ($1 - \alpha$)	False Negative Type II Error Probability = $\beta$

$\alpha$ is the expected Type I error rate - extreme summary measures occurring by chance when there is no effect to measure. $\beta$ is the expected Type II error rate - summary measures that by chance were not extreme enough to reject H0 even though there is an actual effect.

If the population mean really was $\mu$ = 100, any sample mean greater than 104.5 would mistakenly reject H0 at the $\alpha$ = .05 significance level, a Type I error. However, the population mean is actually $\mu$ = 106, and our measured $\bar{x}$ = 104 only achieved a p-value of 0.072, so we failed to reject H0, a Type II error.

Had the sample size been larger, the study would have had the power to reject H0. Here’s the plot with n = 50. The null and actual distributions are tighter, and the $\alpha$ region starts at 103.5.

The area under the actual distribution to the left of $alpha$ is labeled $\beta$. It’s the Type II error zone. You don’t usually see much discussion of $\beta$ in reports because $\beta$ is based on the unknown actual population distribution. $\beta$ is relevant at the design stage. It informs how large your sample needs to be in order to reject H0 given some expected $\bar{x}$. Interestingly, you can see how there can be such a thing as too much power. If n is large enough, you can reject H0 with even a trivial effect size.

1.4 Statistical Power

The ability to detect a difference when it exists (the true positive) is called the power of the test. It’s measured by the area outside of $\beta$. Changing n from 30 to 50 reduced the area in the $\beta$ region, increasing the power of the test.

Statistical power is an increasing function of sample size, effect size, and significance level. The positive association with significance level means there is a trade-off between Type I and Type II error rates. A small $\alpha$ sets a high bar for rejecting H0, but you run the risk of failing to appreciate a real difference. On the other hand, a large $\alpha$ sets a low bar for rejecting H0, but you run the risk of mistaking a random difference as real.

The 1 - $\beta$ statistical power threshold is usually set at .80, similar to the $\alpha$ = .05 level of significance threshold convention. Given a real effect, a study with a statistical power of .80 will only find a positive test result 80% of the time. You may think more power is better, but beware that with a large enough sample size, even trivial effect sizes may yield a positive test result. You need to consider both sides of this coin.

A power analysis determines the sample size required to detect a threshold effect size given an $\alpha$ level of significance. A power analysis expresses the relationship among four components. If you know any three, it tells you the fourth: The components are n, $\alpha$, , 1 - $\beta$, and Cohen’s d = $(\bar{x} - \mu_0)/\sigma$.

Suppose you set power at .80, significance level at .05, and n = 30. What effect size will this design detect?

(pwr <- pwr::pwr.t.test(
  n = 30,
  sig.level = .05,
  power = 0.80,
  type = "one.sample",
  alternative = "greater"
))


     One-sample t test power calculation 

              n = 30
              d = 0.464949
      sig.level = 0.05
          power = 0.8
    alternative = greater

An effect size of d = 0.465 will fall in the $\alpha$ = .05 region with probability 1 - $\beta$ = .80 if the sample size is n = 30. Multiply d = 0.465 by $\sigma$ = 15 to convert to the IQ units, 7.0. More likely, you will use the power test to detect the required sample size. Suppose you set $1 - \beta$ = .8 and $\alpha$ = .05, and want to detect an effect size of $5 / 15$.

(pwr <- pwr::pwr.t.test(
  d = 5 / 15,
  sig.level = .05,
  power = 0.80,
  type = "one.sample",
  alternative = "greater"
))


     One-sample t test power calculation 

              n = 57.02048
              d = 0.3333333
      sig.level = 0.05
          power = 0.8
    alternative = greater

You need a larger sample, n = 58. You can use the power test formula for various n sizes to see the relationship with effect size. Note: the y-axis multiplies Cohen’s d by $\sigma$ to get the effect size in original units.

The dashed lines show a sample size of 30 is required to detect an effect size of 7 at a .05 significance level with 80% probability.

1.5 Power Analyses

One-Sample Prop Test (greater)

How large must a sample be to detect a proportion meaningfully greater than 50% (null hypothesis H0: $\pi_0 = .50$)? You might define meaningfully as at least 10 percentage points.

The effect size for a proportion power calculation uses the arcsine transformation. 10 percentage points equates to an effect size of about .20.

\[ h = 2*\arcsin(\sqrt{p1})-2*\arcsin(\sqrt{p2}) \tag{1.1}\]

(h <- pwr::ES.h(p1 = .60, p2 = .50))

[1] 0.2013579

Use the power calculation to calculate n for an effect size of .20 with 80% power and .05 level of significance.

(pwr_calc <- pwr::pwr.p.test(
  h = h, 
  sig.level = .05, 
  power = .80, 
  alternative = "greater"
))


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2013579
              n = 152.4863
      sig.level = 0.05
          power = 0.8
    alternative = greater

If you repeatedly draw samples of size 153 from a population with proportion $\pi$ = .60, expect to reject the H0 80% of the time. The critical value for a one-sided test is 87 (0.57), meaning only measured proportions of at least 0.57 reject H0.

n <- ceiling(pwr_calc$n)

(crit_val <- qbinom(p = .05, size = n, prob = .50, lower.tail = FALSE))
## [1] 87

(crit_prop <- crit_val / n)
## [1] 0.5686275

In a plot of the theoretical distributions for samples of size 153 from populations centered at $\pi_0$ = .50 and $\pi$ = .60, 20% of the $\pi$ distribution is to the left of the critical value.

Show the code

H0_str <- "H0: p = .50"
H0 <- .50
act_str <- "Actual: pi = .60"
pop <- .60
pal <- c("gray30", "darkgoldenrod")
names(pal) <- c(H0_str, act_str)

sampling_dist <- 
  tibble(
    grp = factor(c(rep(H0_str, n), rep(act_str, n)), levels = c(H0_str, act_str)),
    cnt = rep(seq(1, n, 1), 2),
    prob = c(rep(H0, n), rep(pop, n)),
    prop = cnt / n,
    dens = pmap_dbl(list(cnt, prob), \(x, prob) dbinom(x, size = n, prob))
  )

sampling_dist |>
  ggplot() +
  geom_line(aes(x = prop, y = dens, color = grp, fill = grp)) +
  geom_area(aes(x = prop, fill = grp, 
                y = if_else(cnt >= crit_val & grp == H0_str, dens, NA_real_)), 
            alpha = .2) +
  geom_area(aes(x = prop, fill = grp, 
                y = if_else(cnt <= crit_val & grp == act_str, dens, NA_real_)), 
            alpha = .2) +
  annotate("text", x = .55, y = .005, label = "beta", parse = TRUE, size = 4.5, 
           color = pal[act_str]) +
  annotate("text", x = .58, y = .005, label = "alpha", parse = TRUE, size = 4.5, 
           color = pal[H0_str]) +
  geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) +
  geom_vline(xintercept = pop, color = pal[act_str], linetype = 2, linewidth = 1) +
  scale_fill_manual(values = pal) +
  scale_color_manual(values = pal) +
  scale_x_continuous(breaks = c(H0, pop, crit_prop), labels = comma_format(.01)) +
  coord_cartesian(xlim = c(.3, .7)) +
  guides(fill = "none") +
  theme(axis.text.x = element_text(angle = 90, vjust = .5)) +
  labs(
    x = "Proportion", y = "Density", color = NULL,
    title = "Power Analysis for One Proportion, One-sided",
    subtitle = str_wrap(glue(
      "Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)}. ",
      "Proportions > {comma(crit_prop, .01)} reject {H0_str} at alpha = .05."
    ), 95)
  )

One-Sample Prop Test (two-sided)

How does this change for a two-sided test of the same effect size?

(pwr_calc <- pwr::pwr.p.test(
  h = h, 
  sig.level = .05, 
  power = .80, 
  alternative = "two.sided"
))


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2013579
              n = 193.5839
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

You need more people for a two-sided test. n increased from 153 to 194.

n <- ceiling(pwr_calc$n)

(crit_val_lwr <- qbinom(p = .025, size = n, prob = .50, lower.tail = TRUE))
## [1] 83

(crit_prop_lwr <- crit_val_lwr / n)
## [1] 0.4278351

(crit_val_upr <- qbinom(p = .025, size = n, prob = .50, lower.tail = FALSE))
## [1] 111

(crit_prop_upr <- crit_val_upr / n)
## [1] 0.5721649

Here are the theoretical distributions.

~20% of the area under Actual: pi = .40 is to the right of 0.43 and ~20% of the area under Actual: pi = .60 is to the left of 0.57.

pbinom(q = crit_val_lwr, size = n, prob = pop_lwr, lower.tail = FALSE)
## [1] 0.1932935
pbinom(q = crit_val_upr, size = n, prob = pop_upr, lower.tail = TRUE)
## [1] 0.2356584

Two-Sample Prop Test (greater)

How many people would you need to survey to detect a clear difference in support for an issue between two groups? $H_0$ is that groups A and B have the same support rate. As before, you set a minimum effect size of 10 percentage points. A one-sided test assesses whether A has a higher (lower) support rate than B.

h <- pwr::ES.h(p1 = .60, p2 = .50)

(pwr <- pwr::pwr.2p.test(
  h = h, 
  sig.level = .05, 
  power = .80, 
  alternative = "greater"
))


     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2013579
              n = 304.9725
      sig.level = 0.05
          power = 0.8
    alternative = greater

NOTE: same sample sizes

Whoa, we need twice as many people for each group! Why is that? Recall from your notes that the two proportions test is a test on the distribution of proportion difference, $d = p_1 - p_2$. The sampling distribution of $d$ has a standard error $SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$. The test statistic is $Z = (d-d_0) / SE$.

There is a similar test that allows for unequal group sample sizes. If the first group is 305, so will be the second group. If the first group is 400, the second needs to be 247. The minimum sample size is always going to be for equal sample sizes.

pwr::pwr.2p2n.test(
  h = h, 
  n1 = 400,
  sig.level = .05, 
  power = .80, 
  alternative = "greater"
)


     difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2013579
             n1 = 400
             n2 = 246.4288
      sig.level = 0.05
          power = 0.8
    alternative = greater

NOTE: different sample sizes

n <- ceiling(pwr$n)

SE <- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n)

(crit_val <- qnorm(p = .05, mean = 0, sd = SE, lower.tail = FALSE))
## [1] 0.06592883

Two-Sample Prop Test (two-sided)

How many people would you need to survey to detect a clear difference in support for an issue between two groups? $H_0$ is that groups A and B have the same support rate. Again, suppose you care about a difference of at least a 10 percentage points. In the two-sided test, group A can be a higher or lower than B.

h <- pwr::ES.h(p1 = .60, p2 = .50)

(pwr <- pwr::pwr.2p.test(
  h = h, 
  sig.level = .05, 
  power = .80, 
  alternative = "two.sided"
))


     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2013579
              n = 387.1677
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: same sample sizes

The two-sided test requires a larger sample and the critical values are a little further from zero.

n <- ceiling(pwr$n)

SE <- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n)

(crit_val_lwr <- qnorm(p = .025, mean = 0, sd = SE, lower.tail = TRUE))
## [1] -0.06965147

(crit_val_upr <- qnorm(p = .025, mean = 0, sd = SE, lower.tail = FALSE))
## [1] 0.06965147

One-sample t-test (greater)

How many people would you need to survey to detect a clear level of agreement for an item measured in a 7-level Likert scale? This is an ordinal variable, but you can treat it as continuous. $H_0$ is 4 (middle value) and $H_a$ is >4. By “clear agreement”, you might mean “at least .5 points”. If true average is >4, but less than 4.5, you are okay with not rejecting $H_0$.

sd <- 1  # assumption
cohen.d <- (4.5 - 4) / sd

(pwr <- pwr::pwr.t.test(
  d = cohen.d, 
  sig.level = .05, 
  power = .80, 
  type = "one.sample",
  alternative = "greater"
))


     One-sample t test power calculation 

              n = 26.13753
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = greater

If you repeatedly survey random samples of 27 people from a population where the average preference is level is 4.5 of 7, you should expect to reject H0 80% of the time.

n <- ceiling(pwr$n)
se <- sd / sqrt(n)

(crit_val <- qt(p = .05, df = n - 1, lower.tail = FALSE))
## [1] 1.705618

(crit_rating <- crit_val * se + 4)
## [1] 4.328246

You can see this in a plot of the theoretical distributions for samples of size 27 from populations centered at 4 and 4.5. ~20% of the mean 4.5 curve is to the left of the critical value.

One-sample t-test (two-sided)

How many people would you need to survey to detect a clear level of preference for an item measured in a 7-level Likert scale? Treating the ordinal variable as continuous. $H_0$ is 4 (middle value) and $H_a$ is <>4. “Clear agreement” might mean “at least .5 points” either direction. If true average is between 3.5 and 4.5, you are okay with not rejecting $H_0$.

sd <- 1  # assumption
cohen.d <- (4.5 - 4) / sd

(pwr <- pwr::pwr.t.test(
  d = cohen.d, 
  sig.level = .05, 
  power = .80, 
  type = "one.sample",
  alternative = "two.sided"
))


     One-sample t test power calculation 

              n = 33.36713
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

The sample size increased from 27 to 34. If you repeatedly survey random samples of 34 people from a population where the average preference is level is 3.5 or 4.5 of 7, you should expect to reject H0 80% of the time.

n <- ceiling(pwr$n)
se <- sd / sqrt(n)

(crit_val_lwr <- qt(p = .05, df = n - 1, lower.tail = TRUE))
## [1] -1.69236
(crit_val_upr <- qt(p = .05, df = n - 1, lower.tail = FALSE))
## [1] 1.69236

(crit_rating_lwr <- crit_val_lwr * se + 4)
## [1] 3.709763
(crit_rating_upr <- crit_val_upr * se + 4)
## [1] 4.290237

You can see this in a plot of the theoretical distributions for samples of size 34 from populations centered at 3.5, 4 and 4.5. ~20% of the mean 4.5 curve and mean 3.5 curve is to the left of the critical value.

Indep Samples t-test (greater)

How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? $H_0$ is that groups A and B have the same average rating. As before, you set a minimum effect size of .5 points. The one-sided test assesses whether A has a higher (lower) average than B.

s1 <- 1  # assumption
s2 <- 1  # assumption
cohen.d <- (4.5 - 4) / s1

(pwr <- pwr::pwr.t2n.test(
  n1 = 54,
  d = cohen.d, 
  sig.level = .05, 
  power = .80, 
  alternative = "greater"
))


     t test power calculation 

             n1 = 54
             n2 = 46.80756
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = greater

We need about twice as many people for each group! Recall from your notes that the independent samples t-test is a test on the distribution of the difference in means, $\hat{d} = \bar{x}_1 - \bar{x}_2$. The sampling distribution of $d$ has a standard error $SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$. The test statistic is $T = (\hat{d}-d_0) / SE$.

n1 <- 54
n2 <- ceiling(pwr$n2)
sp2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2)
SE <- sqrt(sp2) * sqrt(1/n1 + 1/n2)
df <- n1 + n2 - 2

(crit_val <- qt(p = .05, df = df, lower.tail = FALSE))
## [1] 1.660391

(crit_diff <- crit_val * SE)
## [1] 0.3312267

Indep Samples t-test (two-sided)

How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? $H_0$ is that groups A and B have the same average rating. As before, you set a minimum effect size of .5 points. The two-sided test assesses whether A has a different average than B.

s1 <- 1  # assumption
s2 <- 1  # assumption
cohen.d <- (4.5 - 4) / s1

(pwr <- pwr::pwr.t2n.test(
  n1 = 54,
  d = cohen.d, 
  sig.level = .05, 
  power = .80, 
  alternative = "two.sided"
))


     t test power calculation 

             n1 = 54
             n2 = 77.74746
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

If $n_1$ is 54, then $n_2$ needs to be 78 - higher than the one-sided test.

n1 <- 54
n2 <- ceiling(pwr$n2)
sp2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2)
SE <- sqrt(sp2) * sqrt(1/n1 + 1/n2)
df <- n1 + n2 - 2

(crit_val_l <- qt(p = .05, df = df, lower.tail = TRUE))
## [1] -1.656659
(crit_val_u <- qt(p = .05, df = df, lower.tail = FALSE))
## [1] 1.656659

(crit_diff_l <- crit_val_l * SE)
## [1] -0.2932757
(crit_diff_u <- crit_val_u * SE)
## [1] 0.2932757

1.6 p-value Intuition

What distribution of p-values would you expect if there is a true effect and you repeated the study many times? What if there is no true effect? The answer is completely determined by the statistical power of the study.¹

To see this, run 100,000 simulations of an experiment measuring the average IQ from a sample of size n = 26. The samples will be 26 random values from the normal distribution centered at 106 with a standard deviation of 15. H0 is $\mu$ = 100.

# 100,000 random samples of IQ simulations from a normal distribution where
# SIGMA = 15. True population value is 100, but we'll try other values.
n_sims <- 1E5
mu <- 100
SIGMA <- 15

run_sim <- function(MU_0 = 106, n = 26) {
  data.frame(i = 1:n_sims) %>%
    mutate(
      x = map(i, ~ rnorm(n = n, mean = MU_0, sd = SIGMA)),
      z = map(x, ~ t.test(., mu = mu)),
      p = map_dbl(z, ~ .x$p.value),
      x_bar = map_dbl(x, mean)
    ) %>%
    select(x_bar, p)
}

The null hypothesis is that the average IQ is 100. Our rigged simulation finds an average IQ of 106 - an effect size of 6.

sim_106_26 <- run_sim(MU_0 = 106, n = 26)

glimpse(sim_106_26)
## Rows: 100,000
## Columns: 2
## $ x_bar <dbl> 112.4862, 101.5958, 104.4396, 106.2315, 102.1960, 108.6005, 104.…
## $ p     <dbl> 0.0001718698, 0.6059147572, 0.0985672856, 0.0428457450, 0.443911…

mean(sim_106_26$x_bar)
## [1] 105.9953

The statistical power achieved by the simulations is 50%. That is, the typical simulation detected the effect size of 6 at the .05 significance level about 50% of the time.

pwr.t.test(
  n = 26,
  d = (106 - 100) / 15,
  sig.level = .05,
  type = "one.sample",
  alternative = "two.sided"
)


     One-sample t test power calculation 

              n = 26
              d = 0.4
      sig.level = 0.05
          power = 0.5004646
    alternative = two.sided

That means that given a population with an average IQ of 106, a two-sided hypothesis test of H0: $\mu$ = 100 from a sample of size 26 will measure an $\bar{x}$ with a p-value under .05 only 50% of the time. You can see that in this histogram of p-values.

sim_106_26 %>% plot_sim()

Had there been no effect to observe, you’d expect all p-values to be equally likely, so the 20 bins would all have been 5% of the number of simulations – i.e., uniformly distributed under the null. This is called “0 power”, although 5% of the p-values will still be significant at the .05 level. The 5% of p-values < .05 is the Type II error rate - that probability of a positive test result when there is no actual effect to observe.

run_sim(MU_0 = 100, n = 26) %>%
  plot_sim(MU_0 = 100)

If you want a higher powered study that would detect the effect at least 80% of the time (the normal standard), you’ll need a higher sample size. How high? Conduct the power analysis again, but specify the power while leaving out the sample size.

pwr.t.test(
  power = 0.80,
  d = (106 - 100) / 15,
  sig.level = .05,
  type = "one.sample",
  alternative = "two.sided"
)


     One-sample t test power calculation 

              n = 51.00945
              d = 0.4
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

You need 51 people (technically, you might want to round up to 52). Here’s what that looks like. 80% of p-values are below .05 now.

run_sim(MU_0 = 106, n = 51) %>%
  plot_sim(MU_0 = 106)

So far, we’ve discovered that when there is an effect, the probability that the measure p-value is under the $\alpha$ significance level equals the power of the study, 1 - $\beta$ - the true positive rate, and $\beta$ will be above the $\alpha$ level - the false negative rate. We’ve also discovered that when there is no effect, all p-values are equally likely, so $\alpha$ of them will be below the $alpha$ level of significance - the false positive rate, and 1 - $\alpha$ will be above $\alpha$ - the true negative rate.

It’s not the case that all p-values below 0.05 are support for the alternative hypothesis. If the statistical power is high enough, a p-value just under .05 can be even less likely under the null hypothesis.

run_sim(MU_0 = 108, n = 51)  %>%
    mutate(bin = case_when(p < .01 ~ "0.00 - 0.01",
                         p < .02 ~ "0.01 - 0.02",
                         p < .03 ~ "0.02 - 0.03",
                         p < .04 ~ "0.03 - 0.04",
                         p < .05 ~ "0.04 - 0.05",
                         TRUE ~ "other")
  ) %>%
  janitor::tabyl(bin)

         bin     n percent
 0.00 - 0.01 86544 0.86544
 0.01 - 0.02  5044 0.05044
 0.02 - 0.03  2340 0.02340
 0.03 - 0.04  1308 0.01308
 0.04 - 0.05   908 0.00908
       other  3856 0.03856

(Recall that under H0, all p-values are equally likely, so each of the percentile bins would contain 1% of p-values.)

In fact, at best, a p-value between .04 and .05 can only be about four times as likely under the alternative hypothesis as the null hypothesis. If your p-value is just under .05, it is at best weak support for the alternative hypothesis.

This section is based on ideas I learned from homework assignment 1 in Daniel Lakens’s Coursera class Improving your statistical inferences.↩︎

# Frequentist Statistics ```{r include=FALSE} library(tidyverse) library(glue) library(pwr) library(scales) set.seed(123456) source("setup.R") ``` ## Central Tendancy and Dispersion Suppose you have a data set with $j = [1 .. p]$ variables. Each variable will have a distribution that can be characterized by its mean and variance. If you consider them together, you can see how variance and covariance are related. For the matrix algebra that follows, assume the data are organized in _rows_, so $X_j$ is a _row vector_ of $n$ observations. $X_{ij}$ refers to column $i$ of row $j$. The mean of row vector $X_j$ is $\bar{x}_j = \frac{1}{n} \sum_{i = 1}^n X_{ij}$. $\bar{x}_j$ estimates the population mean, $\mu_j = E(X_j)$. The collection of means are a column vector. $$\boldsymbol{\bar{x}} = \begin{pmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \cdots \\ \bar{x}_p \end{pmatrix}$$ The variance of row vector $X_j$ is the average squared difference from the mean, $s_j^2 = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j)^2$. $s_j^2$ estimates the population variance, $\sigma_j^2 = E(X_j - \mu_j)^2$. Again, the collection is represented as a column vector, $$\boldsymbol{s}^2 = \begin{pmatrix} s_1^2 \\ s_2^2 \\ \cdots \\ s_p^2 \end{pmatrix}$$ The square root of $s^2$ is called the standard deviation. The concept of variance can be extended to pairs of variables, $j$ and $k$. The covariance of $X_j$ and $X_k$ is the average product of differences from their respective means, $s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)$. $s_{jk}$ estimates the population covariance, $\sigma_{jk} = E\{ (X_{ij} - \mu_j) (X_{ik} - \mu_k)\}$. Notice how the sign of $s_{jk}$ tells you how the variables relate. It's positive if when one variable is larger than its mean, so is the other. It's zero if the value of one variable tells you nothing about the other. It can be shown that $s_{jk}$ is equivalently expressed as $$ s_{jk} = \frac{1}{n-1} \left[ \sum_{i=1}^n X_{ij}X_{ik} - \frac{\sum_{i = 1}^n X_{ij} \sum_{i = 1}^n X_{ik}}{n} \right] $$ This is how it is actually calculated. The first term is dot product, $X_j \cdot X_k$. The second term is the product of the averages. Use matrix algebra to generalize across all $p$ variables to form the _variance-covariance matrix_. $$ \begin{align} S &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{x}) (X_i - \bar{x})' \\ &= \frac{1}{n-1} \left[ \sum_{i=1}^n X_i X_i^{'} - \frac{\sum_{i = 1}^n X_i \sum_{i = 1}^n X_i}{n} \right] \end{align} $$ $S$ estimates the population variance-covariance matrix, $\boldsymbol{\Sigma}$. Divide the covariances by their product of their standard deviations to get their _correlation_, $r_{jk} = \frac{s_{jk}}{s_j s_k}$. $r_{jk}$ estimates the population correlation, $\rho_{jk} = \frac{\sigma_{jk}}{\sigma_j \sigma_k}$. ## Hypothesis Testing ```{r} #| include: false MU_0 <- 100 SIGMA <- 15 MU <- 106 X_BAR <- 104 SD <- 15.2 n <- 30 iq_crit_30 <- qnorm(.95, mean = MU_0, sd = SIGMA / sqrt(30)) iq_crit_50 <- qnorm(.95, mean = MU_0, sd = SIGMA / sqrt(50)) iq_pop <- tibble(iq = seq(1, 200, .01)) |> mutate( MU_0 = map_dbl(iq, \(iq) dnorm(iq, mean = MU_0, sd = SIGMA)), MU = map_dbl(iq, \(iq) dnorm(iq, mean = MU, sd = SIGMA)) ) |> pivot_longer(c(MU_0, MU), names_to = "dist", values_to = "dens") iq_sample <- expand.grid(iq = seq(90, 115, .01), n = c(30, 50), grp = c("presumed", "actual")) |> mutate( mu = if_else(grp == "presumed", MU_0, MU), dens = pmap_dbl(list(iq, n, mu), \(iq, n, mu) dnorm(iq, mean = mu, sd = SIGMA / sqrt(n))), iq_crit = if_else(n == 30, iq_crit_30, iq_crit_50) ) ``` *P*-values express how surprising a summary measure is given the null hypothesis (H0). Suppose you hypothesize that IQs have increased from their intended mean of $\mu$ = `r MU_0`, $\sigma$ = `r SIGMA`. H0 is $\mu_0$ = `r MU_0` with alternative H1 that $\mu$ > 100. Suppose also that you are *right*: $\mu$ is actually `r MU`. The presumed and actual distribution of IQs would look like this: ```{r} #| echo: false iq_pop |> mutate(dist_lbl = glue("mu = {if_else(dist == 'MU', MU, MU_0)}, SD = {SIGMA}")) |> ggplot(aes(x = iq)) + geom_line(aes(y = dens, color = dist_lbl)) + geom_vline(xintercept = MU_0, color = "gray30", linetype = 2) + geom_vline(xintercept = MU, color = "darkgoldenrod", linetype = 2) + labs(x = "IQ", y = "density", color = NULL) + scale_color_manual(values = c("darkgoldenrod", "gray30")) + labs( title = str_wrap(glue( "Presumed and actual population distributions, centered at 100 and 106." ), 80) ) ``` According to the Central Limit Theorem (CLM), repeated samples of size *n* from a large population will yield $\bar{x}$ values that approach a normal distribution centered at $\mu$ with a standard deviation equal to the the *standard error*, $SE = \frac{\sigma}{\sqrt{n}} = \frac{`r SIGMA`}{\sqrt{`r n`}}$ = `r comma(SIGMA / sqrt(n), .1)`. Repeated samples would be distributed $\sim N(`r comma(MU_0, 1)`, `r comma(SIGMA/sqrt(n), .1)`^2).$ Suppose you take a random sample of *n* = 30 IQs from the population and measure $\bar{x}$ = `r X_BAR`, SD = `r SD`. How surprising is this if H0 is true? I.e., what is the *probability* of observing an $\bar{x}$ of `r comma(X_BAR, 1)` if $\mu$ is `r comma(MU_0, 1)` and $\sigma$ is `r comma(SIGMA)`? The situation looks like this: ```{r} #| echo: false #| warning: false lbl <- tibble( grp = rep(c("presumed", "actual"), 2), n = c(30, 30, 50, 50), iq = c(102.5, 105.2, 102.5, 104), lbl = c("beta", "alpha", "beta", "alpha") ) iq_sample |> filter(n == 30) |> ggplot(aes(x = iq, y = dens, color = grp, fill = grp)) + geom_line() + geom_area(aes(y = if_else(iq >= iq_crit & grp == "presumed", dens, NA_real_)), alpha = .2) + geom_area(aes(y = if_else(iq <= iq_crit & grp == "actual", dens, NA_real_)), alpha = .2) + geom_text(data = lbl[1:2, ], aes(y = .015, label = lbl), parse = TRUE, size = 4.5, color = "gray40") + geom_vline(xintercept = MU_0, color = "gray30", linetype = 2, linewidth = 1) + geom_vline(xintercept = MU, color = "darkgoldenrod", linetype = 2, linewidth = 1) + geom_vline(xintercept = X_BAR, color = "forestgreen", linetype = 2, linewidth = 1) + labs(x = "IQ", y = "density", color = NULL) + scale_fill_manual(values = c("gray30", "darkgoldenrod")) + scale_color_manual(values = c("gray30", "darkgoldenrod")) + scale_x_continuous(breaks = c(seq(90, 110, 10), MU, X_BAR, iq_crit_30), labels = comma_format(.1)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 90, vjust = .5)) + labs( title = str_wrap(glue( "Distributions of mean IQ from repeated sampling of n = 30 from ", "distributions centered at 100 and 106." ), 80) ) ``` ```{r} #| include: false #| collapse: true (z <- (X_BAR - MU_0) / (SIGMA / sqrt(n))) (p.value <- pnorm(z, lower.tail = FALSE)) (critical_value <- qnorm(.95, MU_0, SIGMA / sqrt(n))) ``` The $\alpha$ region starts at `qnorm(.05, 100, 2.7, lower.tail = FALSE)` = `r comma(critical_value, .1)`. $\bar{x}$ would have had to be at least that large to reject H0. The probability of measuring $\bar{x} \ge `r comma(X_BAR, 1)`$ is derived from the *z* score, $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{`r X_BAR` - 100}{2.7}$ = `r comma((X_BAR - MU_0) / (SIGMA / sqrt(n)), .1)`. The probability of measuring a *z* score that high given H0 is $P(z \ge 1.5)$ = `r comma(p.value, .001)`, meaning `r percent(pnorm(z, lower.tail = FALSE), .1)` of the area under the null distribution is to the right of `r comma(X_BAR, 1)`. Therefore, do not reject H0 at the $\alpha$ = .05 level of significance. ## Type I and II Errors Either H0 ($\mu = `r MU`$) or H1 ($\mu \ge `r MU`$) is correct. You must choose to either reject or not reject H0. That means there are four possible states at the end of your analysis. If your summary measure is extreme enough for you to declare a "positive" result and reject H0, you are either correct (true positive) or incorrect (false positive). False positives are called *Type I errors*. Alternatively, if it is *not* extreme enough, you are either correct (true negative) or incorrect (false negative). False negatives are called *Type II errors*. The probabilities of landing in these four states depend on your chosen significance level, $\alpha$, and on the statistical power of the study, 1 - $\beta$. | |H0 True |H0 False | |------------------------|------------------------------|-----------------------------| |Positive test, reject H0.| False Positive Type I Error Probability = $\alpha$ | True Positive Good Call! Probability = 1 - $\beta$ | |Negative test, do *not* reject H0. | True Negative Good Call! Probability = ($1 - \alpha$) | False Negative Type II Error Probability = $\beta$ | $\alpha$ is the expected Type I error rate - extreme summary measures occurring by chance when there is no effect to measure. $\beta$ is the expected Type II error rate - summary measures that by chance were not extreme enough to reject H0 even though there is an actual effect. If the population mean really was $\mu$ = `r MU_0`, any sample mean greater than `r qnorm(.95, MU_0, SIGMA / sqrt(n)) |> comma(.1)` would mistakenly reject H0 at the $\alpha$ = .05 significance level, a Type I error. However, the population mean is actually $\mu$ = `r MU`, and our measured $\bar{x}$ = `r comma(X_BAR, 1)` only achieved a *p*-value of `r comma(pnorm(z, lower.tail = FALSE), .001)`, so we failed to reject H0, a Type II error. Had the sample size been larger, the study would have had the power to reject H0. Here's the plot with *n* = 50. The null and actual distributions are tighter, and the $\alpha$ region starts at `r comma(iq_crit_50, .1)`. ```{r} #| echo: false #| warning: false lbl <- tibble( grp = rep(c("presumed", "actual"), 2), n = c(30, 30, 50, 50), iq = c(102.5, 105.2, 102.5, 104), lbl = c("beta", "alpha", "beta", "alpha") ) iq_sample |> filter(n == 50) |> ggplot(aes(x = iq, y = dens, color = grp, fill = grp)) + geom_line() + geom_area(aes(y = if_else(iq >= iq_crit & grp == "presumed", dens, NA_real_)), alpha = .2) + geom_area(aes(y = if_else(iq <= iq_crit & grp == "actual", dens, NA_real_)), alpha = .2) + geom_text(data = lbl[3:4, ], aes(y = .015, label = lbl), parse = TRUE, size = 4.5, color = "gray40") + geom_vline(xintercept = MU_0, color = "gray30", linetype = 2, linewidth = 1) + geom_vline(xintercept = MU, color = "darkgoldenrod", linetype = 2, linewidth = 1) + geom_vline(xintercept = X_BAR, color = "forestgreen", linetype = 2, linewidth = 1) + labs(x = "IQ", y = "density", color = NULL) + scale_fill_manual(values = c("gray30", "darkgoldenrod")) + scale_color_manual(values = c("gray30", "darkgoldenrod")) + scale_x_continuous(breaks = c(seq(90, 110, 10), MU, X_BAR, iq_crit_50), labels = comma_format(.1)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 90, vjust = .5)) + labs( title = str_wrap(glue( "Distributions of mean IQ from repeated sampling of n = 50 from ", "distributions centered at 100 and 106." ), 80) ) ``` The area under the actual distribution to the left of $alpha$ is labeled $\beta$. It's the Type II error zone. You don't usually see much discussion of $\beta$ in reports because $\beta$ is based on the unknown actual population distribution. $\beta$ is relevant at the _design_ stage. It informs how large your sample needs to be in order to reject H0 given some expected $\bar{x}$. Interestingly, you can see how there can be such a thing as too much power. If *n* is large enough, you can reject H0 with even a trivial effect size. ## Statistical Power The ability to detect a difference when it exists (the true positive) is called the power of the test. It's measured by the area outside of $\beta$. Changing *n* from 30 to 50 reduced the area in the $\beta$ region, increasing the power of the test. Statistical power is an increasing function of sample size, effect size, and significance level. The positive association with significance level means there is a trade-off between Type I and Type II error rates. A small $\alpha$ sets a high bar for rejecting H0, but you run the risk of failing to appreciate a *real* difference. On the other hand, a large $\alpha$ sets a low bar for rejecting H0, but you run the risk of mistaking a random difference as real. The 1 - $\beta$ statistical power threshold is usually set at .80, similar to the $\alpha$ = .05 level of significance threshold convention. Given a real effect, a study with a statistical power of .80 will only find a positive test result 80% of the time. You may think more power is better, but beware that with a large enough sample size, even trivial effect sizes may yield a positive test result. You need to consider both sides of this coin. A *power analysis* determines the sample size required to detect a threshold effect size given an $\alpha$ level of significance. A power analysis expresses the relationship among four components. If you know any three, it tells you the fourth: The components are *n*, $\alpha$, , 1 - $\beta$, and Cohen's *d* = $(\bar{x} - \mu_0)/\sigma$. Suppose you set power at .80, significance level at .05, and *n* = 30. What effect size will this design detect? ```{r} (pwr <- pwr::pwr.t.test( n = 30, sig.level = .05, power = 0.80, type = "one.sample", alternative = "greater" )) ``` An effect size of *d* = `r comma(pwr$d, .001)` will fall in the $\alpha$ = .05 region with probability 1 - $\beta$ = .80 if the sample size is *n* = 30. Multiply *d* = `r comma(pwr$d, .001)` by $\sigma$ = `r SIGMA` to convert to the IQ units, `r comma(pwr$d * SIGMA, .1)`. More likely, you will use the power test to detect the required sample size. Suppose you set $1 - \beta$ = .8 and $\alpha$ = .05, and want to detect an effect size of $5 / 15$. ```{r} (pwr <- pwr::pwr.t.test( d = 5 / 15, sig.level = .05, power = 0.80, type = "one.sample", alternative = "greater" )) ``` You need a larger sample, *n* = `r comma(ceiling(pwr$n), 1)`. You can use the power test formula for various *n* sizes to see the relationship with effect size. *Note: the y-axis multiplies Cohen's d by $\sigma$ to get the effect size in original units*. ```{r echo=FALSE} data.frame(n = rep(10:150, 2), power = c(rep(.80, 141), rep(.60, 141))) %>% mutate(d = map2_dbl(n, power, ~ pwr.t.test(n = .x, sig.level = .05, power = .y, type = "one.sample", alternative = "greater") %>% pluck("d")), delta = d * SIGMA ) %>% ggplot(aes(x = n, y = delta, group = as.factor(power), color = as.factor(power))) + geom_line() + annotate("segment", x = 30, xend = 30, y = 0, yend = 7, linetype = 2, size = 1, color = "steelblue") + annotate("segment", x = 0, xend = 30, y = 7, yend = 7, linetype = 2, size = 1, color = "steelblue") + scale_y_continuous(limits = c(0, NA), breaks = 1:100, expand = c(0, 0)) + scale_x_continuous(limits = c(0, 155), breaks = seq(0, 150, 10), expand = c(0, 0)) + scale_color_manual(values = c("0.6" = "gray80", "0.8" = "steelblue")) + theme(panel.grid.minor = element_blank(), legend.position = "right") + labs(title = paste("A sample size of 30 is required to detect an effect size", "of 7 at a .05 significance \nlevel with 80% probability."), subtitle = "Power analysis with power = .80 vs .60 and significance = .05.", color = "Power") ``` The dashed lines show a sample size of 30 is required to detect an effect size of 7 at a .05 significance level with 80% probability. ## Power Analyses ### One-Sample Prop Test (greater) {-} How large must a sample be to detect a proportion meaningfully greater than 50% (null hypothesis H0: $\pi_0 = .50$)? You might define *meaningfully* as at least 10 percentage points. The effect size for a proportion power calculation uses the arcsine transformation. 10 percentage points equates to an effect size of about .20. $$ h = 2*\arcsin(\sqrt{p1})-2*\arcsin(\sqrt{p2}) $$ {#eq-01-arcsin} ```{r} (h <- pwr::ES.h(p1 = .60, p2 = .50)) ``` Use the power calculation to calculate *n* for an effect size of .20 with 80% power and .05 level of significance. ```{r} (pwr_calc <- pwr::pwr.p.test( h = h, sig.level = .05, power = .80, alternative = "greater" )) ``` ```{r} #| include: false n <- ceiling(pwr_calc$n) crit_val <- qbinom(p = .05, size = n, prob = .50, lower.tail = FALSE) crit_pct <- crit_val / n ``` If you repeatedly draw samples of size `r comma(n)` from a population with proportion $\pi$ = .60, expect to reject the H0 80% of the time. The critical value for a one-sided test is `r comma(crit_val)` (`r comma(crit_pct, .01)`), meaning only measured proportions of at least `r comma(crit_pct, .01)` reject H0. ```{r} #| collapse: true n <- ceiling(pwr_calc$n) (crit_val <- qbinom(p = .05, size = n, prob = .50, lower.tail = FALSE)) (crit_prop <- crit_val / n) ``` In a plot of the theoretical distributions for samples of size `r n` from populations centered at $\pi_0$ = .50 and $\pi$ = .60, 20% of the $\pi$ distribution is to the left of the critical value. ```{r} #| code-fold: true #| warning: false H0_str <- "H0: p = .50" H0 <- .50 act_str <- "Actual: pi = .60" pop <- .60 pal <- c("gray30", "darkgoldenrod") names(pal) <- c(H0_str, act_str) sampling_dist <- tibble( grp = factor(c(rep(H0_str, n), rep(act_str, n)), levels = c(H0_str, act_str)), cnt = rep(seq(1, n, 1), 2), prob = c(rep(H0, n), rep(pop, n)), prop = cnt / n, dens = pmap_dbl(list(cnt, prob), \(x, prob) dbinom(x, size = n, prob)) ) sampling_dist |> ggplot() + geom_line(aes(x = prop, y = dens, color = grp, fill = grp)) + geom_area(aes(x = prop, fill = grp, y = if_else(cnt >= crit_val & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = prop, fill = grp, y = if_else(cnt <= crit_val & grp == act_str, dens, NA_real_)), alpha = .2) + annotate("text", x = .55, y = .005, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str]) + annotate("text", x = .58, y = .005, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = pop, color = pal[act_str], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(H0, pop, crit_prop), labels = comma_format(.01)) + coord_cartesian(xlim = c(.3, .7)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 90, vjust = .5)) + labs( x = "Proportion", y = "Density", color = NULL, title = "Power Analysis for One Proportion, One-sided", subtitle = str_wrap(glue( "Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)}. ", "Proportions > {comma(crit_prop, .01)} reject {H0_str} at alpha = .05." ), 95) ) ``` ### One-Sample Prop Test (two-sided) {-} How does this change for a two-sided test of the same effect size? ```{r} (pwr_calc <- pwr::pwr.p.test( h = h, sig.level = .05, power = .80, alternative = "two.sided" )) ``` You need more people for a two-sided test. *n* increased from `r comma(n)` to `r comma(ceiling(pwr_calc$n))`. ```{r} #| collapse: true n <- ceiling(pwr_calc$n) (crit_val_lwr <- qbinom(p = .025, size = n, prob = .50, lower.tail = TRUE)) (crit_prop_lwr <- crit_val_lwr / n) (crit_val_upr <- qbinom(p = .025, size = n, prob = .50, lower.tail = FALSE)) (crit_prop_upr <- crit_val_upr / n) ``` Here are the theoretical distributions. ```{r} #| echo: false #| warning: false H0_str <- "H0: p = .50" H0 <- .50 act_str_lwr <- "Actual: pi = .40" pop_lwr <- .40 act_str_upr <- "Actual: pi = .60" pop_upr <- .60 pal <- c("gray30", "darkgoldenrod", "darkgoldenrod") names(pal) <- c(H0_str, act_str_lwr, act_str_upr) sampling_dist <- tibble( grp = factor(c(rep(H0_str, n), rep(act_str_lwr, n), rep(act_str_upr, n)), levels = c(H0_str, act_str_lwr, act_str_upr)), cnt = rep(seq(1, n, 1), 3), prob = c(rep(H0, n), rep(pop_lwr, n), rep(pop_upr, n)), prop = cnt / n, dens = pmap_dbl(list(cnt, prob), \(x, prob) dbinom(x, size = n, prob)) ) sampling_dist |> ggplot() + geom_line(aes(x = prop, y = dens, color = grp, fill = grp)) + geom_area(data = sampling_dist |> filter(grp == H0_str & cnt <= crit_val_lwr), aes(x = prop, y = dens, fill = grp), alpha = .2) + geom_area(data = sampling_dist |> filter(grp == H0_str & cnt >= crit_val_upr), aes(x = prop, y = dens, fill = grp), alpha = .2) + geom_area(data = sampling_dist |> filter(grp == act_str_lwr & cnt >= crit_val_lwr), aes(x = prop, y = dens, fill = grp), alpha = .2) + geom_area(data = sampling_dist |> filter(grp == act_str_upr & cnt <= crit_val_upr), aes(x = prop, y = dens, fill = grp), alpha = .2) + annotate("text", x = c(.45, .55), y = .005, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_lwr]) + annotate("text", x = c(.42, .58), y = .0025, label = "alpha / 2", parse = TRUE, size = 3.0, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = c(pop_lwr, pop_upr), color = pal[act_str_lwr], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(pop_lwr, H0, pop_upr, crit_prop_lwr, crit_prop_upr), labels = comma_format(.01)) + coord_cartesian(xlim = c(.3, .7)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 90, vjust = .5)) + labs( x = "Proportion", y = "Density", color = NULL, title = "Power Analysis for One Proportion, Two-sided", subtitle = str_wrap(glue( "Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)}. ", "Proportions < {comma(crit_prop_lwr, .01)} and > {comma(crit_prop_upr, .01)} ", "reject {H0_str} at alpha = .05." ), 95) ) ``` ~20% of the area under *`r act_str_lwr`* is to the right of `r comma(crit_prop_lwr, .01)` and ~20% of the area under *`r act_str_upr`* is to the left of `r comma(crit_prop_upr, .01)`. ```{r} #| collapse: true pbinom(q = crit_val_lwr, size = n, prob = pop_lwr, lower.tail = FALSE) pbinom(q = crit_val_upr, size = n, prob = pop_upr, lower.tail = TRUE) ``` ### Two-Sample Prop Test (greater) {-} How many people would you need to survey to detect a clear difference in support for an issue between two groups? $H_0$ is that groups *A* and *B* have the same support rate. As before, you set a minimum effect size of 10 percentage points. A one-sided test assesses whether *A* has a higher (lower) support rate than *B*. ```{r} h <- pwr::ES.h(p1 = .60, p2 = .50) (pwr <- pwr::pwr.2p.test( h = h, sig.level = .05, power = .80, alternative = "greater" )) ``` Whoa, we need twice as many people *for each group*! Why is that? Recall from [your notes](https://mpfoley73.github.io/statistics/08-binomial-binomial.html) that the two proportions test is a test on the distribution of proportion difference, $d = p_1 - p_2$. The sampling distribution of $d$ has a standard error $SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$. The test statistic is $Z = (d-d_0) / SE$. There is a similar test that allows for unequal group sample sizes. If the first group is 305, so will be the second group. If the first group is 400, the second needs to be 247. The minimum sample size is always going to be for equal sample sizes. ```{r} pwr::pwr.2p2n.test( h = h, n1 = 400, sig.level = .05, power = .80, alternative = "greater" ) ``` ```{r} #| collapse: true n <- ceiling(pwr$n) SE <- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n) (crit_val <- qnorm(p = .05, mean = 0, sd = SE, lower.tail = FALSE)) ``` ```{r} #| echo: false #| warning: false H0_str <- "H0: d = 0" H0 <- 0 act_str <- "Actual: D = .10" pop <- .10 pal <- c("gray30", "darkgoldenrod") names(pal) <- c(H0_str, act_str) sampling_dist <- tibble( grp = factor(c(rep(H0_str, 1001), rep(act_str, 1001)), levels = c(H0_str, act_str)), prop_diff = rep(seq(-.2, .3, .0005), 2), dist_mean = c(rep(H0, 1001), rep(pop, 1001)), dens = pmap_dbl(list(prop_diff, dist_mean), \(x, M) dnorm(x, mean = M, sd = SE)) ) sampling_dist |> ggplot() + geom_line(aes(x = prop_diff, y = dens, color = grp, fill = grp)) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff >= crit_val & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff <= crit_val & grp == act_str, dens, NA_real_)), alpha = .2) + annotate("text", x = .03, y = .5, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str]) + annotate("text", x = .08, y = .5, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = pop, color = pal[act_str], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(H0, pop, crit_val), labels = comma_format(.01)) + coord_cartesian(xlim = c(-.2, .3)) + guides(fill = "none") + labs( x = "Difference", y = "Density", color = NULL, title = str_wrap(glue( "Power Analysis for Two Proportions, Equal Sample Sizes, One-sided Test", ), 80), subtitle = str_wrap(glue( "Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)} ", "for each group. Differences >= {comma(crit_val, .01)} reject {H0_str} at alpha = .05." ), 95) ) ``` ### Two-Sample Prop Test (two-sided) {-} How many people would you need to survey to detect a clear difference in support for an issue between two groups? $H_0$ is that groups *A* and *B* have the same support rate. Again, suppose you care about a difference of at least a 10 percentage points. In the two-sided test, group *A* can be a higher *or lower* than *B*. ```{r} h <- pwr::ES.h(p1 = .60, p2 = .50) (pwr <- pwr::pwr.2p.test( h = h, sig.level = .05, power = .80, alternative = "two.sided" )) ``` The two-sided test requires a larger sample and the critical values are a little further from zero. ```{r} #| collapse: true n <- ceiling(pwr$n) SE <- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n) (crit_val_lwr <- qnorm(p = .025, mean = 0, sd = SE, lower.tail = TRUE)) (crit_val_upr <- qnorm(p = .025, mean = 0, sd = SE, lower.tail = FALSE)) ``` ```{r} #| echo: false #| warning: false H0_str <- "H0: d = 0" H0 <- 0 act_str_lwr <- "Actual: D = -.1" pop_lwr <- -.10 act_str_upr <- "Actual: D = .1" pop_upr <- +.10 pal <- c("gray30", "darkgoldenrod", "darkgoldenrod") names(pal) <- c(H0_str, act_str_lwr, act_str_upr) sampling_dist <- tibble( grp = factor(c(rep(act_str_lwr, 1001), rep(H0_str, 1001), rep(act_str_upr, 1001)), levels = c(act_str_lwr, H0_str, act_str_upr)), prop_diff = rep(seq(-.25, .25, .0005), 3), diff_mean = c(rep(pop_lwr, 1001), rep(H0, 1001), rep(pop_upr, 1001)), dens = pmap_dbl(list(prop_diff, diff_mean), \(x, M) dnorm(x, mean = M, sd = SE)), ) sampling_dist |> ggplot() + geom_line(aes(x = prop_diff, y = dens, color = grp)) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff >= crit_val_upr & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff <= crit_val_lwr & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff <= crit_val_upr & grp == act_str_upr, dens, NA_real_)), alpha = .2) + geom_area(aes(x = prop_diff, fill = grp, y = if_else(prop_diff >= crit_val_lwr & grp == act_str_lwr, dens, NA_real_)), alpha = .2) + annotate("text", x = .04, y = .5, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_lwr]) + annotate("text", x = .08, y = .5, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + annotate("text", x = -.04, y = .5, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_upr]) + annotate("text", x = -.08, y = .5, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = pop_lwr, color = pal[act_str_lwr], linetype = 2, linewidth = 1) + geom_vline(xintercept = pop_upr, color = pal[act_str_upr], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(H0, pop_lwr, pop_upr, crit_val_lwr, crit_val_upr), labels = comma_format(.01)) + coord_cartesian(xlim = c(-.25, .25)) + guides(fill = "none") + labs( x = "Difference", y = "Density", color = NULL, title = str_wrap(glue( "Power Analysis for Two Proportions, Equal Sample Sizes, Two-sided Test", ), 80), subtitle = str_wrap(glue( "Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)} ", "for each group. Differences<= {comma(crit_val_lwr, .01)} or ", ">= {comma(crit_val_upr, .01)} reject H0 at alpha = .05." ), 100) ) ``` ### One-sample t-test (greater) {-} How many people would you need to survey to detect a clear level of agreement for an item measured in a 7-level Likert scale? This is an ordinal variable, but you can treat it as continuous. $H_0$ is 4 (middle value) and $H_a$ is >4. By "clear agreement", you might mean "at least .5 points". If true average is >4, but less than 4.5, you are okay with not rejecting $H_0$. ```{r} sd <- 1 # assumption cohen.d <- (4.5 - 4) / sd (pwr <- pwr::pwr.t.test( d = cohen.d, sig.level = .05, power = .80, type = "one.sample", alternative = "greater" )) ``` If you repeatedly survey random samples of `r comma(ceiling(pwr$n))` people from a population where the average preference is level is 4.5 of 7, you should expect to reject H0 80% of the time. ```{r} #| collapse: true n <- ceiling(pwr$n) se <- sd / sqrt(n) (crit_val <- qt(p = .05, df = n - 1, lower.tail = FALSE)) (crit_rating <- crit_val * se + 4) ``` You can see this in a plot of the theoretical distributions for samples of size `r n` from populations centered at 4 and 4.5. ~20% of the mean 4.5 curve is to the left of the critical value. ```{r} #| echo: false #| warning: false H0_str <- "H0: mean = 4" H0 <- 4 act_str <- "Actual: mu = 4.5" pop <- 4.5 pal <- c("gray30", "darkgoldenrod") names(pal) <- c(H0_str, act_str) sampling_dist <- tibble( grp = factor(c(rep(H0_str, 201), rep(act_str, 201)), levels = c(H0_str, act_str)), rating = rep(seq(3, 5, .01), 2), x = (rating - if_else(grp == H0_str, H0, pop)) / se, dens = map_dbl(x, \(x) dt(x, df = n - x)) ) sampling_dist |> ggplot() + geom_line(aes(x = rating, y = dens, color = grp)) + geom_area(aes(x = rating, fill = grp, y = if_else(rating >= crit_rating & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating, color = grp, fill = grp, y = if_else(rating <= crit_rating & grp == act_str, dens, NA_real_)), alpha = .2) + annotate("text", x = 4.2, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str]) + annotate("text", x = 4.4, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = pop, color = pal[act_str], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(3, pop, H0, 5, crit_rating), labels = comma_format(.1)) + coord_cartesian(xlim = c(3, 5)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 0, vjust = .5)) + labs( x = "Average Likert Rating (1-7)", y = "density", color = NULL, title = str_wrap(glue( "Power Analysis for One Sample t-Test, One-sided", ), 80), subtitle = str_wrap(glue( "Effect size d = .5, 80% power, Sample size n = {comma(n)}. ", "Average ratings > {comma(crit_rating, .1)} reject {H0_str} at alpha = .05." ), 95) ) ``` ### One-sample t-test (two-sided) {-} How many people would you need to survey to detect a clear level of preference for an item measured in a 7-level Likert scale? Treating the ordinal variable as continuous. $H_0$ is 4 (middle value) and $H_a$ is <>4. "Clear agreement" might mean "at least .5 points" either direction. If true average is between 3.5 and 4.5, you are okay with not rejecting $H_0$. ```{r} sd <- 1 # assumption cohen.d <- (4.5 - 4) / sd (pwr <- pwr::pwr.t.test( d = cohen.d, sig.level = .05, power = .80, type = "one.sample", alternative = "two.sided" )) ``` The sample size increased from `r ceiling(n)` to `r ceiling(pwr$n)`. If you repeatedly survey random samples of `r comma(ceiling(pwr$n))` people from a population where the average preference is level is 3.5 or 4.5 of 7, you should expect to reject H0 80% of the time. ```{r} #| collapse: true n <- ceiling(pwr$n) se <- sd / sqrt(n) (crit_val_lwr <- qt(p = .05, df = n - 1, lower.tail = TRUE)) (crit_val_upr <- qt(p = .05, df = n - 1, lower.tail = FALSE)) (crit_rating_lwr <- crit_val_lwr * se + 4) (crit_rating_upr <- crit_val_upr * se + 4) ``` You can see this in a plot of the theoretical distributions for samples of size `r n` from populations centered at 3.5, 4 and 4.5. ~20% of the mean 4.5 curve and mean 3.5 curve is to the left of the critical value. ```{r} #| echo: false #| warning: false H0_str <- "H0: mean = 4" H0 <- 4 act_str_lwr <- "Actual: mu = 3.5" act_lwr <- 3.5 act_str_upr <- "Actual: mu = 4.5" act_upr <- 4.5 pal <- c("darkgoldenrod", "gray30", "darkgoldenrod") names(pal) <- c(act_str_lwr, H0_str, act_str_upr) sampling_dist <- tibble( grp = factor(c(rep(H0_str, 201), rep(act_str_lwr, 201), rep(act_str_upr, 201)), levels = c(act_str_lwr, H0_str, act_str_upr)), rating = rep(seq(3, 5, .01), 3), x = (rating - case_when(grp == H0_str ~ H0, grp == act_str_lwr ~ act_lwr, TRUE ~ act_upr)) / se, dens = map_dbl(x, \(x) dt(x, df = n - x)) ) sampling_dist |> ggplot() + geom_line(aes(x = rating, y = dens, color = grp)) + geom_area(aes(x = rating, fill = grp, y = if_else(rating <= crit_rating_lwr & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating, fill = grp, y = if_else(rating >= crit_rating_lwr & grp == act_str_lwr, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating, fill = grp, y = if_else(rating >= crit_rating_upr & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating, fill = grp, y = if_else(rating <= crit_rating_upr & grp == act_str_upr, dens, NA_real_)), alpha = .2) + annotate("text", x = 3.8, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_lwr]) + annotate("text", x = 3.65, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + annotate("text", x = 4.2, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_upr]) + annotate("text", x = 4.35, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = act_lwr, color = pal[act_str_lwr], linetype = 2, linewidth = 1) + geom_vline(xintercept = H0, color = pal[H0], linetype = 2, linewidth = 1) + geom_vline(xintercept = act_upr, color = pal[act_str_upr], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(3, act_lwr, 4, 4.5, 5, crit_rating_lwr, crit_rating_upr), labels = comma_format(.1)) + coord_cartesian(xlim = c(3, 5)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 0, vjust = .5)) + labs( x = "Average Likert Rating (1-7)", y = "density", color = NULL, title = str_wrap(glue( "Power Analysis for One Sample t-Test, One-sided", ), 80), subtitle = str_wrap(glue( "Effect size d = .5, 80% power, Sample size n = {comma(n)}. ", "Average ratings between {comma(crit_rating_lwr, .1)} and ", "{comma(crit_rating_upr, .1)} reject H0 at alpha = .05." ), 95) ) ``` ### Indep Samples t-test (greater) {-} How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? $H_0$ is that groups *A* and *B* have the same average rating. As before, you set a minimum effect size of .5 points. The one-sided test assesses whether *A* has a higher (lower) average than *B*. ```{r} s1 <- 1 # assumption s2 <- 1 # assumption cohen.d <- (4.5 - 4) / s1 (pwr <- pwr::pwr.t2n.test( n1 = 54, d = cohen.d, sig.level = .05, power = .80, alternative = "greater" )) ``` We need about twice as many people *for each group*! Recall from [your notes](https://mpfoley73.github.io/statistics/05-continuous-binomial.html) that the independent samples t-test is a test on the distribution of the difference in means, $\hat{d} = \bar{x}_1 - \bar{x}_2$. The sampling distribution of $d$ has a standard error $SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$. The test statistic is $T = (\hat{d}-d_0) / SE$. ```{r} #| collapse: true n1 <- 54 n2 <- ceiling(pwr$n2) sp2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2) SE <- sqrt(sp2) * sqrt(1/n1 + 1/n2) df <- n1 + n2 - 2 (crit_val <- qt(p = .05, df = df, lower.tail = FALSE)) (crit_diff <- crit_val * SE) ``` ```{r} #| echo: false #| warning: false H0_str <- "H0: d = 0" H0 <- 0 act_str <- "Actual: D = 0.5" act <- 0.5 pal <- c("gray30", "darkgoldenrod") names(pal) <- c(H0_str, act_str) sampling_dist <- tibble( grp = factor(c(rep(H0_str, 201), rep(act_str, 201)), levels = c(H0_str, act_str)), rating_diff = rep(seq(-1, 1, .01), 2), x = (rating_diff - if_else(grp == H0_str, H0, act)) / SE, dens = map_dbl(x, \(x) dt(x, df = df)) ) sampling_dist |> ggplot() + geom_line(aes(x = rating_diff, y = dens, color = grp)) + geom_area(aes(x = rating_diff, fill = grp, y = if_else(rating_diff >= crit_diff & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating_diff, fill = grp, y = if_else(rating_diff <= crit_diff & grp == act_str, dens, NA_real_)), alpha = .2) + annotate("text", x = .25, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str]) + annotate("text", x = .40, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = act, color = pal[act_str], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(-1, -.5, H0, act, 1, crit_diff), labels = comma_format(.1)) + coord_cartesian(xlim = c(-1, 1)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 0, vjust = .5)) + labs( x = "Difference in Likert Rating Average", y = "density", color = NULL, title = str_wrap(glue( "Power Analysis for Independent Sample t-Test, One-sided", ), 80), subtitle = str_wrap(glue( "Effect size d = .5, 80% power, Sample sizes n1 = {comma(n1)}, ", "n2 = {comma(n2)}. ", "Diff in averages > {comma(crit_diff, .1)} reject {H0_str} at alpha = .05." ), 95) ) ``` ### Indep Samples t-test (two-sided) {-} How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? $H_0$ is that groups *A* and *B* have the same average rating. As before, you set a minimum effect size of .5 points. The two-sided test assesses whether *A* has a different average than *B*. ```{r} s1 <- 1 # assumption s2 <- 1 # assumption cohen.d <- (4.5 - 4) / s1 (pwr <- pwr::pwr.t2n.test( n1 = 54, d = cohen.d, sig.level = .05, power = .80, alternative = "two.sided" )) ``` If $n_1$ is `r n1`, then $n_2$ needs to be `r ceiling(pwr$n2)` - higher than the one-sided test. ```{r} #| collapse: true n1 <- 54 n2 <- ceiling(pwr$n2) sp2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2) SE <- sqrt(sp2) * sqrt(1/n1 + 1/n2) df <- n1 + n2 - 2 (crit_val_l <- qt(p = .05, df = df, lower.tail = TRUE)) (crit_val_u <- qt(p = .05, df = df, lower.tail = FALSE)) (crit_diff_l <- crit_val_l * SE) (crit_diff_u <- crit_val_u * SE) ``` ```{r} #| echo: false #| warning: false H0_str <- "H0: d = 0" H0 <- 0 act_str_lwr <- "Actual: D = -0.5" act_lwr <- -0.5 act_str_upr <- "Actual: D = +0.5" act_upr <- 0.5 pal <- c("darkgoldenrod", "gray30", "darkgoldenrod") names(pal) <- c(act_str_lwr, H0_str, act_str_upr) sampling_dist <- tibble( grp = factor(c(rep(act_str_lwr, 201), rep(H0_str, 201), rep(act_str_upr, 201)), levels = c(act_str_lwr, H0_str, act_str_upr)), rating_diff = rep(seq(-1, 1, .01), 3), x = (rating_diff - case_when(grp == H0_str ~ 0, grp == act_str_lwr ~ -0.5, grp == act_str_upr ~ 0.5)) / SE, dens = map_dbl(x, \(x) dt(x, df = df)) ) sampling_dist |> ggplot() + geom_line(aes(x = rating_diff, y = dens, color = grp)) + geom_area(aes(x = rating_diff, fill = grp, y = if_else(rating_diff <= crit_diff_l & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating_diff, fill = grp, y = if_else(rating_diff >= crit_diff_u & grp == H0_str, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating_diff, fill = grp, y = if_else(rating_diff >= crit_diff_l & grp == act_str_lwr, dens, NA_real_)), alpha = .2) + geom_area(aes(x = rating_diff, color = grp, fill = grp, y = if_else(rating_diff <= crit_diff_u & grp == act_str_upr, dens, NA_real_)), alpha = .2) + annotate("text", x = .25, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_lwr]) + annotate("text", x = .35, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + annotate("text", x = -.25, y = .03, label = "beta", parse = TRUE, size = 4.5, color = pal[act_str_upr]) + annotate("text", x = -.35, y = .03, label = "alpha", parse = TRUE, size = 4.5, color = pal[H0_str]) + geom_vline(xintercept = act_lwr, color = pal[act_str_lwr], linetype = 2, linewidth = 1) + geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) + geom_vline(xintercept = act_upr, color = pal[act_str_upr], linetype = 2, linewidth = 1) + scale_fill_manual(values = pal) + scale_color_manual(values = pal) + scale_x_continuous(breaks = c(-1, -.5, 0, .5, 1, crit_diff_l, crit_diff_u), labels = comma_format(.1)) + coord_cartesian(xlim = c(-1, 1)) + guides(fill = "none") + theme(axis.text.x = element_text(angle = 0, vjust = .5)) + labs( x = "Difference in Likert Rating Average", y = "density", color = NULL, title = str_wrap(glue( "Power Analysis for Independent Sample t-Test, Two-sided", ), 80), subtitle = str_wrap(glue( "Effect size d = .5, 80% power, Sample sizes n1 = {comma(n1)}, ", "n2 = {comma(n2)}. ", "Diff in averages > {comma(crit_diff, .1)} reject H0 at alpha = .05." ), 95) ) ``` ## p-value Intuition What distribution of p-values would you expect if there *is* a true effect and you repeated the study many times? What if there is *no* true effect? The answer is completely determined by the statistical power of the study.^[This section is based on ideas I learned from homework assignment 1 in Daniel Lakens's Coursera class [Improving your statistical inferences](https://www.coursera.org/learn/statistical-inferences/home/welcome).] To see this, run 100,000 simulations of an experiment measuring the average IQ from a sample of size *n* = 26. The samples will be 26 random values from the normal distribution centered at 106 with a standard deviation of 15. H0 is $\mu$ = 100. ```{r} # 100,000 random samples of IQ simulations from a normal distribution where # SIGMA = 15. True population value is 100, but we'll try other values. n_sims <- 1E5 mu <- 100 SIGMA <- 15 run_sim <- function(MU_0 = 106, n = 26) { data.frame(i = 1:n_sims) %>% mutate( x = map(i, ~ rnorm(n = n, mean = MU_0, sd = SIGMA)), z = map(x, ~ t.test(., mu = mu)), p = map_dbl(z, ~ .x$p.value), x_bar = map_dbl(x, mean) ) %>% select(x_bar, p) } ``` ```{r echo=FALSE} plot_sim <- function(x, MU_0 = 106, n = 26, n_breaks = 20, x_lim = c(0, 1), p_max = 1.0) { test_power = sum(x$p < .05) / n_sims x %>% filter(p <= p_max) %>% ggplot(aes(x = p)) + geom_bar(fill = "lightgoldenrod") + scale_x_binned(n.breaks = n_breaks, limits = x_lim, labels = waiver()) + geom_hline(yintercept = n_sims / n_breaks * p_max, color = "firebrick", linetype = 2) + annotate("text", x = .8, y = n_sims / n_breaks * p_max, label = "5%") + theme_light() + scale_y_continuous(labels = scales::comma_format(), limits = c(0, n_sims)) + labs( title = glue( "Measured p-values in {n_sims} simulations of samples of size {n} distributed \n", "N({MU_0}, {SIGMA}^2) with H0 = 100. Power of the test is {scales::percent(test_power)}." ) ) } ``` The null hypothesis is that the average IQ is 100. Our rigged simulation finds an average IQ of 106 - an effect size of 6. ```{r collapse=TRUE} sim_106_26 <- run_sim(MU_0 = 106, n = 26) glimpse(sim_106_26) mean(sim_106_26$x_bar) ``` The statistical power achieved by the simulations is 50%. That is, the typical simulation detected the effect size of 6 at the .05 significance level about 50% of the time. ```{r} pwr.t.test( n = 26, d = (106 - 100) / 15, sig.level = .05, type = "one.sample", alternative = "two.sided" ) ``` That means that given a population with an average IQ of 106, a two-sided hypothesis test of H0: $\mu$ = 100 from a sample of size 26 will measure an $\bar{x}$ with a *p*-value under .05 only 50% of the time. You can see that in this histogram of *p*-values. ```{r} sim_106_26 %>% plot_sim() ``` Had there been no effect to observe, you'd expect all *p*-values to be equally likely, so the 20 bins would all have been 5% of the number of simulations -- i.e., *uniformly distributed under the null*. This is called "0 power", although 5% of the *p*-values will still be significant at the .05 level. The 5% of *p*-values < .05 is the Type II error rate - that probability of a positive test result when there is no actual effect to observe. ```{r} run_sim(MU_0 = 100, n = 26) %>% plot_sim(MU_0 = 100) ``` If you want a higher powered study that would detect the effect at least 80% of the time (the normal standard), you'll need a higher sample size. How high? Conduct the power analysis again, but specify the power while leaving out the sample size. ```{r} pwr.t.test( power = 0.80, d = (106 - 100) / 15, sig.level = .05, type = "one.sample", alternative = "two.sided" ) ``` You need 51 people (technically, you might want to round up to 52). Here's what that looks like. 80% of *p*-values are below .05 now. ```{r} run_sim(MU_0 = 106, n = 51) %>% plot_sim(MU_0 = 106) ``` So far, we've discovered that when there *is* an effect, the probability that the measure *p*-value is under the $\alpha$ significance level equals the *power of the study*, 1 - $\beta$ - the *true positive* rate, and $\beta$ will be above the $\alpha$ level - the *false negative* rate. We've also discovered that when there is *no* effect, all *p*-values are equally likely, so $\alpha$ of them will be below the $alpha$ level of significance - the *false positive* rate, and 1 - $\alpha$ will be above $\alpha$ - the *true negative* rate. It's not the case that all p-values below 0.05 are support for the alternative hypothesis. If the statistical power is high enough, a *p*-value just under .05 can be even *less* likely under the null hypothesis. ```{r} run_sim(MU_0 = 108, n = 51) %>% mutate(bin = case_when(p < .01 ~ "0.00 - 0.01", p < .02 ~ "0.01 - 0.02", p < .03 ~ "0.02 - 0.03", p < .04 ~ "0.03 - 0.04", p < .05 ~ "0.04 - 0.05", TRUE ~ "other") ) %>% janitor::tabyl(bin) ``` (Recall that under H0, all *p*-values are equally likely, so each of the percentile bins would contain 1% of *p*-values.) In fact, at best, a *p*-value between .04 and .05 can only be about four times as likely under the alternative hypothesis as the null hypothesis. If your *p*-value is just under .05, it is at best weak support for the alternative hypothesis.