1 Frequentist Statistics
1.1 Central Tendancy and Dispersion
Suppose you have a data set with \(j = [1 .. p]\) variables. Each variable will have a distribution that can be characterized by its mean and variance. If you consider them together, you can see how variance and covariance are related. For the matrix algebra that follows, assume the data are organized in rows, so \(X_j\) is a row vector of \(n\) observations. \(X_{ij}\) refers to column \(i\) of row \(j\).
The mean of row vector \(X_j\) is \(\bar{x}_j = \frac{1}{n} \sum_{i = 1}^n X_{ij}\). \(\bar{x}_j\) estimates the population mean, \(\mu_j = E(X_j)\). The collection of means are a column vector.
\[\boldsymbol{\bar{x}} = \begin{pmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \cdots \\ \bar{x}_p \end{pmatrix}\]
The variance of row vector \(X_j\) is the average squared difference from the mean, \(s_j^2 = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j)^2\). \(s_j^2\) estimates the population variance, \(\sigma_j^2 = E(X_j - \mu_j)^2\). Again, the collection is represented as a column vector,
\[\boldsymbol{s}^2 = \begin{pmatrix} s_1^2 \\ s_2^2 \\ \cdots \\ s_p^2 \end{pmatrix}\]
The square root of \(s^2\) is called the standard deviation. The concept of variance can be extended to pairs of variables, \(j\) and \(k\). The covariance of \(X_j\) and \(X_k\) is the average product of differences from their respective means, \(s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)\). \(s_{jk}\) estimates the population covariance, \(\sigma_{jk} = E\{ (X_{ij} - \mu_j) (X_{ik} - \mu_k)\}\). Notice how the sign of \(s_{jk}\) tells you how the variables relate. It’s positive if when one variable is larger than its mean, so is the other. It’s zero if the value of one variable tells you nothing about the other. It can be shown that \(s_{jk}\) is equivalently expressed as
\[ s_{jk} = \frac{1}{n-1} \left[ \sum_{i=1}^n X_{ij}X_{ik} - \frac{\sum_{i = 1}^n X_{ij} \sum_{i = 1}^n X_{ik}}{n} \right] \]
This is how it is actually calculated. The first term is dot product, \(X_j \cdot X_k\). The second term is the product of the averages. Use matrix algebra to generalize across all \(p\) variables to form the variance-covariance matrix.
\[ \begin{align} S &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{x}) (X_i - \bar{x})' \\ &= \frac{1}{n-1} \left[ \sum_{i=1}^n X_i X_i^{'} - \frac{\sum_{i = 1}^n X_i \sum_{i = 1}^n X_i}{n} \right] \end{align} \]
\(S\) estimates the population variance-covariance matrix, \(\boldsymbol{\Sigma}\). Divide the covariances by their product of their standard deviations to get their correlation, \(r_{jk} = \frac{s_{jk}}{s_j s_k}\). \(r_{jk}\) estimates the population correlation, \(\rho_{jk} = \frac{\sigma_{jk}}{\sigma_j \sigma_k}\).
1.2 Hypothesis Testing
P-values express how surprising a summary measure is given the null hypothesis (H0). Suppose you hypothesize that IQs have increased from their intended mean of \(\mu\) = 100, \(\sigma\) = 15. H0 is \(\mu_0\) = 100 with alternative H1 that \(\mu\) > 100. Suppose also that you are right: \(\mu\) is actually 106. The presumed and actual distribution of IQs would look like this:
According to the Central Limit Theorem (CLM), repeated samples of size n from a large population will yield \(\bar{x}\) values that approach a normal distribution centered at \(\mu\) with a standard deviation equal to the the standard error, \(SE = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{30}}\) = 2.7. Repeated samples would be distributed \(\sim N(100, 2.7^2).\)
Suppose you take a random sample of n = 30 IQs from the population and measure \(\bar{x}\) = 104, SD = 15.2. How surprising is this if H0 is true? I.e., what is the probability of observing an \(\bar{x}\) of 104 if \(\mu\) is 100 and \(\sigma\) is 15? The situation looks like this:
The \(\alpha\) region starts at qnorm(.05, 100, 2.7, lower.tail = FALSE)
= 104.5. \(\bar{x}\) would have had to be at least that large to reject H0. The probability of measuring \(\bar{x} \ge 104\) is derived from the z score, \(z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{104 - 100}{2.7}\) = 1.5. The probability of measuring a z score that high given H0 is \(P(z \ge 1.5)\) = 0.072, meaning 7.2% of the area under the null distribution is to the right of 104. Therefore, do not reject H0 at the \(\alpha\) = .05 level of significance.
1.3 Type I and II Errors
Either H0 (\(\mu = 106\)) or H1 (\(\mu \ge 106\)) is correct. You must choose to either reject or not reject H0. That means there are four possible states at the end of your analysis. If your summary measure is extreme enough for you to declare a “positive” result and reject H0, you are either correct (true positive) or incorrect (false positive). False positives are called Type I errors. Alternatively, if it is not extreme enough, you are either correct (true negative) or incorrect (false negative). False negatives are called Type II errors.
The probabilities of landing in these four states depend on your chosen significance level, \(\alpha\), and on the statistical power of the study, 1 - \(\beta\).
H0 True | H0 False | |
---|---|---|
Positive test, reject H0. | False Positive Type I Error Probability = \(\alpha\) |
True Positive Good Call! Probability = 1 - \(\beta\) |
Negative test, do not reject H0. | True Negative Good Call! Probability = (\(1 - \alpha\)) |
False Negative Type II Error Probability = \(\beta\) |
\(\alpha\) is the expected Type I error rate - extreme summary measures occurring by chance when there is no effect to measure. \(\beta\) is the expected Type II error rate - summary measures that by chance were not extreme enough to reject H0 even though there is an actual effect.
If the population mean really was \(\mu\) = 100, any sample mean greater than 104.5 would mistakenly reject H0 at the \(\alpha\) = .05 significance level, a Type I error. However, the population mean is actually \(\mu\) = 106, and our measured \(\bar{x}\) = 104 only achieved a p-value of 0.072, so we failed to reject H0, a Type II error.
Had the sample size been larger, the study would have had the power to reject H0. Here’s the plot with n = 50. The null and actual distributions are tighter, and the \(\alpha\) region starts at 103.5.
The area under the actual distribution to the left of \(alpha\) is labeled \(\beta\). It’s the Type II error zone. You don’t usually see much discussion of \(\beta\) in reports because \(\beta\) is based on the unknown actual population distribution. \(\beta\) is relevant at the design stage. It informs how large your sample needs to be in order to reject H0 given some expected \(\bar{x}\). Interestingly, you can see how there can be such a thing as too much power. If n is large enough, you can reject H0 with even a trivial effect size.
1.4 Statistical Power
The ability to detect a difference when it exists (the true positive) is called the power of the test. It’s measured by the area outside of \(\beta\). Changing n from 30 to 50 reduced the area in the \(\beta\) region, increasing the power of the test.
Statistical power is an increasing function of sample size, effect size, and significance level. The positive association with significance level means there is a trade-off between Type I and Type II error rates. A small \(\alpha\) sets a high bar for rejecting H0, but you run the risk of failing to appreciate a real difference. On the other hand, a large \(\alpha\) sets a low bar for rejecting H0, but you run the risk of mistaking a random difference as real.
The 1 - \(\beta\) statistical power threshold is usually set at .80, similar to the \(\alpha\) = .05 level of significance threshold convention. Given a real effect, a study with a statistical power of .80 will only find a positive test result 80% of the time. You may think more power is better, but beware that with a large enough sample size, even trivial effect sizes may yield a positive test result. You need to consider both sides of this coin.
A power analysis determines the sample size required to detect a threshold effect size given an \(\alpha\) level of significance. A power analysis expresses the relationship among four components. If you know any three, it tells you the fourth: The components are n, \(\alpha\), , 1 - \(\beta\), and Cohen’s d = \((\bar{x} - \mu_0)/\sigma\).
Suppose you set power at .80, significance level at .05, and n = 30. What effect size will this design detect?
<- pwr::pwr.t.test(
(pwr n = 30,
sig.level = .05,
power = 0.80,
type = "one.sample",
alternative = "greater"
))
One-sample t test power calculation
n = 30
d = 0.464949
sig.level = 0.05
power = 0.8
alternative = greater
An effect size of d = 0.465 will fall in the \(\alpha\) = .05 region with probability 1 - \(\beta\) = .80 if the sample size is n = 30. Multiply d = 0.465 by \(\sigma\) = 15 to convert to the IQ units, 7.0. More likely, you will use the power test to detect the required sample size. Suppose you set \(1 - \beta\) = .8 and \(\alpha\) = .05, and want to detect an effect size of \(5 / 15\).
<- pwr::pwr.t.test(
(pwr d = 5 / 15,
sig.level = .05,
power = 0.80,
type = "one.sample",
alternative = "greater"
))
One-sample t test power calculation
n = 57.02048
d = 0.3333333
sig.level = 0.05
power = 0.8
alternative = greater
You need a larger sample, n = 58. You can use the power test formula for various n sizes to see the relationship with effect size. Note: the y-axis multiplies Cohen’s d by \(\sigma\) to get the effect size in original units.
The dashed lines show a sample size of 30 is required to detect an effect size of 7 at a .05 significance level with 80% probability.
1.5 Power Analyses
One-Sample Prop Test (greater)
How large must a sample be to detect a proportion meaningfully greater than 50% (null hypothesis H0: \(\pi_0 = .50\))? You might define meaningfully as at least 10 percentage points.
The effect size for a proportion power calculation uses the arcsine transformation. 10 percentage points equates to an effect size of about .20.
\[ h = 2*\arcsin(\sqrt{p1})-2*\arcsin(\sqrt{p2}) \tag{1.1}\]
<- pwr::ES.h(p1 = .60, p2 = .50)) (h
[1] 0.2013579
Use the power calculation to calculate n for an effect size of .20 with 80% power and .05 level of significance.
<- pwr::pwr.p.test(
(pwr_calc h = h,
sig.level = .05,
power = .80,
alternative = "greater"
))
proportion power calculation for binomial distribution (arcsine transformation)
h = 0.2013579
n = 152.4863
sig.level = 0.05
power = 0.8
alternative = greater
If you repeatedly draw samples of size 153 from a population with proportion \(\pi\) = .60, expect to reject the H0 80% of the time. The critical value for a one-sided test is 87 (0.57), meaning only measured proportions of at least 0.57 reject H0.
<- ceiling(pwr_calc$n)
n
<- qbinom(p = .05, size = n, prob = .50, lower.tail = FALSE))
(crit_val ## [1] 87
<- crit_val / n)
(crit_prop ## [1] 0.5686275
In a plot of the theoretical distributions for samples of size 153 from populations centered at \(\pi_0\) = .50 and \(\pi\) = .60, 20% of the \(\pi\) distribution is to the left of the critical value.
Show the code
<- "H0: p = .50"
H0_str <- .50
H0 <- "Actual: pi = .60"
act_str <- .60
pop <- c("gray30", "darkgoldenrod")
pal names(pal) <- c(H0_str, act_str)
<-
sampling_dist tibble(
grp = factor(c(rep(H0_str, n), rep(act_str, n)), levels = c(H0_str, act_str)),
cnt = rep(seq(1, n, 1), 2),
prob = c(rep(H0, n), rep(pop, n)),
prop = cnt / n,
dens = pmap_dbl(list(cnt, prob), \(x, prob) dbinom(x, size = n, prob))
)
|>
sampling_dist ggplot() +
geom_line(aes(x = prop, y = dens, color = grp, fill = grp)) +
geom_area(aes(x = prop, fill = grp,
y = if_else(cnt >= crit_val & grp == H0_str, dens, NA_real_)),
alpha = .2) +
geom_area(aes(x = prop, fill = grp,
y = if_else(cnt <= crit_val & grp == act_str, dens, NA_real_)),
alpha = .2) +
annotate("text", x = .55, y = .005, label = "beta", parse = TRUE, size = 4.5,
color = pal[act_str]) +
annotate("text", x = .58, y = .005, label = "alpha", parse = TRUE, size = 4.5,
color = pal[H0_str]) +
geom_vline(xintercept = H0, color = pal[H0_str], linetype = 2, linewidth = 1) +
geom_vline(xintercept = pop, color = pal[act_str], linetype = 2, linewidth = 1) +
scale_fill_manual(values = pal) +
scale_color_manual(values = pal) +
scale_x_continuous(breaks = c(H0, pop, crit_prop), labels = comma_format(.01)) +
coord_cartesian(xlim = c(.3, .7)) +
guides(fill = "none") +
theme(axis.text.x = element_text(angle = 90, vjust = .5)) +
labs(
x = "Proportion", y = "Density", color = NULL,
title = "Power Analysis for One Proportion, One-sided",
subtitle = str_wrap(glue(
"Effect size h = .2 (10 pct pts), 80% power, Sample size n = {comma(n)}. ",
"Proportions > {comma(crit_prop, .01)} reject {H0_str} at alpha = .05."
95)
), )
One-Sample Prop Test (two-sided)
How does this change for a two-sided test of the same effect size?
<- pwr::pwr.p.test(
(pwr_calc h = h,
sig.level = .05,
power = .80,
alternative = "two.sided"
))
proportion power calculation for binomial distribution (arcsine transformation)
h = 0.2013579
n = 193.5839
sig.level = 0.05
power = 0.8
alternative = two.sided
You need more people for a two-sided test. n increased from 153 to 194.
<- ceiling(pwr_calc$n)
n
<- qbinom(p = .025, size = n, prob = .50, lower.tail = TRUE))
(crit_val_lwr ## [1] 83
<- crit_val_lwr / n)
(crit_prop_lwr ## [1] 0.4278351
<- qbinom(p = .025, size = n, prob = .50, lower.tail = FALSE))
(crit_val_upr ## [1] 111
<- crit_val_upr / n)
(crit_prop_upr ## [1] 0.5721649
Here are the theoretical distributions.
~20% of the area under Actual: pi = .40 is to the right of 0.43 and ~20% of the area under Actual: pi = .60 is to the left of 0.57.
pbinom(q = crit_val_lwr, size = n, prob = pop_lwr, lower.tail = FALSE)
## [1] 0.1932935
pbinom(q = crit_val_upr, size = n, prob = pop_upr, lower.tail = TRUE)
## [1] 0.2356584
Two-Sample Prop Test (greater)
How many people would you need to survey to detect a clear difference in support for an issue between two groups? \(H_0\) is that groups A and B have the same support rate. As before, you set a minimum effect size of 10 percentage points. A one-sided test assesses whether A has a higher (lower) support rate than B.
<- pwr::ES.h(p1 = .60, p2 = .50)
h
<- pwr::pwr.2p.test(
(pwr h = h,
sig.level = .05,
power = .80,
alternative = "greater"
))
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.2013579
n = 304.9725
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: same sample sizes
Whoa, we need twice as many people for each group! Why is that? Recall from your notes that the two proportions test is a test on the distribution of proportion difference, \(d = p_1 - p_2\). The sampling distribution of \(d\) has a standard error \(SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\). The test statistic is \(Z = (d-d_0) / SE\).
There is a similar test that allows for unequal group sample sizes. If the first group is 305, so will be the second group. If the first group is 400, the second needs to be 247. The minimum sample size is always going to be for equal sample sizes.
::pwr.2p2n.test(
pwrh = h,
n1 = 400,
sig.level = .05,
power = .80,
alternative = "greater"
)
difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.2013579
n1 = 400
n2 = 246.4288
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: different sample sizes
<- ceiling(pwr$n)
n
<- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n)
SE
<- qnorm(p = .05, mean = 0, sd = SE, lower.tail = FALSE))
(crit_val ## [1] 0.06592883
Two-Sample Prop Test (two-sided)
How many people would you need to survey to detect a clear difference in support for an issue between two groups? \(H_0\) is that groups A and B have the same support rate. Again, suppose you care about a difference of at least a 10 percentage points. In the two-sided test, group A can be a higher or lower than B.
<- pwr::ES.h(p1 = .60, p2 = .50)
h
<- pwr::pwr.2p.test(
(pwr h = h,
sig.level = .05,
power = .80,
alternative = "two.sided"
))
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.2013579
n = 387.1677
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: same sample sizes
The two-sided test requires a larger sample and the critical values are a little further from zero.
<- ceiling(pwr$n)
n
<- sqrt(.5 * (1 - .5) / n + .6 * (1 - .6) / n)
SE
<- qnorm(p = .025, mean = 0, sd = SE, lower.tail = TRUE))
(crit_val_lwr ## [1] -0.06965147
<- qnorm(p = .025, mean = 0, sd = SE, lower.tail = FALSE))
(crit_val_upr ## [1] 0.06965147
One-sample t-test (greater)
How many people would you need to survey to detect a clear level of agreement for an item measured in a 7-level Likert scale? This is an ordinal variable, but you can treat it as continuous. \(H_0\) is 4 (middle value) and \(H_a\) is >4. By “clear agreement”, you might mean “at least .5 points”. If true average is >4, but less than 4.5, you are okay with not rejecting \(H_0\).
<- 1 # assumption
sd <- (4.5 - 4) / sd
cohen.d
<- pwr::pwr.t.test(
(pwr d = cohen.d,
sig.level = .05,
power = .80,
type = "one.sample",
alternative = "greater"
))
One-sample t test power calculation
n = 26.13753
d = 0.5
sig.level = 0.05
power = 0.8
alternative = greater
If you repeatedly survey random samples of 27 people from a population where the average preference is level is 4.5 of 7, you should expect to reject H0 80% of the time.
<- ceiling(pwr$n)
n <- sd / sqrt(n)
se
<- qt(p = .05, df = n - 1, lower.tail = FALSE))
(crit_val ## [1] 1.705618
<- crit_val * se + 4)
(crit_rating ## [1] 4.328246
You can see this in a plot of the theoretical distributions for samples of size 27 from populations centered at 4 and 4.5. ~20% of the mean 4.5 curve is to the left of the critical value.
One-sample t-test (two-sided)
How many people would you need to survey to detect a clear level of preference for an item measured in a 7-level Likert scale? Treating the ordinal variable as continuous. \(H_0\) is 4 (middle value) and \(H_a\) is <>4. “Clear agreement” might mean “at least .5 points” either direction. If true average is between 3.5 and 4.5, you are okay with not rejecting \(H_0\).
<- 1 # assumption
sd <- (4.5 - 4) / sd
cohen.d
<- pwr::pwr.t.test(
(pwr d = cohen.d,
sig.level = .05,
power = .80,
type = "one.sample",
alternative = "two.sided"
))
One-sample t test power calculation
n = 33.36713
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
The sample size increased from 27 to 34. If you repeatedly survey random samples of 34 people from a population where the average preference is level is 3.5 or 4.5 of 7, you should expect to reject H0 80% of the time.
<- ceiling(pwr$n)
n <- sd / sqrt(n)
se
<- qt(p = .05, df = n - 1, lower.tail = TRUE))
(crit_val_lwr ## [1] -1.69236
<- qt(p = .05, df = n - 1, lower.tail = FALSE))
(crit_val_upr ## [1] 1.69236
<- crit_val_lwr * se + 4)
(crit_rating_lwr ## [1] 3.709763
<- crit_val_upr * se + 4)
(crit_rating_upr ## [1] 4.290237
You can see this in a plot of the theoretical distributions for samples of size 34 from populations centered at 3.5, 4 and 4.5. ~20% of the mean 4.5 curve and mean 3.5 curve is to the left of the critical value.
Indep Samples t-test (greater)
How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? \(H_0\) is that groups A and B have the same average rating. As before, you set a minimum effect size of .5 points. The one-sided test assesses whether A has a higher (lower) average than B.
<- 1 # assumption
s1 <- 1 # assumption
s2 <- (4.5 - 4) / s1
cohen.d
<- pwr::pwr.t2n.test(
(pwr n1 = 54,
d = cohen.d,
sig.level = .05,
power = .80,
alternative = "greater"
))
t test power calculation
n1 = 54
n2 = 46.80756
d = 0.5
sig.level = 0.05
power = 0.8
alternative = greater
We need about twice as many people for each group! Recall from your notes that the independent samples t-test is a test on the distribution of the difference in means, \(\hat{d} = \bar{x}_1 - \bar{x}_2\). The sampling distribution of \(d\) has a standard error \(SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\). The test statistic is \(T = (\hat{d}-d_0) / SE\).
<- 54
n1 <- ceiling(pwr$n2)
n2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2)
sp2 <- sqrt(sp2) * sqrt(1/n1 + 1/n2)
SE <- n1 + n2 - 2
df
<- qt(p = .05, df = df, lower.tail = FALSE))
(crit_val ## [1] 1.660391
<- crit_val * SE)
(crit_diff ## [1] 0.3312267
Indep Samples t-test (two-sided)
How many people would you need to survey to detect a clear group difference in agreement for an item measured in a 7-level Likert scale? \(H_0\) is that groups A and B have the same average rating. As before, you set a minimum effect size of .5 points. The two-sided test assesses whether A has a different average than B.
<- 1 # assumption
s1 <- 1 # assumption
s2 <- (4.5 - 4) / s1
cohen.d
<- pwr::pwr.t2n.test(
(pwr n1 = 54,
d = cohen.d,
sig.level = .05,
power = .80,
alternative = "two.sided"
))
t test power calculation
n1 = 54
n2 = 77.74746
d = 0.5
sig.level = 0.05
power = 0.8
alternative = two.sided
If \(n_1\) is 54, then \(n_2\) needs to be 78 - higher than the one-sided test.
<- 54
n1 <- ceiling(pwr$n2)
n2 <- ((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2)
sp2 <- sqrt(sp2) * sqrt(1/n1 + 1/n2)
SE <- n1 + n2 - 2
df
<- qt(p = .05, df = df, lower.tail = TRUE))
(crit_val_l ## [1] -1.656659
<- qt(p = .05, df = df, lower.tail = FALSE))
(crit_val_u ## [1] 1.656659
<- crit_val_l * SE)
(crit_diff_l ## [1] -0.2932757
<- crit_val_u * SE)
(crit_diff_u ## [1] 0.2932757
1.6 p-value Intuition
What distribution of p-values would you expect if there is a true effect and you repeated the study many times? What if there is no true effect? The answer is completely determined by the statistical power of the study.1
To see this, run 100,000 simulations of an experiment measuring the average IQ from a sample of size n = 26. The samples will be 26 random values from the normal distribution centered at 106 with a standard deviation of 15. H0 is \(\mu\) = 100.
# 100,000 random samples of IQ simulations from a normal distribution where
# SIGMA = 15. True population value is 100, but we'll try other values.
<- 1E5
n_sims <- 100
mu <- 15
SIGMA
<- function(MU_0 = 106, n = 26) {
run_sim data.frame(i = 1:n_sims) %>%
mutate(
x = map(i, ~ rnorm(n = n, mean = MU_0, sd = SIGMA)),
z = map(x, ~ t.test(., mu = mu)),
p = map_dbl(z, ~ .x$p.value),
x_bar = map_dbl(x, mean)
%>%
) select(x_bar, p)
}
The null hypothesis is that the average IQ is 100. Our rigged simulation finds an average IQ of 106 - an effect size of 6.
<- run_sim(MU_0 = 106, n = 26)
sim_106_26
glimpse(sim_106_26)
## Rows: 100,000
## Columns: 2
## $ x_bar <dbl> 112.4862, 101.5958, 104.4396, 106.2315, 102.1960, 108.6005, 104.…
## $ p <dbl> 0.0001718698, 0.6059147572, 0.0985672856, 0.0428457450, 0.443911…
mean(sim_106_26$x_bar)
## [1] 105.9953
The statistical power achieved by the simulations is 50%. That is, the typical simulation detected the effect size of 6 at the .05 significance level about 50% of the time.
pwr.t.test(
n = 26,
d = (106 - 100) / 15,
sig.level = .05,
type = "one.sample",
alternative = "two.sided"
)
One-sample t test power calculation
n = 26
d = 0.4
sig.level = 0.05
power = 0.5004646
alternative = two.sided
That means that given a population with an average IQ of 106, a two-sided hypothesis test of H0: \(\mu\) = 100 from a sample of size 26 will measure an \(\bar{x}\) with a p-value under .05 only 50% of the time. You can see that in this histogram of p-values.
%>% plot_sim() sim_106_26
Had there been no effect to observe, you’d expect all p-values to be equally likely, so the 20 bins would all have been 5% of the number of simulations – i.e., uniformly distributed under the null. This is called “0 power”, although 5% of the p-values will still be significant at the .05 level. The 5% of p-values < .05 is the Type II error rate - that probability of a positive test result when there is no actual effect to observe.
run_sim(MU_0 = 100, n = 26) %>%
plot_sim(MU_0 = 100)
If you want a higher powered study that would detect the effect at least 80% of the time (the normal standard), you’ll need a higher sample size. How high? Conduct the power analysis again, but specify the power while leaving out the sample size.
pwr.t.test(
power = 0.80,
d = (106 - 100) / 15,
sig.level = .05,
type = "one.sample",
alternative = "two.sided"
)
One-sample t test power calculation
n = 51.00945
d = 0.4
sig.level = 0.05
power = 0.8
alternative = two.sided
You need 51 people (technically, you might want to round up to 52). Here’s what that looks like. 80% of p-values are below .05 now.
run_sim(MU_0 = 106, n = 51) %>%
plot_sim(MU_0 = 106)
So far, we’ve discovered that when there is an effect, the probability that the measure p-value is under the \(\alpha\) significance level equals the power of the study, 1 - \(\beta\) - the true positive rate, and \(\beta\) will be above the \(\alpha\) level - the false negative rate. We’ve also discovered that when there is no effect, all p-values are equally likely, so \(\alpha\) of them will be below the \(alpha\) level of significance - the false positive rate, and 1 - \(\alpha\) will be above \(\alpha\) - the true negative rate.
It’s not the case that all p-values below 0.05 are support for the alternative hypothesis. If the statistical power is high enough, a p-value just under .05 can be even less likely under the null hypothesis.
run_sim(MU_0 = 108, n = 51) %>%
mutate(bin = case_when(p < .01 ~ "0.00 - 0.01",
< .02 ~ "0.01 - 0.02",
p < .03 ~ "0.02 - 0.03",
p < .04 ~ "0.03 - 0.04",
p < .05 ~ "0.04 - 0.05",
p TRUE ~ "other")
%>%
) ::tabyl(bin) janitor
bin n percent
0.00 - 0.01 86544 0.86544
0.01 - 0.02 5044 0.05044
0.02 - 0.03 2340 0.02340
0.03 - 0.04 1308 0.01308
0.04 - 0.05 908 0.00908
other 3856 0.03856
(Recall that under H0, all p-values are equally likely, so each of the percentile bins would contain 1% of p-values.)
In fact, at best, a p-value between .04 and .05 can only be about four times as likely under the alternative hypothesis as the null hypothesis. If your p-value is just under .05, it is at best weak support for the alternative hypothesis.
This section is based on ideas I learned from homework assignment 1 in Daniel Lakens’s Coursera class Improving your statistical inferences.↩︎