Today we’re going to explore how to use the t-distribution and t-tests to calculate confidence intervals and carry out hypothesis tests related to means.
In a previous lecture we talked about confidence intervals for the mean, but we made an important assumption – that our samples were relatively large (>30 observations). Today we’re going to relax that assumption and see how this change makes the t-distribution more appropriate than the normal distribution for calculating confidence intervals for the mean and carrying out hypothesis tests related to the mean.
library(tidyverse)
library(magrittr)
Under the assumption that:
We learned that the sampling distribution of the mean for samples of size \(n\) is \(\sim N(\mu,\sigma/\sqrt{n})\).
In general, we do not know the population paramters, \(\mu\) and \(\sigma\), so instead we use the information we do have:
We can then use our sample standard deviation to estimate the standard error of the mean as so:
\[ SE_\overline{x} = \frac{s_x}{\sqrt{n}} \]
Such estimates of the standard error feed into our calculations of confidence intervals and p-values. For example, we calculate confidence intervals in terms of multiples of the SE:
\[ {CI}_\overline{x} = \overline{x} \pm (z \times {SE}_\overline{x}) \]
One of the important assumptions underlying the discussion above was that the size of our sample was relatively large.
We would hope that, regardless of sample size, the sampling distributions of both the mean and standard deviation should be centered around the true population value, \(\mu\) and \(\sigma\) respectively.
Let’s use simulation to explore how well this is expectation is met for small samples – say 3, 5, 10, or 20 observations?
We first setup some parameters for doing our simulations:
# set seed for random number generations
set.seed(20161107)
# setup the parameters of the popn distribution we're sampling from
mu <- 0
sigma <- 1
# list of sample sizes we'll generate
ssizes <- c(2, 3, 4, 5, 7, 10, 20, 30)
# number of simulations to carry out *for each sample size*
nsims <- 2500
We then simulate the sampling distribution of the mean and standard deviation for samples of varying size (we also calculate some other values which will be useful for our exposition below):
df.combined <- data_frame(sample.size = double(), sample.mean = double(), sample.sd = double(),
estimated.SE = double(), sample.zscore = double())
for (i in ssizes) {
# create empty vectors to hold simulation stats (for efficiency)
s.means <- rep(NA, nsims)
s.sds <- rep(NA, nsims)
s.SEs <- rep(NA, nsims)
s.zscores <- rep(NA, nsims)
for (j in 1:nsims) {
s <- rnorm(i, mean = mu, sd = sigma) # draw random sample
s.means[j] <- mean(s)
s.sds[j] <- sd(s)
SE <- sd(s)/sqrt(i)
s.SEs[j] <- SE
s.zscores[j] <- (mean(s) - mu)/SE
}
df <- data.frame(sample.size = i, sample.mean = s.means, sample.sd = s.sds,
estimated.SE = s.SEs, sample.zscore = s.zscores)
df.combined <- bind_rows(df.combined, df)
}
df.combined %<>%
mutate(sample.sz = as.factor(sample.size))
In our simulation we specified that the mean and the standard deviation of the underlying population, were \(\mu = 0\) ad \(\sigma = 1\), respectively. Let’s examine how the well centered the sampling distributions of the two statistiscs are around these true values, as a function of sample size.
First a table summarized
by.sample.size <-
df.combined %>%
group_by(sample.size) %>%
summarize(mean.of.means = mean(sample.mean),
mean.of.sds = mean(sample.sd))
by.sample.size
# A tibble: 8 x 3
sample.size mean.of.means mean.of.sds
<dbl> <dbl> <dbl>
1 2 0.006602 0.7950
2 3 0.017028 0.8658
3 4 -0.007289 0.9141
4 5 0.003658 0.9446
5 7 0.002972 0.9612
6 10 0.003743 0.9721
7 20 0.002456 0.9843
8 30 -0.002114 0.9922
We see that the sampling distributions of means are well centered around the true mean (zero), and there is no systematic bias one way or the other. By contrast the sampling distribution of standard deviations tends to underestimate the true standard deviation.
We can visualize this bias as shown here:
ggplot(by.sample.size, aes(x = sample.size, y = mean.of.sds)) +
geom_point(color = 'red') +
geom_line(color = 'red') +
geom_hline(yintercept = 1, color = 'black', linetype='dashed') +
labs(x = "Sample Size", y = "Mean of Sampling Distn of Std Dev") +
ylim(0, 1.1)
The source of this bias is clear if we look at the sampling distribution of the standard deviation for samples of size 3, 5, and 30.
filtered.df <-
df.combined %>%
filter(sample.size %in% c(3, 5, 30))
ggplot(filtered.df, aes(x = sample.sd, y = ..density.., fill = sample.sz)) +
geom_histogram(bins=50, alpha=0.65) +
facet_wrap(~sample.size, nrow = 1) +
geom_vline(xintercept = 1, linetype = 'dashed') +
labs(x = "Std Deviations", y = "Density",
title = "Sampling distributions of the standard deviation\nAs a function of sample size")
There’s very clear indication that the the sampling distribution of standard deviations is not centered around 1 for \(n=3\) and for \(n=5\), however with samples of size 30 the sampling distribution of the standard deviation appears fairly well centered around the true value of the underlying population.
If when sample sizes are small, sample standard deviations, \(s_x\), tend to underestime the true standard deviation, \(\sigma\), then it follows that sample estimates of the standard error of the mean, \(SE_\overline{x} = \frac{s_x}{\sqrt{n}}\), must tend to understimate the true standard error of the mean, \(SE_\mu = \frac{\sigma}{\sqrt{n}}\).
This means that when we use \(SE_\overline{x}\) to estimate confidence intervals as follows: \[ {CI}_\overline{x} = \overline{x} \pm (z \times {SE}_\overline{x}) \] If we use \(z\) values based on the standard Normal distribution, our confidence intervals will be too small for the desired level of confidence. Instead we must turn to an alternate distribution called the “t-distribution” to get appropriate multiples of the standard error.
The problem associated with estimating the standard error of the mean were recognized in the early 20th century by William Gosset, an employee at the Guinness Brewing Company. He published a paper, under the pseudonym “Student”, giving the appropriate distribution for describing the standard error of the mean as a function of the sample size \(n\). Gosset’s distribution is known as the “t-distribution” or “Student’s t-distribution”.
The t-distribution is specified by a single parameter, called degrees of freedom (\(df\)) where \({df} = n - 1\). As \(df\) increases, the t-distribution becomes more and more like the normal such that when \(n \geq 30\) it’s nearly indistinguishable from the standard normal distribution.
In the figures below we compare the t-distribution and the normal distribution for sample sizes \(n = {4, 8, 32}\).
x <- seq(-6, 6, length.out = 200)
n <- c(5, 10, 30) # sample sizes
distns.df <- data_frame(sample.size = double(), z.or.t = double(),
norm.density = double(), t.density = double())
for (i in n) {
norm.density <- dnorm(x, mean = mu, sd = 1)
t.density <- dt(x, df = i - 1)
df.temp <- data_frame(sample.size = i, z.or.t = x,
norm.density = norm.density, t.density = t.density)
distns.df <- bind_rows(distns.df, df.temp)
}
distns.df %<>% mutate(df = as.factor(sample.size - 1))
ggplot(distns.df, aes(x = z.or.t, y = t.density, color = df)) +
geom_line() +
geom_line(aes(y = norm.density), color='black', linetype="dotted") +
labs(x = "z or t value", y = "Probablity density",
title = "Standard normal distribution (black dotted line)\nversus t-distributions for different degrees of freedom")
With our knowledge of the t-distribution, we now revisit how to calculate a confidence interval for the mean. To calculate a \((1-\alpha)\%\) confidence interval for the mean, use the equation:
\[ CI_\overline{x} = \overline{x} \pm t_{1-\alpha/2, df} \times SE_\overline{x} \]
Where \(t_{1-\alpha/2, df}\) is the critical t-value, corresponding to the \(1-\alpha/2\) cutoff of the t-distribution with \(df\) degrees of freedom. We use \(1-\alpha/2\) to account for the fact that for our uncertainty includes both the right and left tails of the t-distribution.
t.values <- seq(-4, 4, length.out = 200)
t.density <- dt(t.values, df = 4)
left.interval <- seq(-4, qt(0.025, df = 4), length.out = 50)
right.interval <- seq(qt(0.975, df = 4), 4, length.out = 50)
ggplot() +
geom_line(aes(x = t.values, y = t.density)) +
geom_area(aes(x = left.interval, y = dt(left.interval, df = 4)), fill = "firebrick") +
geom_area(aes(x = right.interval, y = dt(right.interval, df = 4)), fill = "firebrick") +
annotate("text", x = -3.25, y = 0.05, label = "2.5% left") +
annotate("text", x = 3.25, y = 0.05, label = "2.5% right") +
labs(x = "t", y = "Probability density",
title = "Critical values of t-distribution with df = 4\ncorresponding to 5% cutoff" )
The function qt()
(quantile function for t-distribution) calculates the appropriate critical t-value. For example, to calculate \(t_{0.975, 4}\), which corresponds to the left border of the region labeled “2.5% right” in the figure above, we do:
qt(0.975, 4)
[1] 2.776
For each of the following, the mean, standard deviation, sample size, and desired confidence interval are given. Find the critical values of \(t\) required for a confidence interval of the mean. [0.5 pts each]
You treat a sample of mice with a drug (Drug X) and measure the expression of the gene YFG1 following treatment. For a sample of five mice you observe the following expression values:
What is the 95% confidence interval for the mean expression of YFG1?
alpha = 0.05
yfg1.1sample <- data_frame(YFG1 = c(11.25, 10.5, 12, 11.75, 10))
yfg1.1sample %>%
summarize(mean.YFG1 = mean(YFG1),
sd.YFG1 = sd(YFG1),
se.mean.YFG1 = sd.YFG1/sqrt(n()),
df = n() - 1,
t = qt(1 - alpha/2, df),
left.CI.95 = mean.YFG1 - t * se.mean.YFG1,
right.CI.95 = mean.YFG1 + t * se.mean.YFG1)
# A tibble: 1 x 7
mean.YFG1 sd.YFG1 se.mean.YFG1 df t left.CI.95 right.CI.95
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11.1 0.8404 0.3758 4 2.776 10.06 12.14
We consider three different situations for hypothesis tests regarding means.
A one sample t-test is appropriate when you want to compare an observed sample mean to some a priori hypothesis about what the mean should be.
Calculate the test statistic:
\[ t^{\star} = \frac{\overline{x} - \mu_0}{SE_\overline{x}} \]
where \(\overline{x}\) is the sample mean and \(SE_\overline{x}\) is the sample standard error of the mean.
Compare the test statistic to the t-distribution to calculate the probability that you’d observe a mean value at least as extreme as \(\overline{x}\) if the null hypothesis was true:
\[ P = \text{Pr}[t < -|t^\star|)] + \text{Pr}[t > |t^\star|] \]
Since the t-distribution is symmetric, this simplifies to:
\[ P = 2 \times \text{Pr}[t > |t^\star|] \]
You have established from previous studies that the expression level of YFG1 in control mice is 10. You wish to determine whether the average expression of YFG1 in mice treated with Drug X differs from control mice.
mu_0 = 10
yfg1.1sample %>%
summarize(mean.YFG1 = mean(YFG1),
sd.YFG1 = sd(YFG1),
se.mean.YFG1 = sd.YFG1/sqrt(n()),
df = n() - 1,
t.star = (mean.YFG1 - mu_0)/se.mean.YFG1,
P.value = 2 * pt(abs(t.star), df = df, lower.tail = FALSE))
# A tibble: 1 x 6
mean.YFG1 sd.YFG1 se.mean.YFG1 df t.star P.value
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11.1 0.8404 0.3758 4 2.927 0.04295
Using a type I error cutoff of \(\alpha = 0.05\), we reject the null hypothesis that the mean expression of YFG1 in mice treated with Drug X is the same as the mean expression of YFG1 in control mice. Our data support the hypothesis that Drug X increases the expression of YFG1.
t.test
function in RThe built-in t.test()
function will take care of the all the calcuations we did by hand above. Take a moment to read the t.test
documentation.
For a one sample t-test t.test
we need to specify the a priori mean value, mu
, we are comparing to:
t.test(yfg1.1sample$YFG1, mu = 10)
One Sample t-test
data: yfg1.1sample$YFG1
t = 2.9, df = 4, p-value = 0.04
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
10.06 12.14
sample estimates:
mean of x
11.1
Global warming is predicted to lead to shifts in the geographic ranges of species. Chen et al. (2011) studied recent changes in the highest elevation at which species occur. Typically, higher elevations are coolor than lower elevations. The CSV file ABD-range-shifts.csv gives the changes in highest elevation (measured in meters) for a wide variety of taxa, as measure in the over the last 30 or so years. Positive numbers indicate upward shifts in elevation, and negative numbers indicate shifts to lower elevations.
Each of the following questions is worth 0.5 pts:
Draw a histogram showing the distribution of changes in elevation. Make sure to follow good practice (appropriate bins, axis labels, etc) in creating your figure.
What is the sample size for this data set?
What is the mean of the elevation shift data?
What is the standard deviation of the elevational shifts?
What is the standard error of the mean for these data?
How many degrees of freedome will be associated with a confidence interval and one-sample t-test for the mean elevational shift?
What is the value of \(\alpha\) needed for a a 95% confidence interval?
What is the critical value of \(t\) for this \(alpha\) and number of degrees of freedom?
Calculate the 95% confidence interval for the mean using these data
For the one-sample t-test, write the appropriate null and alternative hypotheses
Carry out a one sample t-test using t.test()
What is your conclusions whether species have changed their highest elevation on average?
We use a two sample t-test to analyze the different between the means of the same variable measured in two different groups or treatments. It is assumed that the two groups are independent samples from two populations.
In a two-sample t-test, we have to account for the uncertainty associated with the means of both groups, which we express in terms of the standard error of the difference in the means between the groups:
\[ SE_{\overline{X}_1 - \overline{X}_2} = \sqrt{s^2_p\left(\frac{1}{n_1} + \frac{1}{n_2} \right)} \]
where
\[ s^2_p = \frac{df_1 s_1^2 + df_2 s_2^2}{df_1 + df_2} \]
\(s^2_p\) is called the “pooled sample variance” and is a weighted average of the sample variances, \(s_1^2\) and \(s_2^2\), of the two groups.
Given the standard error for the difference in means between groups as defined above, we define our test statistic for a two sample t-test as:
\[ t^\star = \frac{(\overline{X}_1 - \overline{X}_2)}{SE_{\overline{X}_1 - \overline{X}_2}} \]
The degrees of freedom for this test statistic is:
\[ df = df_1 + df_2 = n_1 + n_2 - 2 \]
You treat samples of mice with two drugs, X and Y. We want to know if the two drugs have the same average effect on expression of the gene YFG1. The measurements of YFG1 in samples treated with X and Y are as follows:
For simplicity, we skip the “by-hand” calculations and simply use the built in t.test
function:
yfg1.2sample <- data_frame(YFG1.X = c(11.25, 10.5, 12, 11.75, 10),
YFG1.Y = c(8.75, 10, 11, 9.75, 10.5))
t.test(yfg1.2sample$YFG1.X, yfg1.2sample$YFG1.Y)
Welch Two Sample t-test
data: yfg1.2sample$YFG1.X and yfg1.2sample$YFG1.Y
t = 2.1, df = 8, p-value = 0.07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1311 2.3311
sample estimates:
mean of x mean of y
11.1 10.0
Using a type I error cutoff of \(\alpha = 0.05\), we fail to reject the null hypothesis that the mean expression of YFG1 is different in mice treated with Drug X versus those treated with Drug Y.
It has been hypothesized that many North Americans dreamt in black and white in the era of black-and-white television and shifted to color after the introduction of color media (movies and TV). To test this hypothesis Murzyn (2008) queried 30 older individuals who grew up in the black-and-whte era and 30 younger individuals who grew up with color media. She recorded the percentage of color dreams for each individuals. In younger individuals, a mean of 68.4% their dreams were in color, with a standard deviation of 31.8%. For the older individuals, the mean was 33.9% with a standard deviation of 36.9%. The scores were approximately normally distributed in each group. Is the difference between the two means statistically significant?
Each of the following questions is worth 0.5 pts:
State the null and alternate hypotheses for this test.
What are the sample variances for each group?
How many degrees of freedom are associated with each group?
Calculate the pooled variance for these data.
What is the standard error of the difference between means?
Calculate the test statistic, t, for these data
What is the critical value of \(t\) with \(\alpha = 0.05\) for this test?
Calculate a p-value for this comparison.
What can you conclude about the difference between the two groups in how often they dream in color? Interpret the conlusions of the test in terms of the original question.
In a paired t-test there are two groups/treatments, but the samples in the two groups are paired or matched. This typically arises in “before-after” studies where the same individual/object is measured at different time points, before and after application of a treatment. The repeated measurement of the same individual/object means that we can’t treat the two sets of observations as independent. Null and alternative hypotheses are thus typically framed in terms of a mean diffrenes between time points/conditions
The test statistic is thus: \[ t^\star = \frac{\overline{D}}{SE(\overline{D})} \]
Under the null hypothesis, this statistic follows a t-distribution with \(n-1\) degrees of freedom.
You measure the expression of the gene YFG1 in five mice. You then treat those five mice with drug Z and measure gene expression again.
yfg1.paired <- data_frame(YFG1.before = c(12, 11.75, 11.25, 10.5, 10),
YFG1.after = c(11, 10, 10.50, 8.75, 9.75))
t.test(yfg1.paired$YFG1.before, yfg1.paired$YFG1.after,
paired = TRUE)
Paired t-test
data: yfg1.paired$YFG1.before and yfg1.paired$YFG1.after
t = 3.8, df = 4, p-value = 0.02
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2905 1.9095
sample estimates:
mean of the differences
1.1
Using a type I error cutoff of \(\alpha = 0.05\), we reject the null hypothesis of no difference in the average expression of YFG1 before and after treatment with Drug Z.