Definition: The process of drawing conclusions on the basis of statistical testing of collected data.
Goal: To draw conclusions about a population on the basis of data obtained from a sample of that population.
How it works: In hypothesis testing, we start with some default theory — called a null hypothesis — and we ask if the data provide sufficient evidence (via a test statistic) to reject the theory. If not we retain the null hypothesis.
Testing diff. in means: Single-sample t-test
Testing diff. in means: Two-sample t-test
A t-test is a type of hypothesis test used to compare means between groups when the population standard deviation is unknown. It helps us decide whether the difference we see between sample means is likely due to random variation or reflects a real difference in the populations.
The t-test is used to determine if the means of two sets of data are significantly different from each other.
It is mostly used when the data sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.
A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the statistical significance. To conduct a test with three or more means, one must use an analysis of variance.
The test statistic follows a t-distribution, which looks similar to the normal distribution but has heavier tails. This accounts for the extra uncertainty that comes from estimating the standard deviation using sample data instead of knowing the true population value.
Use a t-test when:
If there is a significant linear relationship between the independent quantitative variable \(x\) and the dependent quantitative variable \(y\), the slope will not equal zero.
State the hypotheses: \[ H_0: b_1 = 0,\ \ \ H_A: b_1 \neq 0. \]
The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.
Note: Think of analogy to “innocent until proven guilty”.
Select a significance level, \(\alpha\). The most common one is 5%. Some scientific questions (e.g. elementary particles) and medical questions (e.g. vaccines) require lower significance levels (e.g. 1% or 0.1%). This means the test is more conservative, so it’s harder to get significance by chance.
\(\alpha\) is also the probability of obtaining a Type I error, and \(\beta\) is the probability of obtaining a Type II error.
In practice, use R. But what does R do?
Using sample data, it finds the i) standard error of the slope, ii) the slope of the regression line, iii) the degrees of freedom, iv) the test statistic, and the v) \(p\)-value associated with the test statistic.
\(SE = s_{b1} = \sqrt{ \frac{ \sum(y_i - \hat{y}_i)^2 / (n - 2) }{ \sqrt{\sum(x_i - \overline{x})^2}} },\)
where \(y_i\) is the value of the dependent variable for observation \(i\), \(\hat{y}_i\) is estimated value of the dependent variable for observation \(i\), \(x_i\) is the observed value of the independent variable for observation \(i\), \(x\) is the mean of the independent variable, and \(n\) is the number of observations.
where \(b_1\) is the slope of the sample regression line, and \(SE\) is the standard error of the slope.
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the \(p\)-value to the significance level, and rejecting the null hypothesis when the \(p\)-value is less than the significance level.
Note: We say “reject the null (in favor of the alternative) or don’t reject the null,” NOT “accept the null”.
Analogy: Defendant is not shown to be innocent. Only “guilty or not guilty”.
One of the most common statistical tasks is to compare an outcome between two groups. The example here looks at comparing birth weight between smoking and non-smoking mothers.
To start, it always helps to plot things.
One of the most common statistical tasks is to compare an outcome between two groups. The example here looks at comparing birth weight between smoking and non-smoking mothers.
To start, it always helps to plot things.
# Create boxplot showing how birthwt.grams varies between the two groups of mothers
birthwt %>% ggplot(aes(x=mother.smokes, y=birthwt.grams)) +
geom_boxplot() +
labs(x = "Mother smokes", y="Birthweight (grams)")
This plot suggests that smoking is associated with lower birth weight. But how can we assess whether this difference is statistically significant?
Let’s compute a summary table.
The standard deviation is good to have, but to assess statistical significance we really want to have the standard error (which the standard deviation adjusted by the group size).
birthwt %>%
group_by(mother.smokes) %>%
summarize(num.obs = n(),
mean.birthwt = round(mean(birthwt.grams), 0),
sd.birthwt = round(sd(birthwt.grams), 0),
se.birthwt = round(sd(birthwt.grams) / sqrt(num.obs), 0))
## # A tibble: 2 × 5
## mother.smokes num.obs mean.birthwt sd.birthwt se.birthwt
## <fct> <int> <dbl> <dbl> <dbl>
## 1 no 115 3056 753 70
## 2 yes 74 2772 660 77
This difference is looking quite significant. But let’s do this with a test.
t.test()To run a two-sample t-test, we can simple use the
t.test() function.
birthwt.t.test <- t.test(birthwt.grams ~ mother.smokes, data = birthwt)
birthwt.t.test
##
## Welch Two Sample t-test
##
## data: birthwt.grams by mother.smokes
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 78.57486 488.97860
## sample estimates:
## mean in group no mean in group yes
## 3055.696 2771.919
We see from this output that the difference is highly significant. The t.test() function also outputs a confidence interval for us.
Notice that the function returns a lot of information, and we can access this information element by element. The ability to pull specific information from the output of the hypothesis test allows you to report your results using inline code chunks. That is, you don’t have to hardcode estimates, p-values, confidence intervals, etc.
names(birthwt.t.test)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
birthwt.t.test$p.value # p-value
## [1] 0.007002548
birthwt.t.test$estimate # group means
## mean in group no mean in group yes
## 3055.696 2771.919
birthwt.t.test$conf.int # confidence interval for difference
## [1] 78.57486 488.97860
## attr(,"conf.level")
## [1] 0.95
attr(birthwt.t.test$conf.int, "conf.level") # confidence level
## [1] 0.95
Define a few things:
# Calculate difference in means between smoking and nonsmoking groups
birthwt.t.test$estimate
## mean in group no mean in group yes
## 3055.696 2771.919
birthwt.smoke.diff <- birthwt.t.test$estimate[1] - birthwt.t.test$estimate[2]
# Confidence level as a %
conf.level <- attr(birthwt.t.test$conf.int, "conf.level") * 100
Our study finds that birth weights are on average 283.7767333g higher in the non-smoking group compared to the smoking group (t-statistic 2.73, p=0.007, 95% CI [78.6, 489]g)
For this example, we use the same dataset as in Exercises 5, on Philadelphia housing prices.
library(tidyverse)
dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/philadelphia_house_prices.csv")
Although, we filter out the outliers, just so we can focus on the hypothesis testing for now.
dat_no_outliers <- dat %>% filter(price<500)
We draw the scatterplot to see the relationship between square footage and price.
dat_no_outliers %>% ggplot(aes(x=price, y=sqft)) + geom_point()
We see that there is a positive, strong, linear association, with no
outliers.
Then, we fit the linear regression model, and check the diagnostic plots.
out <- lm(sqft~price, data=dat_no_outliers)
par(mfrow=c(2,2))
plot(out)
par(mfrow=c(1,1))
The four assumptions of linear regression are satisfied: Linearity, independence of errors, normality of errors, and homoscedasticity.
Now we test whether the slope coefficient of the linear regression is statistically significant. The null hypothesis is that the two variables, price and square footage, are not associated with each other.
summary(out)
##
## Call:
## lm(formula = sqft ~ price, data = dat_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -794.54 -165.00 8.86 176.11 640.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 594.4443 36.9483 16.09 <2e-16 ***
## price 4.9217 0.1263 38.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 249 on 993 degrees of freedom
## Multiple R-squared: 0.6047, Adjusted R-squared: 0.6043
## F-statistic: 1519 on 1 and 993 DF, p-value: < 2.2e-16
We can read the coefficient and its corresponding se, t-statistic, and p-value in the line that starts with “## price…”
We see that, indeed, there is a statistically significant association between price and square footage in Philadelphia. The three stars represent statistical significance at a value lower than 0.001. We usually use a p-value of 0.05 in social science. Therefore, we can interpret the coefficient as follows:
An additional US dollar is associated with an additional 4.92 square feet, and this is statistically significant at the 0.05 level.