Statistical inference

  • Definition: The process of analyzing data to infer properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Hypothesis testing

  • Definition: The process of drawing conclusions on the basis of statistical testing of collected data.

  • Goal: To draw conclusions about a population on the basis of data obtained from a sample of that population.

  • How it works: In hypothesis testing, we start with some default theory — called a null hypothesis — and we ask if the data provide sufficient evidence (via a test statistic) to reject the theory. If not we retain the null hypothesis.

Testing diff. in means: Single-sample t-test

Testing diff. in means: Two-sample t-test

t-test

  • A t-test is a type of hypothesis test used to compare means between groups when the population standard deviation is unknown. It helps us decide whether the difference we see between sample means is likely due to random variation or reflects a real difference in the populations.

  • The t-test is used to determine if the means of two sets of data are significantly different from each other.

  • It is mostly used when the data sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

  • A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the statistical significance. To conduct a test with three or more means, one must use an analysis of variance.

Why “t”?

The test statistic follows a t-distribution, which looks similar to the normal distribution but has heavier tails. This accounts for the extra uncertainty that comes from estimating the standard deviation using sample data instead of knowing the true population value.

When to use a t-test

Use a t-test when:

  • Your outcome variable is quantitative (e.g., crime rate, sentence length, years of education).
  • You want to compare means (average values).
  • Your data come from independent random samples or paired observations.
  • The data are approximately normally distributed (especially important for small samples).

Logic of a t-test

  • State hypotheses \(H_0\) and \(H_1\).
  • Compute the t-statistic: \(t=\frac{\overline{X}_1- \overline{X}_2}{SE_{difference}}\).
  • Find the p-value using the t-distribution with appropriate degrees of freedom.
  • Interpret: If p is less than alpha, e.g., 0.05, then reject \(H_0\).

Hypothesis test for linear regression slope

If there is a significant linear relationship between the independent quantitative variable \(x\) and the dependent quantitative variable \(y\), the slope will not equal zero.

State the hypotheses: \[ H_0: b_1 = 0,\ \ \ H_A: b_1 \neq 0. \]

The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.

Note: Think of analogy to “innocent until proven guilty”.

Significance level

Select a significance level, \(\alpha\). The most common one is 5%. Some scientific questions (e.g. elementary particles) and medical questions (e.g. vaccines) require lower significance levels (e.g. 1% or 0.1%). This means the test is more conservative, so it’s harder to get significance by chance.

\(\alpha\) is also the probability of obtaining a Type I error, and \(\beta\) is the probability of obtaining a Type II error.

How to run the test?

In practice, use R. But what does R do?

Using sample data, it finds the i) standard error of the slope, ii) the slope of the regression line, iii) the degrees of freedom, iv) the test statistic, and the v) \(p\)-value associated with the test statistic.

    1. Standard error: If you need to calculate the standard error of the slope (SE) by hand, use the following formula:

\(SE = s_{b1} = \sqrt{ \frac{ \sum(y_i - \hat{y}_i)^2 / (n - 2) }{ \sqrt{\sum(x_i - \overline{x})^2}} },\)

where \(y_i\) is the value of the dependent variable for observation \(i\), \(\hat{y}_i\) is estimated value of the dependent variable for observation \(i\), \(x_i\) is the observed value of the independent variable for observation \(i\), \(x\) is the mean of the independent variable, and \(n\) is the number of observations.

    1. Slope: Given by R.
    1. Degrees of freedom: For simple linear regression (one independent and one dependent variable), the degrees of freedom (DF) is equal to: \(DF = n - 2\), where \(n\) is the number of observations in the sample.
    1. Test statistic: The test statistic is a \(t\)-statistic defined by the following equation: \[ t = \frac{b_1}{SE}, \]

where \(b_1\) is the slope of the sample regression line, and \(SE\) is the standard error of the slope.

    1. p-value: The \(p\)-value is the probability of observing a sample statistic as extreme as the test statistic, given that the null hypothesis is true. Since the test statistic is a \(t\)-statistic, use the t Distribution Calculator to assess the probability associated with the test statistic. Use the degrees of freedom computed above.

Conclusion from test

  • If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the \(p\)-value to the significance level, and rejecting the null hypothesis when the \(p\)-value is less than the significance level.

  • Note: We say “reject the null (in favor of the alternative) or don’t reject the null,” NOT “accept the null”.

  • Analogy: Defendant is not shown to be innocent. Only “guilty or not guilty”.

Testing differences in means

One of the most common statistical tasks is to compare an outcome between two groups. The example here looks at comparing birth weight between smoking and non-smoking mothers.

To start, it always helps to plot things.

One of the most common statistical tasks is to compare an outcome between two groups. The example here looks at comparing birth weight between smoking and non-smoking mothers.

To start, it always helps to plot things.

# Create boxplot showing how birthwt.grams varies between the two groups of mothers
birthwt %>% ggplot(aes(x=mother.smokes, y=birthwt.grams)) + 
  geom_boxplot() + 
  labs(x = "Mother smokes", y="Birthweight (grams)")

This plot suggests that smoking is associated with lower birth weight. But how can we assess whether this difference is statistically significant?

A summary table

Let’s compute a summary table.

The standard deviation is good to have, but to assess statistical significance we really want to have the standard error (which the standard deviation adjusted by the group size).

birthwt %>%
  group_by(mother.smokes) %>%
  summarize(num.obs = n(),
            mean.birthwt = round(mean(birthwt.grams), 0),
            sd.birthwt = round(sd(birthwt.grams), 0),
            se.birthwt = round(sd(birthwt.grams) / sqrt(num.obs), 0))
## # A tibble: 2 × 5
##   mother.smokes num.obs mean.birthwt sd.birthwt se.birthwt
##   <fct>           <int>        <dbl>      <dbl>      <dbl>
## 1 no                115         3056        753         70
## 2 yes                74         2772        660         77

This difference is looking quite significant. But let’s do this with a test.

t-test via t.test()

To run a two-sample t-test, we can simple use the t.test() function.

birthwt.t.test <- t.test(birthwt.grams ~ mother.smokes, data = birthwt)
birthwt.t.test
## 
##  Welch Two Sample t-test
## 
## data:  birthwt.grams by mother.smokes
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
##  mean in group no mean in group yes 
##          3055.696          2771.919

We see from this output that the difference is highly significant. The t.test() function also outputs a confidence interval for us.

Notice that the function returns a lot of information, and we can access this information element by element. The ability to pull specific information from the output of the hypothesis test allows you to report your results using inline code chunks. That is, you don’t have to hardcode estimates, p-values, confidence intervals, etc.

names(birthwt.t.test)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"
birthwt.t.test$p.value   # p-value
## [1] 0.007002548
birthwt.t.test$estimate  # group means
##  mean in group no mean in group yes 
##          3055.696          2771.919
birthwt.t.test$conf.int  # confidence interval for difference
## [1]  78.57486 488.97860
## attr(,"conf.level")
## [1] 0.95
attr(birthwt.t.test$conf.int, "conf.level")  # confidence level
## [1] 0.95

Define a few things:

# Calculate difference in means between smoking and nonsmoking groups
birthwt.t.test$estimate
##  mean in group no mean in group yes 
##          3055.696          2771.919
birthwt.smoke.diff <- birthwt.t.test$estimate[1] - birthwt.t.test$estimate[2]

# Confidence level as a %
conf.level <- attr(birthwt.t.test$conf.int, "conf.level") * 100

Conclusion:

Our study finds that birth weights are on average 283.7767333g higher in the non-smoking group compared to the smoking group (t-statistic 2.73, p=0.007, 95% CI [78.6, 489]g)

Example with linear regression

For this example, we use the same dataset as in Exercises 5, on Philadelphia housing prices.

library(tidyverse)
dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/philadelphia_house_prices.csv")

Although, we filter out the outliers, just so we can focus on the hypothesis testing for now.

dat_no_outliers <- dat %>% filter(price<500)

We draw the scatterplot to see the relationship between square footage and price.

dat_no_outliers %>% ggplot(aes(x=price, y=sqft)) + geom_point()

We see that there is a positive, strong, linear association, with no outliers.

Then, we fit the linear regression model, and check the diagnostic plots.

out <- lm(sqft~price, data=dat_no_outliers)
par(mfrow=c(2,2))
plot(out)

par(mfrow=c(1,1))

The four assumptions of linear regression are satisfied: Linearity, independence of errors, normality of errors, and homoscedasticity.

Now we test whether the slope coefficient of the linear regression is statistically significant. The null hypothesis is that the two variables, price and square footage, are not associated with each other.

summary(out)
## 
## Call:
## lm(formula = sqft ~ price, data = dat_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -794.54 -165.00    8.86  176.11  640.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.4443    36.9483   16.09   <2e-16 ***
## price         4.9217     0.1263   38.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 249 on 993 degrees of freedom
## Multiple R-squared:  0.6047, Adjusted R-squared:  0.6043 
## F-statistic:  1519 on 1 and 993 DF,  p-value: < 2.2e-16

We can read the coefficient and its corresponding se, t-statistic, and p-value in the line that starts with “## price…”

We see that, indeed, there is a statistically significant association between price and square footage in Philadelphia. The three stars represent statistical significance at a value lower than 0.001. We usually use a p-value of 0.05 in social science. Therefore, we can interpret the coefficient as follows:

Conclusion:

An additional US dollar is associated with an additional 4.92 square feet, and this is statistically significant at the 0.05 level.