Note: use city_crime_spending.csv.
library(tidyverse)
dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/city_crime_spending.csv")
1. Question For two quantitative variables (x and y): Write down the requirements for each of these steps.
Instructions Fit a simple linear regression for police_spending regressed onto population.
2. Question What is the null hypothesis for doing inference? What is the research question you could ask here?
The null hypothesis is that there is no relationship between population and police_spending.
3. Question What is the coefficient of determination? (The one called Multiple R-squared. The other one has a penalty for adding more covariates, so we won’t need it here.) What does this mean?
I fit a linear model:
out <- lm(police_spending ~ population, data=dat)
And look at the R squared.
summary(out)
##
## Call:
## lm(formula = police_spending ~ population, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148.345 -25.665 -2.046 20.840 168.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.078e+01 3.800e+00 23.89 <2e-16 ***
## population 3.186e-04 6.558e-06 48.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.17 on 498 degrees of freedom
## Multiple R-squared: 0.8258, Adjusted R-squared: 0.8254
## F-statistic: 2360 on 1 and 498 DF, p-value: < 2.2e-16
The (Multiple) R-squared is 0.8258. This means that there is a strong association between population and police_spending.
4. Question Test the assumptions of the linear regression.
First, draw a scatterplot.
dat %>% ggplot(aes(x=population, y=police_spending)) + geom_point()
For quantitative EDA, I check the conditions for correlation: x and y are quantitative variables, the relationship between x and y is straight enough, and there are no noticeable outliers. Therefore, I go ahead and calculate the correlation.
dat %>% summarize(correlation = cor(population, police_spending))
## # A tibble: 1 × 1
## correlation
## <dbl>
## 1 0.909
I draw the diagnostic plots:
par(mfrow=c(2,2))
plot(out)
par(mfrow=c(1,1))
And then test assumptions:
5. Question Should you interpret the coefficients? If so, go ahead and interpret the coefficients. If not, then give a satatement for why you should not interpret them.
Since the assumptions of linear regression are NOT fully met, I will not interpret the coefficients.