Note: use city_crime_spending.csv.

library(tidyverse)
dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/city_crime_spending.csv")

1. Question For two quantitative variables (x and y): Write down the requirements for each of these steps.

Instructions Fit a simple linear regression for police_spending regressed onto population.

2. Question What is the null hypothesis for doing inference? What is the research question you could ask here?

The null hypothesis is that there is no relationship between population and police_spending.

3. Question What is the coefficient of determination? (The one called Multiple R-squared. The other one has a penalty for adding more covariates, so we won’t need it here.) What does this mean?

I fit a linear model:

out <- lm(police_spending ~ population, data=dat)

And look at the R squared.

summary(out)
## 
## Call:
## lm(formula = police_spending ~ population, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -148.345  -25.665   -2.046   20.840  168.405 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.078e+01  3.800e+00   23.89   <2e-16 ***
## population  3.186e-04  6.558e-06   48.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.17 on 498 degrees of freedom
## Multiple R-squared:  0.8258, Adjusted R-squared:  0.8254 
## F-statistic:  2360 on 1 and 498 DF,  p-value: < 2.2e-16

The (Multiple) R-squared is 0.8258. This means that there is a strong association between population and police_spending.

4. Question Test the assumptions of the linear regression.

First, draw a scatterplot.

dat %>% ggplot(aes(x=population, y=police_spending)) + geom_point()

For quantitative EDA, I check the conditions for correlation: x and y are quantitative variables, the relationship between x and y is straight enough, and there are no noticeable outliers. Therefore, I go ahead and calculate the correlation.

dat %>% summarize(correlation = cor(population, police_spending))
## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1       0.909

I draw the diagnostic plots:

par(mfrow=c(2,2))
plot(out)

par(mfrow=c(1,1))

And then test assumptions:

  1. Linear relationship between x and y: There is a linear relationship between x and y, shown in the scatterplot.
  2. Independence between observations: I cannot test for independence between observations, but I don’t see any evidence of clumping to suggest strong dependencies.
  3. Homoscedasticity of errors: There is a clear difference in variance of the errors shown in the scale-location plot, as well as in the residuals vs. fitted plot and the scatterplot. This assumption is NOT met.
  4. Normality of y for a given x: The q-q plot shows the points mostly on the diagonal, except for the ends. This shows me the assumption is mostly (sufficiently) met.
  5. There are no influential outliers.

5. Question Should you interpret the coefficients? If so, go ahead and interpret the coefficients. If not, then give a satatement for why you should not interpret them.

Since the assumptions of linear regression are NOT fully met, I will not interpret the coefficients.