Note: For this exam, don’t worry about including the correct labels and titles for the plots. Using the variable names and no title is ok.
Question:
Load the data called pretrial_df.csv using the proper R command and file path.
Code (5 points):
dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/pretrial_df.csv")
Question:
Read the codebook. What types of (stat) variables are these? Variables: race, court_type, bail_amount, days_until_trial.
Text answer (5 points):
Question:
How many observations and variables are there in the data?
Code (3 points):
dim(dat)
## [1] 500 4
Text answer (2 points):
There are 500 observations and 4 variables.
Question:
What percentage of the sample is Black and what percentage is white?
Code (3 points):
dat %>% count(race) %>% mutate(prop = prop.table(n))
## # A tibble: 4 × 3
## race n prop
## <chr> <int> <dbl>
## 1 Black 139 0.278
## 2 Hispanic 99 0.198
## 3 Other 55 0.11
## 4 White 207 0.414
Text answer (2 points):
27.8% is Black and 41.4% is white.
Question:
Make a histogram for number of bail_amount. Describe this histogram.
Code (3 points):
dat %>% ggplot(aes(x=bail_amount)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Text answer (2 points):
Bimodal, skewed right, has a long tail, not really outliers.
Question:
Describe the centrality and spread of this variable.
Notice that if the median and mean are very different from each other, that is one way of telling that the distribution is skewed, if you didn’t already know from visual inspection.
Code (3 points):
summ_stats_bail_amount <- dat %>% summarize(median=median(bail_amount),
IQR = IQR(bail_amount))
summ_stats_bail_amount
## # A tibble: 1 × 2
## median IQR
## <dbl> <dbl>
## 1 6639. 9712.
Text answer (2 points):
Median is $6,638.93 and IQR is $9,712.44.
Question:
Make a barplot for race and describe it.
Code (3 points):
dat %>% ggplot(aes(x=race)) + geom_bar()
Text answer (2 points):
There are more white individuals, then Black, then Hispanic, then other.
Question:
Split up the histogram of bail_amount by race. Describe what you see.
Code (3 points):
dat %>% ggplot(aes(x=bail_amount, fill=race)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Text answer (2 points):
It looks like there is a large peak at 0 for Black individuals. The rest of the races look similar in terms of centrality and spread.
Question:
Find the summary statistics (centrality and spread) of bail_amount for each race.
Note, you can use the group_by command for this.
Code (3 points):
dat %>% group_by(race) %>% summarize(median = median(bail_amount),
IQR = IQR(bail_amount))
## # A tibble: 4 × 3
## race median IQR
## <chr> <dbl> <dbl>
## 1 Black 1238. 4746.
## 2 Hispanic 2794. 7631.
## 3 Other 7944. 8592.
## 4 White 9078. 7455.
Text answer (2 points):
Black individuals have a median bail amount of $1,237.97 and White individuals have a median bail amount of $9,078.23. This is a large difference. Hispanic individuals also have a low median bail amount at $2,793.94. Other is somewhere in between, but it’s difficult to say more about this groups since we do not know its composition. The spread is similar for all except it is lower for Black individuals.
Question:
Do individuals of a certain race tend to go to a specific type of court, and do those have lower bail amounts? For instance, do Black individuals tend to go to municipal court, and is that where all those super low bail amounts come from?
To answer this question, you can make a histogram for bail_amount, and then split it up by court_type and race (one way to do this is by faceting, and another is by using fill or color).
Code (3 points):
dat %>% ggplot(aes(x=bail_amount)) + geom_histogram() + facet_grid(court_type~race)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Text answer (2 points):
It looks like Black and Hispanic individuals do get lower bail amounts in municipal court.