Note: For this exam, don’t worry about including the correct labels and titles for the plots. Using the variable names and no title is ok.

1. Loading data (5 points)

Question:

Load the data called pretrial_df.csv using the proper R command and file path.

Code (5 points):

dat <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/pretrial_df.csv")

2. Variable types (5 points)

Question:

Read the codebook. What types of (stat) variables are these? Variables: race, court_type, bail_amount, days_until_trial.

Text answer (5 points):

3. Data dimensions (5 points)

Question:

How many observations and variables are there in the data?

Code (3 points):

dim(dat)
## [1] 500   4

Text answer (2 points):

There are 500 observations and 4 variables.

4. Quantitative EDA (5 points)

Question:

What percentage of the sample is Black and what percentage is white?

Code (3 points):

dat %>% count(race) %>% mutate(prop = prop.table(n))
## # A tibble: 4 × 3
##   race         n  prop
##   <chr>    <int> <dbl>
## 1 Black      139 0.278
## 2 Hispanic    99 0.198
## 3 Other       55 0.11 
## 4 White      207 0.414

Text answer (2 points):

27.8% is Black and 41.4% is white.

5. Visual EDA (5 points)

Question:

Make a histogram for number of bail_amount. Describe this histogram.

Code (3 points):

dat %>% ggplot(aes(x=bail_amount)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Text answer (2 points):

Bimodal, skewed right, has a long tail, not really outliers.

6. Quantitative EDA (5 points)

Question:

Describe the centrality and spread of this variable.

Notice that if the median and mean are very different from each other, that is one way of telling that the distribution is skewed, if you didn’t already know from visual inspection.

Code (3 points):

summ_stats_bail_amount <- dat %>% summarize(median=median(bail_amount),
                  IQR = IQR(bail_amount))

summ_stats_bail_amount
## # A tibble: 1 × 2
##   median   IQR
##    <dbl> <dbl>
## 1  6639. 9712.

Text answer (2 points):

Median is $6,638.93 and IQR is $9,712.44.

7. Visual EDA (5 points)

Question:

Make a barplot for race and describe it.

Code (3 points):

dat %>% ggplot(aes(x=race)) + geom_bar()

Text answer (2 points):

There are more white individuals, then Black, then Hispanic, then other.

8. Visual EDA (5 points)

Question:

Split up the histogram of bail_amount by race. Describe what you see.

Code (3 points):

dat %>% ggplot(aes(x=bail_amount, fill=race)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Text answer (2 points):

It looks like there is a large peak at 0 for Black individuals. The rest of the races look similar in terms of centrality and spread.

9. Quantitative EDA (5 points)

Question:

Find the summary statistics (centrality and spread) of bail_amount for each race.

Note, you can use the group_by command for this.

Code (3 points):

dat %>% group_by(race) %>% summarize(median = median(bail_amount),
                                     IQR = IQR(bail_amount))
## # A tibble: 4 × 3
##   race     median   IQR
##   <chr>     <dbl> <dbl>
## 1 Black     1238. 4746.
## 2 Hispanic  2794. 7631.
## 3 Other     7944. 8592.
## 4 White     9078. 7455.

Text answer (2 points):

Black individuals have a median bail amount of $1,237.97 and White individuals have a median bail amount of $9,078.23. This is a large difference. Hispanic individuals also have a low median bail amount at $2,793.94. Other is somewhere in between, but it’s difficult to say more about this groups since we do not know its composition. The spread is similar for all except it is lower for Black individuals.

Extra credit. Visual EDA (5 points)

Question:

Do individuals of a certain race tend to go to a specific type of court, and do those have lower bail amounts? For instance, do Black individuals tend to go to municipal court, and is that where all those super low bail amounts come from?

To answer this question, you can make a histogram for bail_amount, and then split it up by court_type and race (one way to do this is by faceting, and another is by using fill or color).

Code (3 points):

dat %>% ggplot(aes(x=bail_amount)) + geom_histogram() + facet_grid(court_type~race)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Text answer (2 points):

It looks like Black and Hispanic individuals do get lower bail amounts in municipal court.