The Grammar of Graphics was a book published in 1999 by a statistician named Leland Wilkinson, and later Hadley Wickham used it to write a package called ggplot in R. ggplot stands for Grammar of Graphics plot.
It is called a grammar because it is a structure for producing plots that resembles a language. Nearly every current software tool used to build plots has been informed by this book. Its influence can be found in Tableau, Plotly, and the Python libraries bokeh, altair, seaborn, and plotnine. The most complete implementation of the grammar is found in an R package called ggplot2 by Hadley Wickham.
In Wickham’s adaptation of the grammar of graphics, a plot can be decomposed into seven elements:
Why it’s called “aesthetics”: It’s not about style or beauty — it’s
about visual mappings. An aesthetic is anything that controls what you
see on the plot. (x and y, color, fill, shape, size, alpha
(transparency), linetype). This contrasts with “settings”. If you don’t
want an aesthetic to depend on data, you set it outside aes(). For
example:
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(color = "red")
.
We will use the mpg
dataset, which is built into the
ggplot2
package, which is part of the
tidyverse
package.
We will use the variables cty
and hwy
as
our quantitative variables, and the variables drv
,
cyl
as our categorical ones.
# load packages
library(tidyverse)
library(ggthemes)
If you only use ggplot, then nothing happens.
mpg %>% ggplot()
If you add the aesthetics, you get a coordinate system where the plot will go. We’ve defined two variables, x and y, that are quantitative variables.
mpg %>% ggplot(aes(x=cty))
If you then add the geometry (shortened as geom), then you get a plot.
mpg %>% ggplot(aes(x=cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can change the color of the points for fun, without having any particular meaning. Note that there is no legend. This is SETTING a constant color, not mapping (no legend). That’s why it’s outside aes.
mpg %>% ggplot(aes(x=cty)) + geom_histogram(color = "steelblue", fill="darkblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Or you can use color with meaning. Here we use it to tell us how the points split up by a third categorical variable, drv. This splits up the points by color, one for each category of drv. This is MAPPING color to a variable (legend appears).
mpg %>% ggplot(aes(x=cty, fill = drv)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can use facets instead to see the differences in the points according to a third categorical variable. This splits up the plot into three plots, one for each category of drv.
mpg %>% ggplot(aes(x=cty)) + geom_histogram() + facet_grid(~drv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can also change the coordinates (although we won’t do this often).
# coord_polar() for pie/polar charts
mpg %>% ggplot(aes(x=cty)) + geom_histogram() + coord_polar()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can add “statistics”, meaning some modeling to your plot. Note that ggplot is doing this in the background - we will learn later how to model ourselves.
mpg %>% ggplot(aes(x=cty)) +
geom_histogram(bins=40, aes(y = after_stat(density))) +
stat_density(geom = "line", color = "blue", linewidth = 1) + # Computes a kernel density estimate
stat_function(fun = dnorm,
args = list(mean = mean(mpg$cty), sd = sd(mpg$cty)), # Plots any function (e.g., theoretical Normal distribution)
color = "red", linetype = "dashed", linewidth=1.5)
Finally, you can change the look of the plot.
Themes
Themes are made for you:
mpg %>% ggplot(aes(x=cty)) + geom_histogram() + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# make sure you've loaded the ggthemes package
You can see more options here: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
Palettes
And you can also change the palettes. There’s a nice set of palettes called the Brewer palettes: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. Designed by Cynthia Brewer: https://en.wikipedia.org/wiki/Cynthia_Brewer.
There’s a nice interactive code you can play around with, that shows how color is used in ggplot, here: https://r-graph-gallery.com/ggplot2-color.html
library(tidyverse)
mpg %>% ggplot(aes(x=cty, fill=drv)) + geom_histogram() + scale_fill_brewer(palette = "Dark2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visual EDA
Histogram
Plot a histogram to visualize a single quantitative variable.
mpg %>% ggplot(aes(x=cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplot
Plot a boxplot to visualize a single quantitative variable.
mpg %>% ggplot(aes(y=cty)) + geom_boxplot()
Quantitative EDA
To summarize a quantitative variable, first determine how many modes the distribution has. If it’s uniform, then report the range of the variable.
# If uniform (which is not the case for cty, but I'll show you how to calculate a range with it anyway):
range(mpg$cty)
## [1] 9 35
If it’s multimodal, then see if you can split up the distribution into its modes, by using a categorical variable. If not, then you can just report where the modes are.
If it’s unimodal, then you can give measures of its centrality and spread. If it’s symmetric, then give the mean and standard deviation, and if it’s asymmetric, give the median and IQR. Note that if the mean and the median are equal, then this shows you that the distribution is symmetric.
# If unimodal and symmetric:
mpg %>% summarize(mean=mean(cty), sd=sd(cty))
## # A tibble: 1 × 2
## mean sd
## <dbl> <dbl>
## 1 16.9 4.26
# If unimodal and assymetric:
mpg %>% summarize(median=median(cty), IQR=IQR(cty))
## # A tibble: 1 × 2
## median IQR
## <dbl> <dbl>
## 1 17 5
Visual EDA
Barplot
Draw a barplot to visualize a single categorical variable.
mpg %>% ggplot(aes(x=drv)) + geom_bar()
Pie chart
(Actually, don’t use pie charts. They’re often more deceiving than helpful - this is because humans are bad at comparing areas to each other. Just stick to barplots, or tables.)
#mpg %>% ggplot(aes(x="", fill=drv)) + geom_bar() + coord_polar("y") + theme_void()
Quantitative EDA
How to make a table to summarize a categorical variable:
dat_students <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/students.csv")
## Rows: 10497 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, gender, year_in_college, favorite_color
## dbl (2): age, grade
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat_students %>% count(year_in_college)
## # A tibble: 4 × 2
## year_in_college n
## <chr> <int>
## 1 First year 2129
## 2 Fourth year 2108
## 3 Second year 3144
## 4 Third year 3116
How to add proportions:
dat_students %>%
count(year_in_college) %>%
mutate(prop = n / sum(n))
## # A tibble: 4 × 3
## year_in_college n prop
## <chr> <int> <dbl>
## 1 First year 2129 0.203
## 2 Fourth year 2108 0.201
## 3 Second year 3144 0.300
## 4 Third year 3116 0.297
Visual EDA
Side-by-side barplots
mpg %>% ggplot(aes(x=cyl)) + geom_bar() + facet_wrap(~drv)
Different color barplots
Note: position dodge is almost always prefereable to identity because identity stacks the bars up, and it’s very difficult for humans to compare the sizes of the bars when they start at different heights.
mpg %>% ggplot(aes(x=cyl, fill=factor(drv))) + geom_bar(position="dodge")
# note that sometimes if you don't add factor for the aesthetics, ggplot gets confused
Quantitative EDA
You can make a two-way contingency table. (For higher dimensions, let’s say you want to compare three categorical variables, then you can also do a three-way contingency table.)
# in tidyverse
mpg %>%
count(cyl, drv) %>%
pivot_wider(
names_from = drv,
values_from = n,
values_fill = 0
)
## # A tibble: 4 × 4
## cyl `4` f r
## <int> <int> <int> <int>
## 1 4 23 58 0
## 2 5 0 4 0
## 3 6 32 43 4
## 4 8 48 1 21
# the code is simpler in base R...
table(mpg$cyl, mpg$drv)
##
## 4 f r
## 4 23 58 0
## 5 0 4 0
## 6 32 43 4
## 8 48 1 21
Visual EDA
Side-by-side histograms
mpg %>% ggplot(aes(x=cty)) + geom_histogram() + facet_grid(~drv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Side-by-side boxplots
mpg %>% ggplot(aes(y=cyl)) + geom_boxplot() + facet_grid(~drv)
Different color histograms
mpg %>% ggplot(aes(x=cty, fill=factor(drv))) + geom_histogram(alpha=.7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Quantitative EDA
You can give the summary statistics (same as for a histogram), but for each histogram.
Visual EDA
Scatterplot
mpg %>% ggplot(aes(x=cty, y=hwy)) + geom_point()
Visual EDA
Quantitative EDA
Three-way contingency table.
Scatterplot with different colors by the categories
mpg %>% ggplot(aes(x=cty, y=hwy, color=drv)) + geom_point()
Scatterplot with different colors by the shade of the third quantitative variable
mpg %>% ggplot(aes(x=cty, y=hwy, color=displ)) + geom_point()