The grammar of graphics

The Grammar of Graphics was a book published in 1999 by a statistician named Leland Wilkinson, and later Hadley Wickham used it to write a package called ggplot in R. ggplot stands for Grammar of Graphics plot.

It is called a grammar because it is a structure for producing plots that resembles a language. Nearly every current software tool used to build plots has been informed by this book. Its influence can be found in Tableau, Plotly, and the Python libraries bokeh, altair, seaborn, and plotnine. The most complete implementation of the grammar is found in an R package called ggplot2 by Hadley Wickham.

In Wickham’s adaptation of the grammar of graphics, a plot can be decomposed into seven elements:

Data: The data frame that contains the data you want to visualize.
Aesthetic (aes) mapping of the variables in the data to visual cues: What is your x or y.

Why it’s called “aesthetics”: It’s not about style or beauty — it’s about visual mappings. An aesthetic is anything that controls what you see on the plot. (x and y, color, fill, shape, size, alpha (transparency), linetype). This contrasts with “settings”. If you don’t want an aesthetic to depend on data, you set it outside aes(). For example: ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(color = "red").

Geometry: Is used to encode the observations on the plot: What kind of plot you’re making, a histogram, a bar plot, a line, etc.
Facets: If you want to split up the plots by categories, e.g., one plot for females and one for males.

Statistics: If you want to fit a model to the data. Geoms decide how to draw (points, lines, bars). Stats decide what to draw by computing summaries (counts, means, model fits).
Coordinates: If you want to change the scales. Theme: If you want to make it pretty.

ggplot components

We will use the mpg dataset, which is built into the ggplot2 package, which is part of the tidyverse package.

We will use the variables cty and hwy as our quantitative variables, and the variables drv, cyl as our categorical ones.

# load packages
library(tidyverse)
library(ggthemes)

If you only use ggplot, then nothing happens.

mpg %>% ggplot()

If you add the aesthetics, you get a coordinate system where the plot will go. We’ve defined two variables, x and y, that are quantitative variables.

mpg %>% ggplot(aes(x=cty))

If you then add the geometry (shortened as geom), then you get a plot.

mpg %>% ggplot(aes(x=cty)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can change the color of the points for fun, without having any particular meaning. Note that there is no legend. This is SETTING a constant color, not mapping (no legend). That’s why it’s outside aes.

mpg %>% ggplot(aes(x=cty)) + geom_histogram(color = "steelblue", fill="darkblue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Or you can use color with meaning. Here we use it to tell us how the points split up by a third categorical variable, drv. This splits up the points by color, one for each category of drv. This is MAPPING color to a variable (legend appears).

mpg %>% ggplot(aes(x=cty, fill = drv)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can use facets instead to see the differences in the points according to a third categorical variable. This splits up the plot into three plots, one for each category of drv.

mpg %>% ggplot(aes(x=cty)) + geom_histogram() + facet_grid(~drv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can also change the coordinates (although we won’t do this often).

# coord_polar() for pie/polar charts
mpg %>% ggplot(aes(x=cty)) + geom_histogram() + coord_polar()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can add “statistics”, meaning some modeling to your plot. Note that ggplot is doing this in the background - we will learn later how to model ourselves.

mpg %>% ggplot(aes(x=cty)) + 
  geom_histogram(bins=40, aes(y = after_stat(density))) + 
  stat_density(geom = "line", color = "blue", linewidth = 1) + # Computes a kernel density estimate
  stat_function(fun = dnorm, 
                args = list(mean = mean(mpg$cty), sd = sd(mpg$cty)),  # Plots any function (e.g., theoretical Normal distribution)
                color = "red", linetype = "dashed", linewidth=1.5)

Finally, you can change the look of the plot.

Themes

Themes are made for you:

mpg %>% ggplot(aes(x=cty)) + geom_histogram() + theme_economist()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# make sure you've loaded the ggthemes package

You can see more options here: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

Palettes

And you can also change the palettes. There’s a nice set of palettes called the Brewer palettes: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html. Designed by Cynthia Brewer: https://en.wikipedia.org/wiki/Cynthia_Brewer.

There’s a nice interactive code you can play around with, that shows how color is used in ggplot, here: https://r-graph-gallery.com/ggplot2-color.html

library(tidyverse)
mpg %>% ggplot(aes(x=cty, fill=drv)) + geom_histogram() + scale_fill_brewer(palette = "Dark2")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Single variable EDA

Quantitative varible:

Visual EDA

Histogram

Plot a histogram to visualize a single quantitative variable.

mpg %>% ggplot(aes(x=cty)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Boxplot

Plot a boxplot to visualize a single quantitative variable.

mpg %>% ggplot(aes(y=cty)) + geom_boxplot()

Quantitative EDA

To summarize a quantitative variable, first determine how many modes the distribution has. If it’s uniform, then report the range of the variable.

# If uniform (which is not the case for cty, but I'll show you how to calculate a range with it anyway):
range(mpg$cty)

## [1]  9 35

If it’s multimodal, then see if you can split up the distribution into its modes, by using a categorical variable. If not, then you can just report where the modes are.

If it’s unimodal, then you can give measures of its centrality and spread. If it’s symmetric, then give the mean and standard deviation, and if it’s asymmetric, give the median and IQR. Note that if the mean and the median are equal, then this shows you that the distribution is symmetric.

# If unimodal and symmetric:
mpg %>% summarize(mean=mean(cty), sd=sd(cty))

## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  16.9  4.26

# If unimodal and assymetric: 
mpg %>% summarize(median=median(cty), IQR=IQR(cty))

## # A tibble: 1 × 2
##   median   IQR
##    <dbl> <dbl>
## 1     17     5

Categorical variable

Visual EDA

Barplot

Draw a barplot to visualize a single categorical variable.

mpg %>% ggplot(aes(x=drv)) + geom_bar()

Pie chart

(Actually, don’t use pie charts. They’re often more deceiving than helpful - this is because humans are bad at comparing areas to each other. Just stick to barplots, or tables.)

#mpg %>% ggplot(aes(x="", fill=drv)) + geom_bar() + coord_polar("y") + theme_void()

Quantitative EDA

How to make a table to summarize a categorical variable:

dat_students <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/students.csv")

## Rows: 10497 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, gender, year_in_college, favorite_color
## dbl (2): age, grade
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dat_students %>% count(year_in_college)

## # A tibble: 4 × 2
##   year_in_college     n
##   <chr>           <int>
## 1 First year       2129
## 2 Fourth year      2108
## 3 Second year      3144
## 4 Third year       3116

How to add proportions:

dat_students %>% 
  count(year_in_college) %>% 
  mutate(prop = n / sum(n))

## # A tibble: 4 × 3
##   year_in_college     n  prop
##   <chr>           <int> <dbl>
## 1 First year       2129 0.203
## 2 Fourth year      2108 0.201
## 3 Second year      3144 0.300
## 4 Third year       3116 0.297

Two variable EDA

Two categorical variables

Visual EDA

Side-by-side barplots

mpg %>% ggplot(aes(x=cyl)) + geom_bar() + facet_wrap(~drv)

Different color barplots

Note: position dodge is almost always prefereable to identity because identity stacks the bars up, and it’s very difficult for humans to compare the sizes of the bars when they start at different heights.

mpg %>% ggplot(aes(x=cyl, fill=factor(drv))) + geom_bar(position="dodge")

# note that sometimes if you don't add factor for the aesthetics, ggplot gets confused

Quantitative EDA

You can make a two-way contingency table. (For higher dimensions, let’s say you want to compare three categorical variables, then you can also do a three-way contingency table.)

# in tidyverse
mpg %>%
  count(cyl, drv) %>% 
  pivot_wider(
    names_from = drv,
    values_from = n,
    values_fill = 0
  )

## # A tibble: 4 × 4
##     cyl   `4`     f     r
##   <int> <int> <int> <int>
## 1     4    23    58     0
## 2     5     0     4     0
## 3     6    32    43     4
## 4     8    48     1    21

# the code is simpler in base R...
table(mpg$cyl, mpg$drv)

##    
##      4  f  r
##   4 23 58  0
##   5  0  4  0
##   6 32 43  4
##   8 48  1 21

A quantitative variable and a categorical variable

Visual EDA

Side-by-side histograms

mpg %>% ggplot(aes(x=cty)) + geom_histogram() + facet_grid(~drv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Side-by-side boxplots

mpg %>% ggplot(aes(y=cyl)) + geom_boxplot() + facet_grid(~drv)

Different color histograms

mpg %>% ggplot(aes(x=cty, fill=factor(drv))) + geom_histogram(alpha=.7)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quantitative EDA

You can give the summary statistics (same as for a histogram), but for each histogram.

Two quantitative variables

Visual EDA

Scatterplot

mpg %>% ggplot(aes(x=cty, y=hwy)) + geom_point()

Three variable EDA

Three categorical variables

Visual EDA

Quantitative EDA

Three-way contingency table.

Two quantitative variables and a categorical variable

Scatterplot with different colors by the categories

mpg %>% ggplot(aes(x=cty, y=hwy, color=drv)) + geom_point()

Three quantitative variables

Scatterplot with different colors by the shade of the third quantitative variable

mpg %>% ggplot(aes(x=cty, y=hwy, color=displ)) + geom_point()

Exploratory Data Analysis

Maria Cuellar

2025-09-16

The grammar of graphics

ggplot components

Single variable EDA

Quantitative varible:

Categorical variable

Two variable EDA

Two categorical variables

A quantitative variable and a categorical variable

Two quantitative variables

Three variable EDA

Three categorical variables

Two quantitative variables and a categorical variable

Three quantitative variables