Change appearance colors in RStudio

Navigate to Tools > Global Options > Appearance (or RStudio > Preferences > Appearance on macOS) and select a desired theme from the dropdown menu.

Load data

Load the tidyverse - a package that will give you lots of tools.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data (make sure you name it as an object with a useful name)

dat_kentucky <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/kentucky-derby-2018.csv")
## Rows: 144 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Date, Winner
## dbl (7): Year, Year_no, Mins, Secs, Time.in.Sec, Distance (mi), Speed (mph)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat_students <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/students.csv")
## Rows: 10497 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, gender, year_in_college, favorite_color
## dbl (2): age, grade
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data selection

How to select an observation:

year1986 <- dat_kentucky %>% filter(Year==1986)

year1986
## # A tibble: 1 × 9
##    Year Year_no Date     Winner     Mins  Secs Time.in.Sec `Distance (mi)`
##   <dbl>   <dbl> <chr>    <chr>     <dbl> <dbl>       <dbl>           <dbl>
## 1  1986     112 3-May-86 Ferdinand     2   2.8        123.            1.25
## # ℹ 1 more variable: `Speed (mph)` <dbl>

Note: understand the pipe operator (%>% or |>): The main function of the pipe operator is to take the output of the expression or function on its left-hand side (LHS) and pass it as the first argument to the function on its right-hand side (RHS). This allows you to chain multiple operations together in a clear, sequential manner.

How to select a variable:

dat_kentucky$Winner
##   [1] "Aristides"          "Vagrant"            "Baden-Baden"       
##   [4] "Day Star"           "Lord Murphy"        "Fonso"             
##   [7] "Hindoo"             "Apollo"             "Leonatus"          
##  [10] "Buchanan"           "Joe Cotton"         "Ben Ali"           
##  [13] "Montrose"           "Macbeth II"         "Spokane"           
##  [16] "Riley"              "Kingman"            "Azra"              
##  [19] "Lookout"            "Chant"              "Halma"             
##  [22] "Ben Brush"          "Typhoon II"         "Plaudit"           
##  [25] "Manuel"             "Lieut. Gibson"      "His Eminence"      
##  [28] "Alan-a-Dale"        "Judge Himes"        "Elwood"            
##  [31] "Agile"              "Sir Huon"           "Pink Star"         
##  [34] "Stone Street"       "Wintergreen"        "Donau"             
##  [37] "Meridian"           "Worth"              "Donerail"          
##  [40] "Old Rosebud"        "Regret"             "George Smith"      
##  [43] "*Omar Khayyam"      "Exterminator"       "Sir Barton"        
##  [46] "Paul Jones"         "Behave Yourself"    "Morvich"           
##  [49] "Zev"                "Black Gold"         "Flying Ebony"      
##  [52] "Bubbling Over"      "Whiskery"           "Reigh Count"       
##  [55] "Clyde Van Dusen"    "Gallant Fox"        "Twenty Grand"      
##  [58] "Burgoo King"        "Brokers Tip"        "Cavalcade"         
##  [61] "Omaha"              "Bold Venture"       "War Admiral"       
##  [64] "Lawrin"             "Johnstown"          "Gallahadion"       
##  [67] "Whirlaway"          "Shut Out"           "Count Fleet"       
##  [70] "Pensive"            "Hoop Jr."           "Assault"           
##  [73] "Jet Pilot"          "Citation"           "Ponder"            
##  [76] "Middleground"       "Count Turf"         "Hill Gail"         
##  [79] "Dark Star"          "Determine"          "Swaps"             
##  [82] "Needles"            "Iron Liege"         "Tim Tam"           
##  [85] "*Tomy Lee"          "Venetian Way"       "Carry Back"        
##  [88] "Decidedly"          "Chateaugay"         "Northern Dancer"   
##  [91] "Lucky Debonair"     "Kauai King"         "Proud Clarion"     
##  [94] "Forward Pass**"     "Majestic Prince"    "Dust Commander"    
##  [97] "Canonero II"        "Riva Ridge"         "Secretariat"       
## [100] "Cannonade"          "Foolish Pleasure"   "Bold Forbes"       
## [103] "Seattle Slew"       "Affirmed"           "Spectacular Bid"   
## [106] "Genuine Risk"       "Pleasant Colony"    "Gato Del Sol"      
## [109] "SunnyÕs Halo"      "Swale"              "Spend a Buck"      
## [112] "Ferdinand"          "Alysheba"           "Winning Colors"    
## [115] "Sunday Silence"     "Unbridled"          "Strike the Gold"   
## [118] "Lil E. Tee"         "Sea Hero"           "Go for Gin"        
## [121] "Thunder Gulch"      "Grindstone"         "Silver Charm"      
## [124] "Real Quiet"         "Charismatic"        "Fusaichi Pegasus"  
## [127] "Monarchos"          "War Emblem"         "Funny Cide"        
## [130] "Smarty Jones"       "Giacomo"            "Barbaro"           
## [133] "Street Sense"       "Big Brown"          "Mine That Bird"    
## [136] "Super Saver"        "Animal Kingdom"     "I'll Have Another" 
## [139] "Orb"                "California Chrome2" "American Pharoah"  
## [142] "Nyquist"            "Always Dreaming"    "Justify"
thewinner <- dat_kentucky %>% select(Winner)

thewinner
## # A tibble: 144 × 1
##    Winner     
##    <chr>      
##  1 Aristides  
##  2 Vagrant    
##  3 Baden-Baden
##  4 Day Star   
##  5 Lord Murphy
##  6 Fonso      
##  7 Hindoo     
##  8 Apollo     
##  9 Leonatus   
## 10 Buchanan   
## # ℹ 134 more rows

How to select both (observation and variable):

thewinnerin1986 <- dat_kentucky %>% 
  filter(Year==1986) %>% 
  select(Winner)

thewinnerin1986
## # A tibble: 1 × 1
##   Winner   
##   <chr>    
## 1 Ferdinand

First look at the data

Variables in dataset? can look at Data pane, or use names()

names(dat_kentucky)
## [1] "Year"          "Year_no"       "Date"          "Winner"       
## [5] "Mins"          "Secs"          "Time.in.Sec"   "Distance (mi)"
## [9] "Speed (mph)"

Type of R variable: can look at the Data pane, or use class()

class(dat_kentucky$Secs)
## [1] "numeric"

Look at dimensions of data: can look at the Data pane, or use dim()

dim(dat_kentucky)
## [1] 144   9

Note: the commands let me see what you did.

Categorical variable as factor

In R, a factor is a special data type used to represent categorical variables, which are variables that take on a limited set of values such as “Male”/“Female” or “Low”/“Medium”/“High”. Factors are stored internally as integers with associated labels, making them both memory-efficient and useful for analysis.

Unlike plain character variables, factors explicitly tell R and tidyverse packages (like ggplot2 and dplyr) that a variable is categorical, which allows for better control over ordering and visualization. You can create factors with the factor() function, customize the level order with the levels argument, or specify an ordered factor when categories have a natural ranking (e.g., “Low” < “Medium” < “High”). See the code below.

Factors are essential for working with categorical data, ensuring correct ordering in plots and statistical models, and preventing unintended numeric operations.

library(tidyverse)

# This is the simplest way to make a categorical variable into a factor
dat_students <- dat_students %>% mutate(year_in_college = as.factor(year_in_college))


# This allows you to set the order of the levels.
dat_students <- dat_students %>%
  mutate(year_in_college = factor(year_in_college, levels = c("1", "2", "3", "4")))

Summarize categorical variable

How to make a table to summarize a categorical variable:

names(dat_students)
## [1] "name"            "gender"          "age"             "year_in_college"
## [5] "favorite_color"  "grade"
table(dat_students$year_in_college) # Base R
## 
## 1 2 3 4 
## 0 0 0 0
dat_students %>% count(year_in_college) # tidyverse
## # A tibble: 1 × 2
##   year_in_college     n
##   <fct>           <int>
## 1 <NA>            10497

How to make add proportions:

dat_students %>% 
  count(year_in_college) %>% 
  mutate(prop = n / sum(n))
## # A tibble: 1 × 3
##   year_in_college     n  prop
##   <fct>           <int> <dbl>
## 1 <NA>            10497     1

How to make a table to summarize Two categorical variables:

table(dat_students$favorite_color, dat_students$year_in_college) # It's trickier in tidyverse, but check out the janitor package
##        
##         1 2 3 4
##   blue  0 0 0 0
##   green 0 0 0 0
##   red   0 0 0 0

Visualize quantitative variable

How to draw a histogram:

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram(bins=60) # Changes number of bins

dat_kentucky_lowtime <- dat_kentucky %>% filter(Time.in.Sec<140)

dat_kentucky_lowtime %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

dat_kentucky %>% 
  filter(Time.in.Sec < 140) %>% 
  ggplot(aes(x=Time.in.Sec)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

You can save the plot as an object

p <- dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()

library(ggthemes)
p + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + labs(title="Histogram") + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

How to change the aesthetics and labels

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram(bins = 30, fill = "steelblue", color = "white")

How to change the aesthetics and labels

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram() +
    labs(
    x = "Time in seconds",   # new label for x-axis
    y = "Count",         # new label for y-axis
    title = "Histogram of Time in Seconds"
  )
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

This is equivalent to

p +
  labs(
    x = "Time in seconds",   # new label for x-axis
    y = "Count",         # new label for y-axis
    title = "Histogram of Time in Seconds"
  )
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

You can change the theme too

p + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + theme_stata()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

p + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Visualize categorical variable

How to draw a barplot for YearInCollege

# using base R, it's just barplot()
barplot(table(dat_students$year_in_college))

dat_students %>% ggplot(aes(x = year_in_college)) + geom_bar()