Change appearance colors in RStudio

Navigate to Tools > Global Options > Appearance (or RStudio > Preferences > Appearance on macOS) and select a desired theme from the dropdown menu.

Load data

Load the tidyverse - a package that will give you lots of tools.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data (make sure you name it as an object with a useful name)

dat_kentucky <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/kentucky-derby-2018.csv")
## Rows: 144 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Date, Winner
## dbl (7): Year, Year_no, Mins, Secs, Time.in.Sec, Distance (mi), Speed (mph)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat_students <- read_csv("/Users/mariacuellar/Github/crim_data_analysis/data/students.csv")
## Rows: 10497 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, gender, year_in_college, favorite_color
## dbl (2): age, grade
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data selection

How to select an observation:

year1986 <- dat_kentucky %>% filter(Year==1986)

year1986
## # A tibble: 1 × 9
##    Year Year_no Date     Winner     Mins  Secs Time.in.Sec `Distance (mi)`
##   <dbl>   <dbl> <chr>    <chr>     <dbl> <dbl>       <dbl>           <dbl>
## 1  1986     112 3-May-86 Ferdinand     2   2.8        123.            1.25
## # ℹ 1 more variable: `Speed (mph)` <dbl>

Note: understand the pipe operator (%>% or |>): The main function of the pipe operator is to take the output of the expression or function on its left-hand side (LHS) and pass it as the first argument to the function on its right-hand side (RHS). This allows you to chain multiple operations together in a clear, sequential manner.

How to select a variable:

dat_kentucky$Winner
##   [1] "Aristides"          "Vagrant"            "Baden-Baden"       
##   [4] "Day Star"           "Lord Murphy"        "Fonso"             
##   [7] "Hindoo"             "Apollo"             "Leonatus"          
##  [10] "Buchanan"           "Joe Cotton"         "Ben Ali"           
##  [13] "Montrose"           "Macbeth II"         "Spokane"           
##  [16] "Riley"              "Kingman"            "Azra"              
##  [19] "Lookout"            "Chant"              "Halma"             
##  [22] "Ben Brush"          "Typhoon II"         "Plaudit"           
##  [25] "Manuel"             "Lieut. Gibson"      "His Eminence"      
##  [28] "Alan-a-Dale"        "Judge Himes"        "Elwood"            
##  [31] "Agile"              "Sir Huon"           "Pink Star"         
##  [34] "Stone Street"       "Wintergreen"        "Donau"             
##  [37] "Meridian"           "Worth"              "Donerail"          
##  [40] "Old Rosebud"        "Regret"             "George Smith"      
##  [43] "*Omar Khayyam"      "Exterminator"       "Sir Barton"        
##  [46] "Paul Jones"         "Behave Yourself"    "Morvich"           
##  [49] "Zev"                "Black Gold"         "Flying Ebony"      
##  [52] "Bubbling Over"      "Whiskery"           "Reigh Count"       
##  [55] "Clyde Van Dusen"    "Gallant Fox"        "Twenty Grand"      
##  [58] "Burgoo King"        "Brokers Tip"        "Cavalcade"         
##  [61] "Omaha"              "Bold Venture"       "War Admiral"       
##  [64] "Lawrin"             "Johnstown"          "Gallahadion"       
##  [67] "Whirlaway"          "Shut Out"           "Count Fleet"       
##  [70] "Pensive"            "Hoop Jr."           "Assault"           
##  [73] "Jet Pilot"          "Citation"           "Ponder"            
##  [76] "Middleground"       "Count Turf"         "Hill Gail"         
##  [79] "Dark Star"          "Determine"          "Swaps"             
##  [82] "Needles"            "Iron Liege"         "Tim Tam"           
##  [85] "*Tomy Lee"          "Venetian Way"       "Carry Back"        
##  [88] "Decidedly"          "Chateaugay"         "Northern Dancer"   
##  [91] "Lucky Debonair"     "Kauai King"         "Proud Clarion"     
##  [94] "Forward Pass**"     "Majestic Prince"    "Dust Commander"    
##  [97] "Canonero II"        "Riva Ridge"         "Secretariat"       
## [100] "Cannonade"          "Foolish Pleasure"   "Bold Forbes"       
## [103] "Seattle Slew"       "Affirmed"           "Spectacular Bid"   
## [106] "Genuine Risk"       "Pleasant Colony"    "Gato Del Sol"      
## [109] "SunnyÕs Halo"      "Swale"              "Spend a Buck"      
## [112] "Ferdinand"          "Alysheba"           "Winning Colors"    
## [115] "Sunday Silence"     "Unbridled"          "Strike the Gold"   
## [118] "Lil E. Tee"         "Sea Hero"           "Go for Gin"        
## [121] "Thunder Gulch"      "Grindstone"         "Silver Charm"      
## [124] "Real Quiet"         "Charismatic"        "Fusaichi Pegasus"  
## [127] "Monarchos"          "War Emblem"         "Funny Cide"        
## [130] "Smarty Jones"       "Giacomo"            "Barbaro"           
## [133] "Street Sense"       "Big Brown"          "Mine That Bird"    
## [136] "Super Saver"        "Animal Kingdom"     "I'll Have Another" 
## [139] "Orb"                "California Chrome2" "American Pharoah"  
## [142] "Nyquist"            "Always Dreaming"    "Justify"
thewinner <- dat_kentucky %>% select(Winner)

thewinner
## # A tibble: 144 × 1
##    Winner     
##    <chr>      
##  1 Aristides  
##  2 Vagrant    
##  3 Baden-Baden
##  4 Day Star   
##  5 Lord Murphy
##  6 Fonso      
##  7 Hindoo     
##  8 Apollo     
##  9 Leonatus   
## 10 Buchanan   
## # ℹ 134 more rows

How to select both (observation and variable):

thewinnerin1986 <- dat_kentucky %>% 
  filter(Year==1986) %>% 
  select(Winner)

thewinnerin1986
## # A tibble: 1 × 1
##   Winner   
##   <chr>    
## 1 Ferdinand

First look at the data

Variables in dataset? can look at Data pane, or use names()

names(dat_kentucky)
## [1] "Year"          "Year_no"       "Date"          "Winner"       
## [5] "Mins"          "Secs"          "Time.in.Sec"   "Distance (mi)"
## [9] "Speed (mph)"

Type of R variable: can look at the Data pane, or use class()

class(dat_kentucky$Secs)
## [1] "numeric"

Look at dimensions of data: can look at the Data pane, or use dim()

dim(dat_kentucky)
## [1] 144   9

Note: the commands let me see what you did.

Summarize categorical variable

How to make a table to summarize a categorical variable:

names(dat_students)
## [1] "name"            "gender"          "age"             "year_in_college"
## [5] "favorite_color"  "grade"
table(dat_students$year_in_college) # Base R
## 
##  First year Fourth year Second year  Third year 
##        2129        2108        3144        3116
dat_students %>% count(year_in_college) # tidyverse
## # A tibble: 4 × 2
##   year_in_college     n
##   <chr>           <int>
## 1 First year       2129
## 2 Fourth year      2108
## 3 Second year      3144
## 4 Third year       3116

How to make add proportions:

dat_students %>% 
  count(year_in_college) %>% 
  mutate(prop = n / sum(n))
## # A tibble: 4 × 3
##   year_in_college     n  prop
##   <chr>           <int> <dbl>
## 1 First year       2129 0.203
## 2 Fourth year      2108 0.201
## 3 Second year      3144 0.300
## 4 Third year       3116 0.297

How to make a table to summarize Two categorical variables:

table(dat_students$favorite_color, dat_students$year_in_college) # It's trickier in tidyverse, but check out the janitor package
##        
##         First year Fourth year Second year Third year
##   blue         214         205         301        274
##   green        417         413         616        658
##   red         1498        1490        2227       2184

Visualize quantitative variable

How to draw a histogram:

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram(bins=60) # Changes number of bins

dat_kentucky_lowtime <- dat_kentucky %>% filter(Time.in.Sec<140)

dat_kentucky_lowtime %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

dat_kentucky %>% 
  filter(Time.in.Sec < 140) %>% 
  ggplot(aes(x=Time.in.Sec)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can save the plot as an object

p <- dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram()

library(ggthemes)
p + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + labs(title="Histogram") + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How to change the aesthetics and labels

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram(bins = 30, fill = "steelblue", color = "white")

How to change the aesthetics and labels

dat_kentucky %>% ggplot(aes(x=Time.in.Sec)) + geom_histogram() +
    labs(
    x = "Time in seconds",   # new label for x-axis
    y = "Count",         # new label for y-axis
    title = "Histogram of Time in Seconds"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is equivalent to

p +
  labs(
    x = "Time in seconds",   # new label for x-axis
    y = "Count",         # new label for y-axis
    title = "Histogram of Time in Seconds"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can change the theme too

p + theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + theme_stata()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualize categorical variable

How to draw a barplot for YearInCollege

# using base R, it's just barplot()
barplot(table(dat_students$year_in_college))

dat_students %>% ggplot(aes(x = year_in_college)) + geom_bar()