This intro assumes that the readers know the basics of R. To keep everything concise, the descriptions have a tendency to be extremely short, so pointers to other references are scattered throughout this intro. To focus on presenting certain features of ggplot2, some of the graphics in this intro are not-so-ideal in the sense that better visualization can be made.

After working through this intro, you should be able to …

understand the key components of a ggplot plot
explore many other features of ggplot2 on your own.

Motivation

What is ggplot2?

-a data visualization package in R created by Hadley Wickham. See wikipedia.

Why ggplot2?

Ability to construct plots in layers.
Nice looking graphics (relative to base graphics, in general).
Follows grammar of graphics; more intuitive.

See the article by Hadley Wickham for more information about grammar of graphics.

etc.

See this github page for more reasons to use ggplot2.

Layering

Five components of a layer:

Data

a data frame containing the dataset of interest

Aesthetic mappings

descriptions of how the variables in the dataset are mappeed to visual objects
specified by the aes() function

Geom

geometric objects
control the type of plots
key ones: line (lineplot), point (scatterplot), bar (barplot), histogram, density

Stat

statistical transformation of the data

Position Adjustments

minor adjustments of the position of layer elements
mainly used for discrete data
dodge, fill, identity, jitter, stack

We will focus on the first three in this tutorial.

Example 1:

Using the iris dataset, create a scatterplot of petal lengths (y-axis) versus petal widths (x-axis), color coded by species. In addition, plot the regression line (petal lengths vs petal widths) with a 95% confidence band.

Let’s construct the plot step-by-step.

First, we would like to initialize our plot using ggplot().

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

p1 <- ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width))
p1

What does the code do? data = iris tells ggplot() to look at the dataset iris, and aes(x = Petal.Length, y = Petal.Width)) maps x to the variable Petal.Length in iris and y to the variable Petal.Width in iris (this is evident from the x-axis and the y-axis of the above plot).

You may wonder why there is nothing shown on the plot. The reason is that we haven’t specified what we want to see on the plot! This is where geom comes into play.

p2 <- p1 + geom_point(aes(color = Species))
p2

Three things are added to the plot:

Points representing the observations
Colors corresponding to the species
A legend that describes the color coding

What happened? geom_point() generates a scatterplot via a layer of points based on x and y, and aes(color = Species) maps color to the variable Species. One nice feature of ggplot() is that the legend is created automatically when color-coding/shape-coding via aesthetic mappings.

Finally, we use geom_smooth() with the argument method='lm' to plot the regression line with a confidence band.

p3 <- p2 + geom_smooth(method='lm')
p3

Creating/modifying the title and the axis labels is straightforward.

p4 <- p3 + xlab("Petal Length (cm)") + ylab("Petal Width (cm)") + ggtitle("Petal Length versus Petal Width")
p4

Just to illustrate how the plot evolves, let’s try to organize the four plots in one single plot. An easy way is to use the function grid.arrange() in the package gridExtra.

library(gridExtra)
library(grid)
grid.arrange(p1 + ggtitle("ggplot(data = iris, \naes(x = Petal.Length, \ny = Petal.Width))"),
             p2 + ggtitle("+ geom_point(aes(color = Species))"), 
             p3 + ggtitle("+ geom_smooth(method='lm')"), 
             p4, nrow = 2,
             top = textGrob("Evolution of the plot in Example 1",
                            gp=gpar(fontsize=20)))

Exercise 1

Import the datasets using the following commands:

data(presidential, package = 'ggplot2')
sp500 <- read.csv("http://real-chart.finance.yahoo.com/table.csv?s=%5EGSPC&a=00&b=3&c=1950&d=00&e=31&f=2016&g=d&ignore=.csv")

Create the following plot.

alt text

Hint: geom_rect() might be useful.

Remark: This exercise is inspired from this tutorial.

sp500$Date <- as.Date(sp500$Date)

# Create a line plot of S&P500 log price over time.
layer_line <- geom_line(
  mapping = aes(x = Date, y = log(Adj.Close)),
  data = sp500
)

# Create a layer of rectangles indicating the political 
# affiliation for each president over different time period.
layer_rect <- geom_rect(
  data = presidential, 
  aes(xmin = start, xmax = end, ymin = -Inf, ymax = Inf, fill = party), 
  alpha = 0.4
)

# Put everything together.
ggplot() + layer_line + layer_rect

Faceting

Faceting is basically a mechanism for arranging several plots on the same page. Two types of faceting:

facet_grid

a 2d grid of panels

facet_wrap

a 1d ribbon of panels wrapped into 2d

Example 2:

Make a 1x3 panels of histograms of petal lengths, one for each species.

library(plyr)
vline_df <- ddply(iris, .(Species), function(sp){
  mean(sp$Petal.Length)
})
ggplot(data = iris, aes(x = Petal.Length)) + 
  geom_histogram() + 
  geom_vline(data = vline_df, aes(xintercept = V1, color = Species)) +
  facet_grid(.~Species) +
  ggtitle("Petal Lengths of each species")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As a side note, each of the three plots above has a considerable amount of empty space. It might be better to simply overlay the histograms for better visualization.

ggplot(data = iris, aes(x=Petal.Length, fill=Species)) + 
  geom_histogram(alpha=0.2, position="identity") +
  geom_vline(data = vline_df, aes(xintercept = V1, color = Species)) +
  ggtitle("Petal Lengths of each species") + 
  annotate("segment", x=4.8, y=14.5, xend=4.8, yend=5, size=0.5, 
      arrow=arrow(length=unit(.2, "cm"))) + 
  annotate("text", label="Overlap", x=4.8, y=15, size=3,    
      fontface="bold")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Example 3:

Create a visualization for explaining Simpson’s paradox using the gender bias dataset about UC Berkeley 1973 Graduate Admissions.

library(datasets)
UCBdt <- as.data.frame(UCBAdmissions)
overall <- ddply(UCBdt, .(Gender), function(gender) {
  temp <- c(sum(gender[gender$Admit == "Admitted", "Freq"]), 
                sum(gender[gender$Admit == "Rejected", "Freq"])) / sum(gender$Freq)
  names(temp) <- c("Admitted", "Rejected")
  temp
})
departmentwise <- ddply(UCBdt, .(Gender,Dept), function(gender) {
  temp <- gender$Freq / sum(gender$Freq)
  names(temp) <- c("Admitted", "Rejected")
  temp
})

# A barplot for overall admission percentage for each gender.
p1 <- ggplot(data = overall, aes(x = Gender, y = Admitted, width = 0.2))
p1 <- p1 + geom_bar(stat = "identity") + 
  ggtitle("Overall admission percentage") + ylim(0,1) 

# A 1x6 panel of barplots, each of which represents the 
# admission percentage for a department
p2 <- ggplot(data = UCBdt[UCBdt$Admit == "Admitted", ], aes(x = Gender, y = Freq))
p2 <- p2 + geom_bar(stat = "identity") + facet_grid(. ~ Dept) +
  ggtitle("Number of admitted students\nfor each department") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

# A 1x6 panel of barplots, each of which represents the 
# number of admitted students for a department
p3 <- ggplot(data = departmentwise, aes(x = Gender, y = Admitted))
p3 <- p3 + geom_bar(stat = "identity") + facet_grid(. ~ Dept) + ylim(0,1) +
  ggtitle("Admission percentage\nfor each department") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# A 1x6 panel of barplots, each of which represents the 
# number of applicants for a department
p4 <- ggplot(data = UCBdt, aes(x = Gender, y = Freq))
p4 <- p4 + geom_bar(stat = "identity") + facet_grid(. ~ Dept) + 
  ggtitle("Number of Applicants\nfor each department") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Arrange the four plots on a page.
grid.arrange(p1, p2, p3, p4, nrow=2, 
             top = textGrob("Simpson's Paradox: UC Berkeley 1973 Admissions", gp=gpar(fontsize=20)))

The command theme(axis.text.x = element_text(angle = 90, hjust = 1)) rotates the axis label 90 degrees counterclockwise with some horizontal adjustment. See the documentation of ggplot2 themes for more information on the control of non-data elements of a ggplot object.

Maps

There are tools for dealing with map data in ggplot. The function border() allows us ot create borders on a map.

Example 4

Suppose I would like to study the housing market in Texas for cities with population at least 100000 between January and June in 2015.

# Obtain the dataset.
library(maps)

## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:plyr':
## 
##     ozone

data(us.cities) # from the package maps
data(txhousing) # from the package ggplot2

# Preprocessing
tx.cities <- subset(us.cities, country.etc == "TX" & pop >= 100000)
tx.cities$city <- unlist(strsplit(tx.cities$name, " TX"))
txhousing.2015 <- subset(txhousing, year == 2015 & month <= 6 &
                             city %in% tx.cities$city)
temp <- tx.cities[tx.cities$city %in% txhousing.2015$city, c("pop", "lat", "long")]
temp <- temp[rep(seq_len(nrow(temp)), each = 6), ]
txhousing.2015.geo <- cbind(txhousing.2015, temp)

# Create the plot.
ggplot(txhousing.2015.geo, aes(x = long, y = lat, size = sales, colour = cut(median, 5))) +
  borders("county", "texas", colour = "grey70") + 
  geom_point(shape=1, stroke = 1.1) + facet_wrap(~month) +
  ggtitle("Housing market for populous cities in Texas \n (Jan-Jun 2015)") +
  scale_colour_discrete(name  = "Median price") +
  scale_size_continuous(name  = "Number of Sales")

Note that to adjust the legend, a command of the form scale_xxx_yyy(...) is used. See the cookbook for R page on ggplot2 legends for more detail.

As a side note, there is a package called ggmap which might be useful for creating spatial plots. See here for a quick introduction/tutorial of ggmap.

Example 5

Create a visual display of the population of Texas cities on a map.

library(ggmap)

## Warning: package 'ggmap' was built under R version 3.2.3

tx_center = as.numeric(geocode("Texas"))

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Texas&sensor=false

txMap = ggmap(get_googlemap(center=tx_center, scale=2, zoom=6), extent="normal")

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=31.968599,-99.901813&zoom=6&size=640x640&scale=2&maptype=terrain&sensor=false

tx.cities.all <- subset(us.cities, country.etc == "TX")
txMap + geom_point(aes(x=long, y=lat, size = pop), col = "orange", 
                   data = tx.cities.all, alpha=0.4) +
  ggtitle("Population of Texas cities")

References

ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham
Quick Introduction to ggplot2 by Edwin Chen
[ggmap: Spatial Visualization with ggplot2] (http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf) by David Kahle and Hadley Wickham. The R Journal, 5(1), 144-161.

A (very) brief introduction to ggplot2

Johnny Hong

January 30, 2016