Introduction

Having already explored various ways to visualize the distributions of single variables (or one continuous variable plus a qualitative variable), we now turn our attention to bivariate and multivariate representations.

library(tidyverse)

Subsetting data using dplyr::filter

We’ll continue to use the iris data set for our examples.

Initially we’ll use just the I. setosa specimens from the Iris data set for our visualizations. To select just the I. setosa specimens, we introduce a function called filter which is found in the dplyr package (part of the tidyverse!).

CAUTION: there is a built in filter function as well, so make sure you load tidyverse or dplyr to get the expected behavior.

filter takes the rows of a data frame that match the specified criteria. For example, the following returns only the setosa specimens:

setosa <- filter(iris, Species == "setosa")
dim(setosa)

If we wanted to get both setosa and virginica specimens we could use the OR operator (|) as so:

setosa.or.virginica <- filter(iris, Species == "setosa" | Species == "virgnica")
dim(setosa.or.virginica)

You can also create AND statement using the & operator. For example, if you wanted to get only setosa specimens with sepal widths bigger than 3.5 cm you could do:

big.setosa <- filter(iris, Species == "setosa" & Sepal.Width > 3.5)
dim(big.setosa)

In-class Assignment #1

Create a R Markdown notebook, load any necessary libraries, and read the possums.csv data set. Write code blocks to solve the following problems:

  1. Use dply filter function to create a new data frame with just the female possums [1 pt]

  2. Use filter to create a new data frame that includes only female possums that were assigned to age class 9. How many possums meet these criteria? [1 pt]

  3. Use filter to select those possums in age class 5 or older, or for which age information is unavailable (hint: see the help for the is.na function ). How many possums meet these criteria? [2 pts]

Scatter plots

A scatter plot is one of the simplest representations of a bivariate distribution. Scatter plots are simple to create in ggplot2 by specifying the appropriate X and Y variables in the aesthetic mapping and using geom_point for the geometric mapping.

ggplot(setosa)  + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width))

gplot layers can be assigned to variables

ggplot not only facilitates drawing of plots, but also returns a “plot object” that we can assign to a variable. The following example illustrates this:

# create base plot object and assign to variable p
# this does NOT draw the plot
p <- ggplot(setosa, mapping = aes(x = Sepal.Length, y = Sepal.Width))   

In the code above we created a plot object and assigned it to the variable p. However, the plot wasn’t drawn. To draw the plot object we evaluate it as so:

p  # draw the plot object

Did the output from this code generate the image you expected?

In the previous steps we created a plot object, and drew it, but we haven’t yet specified a geom to determine how our data should be drawn. As we’ve seen in the past, geoms are layers that we can add to our plot object. Here we simply add a geom to our pre-created plot object as so:

# add a point geom to our base layer and draw the  plot
p + geom_point()

ggplot2 themes

What use is creating an intermediate variable p that points to a plot object? One thing it allows us to do is to try different formatting options without having to repeatedly write the same code. In this section we illustrate how this flexibility can be exploited by showing how to set different “themes” for a ggplot2 generated figure.

By now you’re probably familiar with the defaul “look” of plots generated by ggplot2, in particular the ubiquitous gray background with a white grid. This default works fairly well in the context of RStudio notebooks and HTML output, but might not work as well for a published figure or a slide presentation. Almost every individual aspect of a plot can be tweaked, but ggplot2 provides an easier way to make consistent changes to a plot using “themes”. You can think of a theme as adding another layer to your plot. Themes should generally be applied after all the other graphical layers are created (geoms, facets, labels) so the changes they create affect all the prior layers.

There are eight default themes included with ggplot2, which can be invoked by calling the corresponding theme functions: theme_gray, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal, theme_classic, and theme_void (See http://ggplot2.tidyverse.org/reference/ggtheme.html for a visual tour of all the default themes)

For example, let’s regenerate our scatter plot using theme_bw which get’s rid of the gray background:

p + geom_point() + theme_bw()

Another theme, theme_classic, remove the grid lines completely, and also gets rid of the top-most and right-most axis lines.

p + geom_point() + theme_classic()

Further customization with ggplot2::theme

In addition to the eight complete themes, there is a theme function in ggplot2 that allows you to tweak particular elements of a theme (see ?theme for all the possible options). For example, to tweak just the aspect ratio of a plot (the ratio of width to height), you can set the aspect.ratio argument in theme:

p + geom_point() + theme_classic() + theme(aspect.ratio = 1)

Theme related function calls can be combined to generate new themes. For example, let’s create a theme will call my.theme by combining theme_classic with a call to theme:

my.theme <- theme_classic()  + theme(aspect.ratio = 1)

We can then apply this theme as so:

p + geom_point() + my.theme

Other aspects of ggplots can be assigned to variables

Plot objects and themes are not the only aspects of a figure that can be assigned to variables for later use. For example, we can create a label object:

sepal.labels <- labs(x = "Sepal Length (cm)", y = "Sepal Width (cm)",
                     title = "Relationship between Sepal Length and Width",
                     caption = "data from Anderson (1935)")

Combining all of our variables as so, we generate our new plot:

p + 
  geom_point() + 
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

Adding a trend line to a scatter plot

ggplot2 makes it easy to add trend lines to plots. I use “trend lines” here to refer to representations like regression lines, smoothing splines, or other representations mean to help visualize the relationship between pairs of variables. We’ll spend a fair amount of time exploring the mathematics and interpetation of regression lines and related techniques in later lectures, but for now just think about trends lines as summary representations for bivariate relationships.

Trend lines can be created using geom_smooth. Let’s add a default trend line to our I. setosa scatter plot of the Sepal Width vs Sepal Length:

p + 
  geom_jitter() +  # using geom_jitter to avoid overplotting of points
  geom_smooth() +
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

The defaul trend line that geom_smooth fits is generated by a technique called “LOESS regression”. LOESS regression is a non-linear curve fitting method, hence the squiggly trend line we see above. The smoothness of the LOESS regression is controlled by a parameter called span which is related to the proportion of points used. We’ll discuss LOESS in detail in a later lecture, but here’s an illustration how changing the span affects the smoothness of the fit curve:

p + 
  geom_jitter() +  # using geom_jitter to avoid overplotting of points
  geom_smooth(span = 0.95) +
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

Linear trend lines

If instead we want a straight trend line, as would typically be depicted for a linear regression model we can specify a different statistical method:

p + 
  geom_jitter() +  # using geom_jitter to avoid overplotting of points
  geom_smooth(method = "lm", color = "red") + # using linear model ("lm")
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

In-class Assignment #2

These following questions apply to the possums data set.

  1. Draw a scatter plot to depict the relationship between age (an ordered but discrete variable in this data) and total body length (totlngth). Make sure you plot age on the x-axis[1 pt]

  2. Redraw the body length vs age scatter plot, including a LOESS trend line to the age vs body length relationship. Explore different values of the span smoothing parameter between 0.3 and 2 to find a value that generates a trend line that “makes biological sense”. Explain your reasoning. [3 pts]

  3. Redraw the body length vs age scatter plot, including a linear trend line. Discuss the possible differences in biological interpreation one might make when comparing the LOESS trend line versus the linear trend line. [3 pts]

Bivariate density plots

The density plot, which we introduced as a visualization for univariate data, can be extended to two-dimensional data. In a one dimensional density plot, the height of the curve was related to the relatively density of points in the surrounding region. In a 2D density plot, nested contours (or contours plus colors) indicate regions of higher local density. Let’s illustrate this with an example:

p + 
  geom_density2d() + 
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

The relationship between the 2D density plot and a scatter plot can be made clearer if we combine the two:

p + 
  geom_density_2d() + 
  geom_jitter(alpha=0.35) +
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

Combining Scatter Plots and Density Plots with Categorical Information

As with many of the univariate visualizations we explored, it is often useful to depict bivariate relationships as we change a categorical variable. To illustrate this, we’ll go back to using the full iris data set.

all.length.vs.width <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))

all.length.vs.width + 
  geom_point(aes(color = Species, shape = Species), size = 2, alpha = 0.6) +
  sepal.labels + labs(subtitle = "All species") +
  my.theme

Notice how in our aesthetic mapping we specified that both color and shape should be used to represent the species categories.

The same thing can be accomplished with a 2D density plot.

all.length.vs.width + 
  geom_density_2d(aes(color = Species)) +
  sepal.labels + labs(subtitle = "All species") +
  my.theme

As you can see, in the density plots above, when you have multiple categorical variables and there is significant overlap in the range of each sub-distribution, figures can become quite busy. As we’ve seen previously, faceting (conditioning) can be a good way to deal with this. Below a combination of scatter plots and 2D density plots, combined with faceting on the species variable.

all.length.vs.width + 
  geom_density_2d(aes(color = Species), alpha = 0.5) + 
  geom_point(aes(color = Species), alpha=0.5, size=1) + 
  facet_wrap(~ Species) +
  sepal.labels + labs(subtitle = "All species") +
  theme_bw() + 
  theme(aspect.ratio = 1, legend.position = "none")  # get rid of legend

In this example I went back to using a theme that includes grid lines to facilitate more accurate comparisons of the distributions across the facets. I also got rid of the legend, because the information there was redu

Density plots with fill

Let’s revisit our earlier single species 2D density plot. Instead of simply drawing contour lines, let’s use color information to help guide the eye to areas of higher density. To draw filled contours, we use a sister function to geom_density_2d called stat_density_2d:

p + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon") + 
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

Using the default color scale, areas of low density are drawn in dark blue, whereas areas of high density are drawn in light blue. I personally find this dark -to-light color scale non-intuitive for density data, and would prefer that darker regions indicate area of higher density. If we want to change the color scale, we can use the a scale function (in this case scale_fill_continuous) to set the color values used for the low and high values (this function we’ll interpolate the intervening values for us).

NOTE: when specifying color names, R accepts standard HTML color names (see the Wikipedia page on web colors for a list). We’ll also see other ways to set color values in a later class session.

p + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon") + 
  
  # lavenderblush is the HTML standard name for a light purplish-pink color
  scale_fill_continuous(low="lavenderblush", high="red") +
  
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

The two contour plots we generated looked a little funny because the contours are cutoff due to the contour regions being outside the limits of the plot. To fix this, we can change the plot limits using the lims function as shown in the following code block. We’ll also add the scatter (jittered) to the emphasize the relationship between the levels, and we’ll change the title for the color legend on the right by specifying a text label associated with the fill arguments in the labs function.

p + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon") +
  scale_fill_continuous(low="lavenderblush", high="red") +
  geom_jitter(alpha=0.5, size = 1.1) +
  
  # customize labels, including legend label for fill
  labs(x = "Sepal Length(cm)", y = "Sepal Width (cm)",
       title = "Relationship between sepal length and width",
       subtitle = "I. setosa specimens only",
       fill = "Density") +
  
  # Set plot limits. We'll discuss what this c() notation means next lecture
  lims(x = c(4,6), y = c(2.5, 4.5)) +
  my.theme 

2D bin and hex plots

Two dimensional bin and hex plots are alterative ways to represent the joint density of points in the Cartesian plane. Here are examples of to generate these plot types. Compare them to our previous examples.

A 2D bin plot can be tought of as a 2D histogram:

p + 
  geom_bin2d(binwidth = 0.2) + 
  scale_fill_continuous(low="lavenderblush", high="red") +
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme

A hex plot is similar to a 2D bin plot but uses hexagonal regions instead of squares. Hexagonal bins are useful because they can somtimes avoid visual artefacts sometimes apparent with square bins:

p + 
  geom_hex(binwidth = 0.2) + 
  scale_fill_continuous(low="lavenderblush", high="red") +
  sepal.labels + labs(subtitle = "I. setosa data only") +
  my.theme