# Introduction

Having already explored various ways to visualize the distributions of single variables (or one continuous variable plus a qualitative variable), we now turn our attention to bivariate and multivariate representations.

``library(tidyverse)``

# Subsetting data using `dplyr::filter`

We’ll continue to use the `iris` data set for our examples.

Initially we’ll use just the I. setosa specimens from the Iris data set for our visualizations. To select just the I. setosa specimens, we introduce a function called `filter` which is found in the `dplyr` package (part of the tidyverse!).

CAUTION: there is a built in `filter` function as well, so make sure you load tidyverse or dplyr to get the expected behavior.

`filter` takes the rows of a data frame that match the specified criteria. For example, the following returns only the setosa specimens:

``````setosa <- filter(iris, Species == "setosa")
dim(setosa)``````

If we wanted to get both setosa and virginica specimens we could use the OR operator (`|`) as so:

``````setosa.or.virginica <- filter(iris, Species == "setosa" | Species == "virgnica")
dim(setosa.or.virginica)``````

You can also create AND statement using the `&` operator. For example, if you wanted to get only setosa specimens with sepal widths bigger than 3.5 cm you could do:

``````big.setosa <- filter(iris, Species == "setosa" & Sepal.Width > 3.5)
dim(big.setosa)``````

# In-class Assignment #1

Create a R Markdown notebook, load any necessary libraries, and read the `possums.csv` data set. Write code blocks to solve the following problems:

1. Use dply `filter` function to create a new data frame with just the female possums [1 pt]

2. Use `filter` to create a new data frame that includes only female possums that were assigned to age class 9. How many possums meet these criteria? [1 pt]

3. Use `filter` to select those possums in age class 5 or older, or for which age information is unavailable (hint: see the help for the `is.na` function ). How many possums meet these criteria? [2 pts]

# Scatter plots

A scatter plot is one of the simplest representations of a bivariate distribution. Scatter plots are simple to create in ggplot2 by specifying the appropriate X and Y variables in the aesthetic mapping and using `geom_point` for the geometric mapping.

``````ggplot(setosa)  +
geom_point(aes(x = Sepal.Length, y = Sepal.Width))`````` # gplot layers can be assigned to variables

`ggplot` not only facilitates drawing of plots, but also returns a “plot object” that we can assign to a variable. The following example illustrates this:

``````# create base plot object and assign to variable p
# this does NOT draw the plot
p <- ggplot(setosa, mapping = aes(x = Sepal.Length, y = Sepal.Width))   ``````

In the code above we created a plot object and assigned it to the variable `p`. However, the plot wasn’t drawn. To draw the plot object we evaluate it as so:

``p  # draw the plot object`` Did the output from this code generate the image you expected?

In the previous steps we created a plot object, and drew it, but we haven’t yet specified a geom to determine how our data should be drawn. As we’ve seen in the past, geoms are layers that we can add to our plot object. Here we simply add a geom to our pre-created plot object as so:

``````# add a point geom to our base layer and draw the  plot
p + geom_point()``````