library(tidyverse)
library(stringr)
library(magrittr) # a new library!
We’ll once again use the NC births data set we explored in our last class session.
births <- read_tsv("https://raw.githubusercontent.com/Bio204-class/bio204-datasets/master/births.txt")
dplyr
includes a very useful operator available called a pipe available to us. Pipes are actually defined in another packaged called magrittr
. We’ll look at the basic pipe operator and then look at a few additional “special” pipes that magrittr
provides.
Pipes are powerful because they allow us to chain together sets of operations in a very intuitive fashion while minimizing nested function calls. We can think of pipes as taking the output of one function and feeding it as the first argument to another function call, where we’ve already specified the subsequent arguments.
The pipe operator is designated by %>%
. Using pipes, the expression x %>% f()
is equivalent to f(x)
and the expression x %>% f(y)
is equivalent to f(x,y)
.
births %>% head() # same as head(births)
births %>% head # you can even leave the parentheses out
births %>% head(10) # same as head(births, 10)
Multiple pipes can be chained together, such that x %>% f() %>% g() %>% h()
is equivalent to h(g(f(x)))
.
# equivalent to: head(arrange(births, weight), 10)
births %>% arrange(weight) %>% head(10)
The documentation on pipes (see ?magrittr
) uses the notation lhs %>% rhs
where lhs
and rhs
are short for “left-hand side” and “right-hand side” respectively. I’ll use this same notation in some of the explanations that follow.
To illustrate how pipes help us, first let’s look at an example set of analysis steps without using pipes. Let’s say we wanted to explore the relationship between father’s age and baby’s birth weight. We’ll start this process of exploration by generating a bivariate scatter plot. Being good scientists we want to express our data in SI units, so we’ll need to converts pounds to kilograms. You’ll also recall that a number of the cases have missing data on father’s age, so we’ll want to remove those before we plot them. Here’s how we might accomplish these steps:
# add a new column for weight in kg
births.kg <- mutate(births, weight.kg = weight / 2.2)
# filter out the NA fathers
filtered.births <- filter(births.kg, !is.na(fAge))
# create our plot
ggplot(filtered.births, aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
Notice that we created two “temporary” data frames along the way – births.kg
and filtered.births
. These probably aren’t of particular interest to us, but we needed to generate them to build the plot we wanted. If you were particularly masochistic you could avoid these temporary data frames by using nested functions call like this:
# You SHOULD NOT write nested code like this.
# Code like this is hard to debug and understand!
ggplot(filter(mutate(births, weight.kg = weight / 2.2), !is.na(fAge)),
aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
The pipe operator makes the output of one statement (lhs
) as the first input of a following function (rhs
). This simplifies the above example to:
births %>%
mutate(weight.kg = weight / 2.2) %>%
filter(!is.na(fAge)) %>%
ggplot(aes(x = fAge, y = weight.kg)) +
geom_point() +
labs(x = "Father's Age (years)", y = "Birth Weight (kg)")
In the example above, we feed the data frame into the mutate
function. mutate
expects a data frame as a first argument, and subsequent arguments specify the new variables to be created. births %>% mutate(weight.kg = weight / 2.2)
is thus equivalent to mutate(births, weight.kg = weight / 2.2))
. We then pipe the output to filter
, removing NA fathers, and then pipe that output as the input to ggplot.
It’s good coding style to write each discrete step as its own line when using piping. This make it easier to understand what the steps of the analysis are as well as facilitating changes to the code (commenting out lines, adding lines, etc)
It’s important to recognize that pipes are simply a convenient way to chain together a series of expression. Just like any other compound expression, the output of a series of pipe statements can be assigned to a variable, like so:
stats.old.moms <-
births %>%
filter(mAge > 35) %>%
summarize(median.gestation = median(weeks), mean.weight = mean(weight))
stats.old.moms
When working with pipes, sometimes you’ll want to use the lhs
in multiple places on the rhs
, or as something other than the first argument to the rhs
. magrittr
provides for this situation by using the dot (.
) operator as a placeholder. Using the dot operator, the expression y %>% f(x, .)
is equivalent to f(x,y)
.
c("dog", "cakes", "sauce", "house") %>%
sample(1) %>%
str_c("hot", .)
magrittr
defines another operator called the “exposition pipe operator”, designed %$%
. This operator exposes the names in the lhs
to the expression on the rhs
.
Here is an example of using the exposition pipe operator to simply return the vector of weights:
births %>%
filter(premature == "premie") %$% # note the different pipe operator!
weight
If we wanted to calculate the minimum and maximum weight of premature babies in the data set we could do:
births %>%
filter(mAge > 35) %$% # note the different pipe operator!
c(min(weight), max(weight))
Write a code block that uses pipes to count the number of premature births in the data set. [1 pt]
Write a code block that uses pipes to calculate the mean weight, in kilograms, of babies classified as premature. [1 pts]
Write a code block that uses pipes to create a scatter plot depicting birth weight in kilograms (y-axis) versus weeks of gestation (x-axis) for babies born to non-smoking mothers, coloring the points according to whether the baby was premature or full term. Your figure should look similar to the one below [2 pts]
Consider this code:
sample(5:10, 1) %>% sample_n(births, .)
What does the function sample
do? [0.5 pt]
What does the function sample_n
do? [0.5 pt]
What does the code block above do? [1 pt]
Consider the following code block which illustrates two ways to calculate the mean and median gestation time for babies of mothers who smoke:
smokers.1 <-
births %>%
filter(smoke == "smoker") %>%
summarize(mean.gestation = mean(weeks), median.gestation = median(weeks))
smokers.2 <-
births %>%
filter(smoke == "smoker") %$%
c(mean(weeks), median(weeks))
What is the type and class of smokers.1
? [0.5 pt]
What is the type and class of smokers.2
? [0.5 pt]
Why does smokers.1$mean.gestation
work, while smokers.2$mean.gestation
raises an error? [1 pt]
So far we’ve been using a variety of built in functions in R. However the real power of a programming language is the ability to write your own functions. Functions are a mechanism for organizing and abstracting a set of related computations. We usually write functions to represent sets of computations that we apply frequently, or to represent some conceptually coherent set of manipulations to data.
The general form of an R function is as follows:
funcname <- function(arg1, arg2) {
# one or more expressions that operate on the fxn arguments
# last expression is the object returned
# or you can explicitly return an object
}
To make this concrete, here’s an example where we define a function to calculate the area of a circle:
area.of.circle <- function(r){
return(pi * r^2)
}
Since R returns the value of the last expression in the function, the return
call is optional and we could have simply written:
area.of.circle <- function(r){
pi * r^2
}
Very short and concise functions are often written as a single line. In practice I’d probably write the above function as:
area.of.circle <- function(r) {pi * r^2}
The area.of.circle
function takes one argument, r
, and calculates the area of a circle with radius r. Having defined the function we can immediately put it to use:
area.of.circle(3)
radius <- 4
area.of.circle(radius)
If you type a function name without parentheses R shows you the function’s definition. This works for built-in functions as well (thought sometimes these functions are defined in C code in which case R will tell you that the function is a .Primitive
).
Function arguments can specify the data that a function operates on or parameters that the function uses. Function arguments can be either required or optional. In the case of optional arguments, a default value is assigned if the argument is not given.
Take for example the log
function. If you examine the help file for the log
function (type ?log
now) you’ll see that it takes two arguments, refered to as x
and base
. The argument x
represents the numeric vector you pass to the function and is a required argument (see what happens when you type log()
without giving an argument). The argument base
is optional. By default the value of base
is \(e = 2.71828\ldots\). Therefore by default the log
function returns natural logarithms. If you want logarithms to a different base you can change the base
argument as in the following examples:
log(2) # log of 2, base e
log(2,2) # log of 2, base 2
log(2, 4) # log of 2, base 4
Because base 2 and base 10 logarithms are fairly commonly used, there are convenient aliases for calling log
with these bases.
log2(8)
log10(100)
To write a function that has an optional argument, you can simply specify the optional argument and its default value in the function definition as so:
# a function to substitute missing values in a vector
sub.missing <- function(x, sub.value = -99){
x[is.na(x)] <- sub.value
return(x)
}
You can then use this function as so:
m <- c(1, 2, NA, 4)
sub.missing(m, -999) # explicitly define sub.value
sub.missing(m, sub.value = -333) # more explicit syntax
sub.missing(m) # use default sub.value
m # notice that m wasn't modified within the function
Notice that when we called sub.missing
with our vector m
, the vector did not get modified in the function body. Rather a new vector, x
was created within the function and returned. However, if you did the missing value subsitute outside of a function call, then the vector would be modified:
n <- c(1, 2, NA, 4)
n[is.na(n)] <- -99
n
When you define a function at the interactive prompt and then close the interpreter your function definition will be lost. The simple way around this is to define your R functions in a script that you can than access at any time.
In RStudio choose File > New File > R Script
. This will bring up a blank editor window. Type your function(s) into the editor. Everything in this file will be interpretted as R code, so you should not use the code block notation that is used in Markdown notebooks. Save the source file in your R working directory with a name like myfxns.R
.
# functions defined in myfxns.R
area.of.circle <- function(r) {pi * r^2}
area.of.rectangle <- function(l, w) {l * w}
area.of.triangle <- function(b, h) {0.5 * b * h }
Once your functions are in a script file you can make them accesible by using the source
function, which reads the named file as input and evaluates any definitions or statements in the input file (See also the Source
button in the R Studio GUI):
source("myfxns.R")
Having sourced the file you can now use your functions like so:
radius <- 3
len <- 4
width <- 5
base <- 6
height <- 7
area.of.circle(radius)
area.of.rectangle(len, width)
area.of.triangle(base, height)
Note that if you change the source file, such as correcting a mistake or adding a new function, you need to call the source
function again to make those changes available.
There is no built-in is.not.na
function to test if the elements of a vector are not NA. Write a function that implements is.not.na
. I’ve provided a template below [1 pts]:
is.not.na <- function(x){
# write your code here
}
# test your function by uncommenting this line
# is.not.na(c("a", "b", NA, "d", NA))
# The expected output is:
## [1] TRUE TRUE FALSE TRUE FALSE
Write a function, which.na
, that returns the indices of the elements of a vector that are NA. The functions is.na
and which
should be helpful in writing your function [1 pt]
# Define your which.na function here
# test your function by uncommenting this line
# which.na(c("a", "b", NA, "d", NA))
# The expected output is:
## [1] 3 5
Write a function called my.summary
that takes a numeric vector, x
, as input, and returns a list object that gives the minimum, maximum, range (= max - min), mean, and median of x
. [2 pts]
# Define your my.summary function here
# uncomment the following line to test your function
# my.summary(births$weight)
# The expected output should look like this:
## > my.summary(births$weight)
## $min
## [1] 1.63
##
## $max
## [1] 10.13
##
## $range
## [1] 8.5
##
## $mean
## [1] 7.046
##
## $median
## [1] 7.31