Chapter 5 More R Basics: Data structures
In computer science, the term “data structure” refers to the ways that data are stored, retrieved, and organized in a computer’s memory. Common examples include lists, hash tables (also called dictionaries), sets, queues, and trees. Different types of data structures are used to support different types of operations on data.
In R, the three basic data structures are vectors, lists, and data frames.
5.1 Vectors
Vectors are the core data structure in R. Vectors store an ordered lists of items, all of the same type (i.e. the data in a vector are “homogenous” with respect to their type).
The simplest way to create a vector at the interactive prompt is to use the c()
function, which is short hand for “combine” or “concatenate”.
Vectors in R always have a type (accessed with the typeof()
function) and a length (accessed with the length()
function).
Vectors don’t have to be numerical; logical and character vectors work just as well.
y <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)
y
#> [1] TRUE TRUE FALSE TRUE FALSE FALSE
typeof(y)
#> [1] "logical"
length(y)
#> [1] 6
z <- c("How", "now", "brown", "cow")
z
#> [1] "How" "now" "brown" "cow"
typeof(z)
#> [1] "character"
length(z)
#> [1] 4
You can also use c()
to concatenate two or more vectors together.
x <- c(2, 4, 6, 8)
y <- c(1, 3, 5, 7, 9) # create another vector, labeled y
xy <- c(x,y) # combine two vectors
xy
#> [1] 2 4 6 8 1 3 5 7 9
z <- c(pi/4, pi/2, pi, 2*pi)
xyz <- c(x, y, z) # combine three vectors
xyz
#> [1] 2.0000000 4.0000000 6.0000000 8.0000000 1.0000000 3.0000000 5.0000000
#> [8] 7.0000000 9.0000000 0.7853982 1.5707963 3.1415927 6.2831853
5.1.1 Vector Arithmetic
The basic R arithmetic operations work on numeric vectors as well as on single numbers (in fact, behind the scenes in R single numbers are vectors!).
x <- c(2, 4, 6, 8, 10)
x * 2 # multiply each element of x by 2
#> [1] 4 8 12 16 20
x - pi # subtract pi from each element of x
#> [1] -1.1415927 0.8584073 2.8584073 4.8584073 6.8584073
y <- c(0, 1, 3, 5, 9)
x + y # add together each matching element of x and y
#> [1] 2 5 9 13 19
x * y # multiply each matching element of x and y
#> [1] 0 4 18 40 90
x/y # divide each matching element of x and y
#> [1] Inf 4.000000 2.000000 1.600000 1.111111
Basic numerical functions operate element-wise on numerical vectors:
5.1.2 Vector recycling
When vectors are not of the same length R `recycles’ the elements of the shorter vector to make the lengths conform.
x <- c(2, 4, 6, 8, 10)
length(x)
#> [1] 5
z <- c(1, 4, 7, 11)
length(z)
#> [1] 4
x + z
#> [1] 3 8 13 19 11
In the example above z
was treated as if it was the vector (1, 4, 7, 11, 1)
.
5.1.3 Simple statistical functions for numeric vectors
Now that we’ve introduced vectors as the simplest data structure for holding collections of numerical values, we can introduce a few of the most common statistical functions that operate on such vectors.
First let’s create a vector to hold our sample data of interest. Here I’ve taken a random sample of the lengths of the last names of students enrolled in Bio 723 during Spring 2018.
Some common statistics of interest include minimum, maximum, mean, median, variance, and standard deviation:
sum(len.name)
#> [1] 66
min(len.name)
#> [1] 2
max(len.name)
#> [1] 10
mean(len.name)
#> [1] 6.6
median(len.name)
#> [1] 7
var(len.name) # variance
#> [1] 6.044444
sd(len.name) # standard deviation
#> [1] 2.458545
The summary()
function applied to a vector of doubles produce a useful table of some of these key statistics:
5.1.4 Indexing Vectors
Accessing the element of a vector is called “indexing”. Indexing is the process of specifying the numerical positions (indices) that you want to take access from the vector.
For a vector of length \(n\), we can access the elements by the indices \(1 \ldots n\). We say that R vectors (and other data structures like lists) are `one-indexed’. Many other programming languages, such as Python, C, and Java, use zero-indexing where the elements of a data structure are accessed by the indices \(0 \ldots n-1\). Indexing errors are a common source of bugs.
Indexing a vector is done by specifying the index in square brackets as shown below:
x <- c(2, 4, 6, 8, 10)
length(x)
#> [1] 5
x[1] # return the 1st element of x
#> [1] 2
x[4] # return the 4th element of x
#> [1] 8
Negative indices are used to exclude particular elements. x[-1]
returns all elements of x
except the first.
You can get multiple elements of a vector by indexing by another vector. In the example below, x[c(3,5)]
returns the third and fifth element of x`.
5.1.5 Comparison operators applied to vectors
When the comparison operators, such as“greater than” (>
), “less than or equal to” (<=
), equality (==
), etc. are applied to numeric vectors, they return logical vectors:
x <- c(2, 4, 6, 8, 10, 12)
x < 8 # returns TRUE for all elements lass than 8
#> [1] TRUE TRUE TRUE FALSE FALSE FALSE
Here’s a fancier example:
5.1.6 Combining Indexing and Comparison of Vectors
A very powerful feature of R is the ability to combine the comparison operators (which return TRUE or FALSE values) with indexing. This facilitates data filtering and subsetting.
Here’s an example:
In the first example we retrieved all the elements of x
that are larger than 5 (read as “x where x is greater than 5”). Notice how we got back all the elements where the statement in the brackets was TRUE
.
You can string together comparisons for more complex filtering.
In the second example we retrieved those elements of x
that were smaller than four or greater than six. Combining indexing and comparison is a powerful concept which we’ll use repeatedly in this course.
5.1.7 Vector manipulation
You can combine indexing with assignment to change the elements of a vectors:
You can also use indexing vectors to change multiple values at once:
Using logical vectors to manipulate the elements of a vector also works:
5.1.8 Vectors from regular sequences
There are a variety of functions for creating regular sequences in the form of vectors.
1:10 # create a vector with the integer values from 1 to 10
#> [1] 1 2 3 4 5 6 7 8 9 10
20:11 # a vector with the integer values from 20 to 11
#> [1] 20 19 18 17 16 15 14 13 12 11
seq(1, 10) # like 1:10
#> [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10, by = 2) # 1:10, in steps of 2
#> [1] 1 3 5 7 9
seq(2, 4, by = 0.25) # 2 to 4, in steps of 0.25
#> [1] 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
5.1.9 Additional functions for working with vectors
The function unique()
returns the unique items in a vector:
rev()
returns the items in reverse order (without changing the input vector):
y <- rev(x)
y
#> [1] 9 7 5 8 9 6 4 1 2 5
x # x is still in original order
#> [1] 5 2 1 4 6 9 8 5 7 9
There are a number of useful functions related to sorting. Plain sort()
returns a new vector with the items in sorted order:
sorted.x <- sort(x) # returns items of x sorted
sorted.x
#> [1] 1 2 4 5 5 6 7 8 9 9
x # but x remains in its unsorted state
#> [1] 5 2 1 4 6 9 8 5 7 9
The related function order()
gives the indices which would rearrange the items into sorted order:
order()
can be useful when you want to sort one list by the values of another:
students <- c("fred", "tabitha", "beatriz", "jose")
class.ranking <- c(4, 2, 1, 3)
students[order(class.ranking)] # get the students sorted by their class.ranking
#> [1] "beatriz" "tabitha" "jose" "fred"
any()
and all()
, return single boolean values based on a specified comparison provided as an argument:
y <- c(2, 4, 5, 6, 8)
any(y > 5) # returns TRUE if any of the elements are TRUE
#> [1] TRUE
all(y > 5) # returns TRUE if all of the elements are TRUE
#> [1] FALSE
which()
returns the indices of the vector for which the input is true:
5.2 Lists
R lists are like vectors, but unlike a vector where all the elements are of the same type, the elements of a list can have arbitrary types (even other lists). Lists are a powerful data structure for organizing information, because there are few constraints on the shape or types of the data included in a list.
Lists are easy to create:
Note that lists can contain arbitrary data. Lists can even contain other lists:
Lists are displayed with a particular format, distinct from vectors:
l
#> [[1]]
#> [1] "Bob"
#>
#> [[2]]
#> [1] 3.141593
#>
#> [[3]]
#> [1] 10
#>
#> [[4]]
#> [[4]][[1]]
#> [1] "foo"
#>
#> [[4]][[2]]
#> [1] "bar"
#>
#> [[4]][[3]]
#> [1] "baz"
#>
#> [[4]][[4]]
#> [1] "qux"
In the example above, the correspondence between the list and its display is obvious for the first three items. The fourth element may be a little confusing at first. Remember that the fourth item of l
was another list. So what’s being shown in the output for the fourth item is the nested list.
An alternative way to display a list is using the str()
function (short for “structure”). str()
provides a more compact representation that also tells us what type of data each element is:
str(l)
#> List of 4
#> $ : chr "Bob"
#> $ : num 3.14
#> $ : num 10
#> $ :List of 4
#> ..$ : chr "foo"
#> ..$ : chr "bar"
#> ..$ : chr "baz"
#> ..$ : chr "qux"
5.2.1 Length and type of lists
Like vectors, lists have length:
But the type of a list is simply “list”, not the type of the items within the list. This makes sense because lists are allowed to be heterogeneous (i.e. hold data of different types).
5.2.2 Indexing lists
Lists have two indexing operators. Indexing a list with single brackets, like we did with vectors, returns a new list containing the element at index \(i\). Lists also support double bracket indexing (x[[i]]
) which returns the bare element at index \(i\) (i.e. the element without the enclosing list). This is a subtle but important point so make sure you understand the difference between these two forms of indexing.
5.2.2.1 Single bracket list indexing
First, let’s demonstrate single bracket indexing of the lists l
we created above.
l[1] # single brackets, returns list('Bob')
#> [[1]]
#> [1] "Bob"
typeof(l[1]) # notice the list type
#> [1] "list"
When using single brackets, lists support indexing with ranges and numeric vectors:
5.2.2.2 Double bracket list indexing
If double bracket indexing is used, the object at the given index in a list is returned:
l[[1]] # double brackets, return plain 'Bob'
#> [1] "Bob"
typeof(l[[1]]) # notice the 'character' type
#> [1] "character"
Double bracket indexing does not support multiple indices, but you can chain together double bracket operators to pull out the items of sublists. For example:
5.2.3 Naming list elements
The elements of a list can be given names when the list is created:
You can retrieve the names associated with a list using the names
function:
If a list has named elements, you can retrieve the corresponding elements by indexing with the quoted name in either single or double brackets. Consistent with previous usage, single brackets return a list with the corresponding named element, whereas double brackets return the bare element.
For example, make sure you understand the difference in the output generated by these two indexing calls:
5.2.4 The $
operator
Retrieving named elements of lists (and data frames as we’ll see), turns out to be a pretty common task (especially when doing interactive data analysis) so R has a special operator to make this more convenient. This is the $
operator, which is used as illustrated below:
5.2.5 Changing and adding lists items
Combining indexing and assignment allows you to change items in a list:
suspect <- list(first.name = "unknown",
last.name = "unknown",
aka = "little")
suspect$first.name <- "Bo"
suspect$last.name <- "Peep"
suspect[[3]] <- "LITTLE"
str(suspect)
#> List of 3
#> $ first.name: chr "Bo"
#> $ last.name : chr "Peep"
#> $ aka : chr "LITTLE"
By combining assignment with a new name or an index past the end of the list you can add items to a list:
suspect$age <- 17 # add a new item named age
suspect[[5]] <- "shepardess" # create an unnamed item at position 5
Be careful when adding an item using indexing, because if you skip an index an intervening NULL value is created:
# there are only five items in the list, what happens if we
# add a new item at position seven?
suspect[[7]] <- "wanted for sheep stealing"
str(suspect)
#> List of 7
#> $ first.name: chr "Bo"
#> $ last.name : chr "Peep"
#> $ aka : chr "LITTLE"
#> $ age : num 17
#> $ : chr "shepardess"
#> $ : NULL
#> $ : chr "wanted for sheep stealing"
5.2.6 Combining lists
The c
(combine) function we introduced to create vectors can also be used to combine lists:
5.2.7 Converting lists to vectors
Sometimes it’s useful to convert a list to a vector. The unlist()
function takes care of this for us.
When you convert a list to a vector make sure you remember that vectors are homogeneous, so items within the new vector will be “coerced” to have the same type.
# a heterogeneous list
ex2 <- list(2, 4, 6, c("bob", "fred"), list(1 + 0i, 'foo'))
unlist(ex2)
#> [1] "2" "4" "6" "bob" "fred" "1+0i" "foo"
Note that unlist()
also unpacks nested vectors and lists as shown in the second example above.
5.3 Data frames
Along with vectors and lists, data frames are one of the core data structures when working in R. A data frame is essentially a list which represents a data table, where each column in the table has the same number of rows and every item in the a column has to be of the same type. Unlike standard lists, the objects (columns) in a data frame must have names. We’ve seen data frames previously, for example when we loaded data sets using the read_csv
function.
5.3.1 Creating a data frame
While data frames will often be created by reading in a data set from a file, they can also be created directly in the console as illustrated below:
age <- c(30, 26, 21, 29, 25, 22, 28, 24, 23, 20)
sex <- rep(c("M","F"), 5)
wt.in.kg <- c(88, 76, 67, 66, 56, 74, 71, 60, 52, 72)
df <- data.frame(age = age, sex = sex, wt = wt.in.kg)
Here we created a data frame with three columns, each of length 10.
5.3.2 Type and class for data frames
Data frames can be thought of as specialized lists, and in fact the type of a data frame is “list” as illustrated below:
To distinguish a data frame from a generic list, we have to ask about it’s “class”.
class(df) # the class of our data frame
#> [1] "data.frame"
class(l) # compare to the class of our generic list
#> [1] "list"
The term “class” comes from a style/approach to programming called “object oriented programming”. We won’t go into explicit detail about how object oriented programming works in this class, though we will exploit many of the features of objects that have a particular class.
5.3.3 Length and dimension for data frames
Applying the length()
function to a data frame returns the number of columns. This is consistent with the fact that data frames are specialized lists:
To get the dimensions (number of rows and columns) of a data frame, we use the dim()
function. dim()
returns a vector, whose first value is the number of rows and whose second value is the number of columns:
We can get the number of rows and columns individually using the nrow()
and ncol()
functions:
5.3.4 Indexing and accessing data frames
Data frames can be indexed by either column index, column name, row number, or a combination of row and column numbers.
5.3.4.1 Single bracket indexing of the columns of a data frame
The single bracket operator with a single numeric index returns a data frame with the corresponding column.
df[1] # get the first column (=age) of the data frame
#> # A tibble: 10 x 1
#> age
#> <dbl>
#> 1 30
#> 2 26
#> 3 21
#> 4 29
#> 5 25
#> 6 22
#> 7 28
#> 8 24
#> 9 23
#> 10 20
The single bracket operator with multiple numeric indices returns a data frame with the corresponding columns.
df[1:2] # first two columns
#> # A tibble: 10 x 2
#> age sex
#> <dbl> <fct>
#> 1 30 M
#> 2 26 F
#> 3 21 M
#> 4 29 F
#> 5 25 M
#> 6 22 F
#> 7 28 M
#> 8 24 F
#> 9 23 M
#> 10 20 F
df[c(1, 3)] # columns 1 (=age) and 3 (=wt)
#> # A tibble: 10 x 2
#> age wt
#> <dbl> <dbl>
#> 1 30 88
#> 2 26 76
#> 3 21 67
#> 4 29 66
#> 5 25 56
#> 6 22 74
#> 7 28 71
#> 8 24 60
#> 9 23 52
#> 10 20 72
Column names can be substituted for indices when using the single bracket operator:
df["age"]
#> # A tibble: 10 x 1
#> age
#> <dbl>
#> 1 30
#> 2 26
#> 3 21
#> 4 29
#> 5 25
#> 6 22
#> 7 28
#> 8 24
#> 9 23
#> 10 20
df[c("age", "wt")]
#> # A tibble: 10 x 2
#> age wt
#> <dbl> <dbl>
#> 1 30 88
#> 2 26 76
#> 3 21 67
#> 4 29 66
#> 5 25 56
#> 6 22 74
#> 7 28 71
#> 8 24 60
#> 9 23 52
#> 10 20 72
5.3.4.2 Single bracket indexing of the rows of a data frame
To get specific rows of a data frame, we use single bracket indexing with an additional comma following the index. For example to get the first row a data frame we would do:
This syntax extends to multiple rows:
5.3.4.3 Single bracket indexing of both the rows and columns of a data frame
Single bracket indexing of data frames extends naturally to retrieve both rows and columns simultaneously:
df[1, 2] # first row, second column
#> [1] M
#> Levels: F M
df[1:3, 2:3] # first three rows, columns 2 and 3
#> # A tibble: 3 x 2
#> sex wt
#> * <fct> <dbl>
#> 1 M 88
#> 2 F 76
#> 3 M 67
# you can even mix numerical indexing (rows) with named indexing of columns
df[5:10, c("age", "wt")]
#> # A tibble: 6 x 2
#> age wt
#> * <dbl> <dbl>
#> 1 25 56
#> 2 22 74
#> 3 28 71
#> 4 24 60
#> 5 23 52
#> 6 20 72
5.3.4.4 Double bracket and $
indexing of data frames
Whereas single bracket indexing of a data frame always returns a new data frame, double bracket indexing and indexing using the $
operator, returns vectors.
5.3.5 Logical indexing of data frames
Logical indexing using boolean values works on data frames in much the same way it works on vectors. Typically, logical indexing of a data frame is used to filter the rows of a data frame.
For example, to get all the subject in our example data frame who are older than 25 we could do:
# NOTE: the comma after 25 is important to insure we're indexing rows!
df[df$age > 25, ]
#> # A tibble: 4 x 3
#> age sex wt
#> * <dbl> <fct> <dbl>
#> 1 30 M 88
#> 2 26 F 76
#> 3 29 F 66
#> 4 28 M 71
Similarly, to get all the individuals whose weight is between 60 and 70 kgs we could do:
5.3.6 Adding columns to a data frame
Adding columns to a data frame is similar to adding items to a list. The easiest way to do so is using named indexing. For example, to add a new column to our data frame that gives the individuals ages in number of days, we could do: