Chapter 5 More R Basics: Data structures

In computer science, the term “data structure” refers to the ways that data are stored, retrieved, and organized in a computer’s memory. Common examples include lists, hash tables (also called dictionaries), sets, queues, and trees. Different types of data structures are used to support different types of operations on data.

In R, the three basic data structures are vectors, lists, and data frames.

5.1 Vectors

Vectors are the core data structure in R. Vectors store an ordered lists of items, all of the same type (i.e. the data in a vector are “homogenous” with respect to their type).

The simplest way to create a vector at the interactive prompt is to use the c() function, which is short hand for “combine” or “concatenate”.

Vectors in R always have a type (accessed with the typeof() function) and a length (accessed with the length() function).

Vectors don’t have to be numerical; logical and character vectors work just as well.

You can also use c() to concatenate two or more vectors together.

5.1.2 Vector recycling

When vectors are not of the same length R `recycles’ the elements of the shorter vector to make the lengths conform.

In the example above z was treated as if it was the vector (1, 4, 7, 11, 1).

5.1.3 Simple statistical functions for numeric vectors

Now that we’ve introduced vectors as the simplest data structure for holding collections of numerical values, we can introduce a few of the most common statistical functions that operate on such vectors.

First let’s create a vector to hold our sample data of interest. Here I’ve taken a random sample of the lengths of the last names of students enrolled in Bio 723 during Spring 2018.

Some common statistics of interest include minimum, maximum, mean, median, variance, and standard deviation:

The summary() function applied to a vector of doubles produce a useful table of some of these key statistics:

5.1.4 Indexing Vectors

Accessing the element of a vector is called “indexing”. Indexing is the process of specifying the numerical positions (indices) that you want to take access from the vector.

For a vector of length \(n\), we can access the elements by the indices \(1 \ldots n\). We say that R vectors (and other data structures like lists) are `one-indexed’. Many other programming languages, such as Python, C, and Java, use zero-indexing where the elements of a data structure are accessed by the indices \(0 \ldots n-1\). Indexing errors are a common source of bugs.

Indexing a vector is done by specifying the index in square brackets as shown below:

Negative indices are used to exclude particular elements. x[-1] returns all elements of x except the first.

You can get multiple elements of a vector by indexing by another vector. In the example below, x[c(3,5)] returns the third and fifth element of x`.

5.1.5 Comparison operators applied to vectors

When the comparison operators, such as“greater than” (>), “less than or equal to” (<=), equality (==), etc. are applied to numeric vectors, they return logical vectors:

Here’s a fancier example:

5.1.6 Combining Indexing and Comparison of Vectors

A very powerful feature of R is the ability to combine the comparison operators (which return TRUE or FALSE values) with indexing. This facilitates data filtering and subsetting.

Here’s an example:

In the first example we retrieved all the elements of x that are larger than 5 (read as “x where x is greater than 5”). Notice how we got back all the elements where the statement in the brackets was TRUE.

You can string together comparisons for more complex filtering.

In the second example we retrieved those elements of x that were smaller than four or greater than six. Combining indexing and comparison is a powerful concept which we’ll use repeatedly in this course.

5.1.7 Vector manipulation

You can combine indexing with assignment to change the elements of a vectors:

You can also use indexing vectors to change multiple values at once:

Using logical vectors to manipulate the elements of a vector also works:

5.1.9 Additional functions for working with vectors

The function unique() returns the unique items in a vector:

rev() returns the items in reverse order (without changing the input vector):

There are a number of useful functions related to sorting. Plain sort() returns a new vector with the items in sorted order:

The related function order() gives the indices which would rearrange the items into sorted order:

order() can be useful when you want to sort one list by the values of another:

any() and all(), return single boolean values based on a specified comparison provided as an argument:

which() returns the indices of the vector for which the input is true:

5.2 Lists

R lists are like vectors, but unlike a vector where all the elements are of the same type, the elements of a list can have arbitrary types (even other lists). Lists are a powerful data structure for organizing information, because there are few constraints on the shape or types of the data included in a list.

Lists are easy to create:

Note that lists can contain arbitrary data. Lists can even contain other lists:

Lists are displayed with a particular format, distinct from vectors:

In the example above, the correspondence between the list and its display is obvious for the first three items. The fourth element may be a little confusing at first. Remember that the fourth item of l was another list. So what’s being shown in the output for the fourth item is the nested list.

An alternative way to display a list is using the str() function (short for “structure”). str() provides a more compact representation that also tells us what type of data each element is:

5.2.1 Length and type of lists

Like vectors, lists have length:

But the type of a list is simply “list”, not the type of the items within the list. This makes sense because lists are allowed to be heterogeneous (i.e. hold data of different types).

5.2.2 Indexing lists

Lists have two indexing operators. Indexing a list with single brackets, like we did with vectors, returns a new list containing the element at index \(i\). Lists also support double bracket indexing (x[[i]]) which returns the bare element at index \(i\) (i.e. the element without the enclosing list). This is a subtle but important point so make sure you understand the difference between these two forms of indexing.

5.2.2.1 Single bracket list indexing

First, let’s demonstrate single bracket indexing of the lists l we created above.

When using single brackets, lists support indexing with ranges and numeric vectors:

5.2.2.2 Double bracket list indexing

If double bracket indexing is used, the object at the given index in a list is returned:

Double bracket indexing does not support multiple indices, but you can chain together double bracket operators to pull out the items of sublists. For example:

5.2.3 Naming list elements

The elements of a list can be given names when the list is created:

You can retrieve the names associated with a list using the names function:

If a list has named elements, you can retrieve the corresponding elements by indexing with the quoted name in either single or double brackets. Consistent with previous usage, single brackets return a list with the corresponding named element, whereas double brackets return the bare element.

For example, make sure you understand the difference in the output generated by these two indexing calls:

5.2.4 The $ operator

Retrieving named elements of lists (and data frames as we’ll see), turns out to be a pretty common task (especially when doing interactive data analysis) so R has a special operator to make this more convenient. This is the $ operator, which is used as illustrated below:

5.2.7 Converting lists to vectors

Sometimes it’s useful to convert a list to a vector. The unlist() function takes care of this for us.

When you convert a list to a vector make sure you remember that vectors are homogeneous, so items within the new vector will be “coerced” to have the same type.

Note that unlist() also unpacks nested vectors and lists as shown in the second example above.

5.3 Data frames

Along with vectors and lists, data frames are one of the core data structures when working in R. A data frame is essentially a list which represents a data table, where each column in the table has the same number of rows and every item in the a column has to be of the same type. Unlike standard lists, the objects (columns) in a data frame must have names. We’ve seen data frames previously, for example when we loaded data sets using the read_csv function.

5.3.1 Creating a data frame

While data frames will often be created by reading in a data set from a file, they can also be created directly in the console as illustrated below:

Here we created a data frame with three columns, each of length 10.

5.3.2 Type and class for data frames

Data frames can be thought of as specialized lists, and in fact the type of a data frame is “list” as illustrated below:

To distinguish a data frame from a generic list, we have to ask about it’s “class”.

The term “class” comes from a style/approach to programming called “object oriented programming”. We won’t go into explicit detail about how object oriented programming works in this class, though we will exploit many of the features of objects that have a particular class.

5.3.3 Length and dimension for data frames

Applying the length() function to a data frame returns the number of columns. This is consistent with the fact that data frames are specialized lists:

To get the dimensions (number of rows and columns) of a data frame, we use the dim() function. dim() returns a vector, whose first value is the number of rows and whose second value is the number of columns:

We can get the number of rows and columns individually using the nrow() and ncol() functions:

5.3.4 Indexing and accessing data frames

Data frames can be indexed by either column index, column name, row number, or a combination of row and column numbers.

5.3.4.2 Single bracket indexing of the rows of a data frame

To get specific rows of a data frame, we use single bracket indexing with an additional comma following the index. For example to get the first row a data frame we would do:

This syntax extends to multiple rows:

5.3.4.4 Double bracket and $ indexing of data frames

Whereas single bracket indexing of a data frame always returns a new data frame, double bracket indexing and indexing using the $ operator, returns vectors.

5.3.5 Logical indexing of data frames

Logical indexing using boolean values works on data frames in much the same way it works on vectors. Typically, logical indexing of a data frame is used to filter the rows of a data frame.

For example, to get all the subject in our example data frame who are older than 25 we could do:

Similarly, to get all the individuals whose weight is between 60 and 70 kgs we could do:

5.3.6 Adding columns to a data frame

Adding columns to a data frame is similar to adding items to a list. The easiest way to do so is using named indexing. For example, to add a new column to our data frame that gives the individuals ages in number of days, we could do: