Data Exploration: The Titanic

Background
Dataset
Libraries
Read data
What’s in the data?
Categorizing passengers
How does class affect survival?
Passenger ages
How does age relate to survival?
Conclusion

Background

On April 10, 1912 the RMS Titanic left Southhampton, England headed for New York. Aboard were 2,435 passengers and 892 crew members. Five days later, about 20 minutes before midnight, the Titanic hit an iceberg in the frigid waters about 375 miles south of New Foundland. Within approximately two and a half hours the ship had split apart and sunk, leaving just over 700 survivors.

The Titanic

Dataset

The titanic_data.csv file (available on the course git repository) containers information on 1309 passengers from aboard the Titanic (CSV stands for Comma-Separated-Values, a simple plain text format for storing spreadhsheet data). Variables in this data set include gender, age, ticketed class, the passenger’s destitation, whether they survived, etc. We’ll use this data set to explore some of the demographics of the passengers who were aboard the ship, and how their relationship to whether a passenger survived or not. For a detailed description of this data set, see this link.

We’ll use this data to explore whether the saying “Women and children first!” applied on the Titanic.

Libraries

First we’ll load some R libraries (packages) that contain useful functions that will make our analyses quicker and more efficient. We’ll discuss the functions that these libraries provide, and how to use libraries in general, in greater detail in a future lecture.

> library(ggplot2)
> library(readr)
> library(dplyr)
> library(tidyr)
> library(forcats)

Read data

We start by reading in the data from the CSV file.

> titanic <- read_csv("~/Downloads/titanic_data.csv")

Parsed with column specification:
cols(
  pclass = col_integer(),
  survived = col_integer(),
  name = col_character(),
  sex = col_character(),
  age = col_double(),
  sibsp = col_integer(),
  parch = col_integer(),
  ticket = col_character(),
  fare = col_double(),
  cabin = col_character(),
  embarked = col_character(),
  boat = col_character(),
  body = col_integer(),
  home.dest = col_character()
)

The function read_csv does exactly what it advertises – reads a data set from a CSV file and returns it as an object we can compute on. In this case we assigned the variable name titanic to our data set. Simple enough!

What’s in the data?

The function read_csv returns a table, where the columns represent the variables of interest (e.g. sex, age, etc) and the rows represent individuals or observations. Let’s take a look:

> titanic

# A tibble: 1,309 x 14
   pclass survived                                            name    sex
    <int>    <int>                                           <chr>  <chr>
 1      1        1                   Allen, Miss. Elisabeth Walton female
 2      1        1                  Allison, Master. Hudson Trevor   male
 3      1        0                    Allison, Miss. Helen Loraine female
 4      1        0            Allison, Mr. Hudson Joshua Creighton   male
 5      1        0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female
 6      1        1                             Anderson, Mr. Harry   male
 7      1        1               Andrews, Miss. Kornelia Theodosia female
 8      1        0                          Andrews, Mr. Thomas Jr   male
 9      1        1   Appleton, Mrs. Edward Dale (Charlotte Lamson) female
10      1        0                         Artagaveytia, Mr. Ramon   male
# ... with 1,299 more rows, and 10 more variables: age <dbl>, sibsp <int>,
#   parch <int>, ticket <chr>, fare <dbl>, cabin <chr>, embarked <chr>,
#   boat <chr>, body <int>, home.dest <chr>

If we simply wanted the dimensions of the data we could do:

> dim(titanic)

[1] 1309   14

whereas, if we wanted to get a list of the column names in the data we could do:

> names(titanic)

 [1] "pclass"    "survived"  "name"      "sex"       "age"      
 [6] "sibsp"     "parch"     "ticket"    "fare"      "cabin"    
[11] "embarked"  "boat"      "body"      "home.dest"

Simple data wrangling

Two variables of interest to us are pclass (“passenger class”) and survived. These are categorical variables encoded as numbers. Before exploring the data we’re going to create derived “factor” variables from these, which will make our analyses more convenient. I’m also going to recode the “survived” information as the classes “died” and “lived”.

> titanic <- mutate(titanic, 
+                   passenger.class = fct_recode(as.factor(pclass),
+                                                "1st" = "1", "2nd" = "2", "3rd" = "3"),
+                   survival = fct_recode(as.factor(survived), 
+                                         "died" = "0", "lived" = "1"))

Having added to new variables to our data set, the dimensions and column names have changed:

> dim(titanic)

[1] 1309   16

and

> names(titanic)

 [1] "pclass"          "survived"        "name"           
 [4] "sex"             "age"             "sibsp"          
 [7] "parch"           "ticket"          "fare"           
[10] "cabin"           "embarked"        "boat"           
[13] "body"            "home.dest"       "passenger.class"
[16] "survival"

Note that there are now 16 columns in our data, the original 14 plus our two new derived variables passenger.class and survival.

Categorizing passengers

Let’s start by exploring various aspects of the 1309 passengers in our data set.

First, let’s break them down by class:

> count(titanic, passenger.class)

# A tibble: 3 x 2
  passenger.class     n
           <fctr> <int>
1             1st   323
2             2nd   277
3             3rd   709

We could also represent this data as a bar graph (though a simple table is more efficient in this case):

> ggplot(titanic) + 
+   geom_bar(aes(x = passenger.class))

Both our table and bar graph tell us that the largest number of passengers (in the available data) were travelling on third-class tickets. The numbers of first- and second-class ticketed passengers are fairly similar.

How many male and female passengers?

Let’s look at the gender breakdown:

> count(titanic, sex)

# A tibble: 2 x 2
     sex     n
   <chr> <int>
1 female   466
2   male   843

> ggplot(titanic) + 
+   geom_bar(aes(x = sex, fill = sex))

There are almost twice as many men in our data set as women.

How many people survived?

Now let’s look at survival information:

> count(titanic, survival)

# A tibble: 2 x 2
  survival     n
    <fctr> <int>
1     died   809
2    lived   500

> ggplot(titanic) + 
+   geom_bar(aes(x = survival))

Women first?

We can take our simple explorations a step further, by considering the joint counts of passengers with respect to multiple variables. For example, let’s look at the relationship between gender and survival:

> count(titanic, sex, survival)

# A tibble: 4 x 3
     sex survival     n
   <chr>   <fctr> <int>
1 female     died   127
2 female    lived   339
3   male     died   682
4   male    lived   161

Contingency tables

When looking at counts of multiple variables simultaneously, a more traditional representation than the one above is a “contingency table”. The cells in a contingency table give the counts of individuals with respect to combinations of variables (e.g. # of women who survived, # of women who died, etc). Here’s the same data on sex and survival reprsented as a contingency table:

> count(titanic, sex, survival) %>%
+   spread(survival, n)

# A tibble: 2 x 3
     sex  died lived
*  <chr> <int> <int>
1 female   127   339
2   male   682   161

In the code above the symbol %>% can be read as “pipe” or “send”. The pipe operator inserts the object before the pipe as the first argument to the function after the pipe. Here we’re piping the output of the count function as the input into the spread function. We’ll see in later lectures that piping objects makes for very powerful workflows when we do more sophisticated analyses.

We can also create a bar plot to represent the contingency table:

> ggplot(titanic) +
+   geom_bar(aes(sex, fill = survival))

A bar plot using proportions rather than counts

Sometimes it’s useful to look at proportions rather than absolute numbers. Here’s a figure that allows us to visually assess the different proportions of men and women passengers who survived:

> ggplot(titanic) +
+   geom_bar(aes(sex, fill = survival), position = "fill")

Mosaic plots

A slightly more sophisticated version of a bar plot is called a “mosaic plot”. A mosaic plot is similar to a proportional bar plot but the width of the bars also varies, indicating the relative numbers of observations in each class. To create a mosaic plot we need to import the geom_mosaic function from a library called ggmosaic.

> library(ggmosaic)
> 
> ggplot(titanic) + 
+   geom_mosaic(aes(x = product(sex), fill = survival)) + 
+   labs(x = "Sex", y = "Proportion surviving")

As you can see, the mosaic plot emphasizes both that there were more men than women on the Titanic, as well as the fact that a greater fraction of women survived. This seems like strong evidence for the first part of the phrase “Women and children first”.

How does class affect survival?

A passenger’s ticket class (a proxy for socioeconomic status) is another interesting variable to consider in conjunction with sex and survival. Here is a bar plot that takes into account all three of these variables, by representing passenger class as a “facet” variable – we create different subplots for each class grouping.

> ggplot(titanic) +
+   geom_bar(aes(sex, fill = survival), position = "fill") + 
+   facet_wrap(~ passenger.class) + 
+   labs(y = "Portion Died/Lived", title = "Titanic Survival by Sex and Class")

Passenger ages

Let’s create a plot to better understand the distribution of ages of passengers on the Titanic. A histogram is a common way to visualize the distribution of a continuous variable. Creating a histogram is a simple modification of our earlier examples where we created bar plots.

> ggplot(titanic) +
+   geom_histogram(aes(x = age), bins = 35)

The histogram provides a quick visual representation of the frequency of different age groups. It appears that the the most common (modal) value of age is a little over 20 years old. We can explicitly calculate the mean and median age as follows:

> mean(titanic$age, na.rm = TRUE)

[1] 29.88113

> median(titanic$age, na.rm = TRUE)

[1] 28

Note that we have to explicitly tell the mean and median functions to drop any missing (NA) values. Alternately, we could have use pipes to calculate the mean and median as so:

> titanic %>%
+   filter(!is.na(age)) %>%
+   summarize(mean(age), median(age))

# A tibble: 1 x 2
  `mean(age)` `median(age)`
        <dbl>         <dbl>
1    29.88113            28

How does age relate to survival?

Now we turn to the question of how age relates to the probability of survival. Age is a continuous variable, while survival is a binary variable.

Strip charts

We’ll start by creating “strip charts” that plot age on the y-axis, and survival on the x-axis. First we filter out individuals for which we have no age data, then we use a pipe to send the filtered data to the ggplot function:

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot() + 
+     geom_jitter(aes(survival, age), width = 0.1, alpha = 0.5)

Recall that sex was an important variable in our earlier assessment of the data. Let’s create a second strip plot that takes into account both age and sex.

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot() + 
+   geom_jitter(aes(survival, age, color = sex), width = 0.1, alpha = 0.5) +
+   facet_wrap(~sex)

Box plots

Another way to look at the data is to use a summary figure called a “box plot”. First the simple boxplot, relating age and survival:

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot() + 
+   geom_boxplot(aes(survival, age))

A box plot depicts information about the median value (thick central line), the first and third quartiles (lower 25% value, upper 75% value) and outliers. From this simple boxplot, it doesn’t look like there is a great difference in the age distribution of passengers who lived vs died.

Now we look at box plott for age-by-survival, conditioned on sex.

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot() + 
+   geom_boxplot(aes(survival, age, fill = sex)) +
+   facet_wrap(~sex)

Here we start to see a more interesting pattern. When comparing male passengers, the median age of survivors appears to be a little younger than those who died. However, when comparing female passengers the opposite pattern seems to be at play.

Fitting a model relating survival to age and sex

We can get more sophisticated by fitting a formal model to our data. Here we use a technique called logistic regression to model the probability of survival as a function of age, broken down for female and male passengers separately.

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot(aes(x = age, y = survived, color = sex)) + 
+   geom_jitter(height = 0.05, alpha = 0.35) + 
+   geom_smooth(method="glm", method.args = list(family="binomial"))  + 
+   facet_wrap(~sex) + 
+   labs(x = "Age", y = "Probability of survival")

The logistic regression model fit here, seems to support the trend we saw in the box plots. When breaking the data down by sex we find that there is a decreasing probability of survival as age increases for male passengers, but the opposite trend for female passengers. This is pretty interesting – “children first” seemed to hold for men but not for women.

Extending the logistic regression model to consider class, age, and sex

Now let’s add ticketing class into the mix again. Here we treat both sex and passenger class as faceting variables, fitting six different logistic regression models (all combinations of the three classes and two sexes).

> titanic %>%
+   filter(!is.na(age)) %>%
+   ggplot(aes(x = age, y = survived, color = sex)) + 
+   geom_jitter(height = 0.05, alpha = 0.35) + 
+   geom_smooth(method="glm", method.args = list(family="binomial"))  + 
+   facet_grid(passenger.class~sex) + 
+   labs(x = "Age", y = "Probability of survival")

By conditioning on both sex and ticketing class, we gain even more insights into the data and our assesments of relationships between variables can change.

For female passengers, ticketed class appears to have a strong influence on the relationship between age and survival. We see that almost all of the female first-class passengers survived, and the relationship between age and survival is thus flat. The second class female passengers are fairly similar, though with a slight decrease in the probability of survival among the oldest passengers. It’s not until we get to the third class passengers that we see a strong indication of the “children first” relationship playing out in terms of survival.

For the male passengers, the “children first” model seems to fit across classes, but note the generally lower probability of survival across ages when comparing first and third class passengers.

Conclusion

We’ve only scratched the surface of the possible explorations we could undertake of this intersting data set. However, this introductory “data story” illustrates many of the tools and ideas we’ll encounter in this class. You’ll be undertaking even more complex data explorations on your own by the end of the semester!