Logistic regression

Logistic regression is used when the dependent variable is discrete (often binary). The explanatory variables may be either continuous or discrete.

Examples:

We model the binary response variable, \(Y\), as a function of the predictor variables, \(X_1\), \(X_2\), etc as :

\[ P(Y = 1|X_1,\ldots,X_p) = f(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]

So we’re modeling the probability of the state of Y as a function of a linear combination of the predictor variables.

For logistic regression, \(f\) is the logistic function: \[ f(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}} \]

Therefore, the bivariate logistic regression is given by: \[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Note that \(\beta_0\) here is akin to the intercept in our standard linear regression.

In-class Assignment #1

To help you develop an intuition for the logistic regression equation, I’ve developed a small web app, that allows you to explore how the shape of the regression curve responds to changes in the regression coefficients \(\beta_0\) and \(\beta_1\).

Open the Exploring Logistic Regression app in another browser window and then answer the following questions.

  1. Set the both coefficients, B0 and B1, to zero. Given these coefficients, what is \(P(Y = 1)\) when X = 0 (i.e. the intercept)? How does this probability change as X increases? [1 pt]

  2. Set the coefficient B1 to zero. Use the slider to change the value of B0. How does increasing/decreasing the value of B0 affect \(P(Y = 1)\) at the Y-intercept? [1 pt]

  3. Set the coefficient B0 to zero. Use the slider to change the value of B1. How does increasing/decreasing the value of B1 affect \(P(Y = 1)\)? [1 pt]

  4. Explore a range of values for both B0 and B1 simultaneously. What range of coefficients gives curves with decreasing \(P(Y=1)\) as X increases? What range of coefficients generate curves with increasing \(P(Y=1)\) as X increases? [1 pt]

Titanic data set

titanic.csv contains information about passengers on the Titanic. Variables in this data set include information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not they survived the sinking of the ship (0 = died, 1 = survived).

library(tidyverse)
library(broom)
library(cowplot)  # for plot_grid fxn
library(ggthemes)  # install this package for additional ggplot themes
titanic <- read_csv("https://github.com/pmagwene/Bio723/raw/master/datasets/titanic.csv")
names(titanic)
 [1] "row.names" "pclass"    "survived"  "name"      "age"      
 [6] "embarked"  "home.dest" "room"      "ticket"    "boat"     
[11] "sex"      

Subsetting the data

We’ve all heard the phrase, “Women and children first”, so we might expect that the probability that a passenger survived the sinking of the Titanic related to their sex and/or age. Let’s create separate data subsets for male and female passengers.

male <- filter(titanic, sex == "male")
dim(male)
[1] 850  11
female <- filter(titanic, sex == "female")
dim(female)
[1] 463  11

Visualizing survival as a function of age

Let’s create visualizations of survival as a function of age for the male and female passengers.

fcolor = "lightcoral"
mcolor = "lightsteelblue"

female.plot <- ggplot(female, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.05, color = fcolor) +
  labs(title = "Female Passengers")

male.plot <- ggplot(male, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.05, color = mcolor) + 
  labs(title = "Male Passengers")

plot_grid(female.plot, male.plot)

Fitting the logistic regression model

The function glm (generalized linear model) can be used to fit the logistic regression model (as well as other models). Setting the argument family = binomial gives us logistic regression. Note that when fitting the model the dependent variable needs to be numeric, so if the data is provided as Boolean (logical) TRUE/FALSE values, they should be converted to integers using as.numeric().

First we fit the regression for the famale passengers.

fit.female <- glm(survived ~ age, family = binomial, female)
summary(fit.female)

Call:
glm(formula = survived ~ age, family = binomial, data = female)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.192   0.484   0.593   0.680   0.849  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.8309     0.3663    2.27    0.023 *
age           0.0234     0.0119    1.97    0.049 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 229.88  on 242  degrees of freedom
Residual deviance: 225.81  on 241  degrees of freedom
  (220 observations deleted due to missingness)
AIC: 229.8

Number of Fisher Scoring iterations: 4

Now we repeat the same step for the male passengers.

fit.male <- glm(survived ~ age, family = binomial, male)
summary(fit.male)

Call:
glm(formula = survived ~ age, family = binomial, data = male)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.062  -0.737  -0.628  -0.407   2.239  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.26565    0.28982   -0.92  0.35936    
age         -0.03593    0.00949   -3.79  0.00015 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 401.15  on 389  degrees of freedom
Residual deviance: 385.37  on 388  degrees of freedom
  (460 observations deleted due to missingness)
AIC: 389.4

Number of Fisher Scoring iterations: 4

Visualizing the logistic regression

To visualize the logistic regression fit, we first use the predict function to generate the model predictions about probability of survival as a function of age.

ages <- seq(0, 75, 1) # predict survival for ages 0 to 75

predicted.female <- predict(fit.female, 
                            newdata = data.frame(age = ages),
                            type = "response")

predicted.male <- predict(fit.male,
                          newdata = data.frame(age = ages),
                          type = "response")

Having generated the predicted probabilities of survival we can then add these prediction lines to our previous plot using geom_line.

female.logistic.plot <- female.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.female),
            color = fcolor, size = 1)

male.logistic.plot <- male.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.male),
            color = mcolor, size = 1)

plot_grid(female.logistic.plot, male.logistic.plot)

Quick and easy visualization

Here’s an alternative “quick and easy” way to generate the plot above using the awesome power of ggplot. The downside of this approach is we don’t generate the detailed information on the model, which is something you’d certainly want to have in any real analysis.

ggplot(titanic, aes(x=age, y=survived, color=sex)) + 
  geom_jitter(width = 0, height = 0.05) +
  geom_smooth(method="glm",  method.args = list(family="binomial"))  + 
  labs(x = "Age", y = "P(Survival)") +
  facet_wrap(~ sex) +
  scale_color_manual(values = c(fcolor, mcolor))

Impact of sex and passenger class on the models

In the first figure we are faceting just on sex. It can sometimes be useful to facet on multiple variables simultaneously. Below we extend our visualization to look at the regression faceted on both class and sex, using facet_grid:

ggplot(titanic, aes(x=age, y=survived, color=sex)) + 
  geom_jitter(width = 0, height = 0.05) +
  geom_smooth(method="glm",  method.args = list(family="binomial"))  + 
  labs(x = "Age", y = "P(Survival)") +  
  facet_grid(pclass ~ sex) +
  scale_color_manual(values = c(fcolor, mcolor)) + theme_few()

Fitting multiple models based on groupings use dplyr::do

In the figure above we used facet_grid to condition on both sex and class. What if we wanted to fit a bunch of logistic regression for each combination of these two categorical variables? There are three passenger classes and two sexes, meaning we’d have to create six data subsets and fit the model six times if we used the same approach we used above when dealing wjust with. Luckily, dplyr provides a powerful function called do that allows us to carry out arbitrary computations on grouped data.

There are two ways to use do(). The first way is to name the expressions you evaluate in do a name, in which case do will store the results in a column. The second way to use do() is for the expression to return a data frame.

In this first example, the model fits are stored in the fits column. When using do you can refer to the current group using a period (.):

grouped.models <-
  titanic %>%
  group_by(sex, pclass) %>%
  do(fits = glm(.$survived ~ .$age, family = binomial))

grouped.models
Source: local data frame [6 x 3]
Groups: <by row>

# A tibble: 6 x 3
     sex pclass      fits
*  <chr>  <chr>    <list>
1 female    1st <S3: glm>
2 female    2nd <S3: glm>
3 female    3rd <S3: glm>
4   male    1st <S3: glm>
5   male    2nd <S3: glm>
6   male    3rd <S3: glm>

Having stored the model fits in a data frame, we can access the the columns with the fits just like any other variable:

# get the summary of the second logistic regression (Female, 2nd Class) 
summary(grouped.models$fits[[2]])

Call:
glm(formula = .$survived ~ .$age, family = binomial)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.208   0.398   0.472   0.528   0.692  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   2.8799     0.9164    3.14   0.0017 **
.$age        -0.0297     0.0276   -1.08   0.2823   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.576  on 84  degrees of freedom
Residual deviance: 60.393  on 83  degrees of freedom
  (22 observations deleted due to missingness)
AIC: 64.39

Number of Fisher Scoring iterations: 5

Now we illustrate the second approach to using do(). When no name is provided, do() expects its expression to return a dataframe. Here we use the broom::tidy() function to get the key results of each fit model into a data frame:

titanic %>%
  group_by(sex, pclass) %>%
  do(tidy(glm(.$survived ~ .$age, family = binomial)))
# A tibble: 12 x 7
# Groups:   sex, pclass [6]
      sex pclass        term estimate std.error statistic   p.value
    <chr>  <chr>       <chr>    <dbl>     <dbl>     <dbl>     <dbl>
 1 female    1st (Intercept)  2.51051   1.18564    2.1174 3.422e-02
 2 female    1st       .$age  0.01216   0.03094    0.3930 6.943e-01
 3 female    2nd (Intercept)  2.87985   0.91644    3.1424 1.675e-03
 4 female    2nd       .$age -0.02969   0.02762   -1.0751 2.823e-01
 5 female    3rd (Intercept)  0.23541   0.57455    0.4097 6.820e-01
 6 female    3rd       .$age -0.01200   0.02262   -0.5304 5.958e-01
 7   male    1st (Intercept)  1.50502   0.61105    2.4630 1.378e-02
 8   male    1st       .$age -0.05425   0.01503   -3.6096 3.067e-04
 9   male    2nd (Intercept)  1.72884   0.72020    2.4005 1.637e-02
10   male    2nd       .$age -0.14388   0.03174   -4.5330 5.816e-06
11   male    3rd (Intercept) -0.46851   0.58948   -0.7948 4.267e-01
12   male    3rd       .$age -0.06281   0.02586   -2.4290 1.514e-02

In-class Assignment #2

Bumpus (1898) described a sample of house sparrows which he collected after a very severe storm. The sample included 136 birds, sixty four of which perished during the storm. Also included in his description were a variety of morphological measurements on the birds and information about their sex and age (for male birds). This data set has become a benchmark in the evolutionary biology literature for demonstrating methods for analyzing natural selection. The bumpus data set is available from the class website as a tab-delimited file bumpus-data.txt. Reference: Bumpus, H. C. 1898. The elimination of the unfit as illustrated by the introduced sparrow, Passer domesticus. (A fourth contribution to the study of variation.) Biol. Lectures: Woods Hole Marine Biological Laboratory, 209–225.

  1. Before you examine the data, write down your prediction about how body size might influence survival of house sparrows during a storm. There are no right/wrong answers. I’m just looking for your prediction and a sentence about your reasoning. [1 pt]

  2. Measures like weight and skeletal dimensions are often proxies for body size. Fit logistic regression models for: a) survival as a function of weight; and b) survival as a function of skull width and report the coefficients of the models. [1 pts]

  3. Illustrate the two regression models from the previous question using scatter plots overlain with logistic regression curves. Combine the two plots as subplots A) and B) in a single figure, using cowplot. [2 pt]

  4. Summarize your findings from the previous two questions regarding the relationship between survival and the various morphological traits. Are the patterns you observed consistent with your prediction in the first question? Point out any unexpected or unusual aspects of your findings. [2 pts]

  5. Use dplyr::group_by and dplyr::do() to fit logistic regressions of survival as a function of weight for male and female birds separately, using broom::tidy to return a table of coeffficients (and other parameters) from the model fits [2 pts]

  6. Using faceting to produce a figure illustrating the logistic regression of survival as a function of weight for male and female birds separately [1 pt]