dplyr::do
Logistic regression is used when the dependent variable is discrete (often binary). The explanatory variables may be either continuous or discrete.
Examples:
We model the binary response variable, \(Y\), as a function of the predictor variables, \(X_1\), \(X_2\), etc as :
\[ P(Y = 1|X_1,\ldots,X_p) = f(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]
So we’re modeling the probability of the state of Y as a function of a linear combination of the predictor variables.
For logistic regression, \(f\) is the logistic function: \[ f(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}} \]
Therefore, the bivariate logistic regression is given by: \[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
Note that \(\beta_0\) here is akin to the intercept in our standard linear regression.
To help you develop an intuition for the logistic regression equation, I’ve developed a small web app, that allows you to explore how the shape of the regression curve responds to changes in the regression coefficients \(\beta_0\) and \(\beta_1\).
Open the Exploring Logistic Regression app in another browser window and then answer the following questions.
Set the both coefficients, B0 and B1, to zero. Given these coefficients, what is \(P(Y = 1)\) when X = 0 (i.e. the intercept)? How does this probability change as X increases? [1 pt]
Set the coefficient B1 to zero. Use the slider to change the value of B0. How does increasing/decreasing the value of B0 affect \(P(Y = 1)\) at the Y-intercept? [1 pt]
Set the coefficient B0 to zero. Use the slider to change the value of B1. How does increasing/decreasing the value of B1 affect \(P(Y = 1)\)? [1 pt]
Explore a range of values for both B0 and B1 simultaneously. What range of coefficients gives curves with decreasing \(P(Y=1)\) as X increases? What range of coefficients generate curves with increasing \(P(Y=1)\) as X increases? [1 pt]
titanic.csv
contains information about passengers on the Titanic. Variables in this data set include information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not they survived the sinking of the ship (0 = died, 1 = survived).
library(tidyverse)
library(broom)
library(cowplot) # for plot_grid fxn
library(ggthemes) # install this package for additional ggplot themes
titanic <- read_csv("https://github.com/pmagwene/Bio723/raw/master/datasets/titanic.csv")
names(titanic)
[1] "row.names" "pclass" "survived" "name" "age"
[6] "embarked" "home.dest" "room" "ticket" "boat"
[11] "sex"
We’ve all heard the phrase, “Women and children first”, so we might expect that the probability that a passenger survived the sinking of the Titanic related to their sex and/or age. Let’s create separate data subsets for male and female passengers.
male <- filter(titanic, sex == "male")
dim(male)
[1] 850 11
female <- filter(titanic, sex == "female")
dim(female)
[1] 463 11
Let’s create visualizations of survival as a function of age for the male and female passengers.
fcolor = "lightcoral"
mcolor = "lightsteelblue"
female.plot <- ggplot(female, aes(x = age, y = survived)) +
geom_jitter(width = 0, height = 0.05, color = fcolor) +
labs(title = "Female Passengers")
male.plot <- ggplot(male, aes(x = age, y = survived)) +
geom_jitter(width = 0, height = 0.05, color = mcolor) +
labs(title = "Male Passengers")
plot_grid(female.plot, male.plot)
The function glm
(generalized linear model) can be used to fit the logistic regression model (as well as other models). Setting the argument family = binomial
gives us logistic regression. Note that when fitting the model the dependent variable needs to be numeric, so if the data is provided as Boolean (logical) TRUE/FALSE values, they should be converted to integers using as.numeric()
.
First we fit the regression for the famale passengers.
fit.female <- glm(survived ~ age, family = binomial, female)
summary(fit.female)
Call:
glm(formula = survived ~ age, family = binomial, data = female)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.192 0.484 0.593 0.680 0.849
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8309 0.3663 2.27 0.023 *
age 0.0234 0.0119 1.97 0.049 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 229.88 on 242 degrees of freedom
Residual deviance: 225.81 on 241 degrees of freedom
(220 observations deleted due to missingness)
AIC: 229.8
Number of Fisher Scoring iterations: 4
Now we repeat the same step for the male passengers.
fit.male <- glm(survived ~ age, family = binomial, male)
summary(fit.male)
Call:
glm(formula = survived ~ age, family = binomial, data = male)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.062 -0.737 -0.628 -0.407 2.239
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.26565 0.28982 -0.92 0.35936
age -0.03593 0.00949 -3.79 0.00015 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 401.15 on 389 degrees of freedom
Residual deviance: 385.37 on 388 degrees of freedom
(460 observations deleted due to missingness)
AIC: 389.4
Number of Fisher Scoring iterations: 4
To visualize the logistic regression fit, we first use the predict
function to generate the model predictions about probability of survival as a function of age.
ages <- seq(0, 75, 1) # predict survival for ages 0 to 75
predicted.female <- predict(fit.female,
newdata = data.frame(age = ages),
type = "response")
predicted.male <- predict(fit.male,
newdata = data.frame(age = ages),
type = "response")
Having generated the predicted probabilities of survival we can then add these prediction lines to our previous plot using geom_line
.
female.logistic.plot <- female.plot +
geom_line(data = data.frame(age = ages, survived = predicted.female),
color = fcolor, size = 1)
male.logistic.plot <- male.plot +
geom_line(data = data.frame(age = ages, survived = predicted.male),
color = mcolor, size = 1)
plot_grid(female.logistic.plot, male.logistic.plot)
Here’s an alternative “quick and easy” way to generate the plot above using the awesome power of ggplot. The downside of this approach is we don’t generate the detailed information on the model, which is something you’d certainly want to have in any real analysis.
ggplot(titanic, aes(x=age, y=survived, color=sex)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method="glm", method.args = list(family="binomial")) +
labs(x = "Age", y = "P(Survival)") +
facet_wrap(~ sex) +
scale_color_manual(values = c(fcolor, mcolor))
In the first figure we are faceting just on sex. It can sometimes be useful to facet on multiple variables simultaneously. Below we extend our visualization to look at the regression faceted on both class and sex, using facet_grid
:
ggplot(titanic, aes(x=age, y=survived, color=sex)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method="glm", method.args = list(family="binomial")) +
labs(x = "Age", y = "P(Survival)") +
facet_grid(pclass ~ sex) +
scale_color_manual(values = c(fcolor, mcolor)) + theme_few()
dplyr::do
In the figure above we used facet_grid
to condition on both sex and class. What if we wanted to fit a bunch of logistic regression for each combination of these two categorical variables? There are three passenger classes and two sexes, meaning we’d have to create six data subsets and fit the model six times if we used the same approach we used above when dealing wjust with. Luckily, dplyr
provides a powerful function called do
that allows us to carry out arbitrary computations on grouped data.
There are two ways to use do()
. The first way is to name the expressions you evaluate in do
a name, in which case do
will store the results in a column. The second way to use do()
is for the expression to return a data frame.
In this first example, the model fits are stored in the fits
column. When using do
you can refer to the current group using a period (.
):
grouped.models <-
titanic %>%
group_by(sex, pclass) %>%
do(fits = glm(.$survived ~ .$age, family = binomial))
grouped.models
Source: local data frame [6 x 3]
Groups: <by row>
# A tibble: 6 x 3
sex pclass fits
* <chr> <chr> <list>
1 female 1st <S3: glm>
2 female 2nd <S3: glm>
3 female 3rd <S3: glm>
4 male 1st <S3: glm>
5 male 2nd <S3: glm>
6 male 3rd <S3: glm>
Having stored the model fits in a data frame, we can access the the columns with the fits just like any other variable:
# get the summary of the second logistic regression (Female, 2nd Class)
summary(grouped.models$fits[[2]])
Call:
glm(formula = .$survived ~ .$age, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.208 0.398 0.472 0.528 0.692
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8799 0.9164 3.14 0.0017 **
.$age -0.0297 0.0276 -1.08 0.2823
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 61.576 on 84 degrees of freedom
Residual deviance: 60.393 on 83 degrees of freedom
(22 observations deleted due to missingness)
AIC: 64.39
Number of Fisher Scoring iterations: 5
Now we illustrate the second approach to using do()
. When no name is provided, do()
expects its expression to return a dataframe. Here we use the broom::tidy()
function to get the key results of each fit model into a data frame:
titanic %>%
group_by(sex, pclass) %>%
do(tidy(glm(.$survived ~ .$age, family = binomial)))
# A tibble: 12 x 7
# Groups: sex, pclass [6]
sex pclass term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 female 1st (Intercept) 2.51051 1.18564 2.1174 3.422e-02
2 female 1st .$age 0.01216 0.03094 0.3930 6.943e-01
3 female 2nd (Intercept) 2.87985 0.91644 3.1424 1.675e-03
4 female 2nd .$age -0.02969 0.02762 -1.0751 2.823e-01
5 female 3rd (Intercept) 0.23541 0.57455 0.4097 6.820e-01
6 female 3rd .$age -0.01200 0.02262 -0.5304 5.958e-01
7 male 1st (Intercept) 1.50502 0.61105 2.4630 1.378e-02
8 male 1st .$age -0.05425 0.01503 -3.6096 3.067e-04
9 male 2nd (Intercept) 1.72884 0.72020 2.4005 1.637e-02
10 male 2nd .$age -0.14388 0.03174 -4.5330 5.816e-06
11 male 3rd (Intercept) -0.46851 0.58948 -0.7948 4.267e-01
12 male 3rd .$age -0.06281 0.02586 -2.4290 1.514e-02
Bumpus (1898) described a sample of house sparrows which he collected after a very severe storm. The sample included 136 birds, sixty four of which perished during the storm. Also included in his description were a variety of morphological measurements on the birds and information about their sex and age (for male birds). This data set has become a benchmark in the evolutionary biology literature for demonstrating methods for analyzing natural selection. The bumpus data set is available from the class website as a tab-delimited file bumpus-data.txt. Reference: Bumpus, H. C. 1898. The elimination of the unfit as illustrated by the introduced sparrow, Passer domesticus. (A fourth contribution to the study of variation.) Biol. Lectures: Woods Hole Marine Biological Laboratory, 209–225.
Before you examine the data, write down your prediction about how body size might influence survival of house sparrows during a storm. There are no right/wrong answers. I’m just looking for your prediction and a sentence about your reasoning. [1 pt]
Measures like weight and skeletal dimensions are often proxies for body size. Fit logistic regression models for: a) survival as a function of weight; and b) survival as a function of skull width and report the coefficients of the models. [1 pts]
Illustrate the two regression models from the previous question using scatter plots overlain with logistic regression curves. Combine the two plots as subplots A) and B) in a single figure, using cowplot
. [2 pt]
Summarize your findings from the previous two questions regarding the relationship between survival and the various morphological traits. Are the patterns you observed consistent with your prediction in the first question? Point out any unexpected or unusual aspects of your findings. [2 pts]
Use dplyr::group_by
and dplyr::do()
to fit logistic regressions of survival as a function of weight for male and female birds separately, using broom::tidy
to return a table of coeffficients (and other parameters) from the model fits [2 pts]
Using faceting to produce a figure illustrating the logistic regression of survival as a function of weight for male and female birds separately [1 pt]