Chapter 23 Multiple regression

23.1 Libraries to install

We’ll be using several new packages for this class session. Install the following packages via one of the standard install mechanisms:

  • HistData – provides the GaltonFamilies example data sets we’ll work with
  • plot3D – for generating 3D plots
  • rgl – NOTE: On OS X, rgl requires you to install a program called XQuartz. XQuartz can be downloaded from the XQuartz Home Page. If you’re on a Mac, install XQuartz before installing rgl. You may have to reboot your computer after installing XQuartz.

23.2 Review of bivariate regression

Recall the model for bivariate least-squares regression. When we regress \(Y\) and \(X\) we’re looking for a linear function, \(f(X)\), for which the following sum-of-squared deviations is minimized:

\[ \sum_{i=1}^n (y_i - f(x_i))^2 \]

The general form a linear function of one variable is a line,

\[ \widehat{Y} = f(x) = a + bX \]

where \(b\) is the slope of the line and \(a\) is the intercept.

23.3 Multiple regression

The idea behind multiple regression is almost exactly the same as bivariate regression, except now we try and fit a linear model for \(Y\) using multiple explanatory variables, \(X_1, X_2,\ldots, X_m\). That is we’re looking for a linear function, \(f(X_1, X_2,\ldots,X_m)\) that minimizes:

\[ \sum_{i=1}^n(y_i - f(x_1, x_2,\ldots, x_m))^2 \]

A linear function of more than one variable is written as:

\[ \widehat{Y} = f(X_1, X_2,\ldots,X_m) = a + b_1X_1 + b_2X_2 + \cdots + b_mX_m \]

Where \(a\) is the intercept and \(b_1, b_2,\ldots,b_m\) are the regression coefficients.

23.3.1 Geometrical interpretation

Geometrically the regression coefficients have the same interpretation as in the bivariate case – slopes with respect to the corresponding variable. When there are two predictor variables, the linear regression is geometrically a plane in 3-space, as shown in the figure below. When there are more than two predictor variables, the regression solution is a hyper-plane.

Multiple regression, two predictor variables

Multiple regression, two predictor variables

Mathematically, the best fitting regression coefficients, \(b_1, b_2,\ldots,b_m\), are found using linear algebra. Since we haven’t covered linear algebra in this course, I will omit the details. Conceptually the thing to remember is that the regression coefficients are related to the magnitude of the standard deviations of the the predictor variables and the covariances between the predictor and outcome variables.

23.3.2 Coefficient of determination for multiple regression

As in bivariate regression, the coefficient of determination (\(R^2\)) provides a measure of the proportion of variance in the outcome variable (\(Y\)) “explained” by the predictor variables (\(X_1, X_2, \ldots\)).

23.4 Interpretting Multiple Regression

Here are some things to keep in mind when interpretting a multple regression:

  • In most cases of regression, causal interpretation of the model is not justified.

  • Standard bivariate and multiple regression assumes that the predictor variables ( (\(X_1, X_2, \ldots\)) are observed without error. That is, uncertainty in the regression model is only associated with the outcome variable, not the predictors.

  • Comparing the size of regression coefficients only makes sense if all the predictor (explanatory) variables have the same scale

  • If the explanatory variables (\(X_1, X_2,\ldots,X_m\)) are highly correlated, then the regression solution can be “unstable” – a small change in the data could lead to a large change in the regression model.

23.6 Example data set: mtcars

The mtcars dataset contains information on fuel consumption and ten other aspects of car design (see ?mtcars for more info). We’ll use multiple regression to model the relationship between fuel consumption (mpg) and a vehicles weight (wt) and horsepower (hp).

23.7 Visualizing and summarizing the variables of interest

Before carrying out any regression modle it’s always a good idea to start out with visualizations of the individual variables first.

Let’s also create some quick data summaries for our variables:

And a correlation matrix to summarize the bivariate associations between the variables:

We can use the GGally::ggpairs() function, which we’ve seen previously, to create a visualization of the bivariate relationships:

From the scatter plots and correlation matrix we see that weight and horsepower are positively correlated, but both are negatively correlated with fuel economy. This jives with our intuition – bigger cars with more powerful engines generally get lower gas mileage.

23.8 3D plots

Since we’re building a model that involves three variables, it makes sense to look at at 3D plot. ggplot2 has no built in facilities for 3D scatter plots so we’ll use a package called plot3D. plot3D follows the plotting conventions of the base R-graphics capabilities, so we can’t build up figures in layers in the same way we do in ggplot. Instead we pass all the formatting arguments to a single function call.

To create a 3D scatter plot we can use the plot3D::points3D function. The argument pch sets the type of plotting character to use in the plot (for a graphical key of the available plotting characters see this link).

We can change the angle of the 3D plot using the arguments theta and phi which change the “azimuthal direction” and “colatitude” (inclination angle). See the wikipedia page on spherical coordinate systems for more explanation of this values.

If you want the points to have a uniform color specify a single color in the col argument. Here we also add vertical lines to the plot using the type argument and show

For more examples of how you can modify plots generated with the plot3D package see this web page.

23.9 Fitting a multiple regression model in R

Using the lm() function, fitting multiple regression models is a straightforward extension of fitting a bivariate regression model.

As was the case for bivariate regression, the broom package functions tidy, glance, and augment can be useful for working with the results from fitting the mode.

23.11 Interactive 3D Visualizations Using OpenGL

The package rgl is another package that we can use for 3D visualization. rgl is powerful because it lets us create interactive plots we can rotate and zoom in/out on.

Once you’ve installed and loaded rgl try the following code.

We can add a 3d plane to our plot, representing the multiple regression model, with the rgl.planes() function as shown below.

23.12 Examining the residuals

Residual plots are useful for multiple regression, just as they were for bivariate regression.

First we plot the residuals versus each of the predictor variable individually.

And now we plot the residuals in 3D space, with a plane parallel to the xy-plane (wt, hp-plane) representing the plane about which the residuals should be homogeneously scattered if the assumptions of the linear regression model hold.