Simple Linear Regression and Analysis

I’ve been trying to get more accustomed with using Hadley Wickham’s ggplot2 so today I’ll be going over simple linear regression and some easy ways to analyze the models you’ve created. ggplot2 is a phenomenal package for data visualization and I am actually lucky enough to have taken a class with Hadley at Rice University.

First we’ll load some packages and attach the “fat” dataset:

1
2
3
4
5
library(faraway)
library(ggplot2)
library(gridExtra)
data(fat)
attach(fat)

I loaded faraway to have access to the dataset. gridExtra will be used later on for viewing multiple plots at once. Then, I created several linear models with Brozek (Brozek is an estimate of body fat % using density, click link for more information) as the response variable and other various “fat” variables as predictors.

1
2
3
4
g1 <- lm(brozek ~ weight)
g2 <- lm(brozek ~ abdom)
g3 <- lm(brozek ~ age)
g4 <- lm(brozek ~ thigh)

I predict that age will not be a very good predictor of body fat percentage, while all of the other variables will be. My suspicions are confirmed after I analyze the summaries of each linear model and find the correlations of brozek with respect to each of the predictors.

1
2
3
4
5
6
7
8
9
10
h1 <- summary(g1)
h2 <- summary(g2)
h3 <- summary(g3)
h4 <- summary(g4)

cor(brozek, weight)
cor(brozek, abdom)
cor(brozek, age)
cor(brozek, thigh)

We can check the R2 values of h1:h4 to check how well the regression line fits the set of data by using h1$r.squared and cycling it from h1 to h4. The R2 values respectively are 0.3759604, 0.6621178, 0.08362132, 0.3150402. As we can see, age has a very low R2 and thus the linear model is a fairly bad predictor of the data.

The correlation tells the same story. The respective correlations are 0.6131561, 0.8137062, 0.2891735, 0.5612844. The abdom model had the largest R2 value and cor(brozek, abdom) was also highest amongst the pairs. It seems as if abdominal size is good indicator of body fat percentage.

Next we can plot the residuals. These graphs help visualize that the sum of the residuals must always be 0. There doesn’t seem to be much heteroscedasticity present in these plots so the variance of our residuals is almost constant. It is always good to check the validity of our model because when conducting a regression we assume that the variance of the residuals is constant.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
par(mfrow=c(2,2))

plot(weight, residuals(g1), xlab="Weight", ylab="Residuals", pch=19)
abline(h=0)

plot(abdom, residuals(g2), xlab="Abdom", ylab="Residuals", pch=19)
abline(h=0)

plot(age, residuals(g3), xlab="Age", ylab="Residuals", pch=19)
abline(h=0)

plot(thigh, residuals(g4), xlab="Thigh", ylab="Residuals", pch=19)
abline(h=0)

Finally, we can use ggplot2 to make some very aesthetic graphs of the regressions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Dataset1 <- data.frame(weight, brozek)
plot1 <- ggplot(data = Dataset, aes(x = weight, y = brozek)) +
    geom_point() + geom_smooth(method = "lm") + opts(title="Brozek vs Weight")

Dataset2 <- data.frame(abdom, brozek)
plot2 <- ggplot(data = Dataset, aes(x = abdom, y = brozek)) +
    geom_point() + geom_smooth(method = "lm") + opts(title="Brozek vs Abdom")

Dataset3 <- data.frame(abdom, brozek)
plot3 <- ggplot(data = Dataset, aes(x = age, y = brozek)) +
    geom_point() + geom_smooth(method = "lm") + opts(title="Brozek vs Age")

Dataset4 <- data.frame(abdom, brozek)
plot4 <- ggplot(data = Dataset, aes(x = thigh, y = brozek)) +
    geom_point() + geom_smooth(method = "lm") + opts(title="Brozek vs Thigh")

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

The straight lines represent the linear model fit. It’s pretty easy to see which plots exhibit a good linear relationship.