Exploring GDP Per Capita vs. Educational Attainment

The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.

I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.

1
2
3
4
5
6
7
8
9
10
library(ggplot2)

bachelors <- read.csv("bachelors.csv", header = TRUE)
GDP <- read.csv("gdppercapita.csv", header = TRUE)

new.data <- merge(bachelors, GDP, by = "State")

colnames(new.data) <- c("State", "Percent.Bachelors", "GDP.Per.Capita")

model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)

First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.

The summary of this simple linear model is featured below:

1
2
3
4
5
6
7
8
> summary(model)

Call:
lm(formula = GDP.Per.Capita ~ Percent.Bachelors, data = new.data)

Residual standard error: 6923 on 48 degrees of freedom
Multiple R-squared: 0.3564,	Adjusted R-squared: 0.343
F-statistic: 26.58 on 1 and 48 DF,  p-value: 4.737e-06

As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.

Nevertheless, we can see a pretty interesting graph below:

1
2
3
4
5
6
7
8
g <- ggplot(new.data, aes(x = Percent.Bachelors,
                       y = GDP.Per.Capita)) +
                     xlab("Proportion of Population with Bachelor's or Higher") +
                     ylab("GDP Per Capita")

g <- g + geom_text(aes(label = State))

g

Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.

I tried a Box-Cox test in R to see whether our fit might be improved if we transformed the response variable. The Box-Cox plot in R suggests that raising the response to the negative first power might be beneficial.

This new model is outlined below:

1
2
3
4
5
6
7
8
9
model2 <- lm(GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data)
> summary(model2)

Call:
lm(formula = GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data)

Residual standard error: 2.967e-06 on 48 degrees of freedom
Multiple R-squared: 0.4199,	Adjusted R-squared: 0.4079
F-statistic: 34.75 on 1 and 48 DF,  p-value: 3.627e-07

The p-value of this new model is even lower and the R-Squared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.

Graphing this we get:

1
2
3
4
5
6
7
8
h <- ggplot(new.data, aes(x = Percent.Bachelors,
                       y = GDP.Per.Capita^-1)) +
		xlab("Proportion of Population with Bachelor's or Higher") +
		ylab("1 divided by GDP Per Capita")

h <- h + geom_text(aes(label = State))

h

Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.