Some Statistical Analysis of Diamonds

Today I’ll be dealing with a dataset that has the price, carat, and several other attributes of almost 54,000 diamonds. It is publically available in the ggplot2 package. Let’s jump right into it.

1
2
3
4
5
6
7
8
9
library(ggplot2)
library(gridExtra)

data(diamonds)

plot1 <- qplot(cut, data = diamonds)
plot2 <- qplot(carat, data = diamonds, binwidth = .1)

grid.arrange(plot1, plot2, ncol = 2)

These two histograms help us visualize how common the ‘cuts’ were as well as the range of carats present in this dataset.

1
qplot(carat, data = diamonds, fill = cut, binwidth = .1)

This histogram combines the two ideas above and gives an idea of how many diamonds had a certain cut at a specific carat.

1
2
qplot(carat, price, colour = cut, data = diamonds)

A simple plot of price vs. carat with color based on cut shows us that price tends to increase with carat as we might expect.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> by(diamonds[ , 1], diamonds[ , 2], summary)
diamonds[, 2]: Fair
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.220   0.700   1.000   1.046   1.200   5.010
----------------------------------------------------------------
diamonds[, 2]: Good
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2300  0.5000  0.8200  0.8492  1.0100  3.0100
----------------------------------------------------------------
diamonds[, 2]: Very Good
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2000  0.4100  0.7100  0.8064  1.0200  4.0000
----------------------------------------------------------------
diamonds[, 2]: Premium
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.200   0.410   0.860   0.892   1.200   4.010
----------------------------------------------------------------
diamonds[, 2]: Ideal
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2000  0.3500  0.5400  0.7028  1.0100  3.5000

`

The above output gives us a summary of diamond carats separated by cut. We can see that ‘Fair’ cut diamonds tended to be the largest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> by(diamonds[ , 7], diamonds[ , 2], summary)
diamonds[, 2]: Fair
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 337    2050    3282    4359    5206   18570
----------------------------------------------------------------
diamonds[, 2]: Good
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 327    1145    3050    3929    5028   18790
----------------------------------------------------------------
diamonds[, 2]: Very Good
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 336     912    2648    3982    5373   18820
----------------------------------------------------------------
diamonds[, 2]: Premium
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 326    1046    3185    4584    6296   18820
----------------------------------------------------------------
diamonds[, 2]: Ideal
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 326     878    1810    3458    4678   18810

This is a summary of diamond prices separated by cut.

Now I’ll fit a few linear models to the data.

1
2
3
g1 <- lm(price ~ carat + cut, data = diamonds)
g2 <- lm(price ~ carat + cut + carat:cut, data = diamonds)

The first model has ‘price’ as the response variable with ‘carat’ and ‘cut’ as the two predictors. The second model is the same except it accounts for interaction.

My static blogging framework Ruhoh messes up when I try to paste the summary outputs but if you load up R you can confirm that these two linear models are almost identical. Judging the R2 values and Residual Standard Error it does look like the interactive model is TINY bit better but it isnt significant enough to be conclusive.

Next we want to take a look at the residuals to see if they are correlated, have non-constant variance, or are not normally distributed:

1
2
3
4
5
6
7
plot3 <- qplot(fitted(g1), residuals(g1)) + geom_abline(slope = 0, intercept = 0) +
       ggtitle("Residuals of Regular Model")

plot4 <- qplot(fitted(g2), residuals(g2)) + geom_abline(slope = 0, intercept = 0) +
       ggtitle("Residuals of Interactive Model")

grid.arrange(plot3, plot4, ncol = 2)

In general, the residuals are poorly behaved for both models and look correlated. This violates vital assumptions we make when constructing a linear model. The QQplots of these two models confirm the non-normality of the errors.

1
2
3
4
5
6
7
par(mfrow=c(1,2))

qqnorm(g1$residuals, main="QQPLOT Residuals Regular")
qqline(g1$residuals, col="red")

qqnorm(g2$residuals, main="QQPLOT Residuals Interactive")
qqline(g2$residuals, col="red")

Next we can explore some transformations that hopefully have better fits

1
2
qplot(log10(carat), log10(price), colour = cut, data = diamonds)

The plot of log(price) vs. log(carat) (both base 10) seems to exhibit a much stronger linear relationship than the untransformed parameters.

I fit a model with log10(price) as the response and log10(carat) as the predictor along with the interactive ‘cut’ terms.

1
g3 <- lm(log10(price) ~ log10(carat) + cut + carat:cut, data = diamonds)

Once again the summary function messes up Ruhoh so take my word for the following: The tiny residual standard error, and higher R2 value of this model as compared to the one with untransformed parameters gives some evidence that it is indeed a better fit.

1
qplot(fitted(g3), residuals(g3)) + geom_abline(slope = 0, intercept = 0)

The residuals now seem to have almost constant variance as well as no correlation. The normality of the residuals is confirmed by the QQPlot exhibiting a strong linear pattern:

1
2
qqnorm(g3$residuals, main="QQPLOT Residuals")
qqline(g3$residuals, col="red")

Last bit of code to show that log(price) and log(carat) exhibit a nice linear relationship:

1
qplot(log10(carat), log10(price), data = diamonds) + geom_smooth(method = "lm", color = "red")

Now that we’ve built a pretty good linear model to predict log10(price) based on a few other variables, we can back-transform if needed to predict actual price.

This was a pretty lengthy one - hope you enjoyed it!