Some Statistical Analysis of Diamonds
Today I’ll be dealing with a dataset that has the price, carat, and several other attributes of almost 54,000 diamonds. It is publically available in the ggplot2 package. Let’s jump right into it.
1 2 3 4 5 6 7 8 9 

These two histograms help us visualize how common the ‘cuts’ were as well as the range of carats present in this dataset.
1


This histogram combines the two ideas above and gives an idea of how many diamonds had a certain cut at a specific carat.
1 2 

A simple plot of price vs. carat with color based on cut shows us that price tends to increase with carat as we might expect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

`
The above output gives us a summary of diamond carats separated by cut. We can see that ‘Fair’ cut diamonds tended to be the largest.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

This is a summary of diamond prices separated by cut.
Now I’ll fit a few linear models to the data.
1 2 3 

The first model has ‘price’ as the response variable with ‘carat’ and ‘cut’ as the two predictors. The second model is the same except it accounts for interaction.
My static blogging framework Ruhoh messes up when I try to paste the summary outputs but if you load up R you can confirm that these two linear models are almost identical. Judging the R2 values and Residual Standard Error it does look like the interactive model is TINY bit better but it isnt significant enough to be conclusive.
Next we want to take a look at the residuals to see if they are correlated, have nonconstant variance, or are not normally distributed:
1 2 3 4 5 6 7 

In general, the residuals are poorly behaved for both models and look correlated. This violates vital assumptions we make when constructing a linear model. The QQplots of these two models confirm the nonnormality of the errors.
1 2 3 4 5 6 7 

Next we can explore some transformations that hopefully have better fits
1 2 

The plot of log(price) vs. log(carat) (both base 10) seems to exhibit a much stronger linear relationship than the untransformed parameters.
I fit a model with log10(price) as the response and log10(carat) as the predictor along with the interactive ‘cut’ terms.
1


Once again the summary function messes up Ruhoh so take my word for the following: The tiny residual standard error, and higher R2 value of this model as compared to the one with untransformed parameters gives some evidence that it is indeed a better fit.
1


The residuals now seem to have almost constant variance as well as no correlation. The normality of the residuals is confirmed by the QQPlot exhibiting a strong linear pattern:
1 2 

Last bit of code to show that log(price) and log(carat) exhibit a nice linear relationship:
1


Now that we’ve built a pretty good linear model to predict log10(price) based on a few other variables, we can backtransform if needed to predict actual price.
This was a pretty lengthy one  hope you enjoyed it!