Introducing FlyVis

After a couple all-nighters we’re finally done with our undergraduate statistics thesis. The abstract provides a brief overview of what we were trying to accomplish:

We explore the possibility of improving data analysis through the use of interactive visualization. Exploration of data and models is an iterative process. We hypothesize that dynamic, interactive visualizations greatly reduce the cost of new iterations and thus f acilitate agile investigation and rapid prototyping. Our web-application framework, flyvis.com, offers evidence for such a hypothesis for a dataset consisting of airline on-tim e flight performance between 2006-2008. Utilizing our framework we are able to study the feasibility of modeling subsets of flight delays from temporal data, which fails on the full dataset.

Technically, this was a very fun project. Shiny is an extremely powerful package which provides the interactive framework necessary to build such applications. We also made use of the JavaScript library leaflet.js for the interactive map. All in all, I learned quite a bit about writing efficient R code, as the dataset we were using had over 18 million observations.

To learn more about the app check out the projects page or the actual application website FlyVis.com.

FlyVis lets you dynamically explore the airports on-time dataset which yields some pretty interesting graphs. For example, if we look at the intraday distribution of flights and delays for Memphis:

we see a pretty interesting pattern. Turns out the FedEx shipments control most of the flights out of Memphis which gives us this unique shape.

The beauty of an interactive application is that you (the reader) can discover something that I haven’t even considered. I merely provide the tools and you can explore. If anybody finds some cool patterns in certain airports then I’d love to hear about it over e-mail or comment.

Once we polish the application a bit more we will release the source code on GitHub. Disclaimer: The site will initially take a minute or two to load since our server has to load the massive dataset into memory. Also, the plots do take 5-6 seconds to generate. Again this is due to the size of our data and is something we are currently trying to optimize.

Finally, what would this post be without some R code? Here’s what we used for the Calendar Heatmap plot:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
myCalPlot <- function (dates, values)
{
  if (!require(ggplot2) & !require(plyr))
    stop("The packages ggplot2 and plyr are required to use plotCalendarHeatmap")
  tp = projectDate(dates, drop = F)
  tp$values = values
  tp$week = as.numeric(format(dates, "%W"))
  tp$month <- factor(tp$month,levels=as.character(1:12),
                     labels=c("Jan","Feb","Mar","Apr","May","Jun",
                              "Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)

  tp = ddply(tp, .(year, month), transform, monthweek = 1 +
               week - min(week))
  ggplot(tp, aes(monthweek, weekday, fill = values)) + geom_tile(colour = "white") +
    facet_grid(. ~ month) +
    scale_fill_gradientn(colours = rev(c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384"))) + theme_bw()
}

Doesn’t include the pre-processing but that’ll be out early 2014.