I always get giddy when I can apply real statistics and math to problems in my life. Recently, I had an opportunity to apply the ‘Taxicab Problem’ to something that came up at work. Given that I work for a ridesharing platform and I was quite literally counting “taxis” (or at least cars meant to drive others around), this was doubly exquisite.

For the uninitiated, the Taxicab / Germany Tank problem is as follows:

Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?

This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200-300) while “intelligence” (unintelligence) operations overestimated numbers about 6-7x. Read the full details on Wikipedia here (and donate while you’re over there).

Somebody suggested using $2 * \bar{x}$ where $\bar{x}$ is the mean of the current observations. This made me feel all sorts of weird inside but I couldn’t quite put my finger on what was immediately wrong with this (outside of the glaringly obvious, touched upon shortly). On one hand, it makes some sense intuitively. If the distribution of the numbers you see is random, you can be somewhat sure that the mean of your subset is approxiately the mean of the true distribution. Then you simply multiply by 2 to get the max since the distribution is uniform. However, this doesn’t take into account that $2 * \bar{x}$ might be lower than $m$ where $m$ is the max number you’ve seen so far. An easy fix to this method is to simply use $max(m, 2 * \bar{x})$.

Read on →

bayesAB 0.7.0

Quick announcement that my package for Bayesian AB Testing, bayesAB, has been updated to 0.7.0 on CRAN. Some improvements on the backend as well a few tweaks for a more fluid UX/API. Some links:

Now, on to the good stuff.

Why should we care about priors?

Most questions I’ve gotten since I released bayesAB have been along the lines of:

  • Why/how is Bayesian AB testing better than Frequentist hypothesis AB testing?
  • Why do I need priors?
  • Do I really really really need priors?
  • How do I choose priors?

Question 1 has a few objective and a few subjective answers to it. The main benefits are ones that I’ve already highlighted in the README/vignette of the bayesAB package. To briefly summarize, we get direct probabilities for A > B (rather than p-values) and distributions over the parameter estimates rather than point estimates. Finally, we can also leverage priors which help with the low sample size and low base rate problems.

To start, let’s go back to what a prior actually is in a Bayesian context. There are countless mathematical resources out there (including part of my previous blog post) so I’ll only about this conceptually. Simply put, a prior lets you specify some sort of, ahem, prior information about a certain parameter so that the end posterior on that parameter encapsualtes both the data you saw and the prior you inputted. Priors can come from a variety of places including past experiments, literature, and domain expertise into the problem. See this blogpost for a great example of somebody combining their own past data and literature to form very strong priors.

Read on →

This is a blog post to showcase my new R package, bayesAB. Below is the knitted HTML of the introduction.Rmd vignette of the package. Formatting/colors may be a bit off since the vignette has it’s own styling. Check the full vignette here. Check it out on Github here. Newer version has since been released.




Introduction to bayesAB

Frank Portman - fportman.com - frank1214@gmail.com

Most A/B test approaches are centered around frequentist hypothesis tests used to come up with a point estimate (probability of rejecting the null) of a hard-to-interpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test to determine sample size and then interface with a Product Manager or Marketing Exec in order to relay the results. This quickly gets messy in terms of interpretability. More importantly it is simply not as robust as A/B testing given informative priors and the ability to inspect an entire distribution over a parameter, not just a point estimate.

Enter Bayesian A/B testing.

Bayesian methods provide several benefits over frequentist methods in the context of A/B tests - namely in interpretability. Instead of p-values you get direct probabilities on whether A is better than B (and by how much). Instead of point estimates your posterior distributions are parametrized random variables which can be summarized any number of ways. Bayesian tests are also immune to ‘peeking’ and are thus valid whenever a test is stopped.

This document is meant to provide a brief overview of the bayesAB package with a few usage examples. A basic understanding of statistics (including Bayesian) and A/B testing is helpful for following along.

Methods

Unlike a frequentist method, in a Bayesian approach you first encapsulate your prior beliefs mathematically. This involves choosing a distribution over which you believe your parameter might lie. As you expose groups to different tests, you collect the data and combine it with the prior to get the posterior distribution over the parameter(s) in question. Mathematically, you are looking for P(parameter | data) which is a combination of the prior and posterior (the math, while relatively straightforward, is outside of the scope of this brief intro).

Read on →

My New Year’s resolution is to make more than one blog post in 2016. I’m halfway to my minimum goal as of January 2nd so things are looking good.

Background

Twitter released a new R package earlier this year named AnomalyDetection (link to Github). The Github goes into a bit more detail, but at a high-level it uses a Seasonal Hybrid ESD (S-H-ESD) which is built upon the Generalized ESD (Extreme Studentized Deviate Test) - a test for outliers. The S-H-ESD is particularly noteworthy since it can detect both local and global outliers. That is, it can detect outliers within local short-term seasonal trends, as well as global outliers that fall far above or below all other values.

I’ve been dealing with a similar problem as part of my personal work (and FT job as it happens) so this package came up alongside the literature.

Improvements

One area in which this package is lacking is data visualization as we’ll see below. Since the purpose of these problem is to find anomalies in time-series data, dygraphs jumps to mind. A few of the benefits and features of dygraphs:

  • Fully reactive javascript (sexy, web graphics, etc)
  • Can add a bunch of widgets without needing to throw togeher a full fledged Shiny app (or other)
  • Far greater control on the exploration of time-series data. For example, you can drag vertical or horizontal sections to zoom in on the plot, and double-click to reset.
Read on →

Let’s take $n$ distinct points on the real line:

Yay. We can now define the Lagrange Polynomials :

Why am I making you look at this beauté? Turns out there’s some neat mathematical properties - namely in the subject of polynomial interpolation. That’s a fancy way of saying ‘fit a polynomial through certain points’ but we’ll get to that.

Now, it’s easy to see that

since for $k = j$ every term in the product will be 1 ($t = t_k$ in the product), while for $k \ne j$ there will be one term with $t_j - t_j$ in the numerator (rendering the whole product to be 0). Keep in mind that the product above is not defined for the term where $j = k$ otherwise we’d have a divison-by-zero term.

See where this is going? Let’s introduce $y_1, …, y_n$ as arbitrary real numbers. How would we construct a polynomial that goes through the $(t_i, y_i)$ points? That is, can we find a polynomial $p$ such that $p(t_j) = y_j$ for each index $j = 1, …, n$? Hint: I didn’t write all that stuff up there for no reason.

Given what we know about $p_k(t_j)$ we can do the following:

Read on →

After a couple all-nighters we’re finally done with our undergraduate statistics thesis. The abstract provides a brief overview of what we were trying to accomplish:

We explore the possibility of improving data analysis through the use of interactive visualization. Exploration of data and models is an iterative process. We hypothesize that dynamic, interactive visualizations greatly reduce the cost of new iterations and thus f acilitate agile investigation and rapid prototyping. Our web-application framework, flyvis.com, offers evidence for such a hypothesis for a dataset consisting of airline on-tim e flight performance between 2006-2008. Utilizing our framework we are able to study the feasibility of modeling subsets of flight delays from temporal data, which fails on the full dataset.

Technically, this was a very fun project. Shiny is an extremely powerful package which provides the interactive framework necessary to build such applications. We also made use of the JavaScript library leaflet.js for the interactive map. All in all, I learned quite a bit about writing efficient R code, as the dataset we were using had over 18 million observations.

To learn more about the app check out the projects page or the actual application website FlyVis.com.

FlyVis lets you dynamically explore the airports on-time dataset which yields some pretty interesting graphs. For example, if we look at the intraday distribution of flights and delays for Memphis:

Read on →

Although virtually obsolete, Roman Numerals are subtly embedded into our culture. From the Super Bowl and Olympics to royal titles, Roman Numerals refuse to fully be extinguished from our every day lives. And that’s not without reason. All numbers are beautiful and Roman Numerals are no exception, even if they are written a little differently from their Arabic counterparts.

In this post, we’ll examine some fascinating properties of Roman Numerals - namely the lengths of Roman Numerals in succession.

First, we define a simple Arabic –> Roman Numeral converter. Start by creating two vectors, one for the 13 Roman symbols and another for the Arabic counterparts. Next, a simple for/while combination iterates through the arrays and chooses the appropriate Roman symbols while iteratively decreasing the input variable.

Read on →

Many of my research and personal interests lie in the realm of Machine Learning for several reasons. To me, it is the perfect blend of mathematics, statistics, and computer science. Also, it is extremely pervasive in today’s society. Everything from improved web-search to self-driving cars can be attributed to developments in Machine Learning.

For one of my computational finance classes, I attempted to implement a Machine Learning algorithm in order to predict stock prices, namely S&P 500 Adjusted Close prices. In order to do this, I turned to Artificial Neural Networks (ANN) for a plethora of reasons. ANNs have been known to work well for computationally intensive problems where a user may not have a clear hypothesis of how the inputs should interact. As such, ANNs excel at picking up hidden patterns within the data so well that they often overfit!

Keeping this in mind, I experimented with a technique known as a ‘sliding window’. Rather than train the model with years of S&P 500 data, I created an ANN that would train over the past 30 days (t-30, …, t) to predict the close price at t+1. A 30 day sliding window seemed to make a good fit. It wasn’t so wide as to not capture the current market atmosphere, but also it wasn’t narrow enough to be hypersensitive to recent erratic movements.

Read on →

I’m gonna share a short code snippet that I thought was interesting. This post is inspired by one of my engineering computation classes at Rice. The program initializes a pair of coordinates ‘z’ and iteratively updates z by matrix multiplication based on some random number generation criteria. After each successive coordinate update, the new ‘z’ is plotted.

Read on →

I was inspired by a few animated gifs that I saw recently so I decided to make one of my own. For this project, I sought out a way to effectively visualize how Mcdonald’s expanded throughout the world. To do this, I created a heatmap of the world and using animations I was able to efficiently map out how McDonald’s became more popular over time.

The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries’ spellings from this article did not match with what is used in the R ‘maps’ package.

Read on →