For the uninitiated, the Taxicab / Germany Tank problem is as follows:
Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?
This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200300) while “intelligence” (unintelligence) operations overestimated numbers about 67x. Read the full details on Wikipedia here (and donate while you’re over there).
Somebody suggested using $2 * \bar{x}$ where $\bar{x}$ is the mean of the current observations. This made me feel all sorts of weird inside but I couldn’t quite put my finger on what was immediately wrong with this (outside of the glaringly obvious, touched upon shortly). On one hand, it makes some sense intuitively. If the distribution of the numbers you see is random, you can be somewhat sure that the mean of your subset is approxiately the mean of the true distribution. Then you simply multiply by 2 to get the max since the distribution is uniform. However, this doesn’t take into account that $2 * \bar{x}$ might be lower than $m$ where $m$ is the max number you’ve seen so far. An easy fix to this method is to simply use $max(m, 2 * \bar{x})$.
To be most correct, we can use the MVUE (Minimum Variance Unbiased Estimator). I won’t bore you with the derivation (and it’s also on the Wiki), but it works out to be $m + \frac{m}{N}  1$ with $m$ is again the max number you’ve seen and $N$ is the total number you’ve seen. Let’s compare all 3 methods below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

Let’s plot it:
1 2 3 4 5 6 

Zoom in a bit:
1


And this one is just prettttyyyy:
1


As we can see, the green line converges to the true max of 5000 very quickly, and has far less variance than the two other methods. The modified method I proposed is just an asymmetric version of $2 * \bar{x}$, which approaches the true max but seems to level out in local minima/maxima at certain points. Ultimately, you can’t deny that the MVUE method is best.
Another approach we can take is a Bayesian one. Since I’d never pass up an opportunity to showcase my R package, bayesAB, (github repo here) for Bayesian AB testing let’s work through an example with bayesAB. This is actually an extra cool example since it showcases that while bayesAB is primarily meant to fit and compare two sets of posteriors, it can still be used to fit a single Bayesian model.
For brevity’s sake, we fit a Bayesian model using the Pareto distribution as the conjugate prior (if this is all a foreign language to you, definitely check out my other blog posts on bayesAB 1 2). We can use very small values for xm and alpha as diffuse priors so the model we build for the Uniform distribution is solely based on the data.
1 2 3 4 5 

As we can see, the posterior is fairly thin between 4980 and 5000. Taking the mean or median of this is a solid estimate for the maximum. Let’s see what the median says using the summary generic.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

For our purposes, we can ignore everything besides the quantiles of the posteriors. Using this, we arrive at an estimate of 4983 as our max. Using $2 * \bar{x}$ for this same sample results in 5034 which is overshooting by quite a bit.
]]>Quick announcement that my package for Bayesian AB Testing, bayesAB, has been updated to 0.7.0 on CRAN. Some improvements on the backend as well a few tweaks for a more fluid UX/API. Some links:
Now, on to the good stuff.
Most questions I’ve gotten since I released bayesAB have been along the lines of:
Question 1 has a few objective and a few subjective answers to it. The main benefits are ones that I’ve already highlighted in the README/vignette of the bayesAB package. To briefly summarize, we get direct probabilities for A > B (rather than pvalues) and distributions over the parameter estimates rather than point estimates. Finally, we can also leverage priors which help with the low sample size and low base rate problems.
To start, let’s go back to what a prior actually is in a Bayesian context. There are countless mathematical resources out there (including part of my previous blog post) so I’ll only about this conceptually. Simply put, a prior lets you specify some sort of, ahem, prior information about a certain parameter so that the end posterior on that parameter encapsualtes both the data you saw and the prior you inputted. Priors can come from a variety of places including past experiments, literature, and domain expertise into the problem. See this blogpost for a great example of somebody combining their own past data and literature to form very strong priors.
Priors can be weak or strong. The weakest prior will be completely objective and thus assign an equal probability to each value for the parameter. Examples of this include a Beta(1, 1) prior for the Bernoulli distribution. In these cases, the posterior distribution is completely reliant on the data. A strong prior will convey a very precise belief as to where a parameter’s values may lie. For example:
1 2 3 

The stronger the prior the more say it has in the posterior distribution. Of course, according to the Bernstein–von Mises theorem the posterior is effectively independent of the prior once a large enough sample size has been reached for the data. How quickly this is the case, depends on the strength of your prior.
Do you need (weak/strong) priors? Not necessarily. You can still leverage the interpretability benefits of Bayesian AB testing even without priors. At worst, you’ll also get slightly more pertinent results since you can parametrize your metrics as the appropriate distribution random variable. However, without priors of some kind (and to be clear, not random bullshit priors either) you run into similar issues as with Frequentist AB testing, namely Type 1 and Type 2 errors. A Type 1 error is calling one version better when it really isn’t, and a Type 2 error is calling a better version equal or worse. Both typically arise from low sample size/base rate and are controlled by reaching appropriate sample size as per a power calculation.
Have no fear! Even without good and/or strong priors there are still ways to control for false positives and all that good stuff. We use something called Expected Posterior Loss or “based on the current winner, what is the expected loss you would see should you choose wrongly”. If this value is lower than your threshold of caring (abs(A  b)
) then you can go ahead and call your test. This value implictly encompasses the uncertainty about your posteriors.
Okay cool, that roughly answers Questions 14 in some order.
Let’s do a quick simulation to illustrate some of the above points. Let’s make three examples: weak priors, strong priors, and diffuse priors (quick tip: the Jeffrey’s Prior of a Gamma distribution is Gamma(eps, eps) where eps is smallllll). We’ll be taking 2 x 100 samples from a Poisson distribution with the same $\lambda$ parameters. The strong and weak priors will be centered around this value of 2.3.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

Now, A and B shouldn’t have any difference between the two but occasionally we will see a Type 1 error. That’s what the bottom 3 lines are doing. If P(A > B) is <=0.05 or >= .95 we call one of the recipes “significantly” better. Observe what happens with each case of prior.
1


1


1


1


1


1


The diffuse priors have the most Type 1 errors, followed by the weak priors, followed by the strong priors; to be expected.
Finally, we can fit another bayesTest (:D) to determine whether the differences between Type 1 error percents across priors are different from one another.
1 2 3 4 

1


As we can see, it’s somewhat clear that the diffuse is worse than the weak and very clear that the diffuse is worse than the stronger priors. Note that in our case I use a diffuse prior of Beta(1, 1) since I have no idea what’s normal going into this simulation.
Finally we can check the output of summary
to see if the Posterior Expected Loss is within our constraints.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

If the Posterior Expected Loss is lower than our threshold for caring on abs(A  B) then we can call this test and accept the current results. The PEL is small in both cases, and possibly 0/NaN for t2
so it’s quite clear that priors, even weak ones, have a significant positive effect on Type 1 Errors. Remember that we see this effect partially because our priors were of a similar shape to the data. If the priors and the data disagree, the effects might not be so clear cut and you will need more data to have a stable posterior.
Check out bayesAB 0.7.0 to supercharge your AB tests and reach out to me if you have any questions!
]]>Frank Portman  fportman.com  frank1214@gmail.com
Most A/B test approaches are centered around frequentist hypothesis tests used to come up with a point estimate (probability of rejecting the null) of a hardtointerpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test
to determine sample size and then interface with a Product Manager or Marketing Exec in order to relay the results. This quickly gets messy in terms of interpretability. More importantly it is simply not as robust as A/B testing given informative priors and the ability to inspect an entire distribution over a parameter, not just a point estimate.
Enter Bayesian A/B testing.
Bayesian methods provide several benefits over frequentist methods in the context of A/B tests  namely in interpretability. Instead of pvalues you get direct probabilities on whether A is better than B (and by how much). Instead of point estimates your posterior distributions are parametrized random variables which can be summarized any number of ways. Bayesian tests are also immune to ‘peeking’ and are thus valid whenever a test is stopped.
This document is meant to provide a brief overview of the bayesAB package with a few usage examples. A basic understanding of statistics (including Bayesian) and A/B testing is helpful for following along.
Unlike a frequentist method, in a Bayesian approach you first encapsulate your prior beliefs mathematically. This involves choosing a distribution over which you believe your parameter might lie. As you expose groups to different tests, you collect the data and combine it with the prior to get the posterior distribution over the parameter(s) in question. Mathematically, you are looking for P(parameter  data)
which is a combination of the prior and posterior (the math, while relatively straightforward, is outside of the scope of this brief intro).
As mentioned above, there are several reasons to prefer Bayesian methods for A/B testing (and other forms of statistical analysis!). First of all, interpretability is everything. Would you rather say “P(A > B) is 10%”, or “Assuming the null hypothesis that A and B are equal is true, the probability that we would see a result this extreme in A vs B is equal to 3%”? I think I know my answer. Furthermore, since we get a probability distribution over the parameters of the distributions of A and B, we can say something such as “There is a 74.2% chance that A’s $\lambda$ is between 3.7 and 5.9.” directly from the methods themselves.
Secondly, by using an informative prior we alleviate many common issues in regular A/B testing. For example, repeated testing is an issue in A/B tests. This is when you repeatedly calculate the hypothesis test results as the data comes in. In a perfect world, if you were trying to run a Frequentist hypothesis test in the most correct manner, you would use a power test calculation
to determine sample size and then not peek at your data until you hit the amount of data required. Each time you run a hypothesis test calculation, you incur a probability of false positive. Doing this repeatedly makes the possibility of any single one of those ‘peeks’ being a false positive extremely likely. An informative prior, means that your posterior distribution should make sense any time you wish to look at it. If you ever look at the posterior distribution and think “this doesn’t look right!”, then you probably weren’t being fair with yourself and the problem when choosing priors.
Furthermore, an informative prior will help with the low baserate problem (when the probability of a success or observation is very low). By indicating this in your priors, your posterior distribution will be far more stable right from the onset.
One criticism of Bayesian methods is that they are computationally slow or inefficient. bayesAB leverages the notion of conjugate priors to sample from analytical distributions very quickly (1e6 samples in <1s).
We’ll walk through two examples. One Bernoulli
random variable modeling clickthroughrate onto a page, and one Poisson
random variable modeling the number of selections one makes once on that page. We will also go over how to combine these into a more arbitrary random variable.
Let’s say we are testing two versions of Page 1, to see the CTR onto Page 2. For this example, we’ll just simulate some data with the properties we desire.
1 2 3 4 

Of course, we can see the probabilities we chose for the example, but let’s say our prior knowledge tells us that the parameter p
in the Bernoulli distribution should roughly fall over the .2.3 range. Let’s also say that we’re very sure about this prior range and so we want to choose a pretty strict prior. The conjugate prior for the Bernoulli distribution is the Beta distribution. (?bayesTest
for more info).
1 2 

Now that we’ve settled on a prior, let’s fit our bayesTest
object.
1


bayesTest
objects come coupled with print
, plot
and summary
generics. Let’s check them out:
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


print
talks about the inputs to the test, summary
will do a P((A  B) / B > percentLift) and credible interval on (A  B) / B calculation, and plot
will plot the priors, posteriors, and the Monte Carlo ‘integrated’ samples.
Now we are on Page 2. On Page 2 you have any number of ‘interactions’ you can make (being vague is fun). Let’s say we wish to parametrize the amount of ‘interactions’ a user has by the Poisson distribution. Let’s also say, our priors would have us believe that the bulk of users will make between 56 interactions but we aren’t too sure on that number so we will allow a reasonable probability for other values. The conjugate prior for the Poisson distribution is the Gamma distribution.
1 2 3 4 

Let’s fit our bayesTest
object in a similar manner.
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


Another feature of bayesAB is the ability to decompose your end distribution into a series of intermediate distributions which are easier to parametrize. For example, let’s take the above example and say we want to test the effect of JUST Page 1 on Page 2’s interactions. Sure, we can try to come up with a way to parametrize the behaviors on Page 2 in the context of the conversion from Page 1, but isn’t it easier to encapsulate both parts as their own random variables, with their own informed priors from past traffic data. Using the combine
function in bayesAB we can make this possible. Let’s consider the same test objects we have already fit. A combine
d object would look like this:
1 2 3 4 5 6 7 

Small note: the magrittr example may not look very elegant but it comes in handy when chaining together more than just 2 bayesTest
s.
For the combined distribution, we use the *
function (multiplication) since each value of the Poisson distribution for Page 2 is multiplied by the corresponding probability of landing on Page 2 from Page 1 in the first place. The resulting distribution can be thought of as the ‘Expected number of interactions on Page 2’ so we have chosen the name ‘Expectation’. The class of bayesTest
is idempotent under combine
, meaning the resulting object is also a bayesTest
. That means the same generics apply.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


This document was meant to be a quickstart guide to A/B testing in a Bayesian light using the bayesAB package. Feel free to read the help documents for the individual functions shown above (some have default params that can be changed, including the generics). Report any issues or grab the development version from our Github.
]]>Twitter released a new R package earlier this year named AnomalyDetection (link to Github). The Github goes into a bit more detail, but at a highlevel it uses a Seasonal Hybrid ESD (SHESD) which is built upon the Generalized ESD (Extreme Studentized Deviate Test)  a test for outliers. The SHESD is particularly noteworthy since it can detect both local and global outliers. That is, it can detect outliers within local shortterm seasonal trends, as well as global outliers that fall far above or below all other values.
I’ve been dealing with a similar problem as part of my personal work (and FT job as it happens) so this package came up alongside the literature.
One area in which this package is lacking is data visualization as we’ll see below. Since the purpose of these problem is to find anomalies in timeseries data, dygraphs jumps to mind. A few of the benefits and features of dygraphs:
I’m sure I’m missing a couple but those who haven’t used it yet will quickly see why it’s great and particularly well suited to this type of problem. Plus, it just LOOKS better.
I figure the best way to show the extensions is to go through a few use cases using the AnomalyDetection package and show the normal plotting vs. the new and improved plotting.
Let’s load a few packages first:
1 2 3 4 

In this case, we won’t restrict our potential anomalies to only the last day, hour, week, etc. and focus on the entire timeseries.
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 

The benefits are noticed immediately. Try experimenting with the ‘roll period’ in the bottom left of the interactive plot to smooth out the timeseries. Or, try zooming in on a cluster of anomalies by using the slider on the bottom, or by simply selecting a shaded region horizontally. Doubleclick to reset, and try a bunch more stuff to explore your data even further. Notice that if you hover over an anomaly you get a popup with the date (this can be customized to say anything). On top of all of this, a reactive component in the topright keeps track of the date and ‘y’ value on which you are hovering.
The AnomalyDetection package also has an option where you can only look for anomalies in the last ‘x’ periods (days, weeks, months, etc.). In this case, not only does it only find anomalies only in the specified period, the plot changes as well. The Github link goes this in more depth but basically it cuts off the plot earlier and dims the ‘irrelevant’ timeseries. Of course, this isn’t needed for the dygraphs case since you can zoom in/out at your own will. Let’s check the examples out:
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 12 13 

We see it’s easy to have the dygraph start zoomed in on the same days as the original plot, but we have the ability to easily look further back. Also, recreating the shading isn’t hard to do. Note that while it begins zoomed in on a 6 day window (to mirror the base package plot), doubleclicking it will zoom out fully as with the other examples. To ‘reset’ the plot to the 6 day window you’ll have to reload the page. You can also manually restrict the axis upon creation of the plot, but since you can zoom in and out fluidly as much as you like, you may not ever need to do this.
All in all, we see how a few simple modifications greatly enhances the plotting capabilities of the AnomalyDetection package. If you’re like me (and most other Data Scientists) a good chunk of your work relates to presenting information to non or less technical people in a palatable format. Rather than writing your own wrappers to digest this information in a tabular or textual format from the internals of the R package, by simply employing dygraphs you allow anybody to do their own exploration of the data, zoom in on anomalies, smooth out data to see the underlying trends, and much more. This would pair extremely well with a Shiny app that lets the user swap in timeseries and change other parameters such as sensitivty and direction.
To extend this work I am writing my own wrappers around the package to automate a lot of the plot building. As you noticed, for this quick example I’m manually filling in values such as title, axis labels, date ranges, and so on. However, this shouldn’t be too hard to do programmatically from the package internals. As for whether I want to implement this as a separate method within the package itself and potentially do a pull request is still in the air  the package’s architecture is a little messy for that sort of contribution. Perhaps if some time frees up.
]]>Yay. We can now define the Lagrange Polynomials :
Why am I making you look at this beauté? Turns out there’s some neat mathematical properties  namely in the subject of polynomial interpolation. That’s a fancy way of saying ‘fit a polynomial through certain points’ but we’ll get to that.
Now, it’s easy to see that
since for $k = j$ every term in the product will be 1 ($t = t_k$ in the product), while for $k \ne j$ there will be one term with $t_j  t_j$ in the numerator (rendering the whole product to be 0). Keep in mind that the product above is not defined for the term where $j = k$ otherwise we’d have a divisonbyzero term.
See where this is going? Let’s introduce $y_1, …, y_n$ as arbitrary real numbers. How would we construct a polynomial that goes through the $(t_i, y_i)$ points? That is, can we find a polynomial $p$ such that $p(t_j) = y_j$ for each index $j = 1, …, n$? Hint: I didn’t write all that stuff up there for no reason.
Given what we know about $p_k(t_j)$ we can do the following:
Now we’ve defined an equation based on arbitrary $t_i$ and $y_i$ that will go through the points $(t_i, y_i)$ and behave otherwise elsewhere (we will see soon).
How do we code something like this up? As usual there’s a lot of different ways but I decided to go with closures as function factories because I tend to think functionally, and also because the code is a bit more concise in this case. First we can define a $p_k(t)$ function factory.
1 2 3 4 5 6 7 8 9 

Given inputs k and ts (vector), we get back a $p_k$ that can be evaluated at any real point. It also has the above mentioned properties.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Next a $p(t)$ function factory.
1 2 3 4 5 6 7 8 9 10 11 12 13 

It takes in a Y vector and a list of polynomials ($p_k(t)$) and outputs a realvalued function of $t$. The list of polynomials in this case is a list of functions (output of the p_k function).
Finally a quick function to create the combined polynomial given a vector of Y and T (constructing the $p_k$ polynomials in the process).
1 2 3 4 5 6 7 8 9 

We see that it has the desired effect:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Finally, a few applications:
We let t be 10 equispaced points between 1 and 1, and Y be the cube of those points. As we can see, the polynomial interpolation of those points looks to be roughly equal to $f(x) = x ^ 3$ which is pretty ‘good’.
Now we try it with points that aren’t monotone increasing in both $t$ and $y$. The result is a little weird with sharp corners near the edges.
And what if we tried to interpolate on completely random points (i.e. not trying to ‘retrieve’ some sort of function transformation)?
Pretty cool. You can play around with this but the polynomials have massive spikes near $t_1$ and $t_n$ if you introduce too many more points.
If you’re a statistician, alarms should be going off in your head as far as using polynomial interpolation to describe a generalized relationship between two vectors of data (overfitting anybody?). Any form of function fitting (even nonparametric) requires some pretty strong assumptions on the the smoothness of the curve involved. It also quickly becomes clear that if you add in three or more noncolinear points that the polynomial interpolation will start to have very sharp edges. This is where other options (i.e. linear regression) come into play and help fix the biasvariance problem we are observing. Also, more modern ‘smoothing’ techniques such as splining exist.
That being said, under the right conditions (monotonicity, noncomplicated transformations, etc.) the polynomial retrieved from this method may not be a bad generalization of the relationship between $t$ and $y$. All in all, there are some cool mathematical properties at play here and this isn’t too shabby for something that was intuitively derived in the 1700s.
]]>We explore the possibility of improving data analysis through the use of interactive visualization. Exploration of data and models is an iterative process. We hypothesize that dynamic, interactive visualizations greatly reduce the cost of new iterations and thus f acilitate agile investigation and rapid prototyping. Our webapplication framework, flyvis.com, offers evidence for such a hypothesis for a dataset consisting of airline ontim e flight performance between 20062008. Utilizing our framework we are able to study the feasibility of modeling subsets of flight delays from temporal data, which fails on the full dataset.
Technically, this was a very fun project. Shiny is an extremely powerful package which provides the interactive framework necessary to build such applications. We also made use of the JavaScript library leaflet.js for the interactive map. All in all, I learned quite a bit about writing efficient R code, as the dataset we were using had over 18 million observations.
To learn more about the app check out the projects page or the actual application website FlyVis.com.
FlyVis lets you dynamically explore the airports ontime dataset which yields some pretty interesting graphs. For example, if we look at the intraday distribution of flights and delays for Memphis:
we see a pretty interesting pattern. Turns out the FedEx shipments control most of the flights out of Memphis which gives us this unique shape.
The beauty of an interactive application is that you (the reader) can discover something that I haven’t even considered. I merely provide the tools and you can explore. If anybody finds some cool patterns in certain airports then I’d love to hear about it over email or comment.
Once we polish the application a bit more we will release the source code on GitHub. Disclaimer: The site will initially take a minute or two to load since our server has to load the massive dataset into memory. Also, the plots do take 56 seconds to generate. Again this is due to the size of our data and is something we are currently trying to optimize.
Finally, what would this post be without some R code? Here’s what we used for the Calendar Heatmap plot:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Doesn’t include the preprocessing but that’ll be out early 2014.
]]>In this post, we’ll examine some fascinating properties of Roman Numerals  namely the lengths of Roman Numerals in succession.
First, we define a simple Arabic –> Roman Numeral converter. Start by creating two vectors, one for the 13 Roman symbols and another for the Arabic counterparts. Next, a simple for/while combination iterates through the arrays and chooses the appropriate Roman symbols while iteratively decreasing the input variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Next we want to find the lengths of each of these representations:
1 2 3 4 5 6 

Finally, plot the result:
1 2 3 4 

What a fascinating relationship! The lengths seem to take a sinusoidal shape with positive drift. We can see that although the average length keeps rising, the plot goes through very regular cycles of rising and falling.
]]>For one of my computational finance classes, I attempted to implement a Machine Learning algorithm in order to predict stock prices, namely S&P 500 Adjusted Close prices. In order to do this, I turned to Artificial Neural Networks (ANN) for a plethora of reasons. ANNs have been known to work well for computationally intensive problems where a user may not have a clear hypothesis of how the inputs should interact. As such, ANNs excel at picking up hidden patterns within the data so well that they often overfit!
Keeping this in mind, I experimented with a technique known as a ‘sliding window’. Rather than train the model with years of S&P 500 data, I created an ANN that would train over the past 30 days (t30, …, t) to predict the close price at t+1. A 30 day sliding window seemed to make a good fit. It wasn’t so wide as to not capture the current market atmosphere, but also it wasn’t narrow enough to be hypersensitive to recent erratic movements.
Then, I had to decide on the input variables I was going to use. Many stock market models are pure timeseries autoregressive functions, but the benefit of ANNs is that we can use them as a more traditional Machine Learning technique, we several inputs (and not only previous prices). This is helpful, because there exist an extremely large number of technical indicators which should uncover some significance in the market. I defined several of my own inputs that I thought would be significant predictors such as:
In addition, I did some research and introduced several other technical indicators. Since there were so many, I fit a Random Forest model in order to reduce the dimensionality of the problem. After taking only the most important variables, I was left with:
Finally, in order to obtain a stable model, I had to scale the inputs so that all variables lied in the range of [1, 1]. This is done to avoid oversaturating the individual neurons in the ANN with outliers. There are several ways to do this, and I just created a simple function called ‘myScale’ in order to get the job done.
1 2 3 4 5 

I won’t be sharing the final model, but I did create a simulation of my methodology for the 2013 trading year.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

The animation demonstrates the mechanics of a Sliding Window model. We see that the ‘Training’ Window keeps moving up, regardless of whether the prediction was accurate or not. The RMSE of the model for the 2013 dates is 1.2115 which is extremely small considering the prices of the S&P 500 are over 1000. We know from the RMSE, and from the animation that the fit tends to be good. When the predicted values are slightly off, we can still see that the ANN usually correctly predicts the direction of the S&P 500 for that day. Perhaps in the future I will revisit this problem as a classification instead due to the results we’re seeing.
The beauty of my result is that although this is a naive implementation of a sophisticated model, we can see that the results are really good. I can spend many weeks tuning the model and playing around with the inputs to get a better fit.
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

Pretty.
]]>The GTD provides over 100,000 observations of terrorist incidents between 1970 and 2011. Of these, there are about 2400 observations in the USA. While this is not a large number, the graph still provides some interesting and intuitive results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 

In order to obtain meaningful results, rather than simply plot the number of terrorist incidents per state, I divided each state’s count by the 2010 state population. I know that this is not entirely correct as population levels have fluctuated (with respect to one another) from 19702011 but this was fine for my purposes. I noticed some clustering in the frequencies of terrorist attacks so I took a log10 transform of those numbers to spread the numbers out more smoothly.
]]>When John was a little kid he didn’t have much to do. There was no internet, no Facebook, and no programs to hack on. So he did the only thing he could… he evaluated the beauty of strings in a quest to discover the most beautiful string in the world.
Given a string s, little Johnny defined the beauty of the string as the sum of the beauty of the letters in it.
The beauty of each letter is an integer between 1 and 26, inclusive, and no two letters have the same beauty. Johnny doesn’t care about whether letters are uppercase or lowercase, so that doesn’t affect the beauty of a letter. (Uppercase ‘F’ is exactly as beautiful as lowercase ‘f’, for example.)
You’re a student writing a report on the youth of this famous hacker. You found the string that Johnny considered most beautiful. What is the maximum possible beauty of this string?
The input file consists of a single integer m, followed by m lines.
You can find the input file here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

To start, I read the input.txt (as provided by Facebook) into R, saving each line of input as a separate character string in the ‘inputs’ vector. I then saved the first entry of ‘inputs’ as m in accordance with Facebook’s description of the problem and then subsetted the whole vector to only include the actual strings we are to analyze.
After allocating some memory for an outputs vector and using R’s built in letters variable (it might have been redundant to name it as a variable) I began a for loop to actually calculate the maximum beauty of each string. Inside the for loop, I created local variables x and y. x was simply the i^{th} input in all lowercase (for homogeneity). I then created y to split the strings into vectors which contained one character per entry and removed all nonletter components through subsetting.
From there, I created a table to look at the number of times each letter appeared in the i’th string and sorted it by decreasing. A nested for loop assigned values to the elements in the table that I had constructed. I assigned 26 to the letter that appeared most frequently, 25 to the second most frequent, and so on. In this way we calculate the MAXIMUM POSSIBLE beauty of each string.
Finally, I filled the outputs vector by creating a string in correspondence with Facebook’s requirements. Once we rinse and repeat this process a total of m times, we have a complete outputs vector that stores the maximum beauty of each string in inputs.txt.
In order to produce the outputs as a .txt file to upload to Facebook, I employed a simple writeLines
function which placed every element of the outputs vector on a separate line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

After exploring some of the package’s capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet  whether the tweet is positive, negative, or neutral.
The hashtags I used were: #YOLO, #FML, #blessed, #bacon
The actual script is fairly simple and repetitive but does yield some interesting results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 

The histogram portrays some peculiar information. For one, all of these hashtags seemed to associate with positive messages. I did not expect #fml to be associated with positive results since the fmylife site is a place for people to post negative things that have happened to them. Nevertheless, the other hashtags had more positives which was expected.
Next, I decided to explore some of the functions of the ‘wordcloud’ package in R. In order to do so, I mined tweets that contained #rstats, and built a wordcloud that sized and placed words based on their frequencies.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

I had to use the tm_map
function to ensure all tweets were encoded properly, before using TermDocumentMatrix
. As we can see, ‘shiny’ is by far the most popular word tweeted with ‘#rstats’. This should come as no surprise  Shiny is RStudio’s new and exciting way to integrate R with web applications.
I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.
1 2 3 4 5 6 7 8 9 10 

First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.
The summary of this simple linear model is featured below:
1 2 3 4 5 6 7 8 

As we can see, the pvalue is very small (small enough for this model to be significant). However the low RSquared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.
Nevertheless, we can see a pretty interesting graph below:
1 2 3 4 5 6 7 8 

Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.
I tried a BoxCox test in R to see whether our fit might be improved if we transformed the response variable. The BoxCox plot in R suggests that raising the response to the negative first power might be beneficial.
This new model is outlined below:
1 2 3 4 5 6 7 8 9 

The pvalue of this new model is even lower and the RSquared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.
Graphing this we get:
1 2 3 4 5 6 7 8 

Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.
]]>The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries’ spellings from this article did not match with what is used in the R ‘maps’ package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 

The .gif starts from year 1940  the year when the first McDonald’s opened in the US. There is a bit of downtime between 1940 and 1967 before McDonald’s expands to Canada. After that, the company rapidly takes off all over the globe.
The animation package is phenomenal and makes use of GraphicsMagick. That part alone took me over an hour to figure out because command line typing on Windows is absolutely terrible.
Please email or leave a comment if you have any questions  I would be happy to help you understand the code better.
]]>A Pythagorean triplet is a set of three natural numbers such that a^{2} + b^{2} = c^{2}.
To solve this problem, we first see that c = (a^{2} + b^{2})^{1/2}. Without loss of generality, we can only run the for loop for a and b, since c will be uniquely determined given a certain a and b.
The code I used:
1 2 3 4 5 6 7 8 9 10 11 

I use nested for loops to test values of a and b between 1 and 499. a and b cannot take values greater than 499 because then a + b + sqrt(a^{2} + b^{2}) would be greater than 1000.
The if statement in the nested loops checks to see whether a, b, and c add up to 1000. If they do, it prints their product and then breaks the loop. This is a short script which produces the correct answer in a few milliseconds. The trick here was expressing c in terms of a and b to reduce the amount of loops we need to run.
]]>I then plotted the data to see if there was a discernable relationship between the three. Most of the work for this project had to do with merging and cleaning the data. I began by pasting the tables from the wikipedia articles into a .csv file. Since the tables were all different lengths and some countries were missing values of the parameters, I had to tidy the dataset up quite a bit.
The result, featured below the code, is pretty interesting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 

We can see that there is a weak positive relationship between the Quality of Life of a country and its Nobel Laureates per capita, as I expected. Perhaps most interesting is how the income inequality fits into all of this. We see that countries that have a high Quality of Life Score AND high Nobel Laureates per capita tend to have very low income inequality. On the other hand, most of the countries that had relatively low Quality of Life scores and very few Nobel Laureates per capita tended to also have high levels of income inequality.
It is curious to note that some countries with higher levels of income inequality still had very low Nobel Laureates per capita, even if they were on the mediumhigh end of the Quality of Life scoring.
]]>1 2 3 4 5 6 7 8 9 

These two histograms help us visualize how common the ‘cuts’ were as well as the range of carats present in this dataset.
1


This histogram combines the two ideas above and gives an idea of how many diamonds had a certain cut at a specific carat.
1 2 

A simple plot of price vs. carat with color based on cut shows us that price tends to increase with carat as we might expect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

`
The above output gives us a summary of diamond carats separated by cut. We can see that ‘Fair’ cut diamonds tended to be the largest.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

This is a summary of diamond prices separated by cut.
Now I’ll fit a few linear models to the data.
1 2 3 

The first model has ‘price’ as the response variable with ‘carat’ and ‘cut’ as the two predictors. The second model is the same except it accounts for interaction.
My static blogging framework Ruhoh messes up when I try to paste the summary outputs but if you load up R you can confirm that these two linear models are almost identical. Judging the R2 values and Residual Standard Error it does look like the interactive model is TINY bit better but it isnt significant enough to be conclusive.
Next we want to take a look at the residuals to see if they are correlated, have nonconstant variance, or are not normally distributed:
1 2 3 4 5 6 7 

In general, the residuals are poorly behaved for both models and look correlated. This violates vital assumptions we make when constructing a linear model. The QQplots of these two models confirm the nonnormality of the errors.
1 2 3 4 5 6 7 

Next we can explore some transformations that hopefully have better fits
1 2 

The plot of log(price) vs. log(carat) (both base 10) seems to exhibit a much stronger linear relationship than the untransformed parameters.
I fit a model with log10(price) as the response and log10(carat) as the predictor along with the interactive ‘cut’ terms.
1


Once again the summary function messes up Ruhoh so take my word for the following: The tiny residual standard error, and higher R2 value of this model as compared to the one with untransformed parameters gives some evidence that it is indeed a better fit.
1


The residuals now seem to have almost constant variance as well as no correlation. The normality of the residuals is confirmed by the QQPlot exhibiting a strong linear pattern:
1 2 

Last bit of code to show that log(price) and log(carat) exhibit a nice linear relationship:
1


Now that we’ve built a pretty good linear model to predict log10(price) based on a few other variables, we can backtransform if needed to predict actual price.
This was a pretty lengthy one  hope you enjoyed it!
]]>I first saved the number in a text file and used the scan function to import the number into R. At first it is a 20 string characteric because “scan” separates items based on line breaks. The paste function then allows me to to concatenate the strings into one character. After that, I use the string split function to separate each number in the character into its own position in a vector. Finally, I convert the “strings” vector into a numeric vector so we can use mathematical operations on it.
It was easy to see that if we are taking products of 5 consecutive integers of a certain number, there are a total of x  4 total products, where x is the total number of digits in the number. (If its a 5 digit number, only 1 product exist. For a 6 digit number, we can multiply 1 through 5 or 2 through 6 for 2 different products.)
Therefore, for this problem we had to consider 996 products. After allocating memory for an empty numeric vector, I ran a for loop 996 times to fill up the entries of the ‘products’ vector. The first entry became the product of the first 5 digits of our number, the second entry the product of digits 26, and so on.
1 2 3 4 5 6 7 8 9 10 11 12 13 

The for loop terminates very quickly and we can use a simple max(products)
command to get our answer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

The blue points represent zip codes that are part of the lowest group while the red points are in the wealthiest group. The size of the points correspond to reported salaries in those zip codes. For this post, I only considered zip codes that reported more than 5,000 returns.
We can see that many of the least wealthy zip codes are in the southeast while the New York/New Jersey area as well as LA/San Francisco have high densities of wealthy zip codes.
]]>Pictured above is Charles Minard’s flow map of Napoleon’s march. It is simply amazing that such a detailed and innovative graphic was published in 1869 (way before the first computer). Minard was truly a pioneer in the use of graphics in engineering and statistics.
The troops text file contains the longitude and latitude of Napoleon’s army, both on the attack and retreat, on his march. The cities file contains the longitudes and latitudes of a few major Russian cities that were in Napoleon’s path.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The red line represents the attacking march while the blue line represents the retreat. The thickness of the line corresponds to the size of Napoleon’s army at that point.
Some improvements that could be made:
Datasets and inspiration from Hadley Wickham’s STAT 405 class at Rice University.
]]>