For the uninitiated, the Taxicab / Germany Tank problem is as follows:
Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?
This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200300) while “intelligence” (unintelligence) operations overestimated numbers about 67x. Read the full details on Wikipedia here (and donate while you’re over there).
Somebody suggested using $2 * \bar{x}$ where $\bar{x}$ is the mean of the current observations. This made me feel all sorts of weird inside but I couldn’t quite put my finger on what was immediately wrong with this (outside of the glaringly obvious, touched upon shortly). On one hand, it makes some sense intuitively. If the distribution of the numbers you see is random, you can be somewhat sure that the mean of your subset is approxiately the mean of the true distribution. Then you simply multiply by 2 to get the max since the distribution is uniform. However, this doesn’t take into account that $2 * \bar{x}$ might be lower than $m$ where $m$ is the max number you’ve seen so far. An easy fix to this method is to simply use $max(m, 2 * \bar{x})$.
To be most correct, we can use the MVUE (Minimum Variance Unbiased Estimator). I won’t bore you with the derivation (and it’s also on the Wiki), but it works out to be $m + \frac{m}{N}  1$ with $m$ is again the max number you’ve seen and $N$ is the total number you’ve seen. Let’s compare all 3 methods below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

Let’s plot it:
1 2 3 4 5 6 

Zoom in a bit:
1


And this one is just prettttyyyy:
1


As we can see, the green line converges to the true max of 5000 very quickly, and has far less variance than the two other methods. The modified method I proposed is just an asymmetric version of $2 * \bar{x}$, which approaches the true max but seems to level out in local minima/maxima at certain points. Ultimately, you can’t deny that the MVUE method is best.
Another approach we can take is a Bayesian one. Since I’d never pass up an opportunity to showcase my R package, bayesAB, (github repo here) for Bayesian AB testing let’s work through an example with bayesAB. This is actually an extra cool example since it showcases that while bayesAB is primarily meant to fit and compare two sets of posteriors, it can still be used to fit a single Bayesian model.
For brevity’s sake, we fit a Bayesian model using the Pareto distribution as the conjugate prior (if this is all a foreign language to you, definitely check out my other blog posts on bayesAB 1 2). We can use very small values for xm and alpha as diffuse priors so the model we build for the Uniform distribution is solely based on the data.
1 2 3 4 5 

As we can see, the posterior is fairly thin between 4980 and 5000. Taking the mean or median of this is a solid estimate for the maximum. Let’s see what the median says using the summary generic.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

For our purposes, we can ignore everything besides the quantiles of the posteriors. Using this, we arrive at an estimate of 4983 as our max. Using $2 * \bar{x}$ for this same sample results in 5034 which is overshooting by quite a bit.
]]>Quick announcement that my package for Bayesian AB Testing, bayesAB, has been updated to 0.7.0 on CRAN. Some improvements on the backend as well a few tweaks for a more fluid UX/API. Some links:
Now, on to the good stuff.
Most questions I’ve gotten since I released bayesAB have been along the lines of:
Question 1 has a few objective and a few subjective answers to it. The main benefits are ones that I’ve already highlighted in the README/vignette of the bayesAB package. To briefly summarize, we get direct probabilities for A > B (rather than pvalues) and distributions over the parameter estimates rather than point estimates. Finally, we can also leverage priors which help with the low sample size and low base rate problems.
To start, let’s go back to what a prior actually is in a Bayesian context. There are countless mathematical resources out there (including part of my previous blog post) so I’ll only about this conceptually. Simply put, a prior lets you specify some sort of, ahem, prior information about a certain parameter so that the end posterior on that parameter encapsualtes both the data you saw and the prior you inputted. Priors can come from a variety of places including past experiments, literature, and domain expertise into the problem. See this blogpost for a great example of somebody combining their own past data and literature to form very strong priors.
Priors can be weak or strong. The weakest prior will be completely objective and thus assign an equal probability to each value for the parameter. Examples of this include a Beta(1, 1) prior for the Bernoulli distribution. In these cases, the posterior distribution is completely reliant on the data. A strong prior will convey a very precise belief as to where a parameter’s values may lie. For example:
1 2 3 

The stronger the prior the more say it has in the posterior distribution. Of course, according to the Bernstein–von Mises theorem the posterior is effectively independent of the prior once a large enough sample size has been reached for the data. How quickly this is the case, depends on the strength of your prior.
Do you need (weak/strong) priors? Not necessarily. You can still leverage the interpretability benefits of Bayesian AB testing even without priors. At worst, you’ll also get slightly more pertinent results since you can parametrize your metrics as the appropriate distribution random variable. However, without priors of some kind (and to be clear, not random bullshit priors either) you run into similar issues as with Frequentist AB testing, namely Type 1 and Type 2 errors. A Type 1 error is calling one version better when it really isn’t, and a Type 2 error is calling a better version equal or worse. Both typically arise from low sample size/base rate and are controlled by reaching appropriate sample size as per a power calculation.
Have no fear! Even without good and/or strong priors there are still ways to control for false positives and all that good stuff. We use something called Expected Posterior Loss or “based on the current winner, what is the expected loss you would see should you choose wrongly”. If this value is lower than your threshold of caring (abs(A  b)
) then you can go ahead and call your test. This value implictly encompasses the uncertainty about your posteriors.
Okay cool, that roughly answers Questions 14 in some order.
Let’s do a quick simulation to illustrate some of the above points. Let’s make three examples: weak priors, strong priors, and diffuse priors (quick tip: the Jeffrey’s Prior of a Gamma distribution is Gamma(eps, eps) where eps is smallllll). We’ll be taking 2 x 100 samples from a Poisson distribution with the same $\lambda$ parameters. The strong and weak priors will be centered around this value of 2.3.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

Now, A and B shouldn’t have any difference between the two but occasionally we will see a Type 1 error. That’s what the bottom 3 lines are doing. If P(A > B) is <=0.05 or >= .95 we call one of the recipes “significantly” better. Observe what happens with each case of prior.
1


1


1


1


1


1


The diffuse priors have the most Type 1 errors, followed by the weak priors, followed by the strong priors; to be expected.
Finally, we can fit another bayesTest (:D) to determine whether the differences between Type 1 error percents across priors are different from one another.
1 2 3 4 

1


As we can see, it’s somewhat clear that the diffuse is worse than the weak and very clear that the diffuse is worse than the stronger priors. Note that in our case I use a diffuse prior of Beta(1, 1) since I have no idea what’s normal going into this simulation.
Finally we can check the output of summary
to see if the Posterior Expected Loss is within our constraints.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

If the Posterior Expected Loss is lower than our threshold for caring on abs(A  B) then we can call this test and accept the current results. The PEL is small in both cases, and possibly 0/NaN for t2
so it’s quite clear that priors, even weak ones, have a significant positive effect on Type 1 Errors. Remember that we see this effect partially because our priors were of a similar shape to the data. If the priors and the data disagree, the effects might not be so clear cut and you will need more data to have a stable posterior.
Check out bayesAB 0.7.0 to supercharge your AB tests and reach out to me if you have any questions!
]]>Frank Portman  fportman.com  frank1214@gmail.com
Most A/B test approaches are centered around frequentist hypothesis tests used to come up with a point estimate (probability of rejecting the null) of a hardtointerpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test
to determine sample size and then interface with a Product Manager or Marketing Exec in order to relay the results. This quickly gets messy in terms of interpretability. More importantly it is simply not as robust as A/B testing given informative priors and the ability to inspect an entire distribution over a parameter, not just a point estimate.
Enter Bayesian A/B testing.
Bayesian methods provide several benefits over frequentist methods in the context of A/B tests  namely in interpretability. Instead of pvalues you get direct probabilities on whether A is better than B (and by how much). Instead of point estimates your posterior distributions are parametrized random variables which can be summarized any number of ways. Bayesian tests are also immune to ‘peeking’ and are thus valid whenever a test is stopped.
This document is meant to provide a brief overview of the bayesAB package with a few usage examples. A basic understanding of statistics (including Bayesian) and A/B testing is helpful for following along.
Unlike a frequentist method, in a Bayesian approach you first encapsulate your prior beliefs mathematically. This involves choosing a distribution over which you believe your parameter might lie. As you expose groups to different tests, you collect the data and combine it with the prior to get the posterior distribution over the parameter(s) in question. Mathematically, you are looking for P(parameter  data)
which is a combination of the prior and posterior (the math, while relatively straightforward, is outside of the scope of this brief intro).
As mentioned above, there are several reasons to prefer Bayesian methods for A/B testing (and other forms of statistical analysis!). First of all, interpretability is everything. Would you rather say “P(A > B) is 10%”, or “Assuming the null hypothesis that A and B are equal is true, the probability that we would see a result this extreme in A vs B is equal to 3%”? I think I know my answer. Furthermore, since we get a probability distribution over the parameters of the distributions of A and B, we can say something such as “There is a 74.2% chance that A’s $\lambda$ is between 3.7 and 5.9.” directly from the methods themselves.
Secondly, by using an informative prior we alleviate many common issues in regular A/B testing. For example, repeated testing is an issue in A/B tests. This is when you repeatedly calculate the hypothesis test results as the data comes in. In a perfect world, if you were trying to run a Frequentist hypothesis test in the most correct manner, you would use a power test calculation
to determine sample size and then not peek at your data until you hit the amount of data required. Each time you run a hypothesis test calculation, you incur a probability of false positive. Doing this repeatedly makes the possibility of any single one of those ‘peeks’ being a false positive extremely likely. An informative prior, means that your posterior distribution should make sense any time you wish to look at it. If you ever look at the posterior distribution and think “this doesn’t look right!”, then you probably weren’t being fair with yourself and the problem when choosing priors.
Furthermore, an informative prior will help with the low baserate problem (when the probability of a success or observation is very low). By indicating this in your priors, your posterior distribution will be far more stable right from the onset.
One criticism of Bayesian methods is that they are computationally slow or inefficient. bayesAB leverages the notion of conjugate priors to sample from analytical distributions very quickly (1e6 samples in <1s).
We’ll walk through two examples. One Bernoulli
random variable modeling clickthroughrate onto a page, and one Poisson
random variable modeling the number of selections one makes once on that page. We will also go over how to combine these into a more arbitrary random variable.
Let’s say we are testing two versions of Page 1, to see the CTR onto Page 2. For this example, we’ll just simulate some data with the properties we desire.
1 2 3 4 

Of course, we can see the probabilities we chose for the example, but let’s say our prior knowledge tells us that the parameter p
in the Bernoulli distribution should roughly fall over the .2.3 range. Let’s also say that we’re very sure about this prior range and so we want to choose a pretty strict prior. The conjugate prior for the Bernoulli distribution is the Beta distribution. (?bayesTest
for more info).
1 2 

Now that we’ve settled on a prior, let’s fit our bayesTest
object.
1


bayesTest
objects come coupled with print
, plot
and summary
generics. Let’s check them out:
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


print
talks about the inputs to the test, summary
will do a P((A  B) / B > percentLift) and credible interval on (A  B) / B calculation, and plot
will plot the priors, posteriors, and the Monte Carlo ‘integrated’ samples.
Now we are on Page 2. On Page 2 you have any number of ‘interactions’ you can make (being vague is fun). Let’s say we wish to parametrize the amount of ‘interactions’ a user has by the Poisson distribution. Let’s also say, our priors would have us believe that the bulk of users will make between 56 interactions but we aren’t too sure on that number so we will allow a reasonable probability for other values. The conjugate prior for the Poisson distribution is the Gamma distribution.
1 2 3 4 

Let’s fit our bayesTest
object in a similar manner.
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


Another feature of bayesAB is the ability to decompose your end distribution into a series of intermediate distributions which are easier to parametrize. For example, let’s take the above example and say we want to test the effect of JUST Page 1 on Page 2’s interactions. Sure, we can try to come up with a way to parametrize the behaviors on Page 2 in the context of the conversion from Page 1, but isn’t it easier to encapsulate both parts as their own random variables, with their own informed priors from past traffic data. Using the combine
function in bayesAB we can make this possible. Let’s consider the same test objects we have already fit. A combine
d object would look like this:
1 2 3 4 5 6 7 

Small note: the magrittr example may not look very elegant but it comes in handy when chaining together more than just 2 bayesTest
s.
For the combined distribution, we use the *
function (multiplication) since each value of the Poisson distribution for Page 2 is multiplied by the corresponding probability of landing on Page 2 from Page 1 in the first place. The resulting distribution can be thought of as the ‘Expected number of interactions on Page 2’ so we have chosen the name ‘Expectation’. The class of bayesTest
is idempotent under combine
, meaning the resulting object is also a bayesTest
. That means the same generics apply.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

1


This document was meant to be a quickstart guide to A/B testing in a Bayesian light using the bayesAB package. Feel free to read the help documents for the individual functions shown above (some have default params that can be changed, including the generics). Report any issues or grab the development version from our Github.
]]>Twitter released a new R package earlier this year named AnomalyDetection (link to Github). The Github goes into a bit more detail, but at a highlevel it uses a Seasonal Hybrid ESD (SHESD) which is built upon the Generalized ESD (Extreme Studentized Deviate Test)  a test for outliers. The SHESD is particularly noteworthy since it can detect both local and global outliers. That is, it can detect outliers within local shortterm seasonal trends, as well as global outliers that fall far above or below all other values.
I’ve been dealing with a similar problem as part of my personal work (and FT job as it happens) so this package came up alongside the literature.
One area in which this package is lacking is data visualization as we’ll see below. Since the purpose of these problem is to find anomalies in timeseries data, dygraphs jumps to mind. A few of the benefits and features of dygraphs:
I’m sure I’m missing a couple but those who haven’t used it yet will quickly see why it’s great and particularly well suited to this type of problem. Plus, it just LOOKS better.
I figure the best way to show the extensions is to go through a few use cases using the AnomalyDetection package and show the normal plotting vs. the new and improved plotting.
Let’s load a few packages first:
1 2 3 4 

In this case, we won’t restrict our potential anomalies to only the last day, hour, week, etc. and focus on the entire timeseries.
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 

The benefits are noticed immediately. Try experimenting with the ‘roll period’ in the bottom left of the interactive plot to smooth out the timeseries. Or, try zooming in on a cluster of anomalies by using the slider on the bottom, or by simply selecting a shaded region horizontally. Doubleclick to reset, and try a bunch more stuff to explore your data even further. Notice that if you hover over an anomaly you get a popup with the date (this can be customized to say anything). On top of all of this, a reactive component in the topright keeps track of the date and ‘y’ value on which you are hovering.
The AnomalyDetection package also has an option where you can only look for anomalies in the last ‘x’ periods (days, weeks, months, etc.). In this case, not only does it only find anomalies only in the specified period, the plot changes as well. The Github link goes this in more depth but basically it cuts off the plot earlier and dims the ‘irrelevant’ timeseries. Of course, this isn’t needed for the dygraphs case since you can zoom in/out at your own will. Let’s check the examples out:
1 2 3 

1 2 3 4 5 6 7 8 9 10 11 12 13 

We see it’s easy to have the dygraph start zoomed in on the same days as the original plot, but we have the ability to easily look further back. Also, recreating the shading isn’t hard to do. Note that while it begins zoomed in on a 6 day window (to mirror the base package plot), doubleclicking it will zoom out fully as with the other examples. To ‘reset’ the plot to the 6 day window you’ll have to reload the page. You can also manually restrict the axis upon creation of the plot, but since you can zoom in and out fluidly as much as you like, you may not ever need to do this.
All in all, we see how a few simple modifications greatly enhances the plotting capabilities of the AnomalyDetection package. If you’re like me (and most other Data Scientists) a good chunk of your work relates to presenting information to non or less technical people in a palatable format. Rather than writing your own wrappers to digest this information in a tabular or textual format from the internals of the R package, by simply employing dygraphs you allow anybody to do their own exploration of the data, zoom in on anomalies, smooth out data to see the underlying trends, and much more. This would pair extremely well with a Shiny app that lets the user swap in timeseries and change other parameters such as sensitivty and direction.
To extend this work I am writing my own wrappers around the package to automate a lot of the plot building. As you noticed, for this quick example I’m manually filling in values such as title, axis labels, date ranges, and so on. However, this shouldn’t be too hard to do programmatically from the package internals. As for whether I want to implement this as a separate method within the package itself and potentially do a pull request is still in the air  the package’s architecture is a little messy for that sort of contribution. Perhaps if some time frees up.
]]>Yay. We can now define the Lagrange Polynomials :
Why am I making you look at this beauté? Turns out there’s some neat mathematical properties  namely in the subject of polynomial interpolation. That’s a fancy way of saying ‘fit a polynomial through certain points’ but we’ll get to that.
Now, it’s easy to see that
since for $k = j$ every term in the product will be 1 ($t = t_k$ in the product), while for $k \ne j$ there will be one term with $t_j  t_j$ in the numerator (rendering the whole product to be 0). Keep in mind that the product above is not defined for the term where $j = k$ otherwise we’d have a divisonbyzero term.
See where this is going? Let’s introduce $y_1, …, y_n$ as arbitrary real numbers. How would we construct a polynomial that goes through the $(t_i, y_i)$ points? That is, can we find a polynomial $p$ such that $p(t_j) = y_j$ for each index $j = 1, …, n$? Hint: I didn’t write all that stuff up there for no reason.
Given what we know about $p_k(t_j)$ we can do the following:
Now we’ve defined an equation based on arbitrary $t_i$ and $y_i$ that will go through the points $(t_i, y_i)$ and behave otherwise elsewhere (we will see soon).
How do we code something like this up? As usual there’s a lot of different ways but I decided to go with closures as function factories because I tend to think functionally, and also because the code is a bit more concise in this case. First we can define a $p_k(t)$ function factory.
1 2 3 4 5 6 7 8 9 

Given inputs k and ts (vector), we get back a $p_k$ that can be evaluated at any real point. It also has the above mentioned properties.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Next a $p(t)$ function factory.
1 2 3 4 5 6 7 8 9 10 11 12 13 

It takes in a Y vector and a list of polynomials ($p_k(t)$) and outputs a realvalued function of $t$. The list of polynomials in this case is a list of functions (output of the p_k function).
Finally a quick function to create the combined polynomial given a vector of Y and T (constructing the $p_k$ polynomials in the process).
1 2 3 4 5 6 7 8 9 

We see that it has the desired effect:
1 2 3 4 5 6 7 8 9 10 11 12 13 

Finally, a few applications:
We let t be 10 equispaced points between 1 and 1, and Y be the cube of those points. As we can see, the polynomial interpolation of those points looks to be roughly equal to $f(x) = x ^ 3$ which is pretty ‘good’.
Now we try it with points that aren’t monotone increasing in both $t$ and $y$. The result is a little weird with sharp corners near the edges.
And what if we tried to interpolate on completely random points (i.e. not trying to ‘retrieve’ some sort of function transformation)?
Pretty cool. You can play around with this but the polynomials have massive spikes near $t_1$ and $t_n$ if you introduce too many more points.
If you’re a statistician, alarms should be going off in your head as far as using polynomial interpolation to describe a generalized relationship between two vectors of data (overfitting anybody?). Any form of function fitting (even nonparametric) requires some pretty strong assumptions on the the smoothness of the curve involved. It also quickly becomes clear that if you add in three or more noncolinear points that the polynomial interpolation will start to have very sharp edges. This is where other options (i.e. linear regression) come into play and help fix the biasvariance problem we are observing. Also, more modern ‘smoothing’ techniques such as splining exist.
That being said, under the right conditions (monotonicity, noncomplicated transformations, etc.) the polynomial retrieved from this method may not be a bad generalization of the relationship between $t$ and $y$. All in all, there are some cool mathematical properties at play here and this isn’t too shabby for something that was intuitively derived in the 1700s.
]]>We explore the possibility of improving data analysis through the use of interactive visualization. Exploration of data and models is an iterative process. We hypothesize that dynamic, interactive visualizations greatly reduce the cost of new iterations and thus f acilitate agile investigation and rapid prototyping. Our webapplication framework, flyvis.com, offers evidence for such a hypothesis for a dataset consisting of airline ontim e flight performance between 20062008. Utilizing our framework we are able to study the feasibility of modeling subsets of flight delays from temporal data, which fails on the full dataset.
Technically, this was a very fun project. Shiny is an extremely powerful package which provides the interactive framework necessary to build such applications. We also made use of the JavaScript library leaflet.js for the interactive map. All in all, I learned quite a bit about writing efficient R code, as the dataset we were using had over 18 million observations.
To learn more about the app check out the projects page or the actual application website FlyVis.com.
FlyVis lets you dynamically explore the airports ontime dataset which yields some pretty interesting graphs. For example, if we look at the intraday distribution of flights and delays for Memphis:
we see a pretty interesting pattern. Turns out the FedEx shipments control most of the flights out of Memphis which gives us this unique shape.
The beauty of an interactive application is that you (the reader) can discover something that I haven’t even considered. I merely provide the tools and you can explore. If anybody finds some cool patterns in certain airports then I’d love to hear about it over email or comment.
Once we polish the application a bit more we will release the source code on GitHub. Disclaimer: The site will initially take a minute or two to load since our server has to load the massive dataset into memory. Also, the plots do take 56 seconds to generate. Again this is due to the size of our data and is something we are currently trying to optimize.
Finally, what would this post be without some R code? Here’s what we used for the Calendar Heatmap plot:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Doesn’t include the preprocessing but that’ll be out early 2014.
]]>In this post, we’ll examine some fascinating properties of Roman Numerals  namely the lengths of Roman Numerals in succession.
First, we define a simple Arabic –> Roman Numeral converter. Start by creating two vectors, one for the 13 Roman symbols and another for the Arabic counterparts. Next, a simple for/while combination iterates through the arrays and chooses the appropriate Roman symbols while iteratively decreasing the input variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Next we want to find the lengths of each of these representations:
1 2 3 4 5 6 

Finally, plot the result:
1 2 3 4 

What a fascinating relationship! The lengths seem to take a sinusoidal shape with positive drift. We can see that although the average length keeps rising, the plot goes through very regular cycles of rising and falling.
]]>For one of my computational finance classes, I attempted to implement a Machine Learning algorithm in order to predict stock prices, namely S&P 500 Adjusted Close prices. In order to do this, I turned to Artificial Neural Networks (ANN) for a plethora of reasons. ANNs have been known to work well for computationally intensive problems where a user may not have a clear hypothesis of how the inputs should interact. As such, ANNs excel at picking up hidden patterns within the data so well that they often overfit!
Keeping this in mind, I experimented with a technique known as a ‘sliding window’. Rather than train the model with years of S&P 500 data, I created an ANN that would train over the past 30 days (t30, …, t) to predict the close price at t+1. A 30 day sliding window seemed to make a good fit. It wasn’t so wide as to not capture the current market atmosphere, but also it wasn’t narrow enough to be hypersensitive to recent erratic movements.
Then, I had to decide on the input variables I was going to use. Many stock market models are pure timeseries autoregressive functions, but the benefit of ANNs is that we can use them as a more traditional Machine Learning technique, we several inputs (and not only previous prices). This is helpful, because there exist an extremely large number of technical indicators which should uncover some significance in the market. I defined several of my own inputs that I thought would be significant predictors such as:
In addition, I did some research and introduced several other technical indicators. Since there were so many, I fit a Random Forest model in order to reduce the dimensionality of the problem. After taking only the most important variables, I was left with:
Finally, in order to obtain a stable model, I had to scale the inputs so that all variables lied in the range of [1, 1]. This is done to avoid oversaturating the individual neurons in the ANN with outliers. There are several ways to do this, and I just created a simple function called ‘myScale’ in order to get the job done.
1 2 3 4 5 

I won’t be sharing the final model, but I did create a simulation of my methodology for the 2013 trading year.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

The animation demonstrates the mechanics of a Sliding Window model. We see that the ‘Training’ Window keeps moving up, regardless of whether the prediction was accurate or not. The RMSE of the model for the 2013 dates is 1.2115 which is extremely small considering the prices of the S&P 500 are over 1000. We know from the RMSE, and from the animation that the fit tends to be good. When the predicted values are slightly off, we can still see that the ANN usually correctly predicts the direction of the S&P 500 for that day. Perhaps in the future I will revisit this problem as a classification instead due to the results we’re seeing.
The beauty of my result is that although this is a naive implementation of a sophisticated model, we can see that the results are really good. I can spend many weeks tuning the model and playing around with the inputs to get a better fit.
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

Pretty.
]]>The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries’ spellings from this article did not match with what is used in the R ‘maps’ package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 

The .gif starts from year 1940  the year when the first McDonald’s opened in the US. There is a bit of downtime between 1940 and 1967 before McDonald’s expands to Canada. After that, the company rapidly takes off all over the globe.
The animation package is phenomenal and makes use of GraphicsMagick. That part alone took me over an hour to figure out because command line typing on Windows is absolutely terrible.
Please email or leave a comment if you have any questions  I would be happy to help you understand the code better.
]]>Pictured above is Charles Minard’s flow map of Napoleon’s march. It is simply amazing that such a detailed and innovative graphic was published in 1869 (way before the first computer). Minard was truly a pioneer in the use of graphics in engineering and statistics.
The troops text file contains the longitude and latitude of Napoleon’s army, both on the attack and retreat, on his march. The cities file contains the longitudes and latitudes of a few major Russian cities that were in Napoleon’s path.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The red line represents the attacking march while the blue line represents the retreat. The thickness of the line corresponds to the size of Napoleon’s army at that point.
Some improvements that could be made:
Datasets and inspiration from Hadley Wickham’s STAT 405 class at Rice University.
]]>1 2 3 4 5 6 7 8 9 10 

I created a function named “SieveOfE” which takes the input of a number and returns all primes up to that number. After inputting a number, “n”, the function creates a logical vector that is “n” in length with “TRUE” in every position. It then manually sets the first spot to be “FALSE” and defines the last prime to be 2.
This is where the efficiency of this algorithm shines. The function then goes through and eliminates all multiples of the last prime (2 for the first iteration) by making their positions’ values “FALSE”. The next “last.prime” is then chosen by first generating a subvector from the “primes” vector of all those values that are still “TRUE” (starting from 1 + the previous prime used) and finding the minimum of that vector (this will find the minimum position relative to the previous last prime rather than the actual next prime number, which is why we add this value to the last prime to get the new “last.prime” variable).
This process is repeated until “last.prime” is less than or equal to the square root of the input number (this greatly optimizes the function). Finally, the function outputs which values in the “primes” vector are still true.
1 2 3 4 5 

As you can see, this function is lightning quick.
]]>