Latest Posts
Many of my research and personal interests lie in the realm of Machine Learning for several reasons. To me, it is the perfect blend of mathematics, statistics, and computer science. Also, it is extremely pervasive in today's society. Everything from improved web-search to self-driving cars can be attributed to developments in Machine Learning.
For one of my computational finance classes, I attempted to implement a Machine Learning algorithm in order to predict stock prices, namely S&P 500 Adjusted Close prices. In order to do this, I turned to Artificial Neural Networks (ANN) for a plethora of reasons. ANNs have been known to work well for computationally intensive problems where a user may not have a clear hypothesis of how the inputs should interact. As such, ANNs excel at picking up hidden patterns within the data so well that they often overfit!
Keeping this in mind, I experimented with a technique known as a 'sliding window'. Rather than train the model with years of S&P 500 data, I created an ANN that would train over the past 30 days (t-30, ..., t) to predict the close price at t+1. A 30 day sliding window seemed to make a good fit. It wasn't so wide as to not capture the current market atmosphere, but also it wasn't narrow enough to be hypersensitive to recent erratic movements.
Then, I had to decide on the input variables I was going to use. Many stock market models are pure time-series autoregressive functions, but the benefit of ANNs is that we can use them as a more traditional Machine Learning technique, we several inputs (and not only previous prices). This is helpful, because there exist an extremely large number of technical indicators which should uncover some significance in the market. I defined several of my own inputs that I thought would be significant predictors such as:
- 28 Day Moving Average
- 14 Day Moving Average
- 7 Day Moving Average
- Previous Day Close Price
In addition, I did some research and introduced several other technical indicators. Since there were so many, I fit a Random Forest model in order to reduce the dimensionality of the problem. After taking only the most important variables, I was left with:
- Volatility
- SAR
- MACD Oscillator
- ATR
Finally, in order to obtain a stable model, I had to scale the inputs so that all variables lied in the range of [-1, 1]. This is done to avoid oversaturating the individual neurons in the ANN with outliers. There are several ways to do this, and I just created a simple function called 'myScale' in order to get the job done.
myScale <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
I won't be sharing the final model, but I did create a simulation of my methodology for the 2013 trading year.
trainPlot <- function(x) {
g <- ggplot() + geom_path(aes(x = date, y = Adj.Close), colour = I("blue"),
data = sp.2013)
I'm gonna share a short code snippet that I thought was interesting. This post is inspired by one of my engineering computation classes at Rice. The program initializes a pair of coordinates 'z' and iteratively updates z by matrix multiplication based on some random number generation criteria. After each successive coordinate update, the new 'z' is plotted.
library(ggplot2)
z <- matrix(c(0, 0), nrow = 2)
x <- c()
y <- c()
for (i in 1:40000) {
r <- runif(1)
if (r < .01) {
z <- matrix(c(0, 0, 0, .16), nrow = 2, byrow = T) %*% z
}
else if (r < .86) {
z <- matrix(c(.85, .04, -.04, .85), nrow = 2, byrow = T) %*% z + c(0, 1.6)
}
else if (r < .93) {
z <- matrix(c(0.2, -.26, .23, .22), nrow = 2, byrow = T) %*% z + c(0, 1.6)
}
else {
z <- matrix(c(-.15, .28, .26, .24), nrow = 2, byrow = T) %*% z + c(0, .44)
}
x[i] <- z[1]
y[i] <- z[2]
I've been using the Global Terrorism Database a lot lately so I decided to share an interesting plot I made with the data.
The GTD provides over 100,000 observations of terrorist incidents between 1970 and 2011. Of these, there are about 2400 observations in the USA. While this is not a large number, the graph still provides some interesting and intuitive results.
## Load libraries
library(ggplot2)
library(plyr)
library(maps)
library(stringr)
## Load terrorism data
gtd.data <- read.csv("gtd.csv", stringsAsFactors = F)
##
## Begin USA heatmap plot
##
## Subset data to only include terrorist attacks in the USA
gtd.usa <- subset(gtd.data, country_txt == "United States")
## Clean provstate column
gtd.usa$provstate <- str_replace(gtd.usa$provstate, "(U.S. State)", "")
gtd.usa$provstate <- str_replace(gtd.usa$provstate, "[(]", "")
gtd.usa$provstate <- str_replace(gtd.usa$provstate, "[)]", "")
## Trim whitespaces
gtd.usa$provstate <- str_trim(gtd.usa$provstate)
Now that the Facebook Hacker Cup is coming to an end I figured I'd post my solution to one of the challenges. Unfortunately I only made it to Round 1, but I was able to answer this rather interesting Qualification Round problem.
The Problem:
When John was a little kid he didn't have much to do. There was no internet, no Facebook, and no programs to hack on.
So he did the only thing he could... he evaluated the beauty of strings in a quest to discover the most beautiful string in the world.
Given a string s, little Johnny defined the beauty of the string as the sum of the beauty of the letters in it.
The beauty of each letter is an integer between 1 and 26, inclusive, and no two letters have the same beauty.
Johnny doesn't care about whether letters are uppercase or lowercase, so that doesn't affect the beauty of a letter. (Uppercase 'F' is exactly as beautiful as lowercase 'f', for example.)
You're a student writing a report on the youth of this famous hacker. You found the string that Johnny considered most beautiful.
What is the maximum possible beauty of this string?
The input file consists of a single integer m, followed by m lines.
You can find the input file here.
My solution:
inputs <- readLines("inputs.txt", n = -1)
m <- inputs[1]
m <- as.numeric(m)
inputs <- inputs[2: (m + 1)]
outputs <- c()
letters <- letters
for (i in 1:m) {
x <- inputs[i]
x <- tolower(x)
I've been playing around with the 'twitteR' package for R ever since I heard of its existence. Twitter is great and easy to mine because the messages are all-text and most people's profiles are public. This process is made even easier with the 'twitteR' package, which takes advantage of the Twitter API.
After exploring some of the package's capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet - whether the tweet is positive, negative, or neutral.
The hashtags I used were: #YOLO, #FML, #blessed, #bacon
The actual script is fairly simple and repetitive but does yield some interesting results:
library(twitteR)
library(sentiment)
library(ggplot2)
library(RJSONIO)
library(wordcloud)
yolo_tweets <- searchTwitter("#yolo", n = 1500)
yolo_tweets <- twListToDF(yolo_tweets)
yolo_tweets <- yolo_tweets$text
yolo_emotions <- classify_emotion(yolo_tweets)
yolo_polarity <- classify_polarity(yolo_tweets)
yolo_polarity.new <- matrix(nrow = 1500, ncol = 2)
yolo_polarity.new[1:1500, 1] <- yolo_polarity[, 4]
yolo_polarity.new[1:1500, 2] <- "yolo"
fml_tweets <- searchTwitter("#fml", n = 1500)
fml_tweets <- twListToDF(fml_tweets)
fml_tweets <- fml_tweets$text
fml_emotions <- classify_emotion(fml_tweets)
fml_polarity <- classify_polarity(fml_tweets)
fml_polarity.new <- matrix(nrow = 1500, ncol = 2)
fml_polarity.new[1:1500, 1] <- fml_polarity[, 4]
fml_polarity.new[1:1500, 2] <- "fml"
The inspiration for this post came as I was browsing texts and articles about USA's GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.
I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.
library(ggplot2)
bachelors <- read.csv("bachelors.csv", header = TRUE)
GDP <- read.csv("gdppercapita.csv", header = TRUE)
new.data <- merge(bachelors, GDP, by = "State")
colnames(new.data) <- c("State", "Percent.Bachelors", "GDP.Per.Capita")
model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)
First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.
The summary of this simple linear model is featured below:
> summary(model)
Call:
lm(formula = GDP.Per.Capita ~ Percent.Bachelors, data = new.data)
Residual standard error: 6923 on 48 degrees of freedom
Multiple R-squared: 0.3564, Adjusted R-squared: 0.343
F-statistic: 26.58 on 1 and 48 DF, p-value: 4.737e-06
As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.
Nevertheless, we can see a pretty interesting graph below:
g <- ggplot(new.data, aes(x = Percent.Bachelors,
y = GDP.Per.Capita)) +
xlab("Proportion of Population with Bachelor's or Higher") +
ylab("GDP Per Capita")
I was inspired by a few animated gifs that I saw recently so I decided to make one of my own. For this project, I sought out a way to effectively visualize how Mcdonald's expanded throughout the world. To do this, I created a heatmap of the world and using animations I was able to efficiently map out how McDonald's became more popular over time.
The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries' spellings from this article did not match with what is used in the R 'maps' package.
library(ggplot2)
library(maps)
library(mapproj)
library(lubridate)
library(animation)
mcdonalds <- read.csv("mcdonalds.csv", header = TRUE)
mcdonalds$region <- mcdonalds$Country
mcdonalds <- mcdonalds[, 2:3]
world <- map_data("world")
merged.data <- merge(world, mcdonalds, sort=FALSE, by = "region")
merged.data <- merged.data[order(merged.data$order), ]
merged.data$Year <- as.numeric(merged.data$Year)
plotIfStore <- function(x) {
df = subset(merged.data, Year <= x)
g <- ggplot() + geom_path(aes(x = long, y = lat, group = group),
data = world)
g <- g + ggtitle("Growth of Mcdonalds from 1940-2011")
g <- g + geom_polygon(aes(x = long, y = lat, group = group, fill = "red"),
data = df)
In Problem 9 of Project Euler we are tasked with finding the product (abc) of the Pythagorean Triplet (a, b, c) such that a + b + c = 1000.
A Pythagorean triplet is a set of three natural numbers such that a2 + b2 = c2.
To solve this problem, we first see that c = (a2 + b2)1/2. Without loss of generality, we can only run the for loop for a and b, since c will be uniquely determined given a certain a and b.
The code I used:
for (a in 1:499) {
for (b in 1:499) {
if (a + b + sqrt(a^2 + b^2) == 1000) {
print(a * b * sqrt(a^2 + b^2))
break
}
}
if (a + b + sqrt(a^2 + b^2) == 1000) {
break
}
}
I use nested for loops to test values of a and b between 1 and 499. a and b cannot take values greater than 499 because then a + b + sqrt(a2 + b2) would be greater than 1000.
The if statement in the nested loops checks to see whether a, b, and c add up to 1000. If they do, it prints their product and then breaks the loop. This is a short script which produces the correct answer in a few milliseconds. The trick here was expressing c in terms of a and b to reduce the amount of loops we need to run.
Now that I'm done with finals, I finally have time to update my blog. I managed to find three separate Wikipedia entries: one about the Quality of Life Scores of several different countries, one about the number of Nobel Laureates per capita, and one that is a List of Countries by Income Inequality which uses Gini index to rank countries.
I then plotted the data to see if there was a discernable relationship between the three. Most of the work for this project had to do with merging and cleaning the data. I began by pasting the tables from the wikipedia articles into a .csv file. Since the tables were all different lengths and some countries were missing values of the parameters, I had to tidy the dataset up quite a bit.
The result, featured below the code, is pretty interesting.
library(ggplot2)
nobel.data <- read.csv("nobel.csv", header = TRUE)
fixed.nobel.data <- matrix(nrow = 64, ncol = 4)
colnames(fixed.nobel.data) <- c("Country", "Laureates.Per.10.Million", "Quality.Of.Life.Score", "Gini.Score")
fixed.nobel.data[ , 1] <- paste(intersect(intersect(nobel.data$Country,
nobel.data$Country.or.territory),
nobel.data$Country.1))
for(i in 1:64) {
country <- fixed.nobel.data[i, 1]
for (j in 1:48) {
if (nobel.data[j, 2] == country) {
fixed.nobel.data[i , 2] <- nobel.data[j, 3]
}
}
}
for(i in 1:64) {
country <- fixed.nobel.data[i, 1]
for (j in 1:111) {
if (nobel.data[j, 4] == country) {
fixed.nobel.data[i , 3] <- nobel.data[j, 5]
}
}
}
Today I'll be dealing with a dataset that has the price, carat, and several other attributes of almost 54,000 diamonds. It is publically available in the ggplot2 package. Let's jump right into it.
library(ggplot2)
library(gridExtra)
data(diamonds)
plot1 <- qplot(cut, data = diamonds)
plot2 <- qplot(carat, data = diamonds, binwidth = .1)
grid.arrange(plot1, plot2, ncol = 2)

These two histograms help us visualize how common the 'cuts' were as well as the range of carats present in this dataset.
qplot(carat, data = diamonds, fill = cut, binwidth = .1)

This histogram combines the two ideas above and gives an idea of how many diamonds had a certain cut at a specific carat.
qplot(carat, price, colour = cut, data = diamonds)

A simple plot of price vs. carat with color based on cut shows us that price tends to increase with carat as we might expect.
> by(diamonds[ , 1], diamonds[ , 2], summary)
diamonds[, 2]: Fair
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.220 0.700 1.000 1.046 1.200 5.010
----------------------------------------------------------------
diamonds[, 2]: Good
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2300 0.5000 0.8200 0.8492 1.0100 3.0100
----------------------------------------------------------------
diamonds[, 2]: Very Good
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2000 0.4100 0.7100 0.8064 1.0200 4.0000
----------------------------------------------------------------
diamonds[, 2]: Premium
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.200 0.410 0.860 0.892 1.200 4.010
----------------------------------------------------------------
diamonds[, 2]: Ideal
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2000 0.3500 0.5400 0.7028 1.0100 3.5000