Now that the Facebook Hacker Cup is coming to an end I figured I’d post my solution to one of the challenges. Unfortunately I only made it to Round 1, but I was able to answer this rather interesting Qualification Round problem.

The Problem:

When John was a little kid he didn’t have much to do. There was no internet, no Facebook, and no programs to hack on. So he did the only thing he could… he evaluated the beauty of strings in a quest to discover the most beautiful string in the world.

Given a string s, little Johnny defined the beauty of the string as the sum of the beauty of the letters in it.

The beauty of each letter is an integer between 1 and 26, inclusive, and no two letters have the same beauty. Johnny doesn’t care about whether letters are uppercase or lowercase, so that doesn’t affect the beauty of a letter. (Uppercase ‘F’ is exactly as beautiful as lowercase ‘f’, for example.)

You’re a student writing a report on the youth of this famous hacker. You found the string that Johnny considered most beautiful. What is the maximum possible beauty of this string?

The input file consists of a single integer m, followed by m lines.

You can find the input file here.

Read on →

I’ve been playing around with the ‘twitteR’ package for R ever since I heard of its existence. Twitter is great and easy to mine because the messages are all-text and most people’s profiles are public. This process is made even easier with the ‘twitteR’ package, which takes advantage of the Twitter API.

After exploring some of the package’s capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet - whether the tweet is positive, negative, or neutral.

The hashtags I used were: #YOLO, #FML, #blessed, #bacon

The actual script is fairly simple and repetitive but does yield some interesting results:

Read on →

The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.

I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.

1
2
3
4
5
6
7
8
9
10
library(ggplot2)

bachelors <- read.csv("bachelors.csv", header = TRUE)
GDP <- read.csv("gdppercapita.csv", header = TRUE)

new.data <- merge(bachelors, GDP, by = "State")

colnames(new.data) <- c("State", "Percent.Bachelors", "GDP.Per.Capita")

model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)
Read on →

I was inspired by a few animated gifs that I saw recently so I decided to make one of my own. For this project, I sought out a way to effectively visualize how Mcdonald’s expanded throughout the world. To do this, I created a heatmap of the world and using animations I was able to efficiently map out how McDonald’s became more popular over time.

The data I am using is from this Wikipedia page. It took a small amount of manual cleaning before I could import it into R just because some of the countries’ spellings from this article did not match with what is used in the R ‘maps’ package.

Read on →

In Problem 9 of Project Euler we are tasked with finding the product (abc) of the Pythagorean Triplet (a, b, c) such that a + b + c = 1000.

A Pythagorean triplet is a set of three natural numbers such that a2 + b2 = c2.

To solve this problem, we first see that c = (a2 + b2)1/2. Without loss of generality, we can only run the for loop for a and b, since c will be uniquely determined given a certain a and b.

Read on →

Now that I’m done with finals, I finally have time to update my blog. I managed to find three separate Wikipedia entries: one about the Quality of Life Scores of several different countries, one about the number of Nobel Laureates per capita, and one that is a List of Countries by Income Inequality which uses Gini index to rank countries.

I then plotted the data to see if there was a discernable relationship between the three. Most of the work for this project had to do with merging and cleaning the data. I began by pasting the tables from the wikipedia articles into a .csv file. Since the tables were all different lengths and some countries were missing values of the parameters, I had to tidy the dataset up quite a bit.

The result, featured below the code, is pretty interesting.

Read on →

Today I’ll be dealing with a dataset that has the price, carat, and several other attributes of almost 54,000 diamonds. It is publically available in the ggplot2 package. Let’s jump right into it.

1
2
3
4
5
6
7
8
9
library(ggplot2)
library(gridExtra)

data(diamonds)

plot1 <- qplot(cut, data = diamonds)
plot2 <- qplot(carat, data = diamonds, binwidth = .1)

grid.arrange(plot1, plot2, ncol = 2)
Read on →

Problem 8 of Project Euler asks us to find the greatest product of five consecutive digits of a 1000 digit number. The problem and 1000 digit number can be found here.

I first saved the number in a text file and used the scan function to import the number into R. At first it is a 20 string characteric because “scan” separates items based on line breaks. The paste function then allows me to to concatenate the strings into one character. After that, I use the string split function to separate each number in the character into its own position in a vector. Finally, I convert the “strings” vector into a numeric vector so we can use mathematical operations on it.

It was easy to see that if we are taking products of 5 consecutive integers of a certain number, there are a total of x - 4 total products, where x is the total number of digits in the number. (If its a 5 digit number, only 1 product exist. For a 6 digit number, we can multiply 1 through 5 or 2 through 6 for 2 different products.)

Read on →

I recently discovered an awesome R package called zipcode so I decided to play around with it a little bit. I found some IRS data on the 100 highest and 100 lowest income zip codes in the US. After cleaning up and modifying the data a little bit I plotted it onto a map projection of the US.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
library(ggplot2)
library(maps)
library(zipcode)

data(zipcode)

high.income <- read.csv("high.csv", header=T)
low.income <-read.csv("low.csv", header=T)

high.income$type <- "High"
low.income$type <- "Low"

high.income$zip <- clean.zipcodes(high.income$zip)
low.income$zip <- clean.zipcodes(low.income$zip)

high.income <- merge(high.income, zipcode, by.x='zip', by.y='zip')
low.income <- merge(low.income, zipcode, by.x='zip', by.y='zip')

#### Remove Hawaii
low.income <- low.income[-which(low.income$state=="HI"),]

states <- map_data("state")

g <- ggplot() + geom_path(aes(x = long, y = lat, group=group), data=states)

g <- g + geom_point(aes(x=longitude, y=latitude, color=type, size=Salary),
                    data = high.income)

g <- g + geom_point(aes(x=longitude, y=latitude, color=type, size=Salary),
                    data = low.income)

g <- g + scale_size_continuous(range = c(3, 10))

g <- g + theme_bw() + labs(x=NULL, y=NULL) + ggtitle("Zip Codes by Salary")

g
Read on →

For this post, I attempted to reconstruct a famous visualization of Napoleon’s March to Moscow. The French Invasion of Russia is considered a major turning point in the Napoleonic Wars. Up until that point, Napoleon’s army was vast in size. By the end of his March on Moscow, the French army was reduced to a tiny fraction of its size.

Pictured above is Charles Minard’s flow map of Napoleon’s march. It is simply amazing that such a detailed and innovative graphic was published in 1869 (way before the first computer). Minard was truly a pioneer in the use of graphics in engineering and statistics.

Read on →