Fun With Twitter

I’ve been playing around with the ‘twitteR’ package for R ever since I heard of its existence. Twitter is great and easy to mine because the messages are all-text and most people’s profiles are public. This process is made even easier with the ‘twitteR’ package, which takes advantage of the Twitter API.

After exploring some of the package’s capabilities, I decided to conduct a pretty basic sentiment analysis on some tweets with various hashtags. Specifically, I analyzed the polarity of each tweet - whether the tweet is positive, negative, or neutral.

The hashtags I used were: #YOLO, #FML, #blessed, #bacon

The actual script is fairly simple and repetitive but does yield some interesting results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
library(twitteR)
library(sentiment)
library(ggplot2)
library(RJSONIO)
library(wordcloud)

yolo_tweets <- searchTwitter("#yolo", n = 1500)
yolo_tweets <- twListToDF(yolo_tweets)
yolo_tweets <- yolo_tweets$text
yolo_emotions <- classify_emotion(yolo_tweets)
yolo_polarity <- classify_polarity(yolo_tweets)
yolo_polarity.new <- matrix(nrow = 1500, ncol = 2)
yolo_polarity.new[1:1500, 1] <- yolo_polarity[, 4]
yolo_polarity.new[1:1500, 2] <- "yolo"

fml_tweets <- searchTwitter("#fml", n = 1500)
fml_tweets <- twListToDF(fml_tweets)
fml_tweets <- fml_tweets$text
fml_emotions <- classify_emotion(fml_tweets)
fml_polarity <- classify_polarity(fml_tweets)
fml_polarity.new <- matrix(nrow = 1500, ncol = 2)
fml_polarity.new[1:1500, 1] <- fml_polarity[, 4]
fml_polarity.new[1:1500, 2] <- "fml"


blessed_tweets <- searchTwitter("#blessed", n = 1500)
blessed_tweets <- twListToDF(blessed_tweets)
blessed_tweets <- blessed_tweets$text
blessed_emotions <- classify_emotion(blessed_tweets)
blessed_polarity <- classify_polarity(blessed_tweets)
blessed_polarity.new <- matrix(nrow = 1500, ncol = 2)
blessed_polarity.new[1:1500, 1] <- blessed_polarity[, 4]
blessed_polarity.new[1:1500, 2] <- "blessed"


bacon_tweets <- searchTwitter("#bacon", n = 1500)
bacon_tweets <- twListToDF(bacon_tweets)
bacon_tweets <- bacon_tweets$text
bacon_emotions <- classify_emotion(bacon_tweets)
bacon_polarity <- classify_polarity(bacon_tweets)
bacon_polarity.new <- matrix(nrow = 1500, ncol = 2)
bacon_polarity.new[1:1500, 1] <- bacon_polarity[, 4]
bacon_polarity.new[1:1500, 2] <- "bacon"


polarities <- rbind(yolo_polarity.new, fml_polarity.new,
                    blessed_polarity.new, bacon_polarity.new)
colnames(polarities) <- c("Polarities", "Hashtag")

qplot(polarities[, 2], fill = polarities[, 1]) + xlab("Hashtags") +
      scale_fill_discrete(name = "Text Polarity") +
      ggtitle("Polarities of Different Hashtags on Twitter")

The histogram portrays some peculiar information. For one, all of these hashtags seemed to associate with positive messages. I did not expect #fml to be associated with positive results since the fmylife site is a place for people to post negative things that have happened to them. Nevertheless, the other hashtags had more positives which was expected.

Next, I decided to explore some of the functions of the ‘wordcloud’ package in R. In order to do so, I mined tweets that contained #rstats, and built a wordcloud that sized and placed words based on their frequencies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
rstats_tweets <- searchTwitter("#rstats", n = 1500)
rstats_tweets <- twListToDF(rstats_tweets)
rstats_tweets <- rstats_tweets$text


rstats_corpus <- Corpus(VectorSource(rstats_tweets))
rstats_corpus <- tm_map(rstats_corpus,
                        function(x) iconv(enc2utf8(x), sub = "byte"))

tdm <- TermDocumentMatrix(rstats_corpus,
                          control = list(removePunctuation = TRUE,
                                         stopwords = c("rstats",
                                                       stopwords("english")),
                                         removeNumbers = TRUE,
                                         tolower = TRUE))

m = as.matrix(tdm)

word_freqs = sort(rowSums(m), decreasing=TRUE)

dm = data.frame(word=names(word_freqs), freq=word_freqs)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

I had to use the tm_map function to ensure all tweets were encoded properly, before using TermDocumentMatrix. As we can see, ‘shiny’ is by far the most popular word tweeted with ‘#rstats’. This should come as no surprise - Shiny is RStudio’s new and exciting way to integrate R with web applications.