R Sentimant Analysis

This is a second exercise I followed from Julia Silge ( https://juliasilge.com/blog/ten-thousand-tweets/ ) and involves word frquency counts and some sentiment analysis. We are still trying to get al feel for the content of the data. THis only constitutes part of the picture, and I would guess it is one third of the basic high level analysis.

tweets <- read.csv("D:/CivicTechYYC/3thingsforcanada3.csv",stringsAsFactors = FALSE)

Wordcloud

Other than the 4 items mentioned with warning, we see the top 100 words used over all the texts we retrieved from Twitter. “Canada 150”, “things” and “community” are most frequently referenced after excluding the long strings, and these are expected and confirm the consistent context of the tweets. You can tell there is a lot of activity in Calgary and Guelph as wells as references to Major Nenshi really stad out. Geerally we see a lot of positve words and maybe they are more about raising awareness and are calls to action?

library(tm)
## Loading required package: NLP
library(stringr)
library(wordcloud)
## Loading required package: RColorBrewer
library(tidytext)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(RColorBrewer)

nohandles <- str_replace_all(tweets$text, "@\\w+", "")
wordCorpus <- Corpus(VectorSource(nohandles))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, removeWords, c("amp", "2yo", "3yo", "4yo"))
wordCorpus <- tm_map(wordCorpus, stripWhitespace)

replace_reg1 <- "https://t.co/[A-Za-z\\d]+|"
replace_reg2 <- "http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
replace_reg <- paste0(replace_reg1, replace_reg2)
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

pal <- brewer.pal(9, "YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)
wordcloud(words = wordCorpus, scale=c(30,0.1), max.words=100, random.order = FALSE,
          rot.per=0.35, use.r.layout=FALSE, colors=pal)
## Warning in wordcloud(words = wordCorpus, scale = c(30, 0.1), max.words =
## 100, : 3thingsforcanada could not be fit on page. It will not be plotted.

Extracting the corpus

The code below creates the document matrix that basically keeps tracks of what word occurs with which tweet. The matrix created tells us that there are 677 tweets and 2325 words. The matrix also indicates where there is overlap. The first 25 most frequent words are shown in a table and by the time we get to “share” as 25 we are at a word that occurs less than 4% of the time. So the TermDocumentMatrix does indciate a sparse or highly variabe set of words. A wordcloud of words that occur only 6 times were plotted and you can see there is a lot of words and they are diverse, which may mean it might take more ork to sort out the real tweets of good deeds.

tdm <- TermDocumentMatrix(wordCorpus)
tdm
## <<TermDocumentMatrix (terms: 2325, documents: 677)>>
## Non-/sparse entries: 6186/1567839
## Sparsity           : 100%
## Maximal term length: 172
## Weighting          : term frequency (tf)
#inspect(tdm[1:10, 1:10])
dfwords <- data.frame(inspect(tdm[1:2325, 1:677]))
## <<TermDocumentMatrix (terms: 2325, documents: 677)>>
## Non-/sparse entries: 6186/1567839
## Sparsity           : 100%
## Maximal term length: 172
## Weighting          : term frequency (tf)
## Sample             :
##                   Docs
## Terms              207 394 588 616 624 629 639 641 642 667
##   3thingsforcanada   1   1   1   1   1   1   1   0   0   1
##   canada             0   0   0   0   0   0   0   0   0   0
##   canada150          0   0   1   0   0   0   0   0   0   0
##   community          0   0   0   0   0   0   0   0   0   0
##   great              0   0   1   0   1   0   0   0   0   1
##   guelph             0   1   0   0   1   0   0   0   0   0
##   make               0   1   0   0   0   0   1   0   0   0
##   mayor              1   0   0   1   0   0   0   0   0   0
##   one                0   1   0   0   0   0   0   0   0   0
##   things             0   0   0   0   0   0   0   0   0   0
#findFreqTerms(tdm, 10)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
#d[1:25,]
ggplot(d[2:26,])+
geom_bar(stat = 'identity', aes(reorder(word,-freq), freq, fill=freq))+
coord_flip() + ggtitle("Top 26 Words(excluding 3thingsforcanada")

set.seed(1234)
mention <- d[d$freq < 8 & d$freq > 1,]
wordcloud(words = mention[mention$freq == 6, 1], 
          max.words=Inf, scale=c(4,.5),random.order=FALSE, rot.per =.15, colors=palette(),
          vfont = c("sans serif", "plain"))
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : sesquicentennial could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : everyone could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : awesome could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : public could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : ways could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : blood could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : donated could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : tix could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : video could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : edmonton could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : cfnwab could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : families could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : water could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = mention[mention$freq == 6, 1], max.words =
## Inf, : n3thingsforcanadancfnwab could not be fit on page. It will not be
## plotted.

Who is mentioned in tweets

These are the people who are mentioned within a tweet.As you can see mayor Nensh seems to be at the center of it all.

friends <- str_extract_all(tweets$text, "@\\w+")
namesCorpus <- Corpus(VectorSource(friends))

set.seed(146)
wordcloud(words = namesCorpus, scale=c(20,0.5), max.words=40, random.order=FALSE, rot.per =0.10,
          use.r.layout=FALSE, colors=pal)
## Warning in wordcloud(words = namesCorpus, scale = c(20, 0.5), max.words =
## 40, : character could not be fit on page. It will not be plotted.

Sentiment Analysis

Using R sentimant analysis we can mark words with a sentiment derrived from the tweet context. The highest is anticipation, joy and trust and these describe the optimism associated with the 3things effort.

library(syuzhet)
library(reshape2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
mySentiment <- get_nrc_sentiment(tweets$text)
head(mySentiment)
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
## 2     0            0       0    0   0       0        0     1        0
## 3     0            1       0    0   0       0        0     0        0
## 4     0            2       0    0   2       0        2     0        0
## 5     0            0       0    0   0       0        0     0        0
## 6     0            0       0    0   1       0        0     0        0
##   positive
## 1        1
## 2        2
## 3        1
## 4        3
## 5        2
## 6        2
tweets <- cbind(tweets, mySentiment)

sentimentTotals <- data.frame(colSums(tweets[,c(11:18)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets")

Positive vs Negative Sentiment

As expected, there is an majority of positive sentimentsand there are few negative. There seems to be some most positive tweets after Canada’s 150 and during August 2017 timeframe.

tweets$timestamp <- with_tz(ymd_hms(tweets$timestamp), "America/Chicago")
posnegtime <- tweets %>% 
  group_by(timestamp = cut(timestamp, breaks="2 months")) %>%
  summarise(negative = mean(negative),
            positive = mean(positive)) %>% melt
## Using timestamp as id variables
names(posnegtime) <- c("timestamp", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])

ggplot(data = posnegtime, aes(x = as.Date(timestamp), y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Time")

Sentiment over weeks and months

The year seem to end on a positive note. There are some indication of Tuesday and Friday being associated with more joy, trust and anticipation. With the sparse dataset, perhaps this can be guideline or something to form hypotheses for future analysis. Being a sample dataset there are probably too few points to gain insight. Definitely leaves more to look into.

tweets$weekday <- wday(tweets$timestamp, label = TRUE)
weeklysentiment <- tweets %>% group_by(weekday) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
## Using weekday as id variables
names(weeklysentiment) <- c("weekday", "sentiment", "meanvalue")

ggplot(data = weeklysentiment, aes(x = weekday, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, 0.6) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Week")
## Warning: Removed 8 rows containing missing values (geom_point).

tweets$month <- month(tweets$timestamp, label = TRUE)
monthlysentiment <- tweets %>% group_by(month) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
## Using month as id variables
names(monthlysentiment) <- c("month", "sentiment", "meanvalue")

ggplot(data = monthlysentiment, aes(x = month, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Year")