3 Things Text Study

Based on Julia Silge’s log on text mining. The intent is to extract some information from 3 Things tweets so we can learn more about them.

tweets <- read.csv("D:/CivicTechYYC/3thingsforcanada3.csv",stringsAsFactors = FALSE)

Start with distribution plots

The first get to know the data takes form in plots.

There was an extraction done over 2017 especially since we wanted to capitalize over the Canada 150 celebration. From a count of tweets binned by the tweet timestamp, we can see that enthusiasm was greater for the first part of the year and around Canada’s birthday we see a grouping of high activity.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(scales)

tweets$timestamp <- ymd_hms(tweets$timestamp)
ggplot(data = tweets, aes(x = timestamp)) + 
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Tweets Over Time") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Patterns over different ranges

Plotting number of tweets by month definitely shows a high right before Canada’s birthday which is expected. Tweets seem to build durning the week with the highest tweet count on Friday and lowest on Saturday and Sunday. It seems like there is more tweeting when people work? Time of tweets might help us here.

ggplot(data = tweets, aes(x = month(timestamp))) + 
  geom_histogram(breaks = seq(.5,12, by=1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Month") + ylab("Number of tweets") + ggtitle("Tweets by Month") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

ggplot(data = tweets, aes(x = wday(timestamp))) + 
  geom_histogram(breaks = seq(.5, 7.5, by=1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Days of Week") + ylab("Number of tweets") + ggtitle("Tweets by Day of the Week") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

Measuring consistency of week patterns.

Using Chi Squared tests we will determine how similar are the patterns between weeks.With values returned by Chi-squared, it is unexpected that the distribution for either weeks or months is uniform. Even by investigating into tweets of week versus weekend, Chi-squared p value is way too low and eventhoughh the ratio of more tweets during the week is 25% higher, the hypothesis is not strong enough to support a pattern by week.

chisq.test(table(wday(tweets$timestamp, label = TRUE)))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(wday(tweets$timestamp, label = TRUE))
## X-squared = 92.142, df = 6, p-value < 2.2e-16
x3table <- table(wday(tweets$timestamp, label = TRUE))
mean(x3table[c(2:5)]/mean(x3table[c(1,6,7)]))
## [1] 1.218023
chisq.test(table(wday(tweets$timestamp, label = TRUE)), p = c(4, 5, 5, 5, 5, 4, 4)/32)
## 
##  Chi-squared test for given probabilities
## 
## data:  table(wday(tweets$timestamp, label = TRUE))
## X-squared = 95.845, df = 6, p-value < 2.2e-16
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))
tweets[(minute(tweets$timestamp) == 0 & second(tweets$timestamp) == 0),11] <- NA
mean(is.na(tweets$timeonly))
## [1] 0

Time of tweets

Most tweets happen during the day, after 4:00pm and during the evening. There are very few late night tweets. There might be some distortion with time of tweets because I did not timezone correct the data.

class(tweets$timeonly) <- "POSIXct"
ggplot(data = tweets, aes(x = timeonly)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Time of Tweets") +
  scale_x_datetime(breaks = date_breaks("3 hours"), 
                   labels = date_format("%H:00")) +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Hastags

Typically all have hashtags, which is expected since we are at least expecting the #3thingsforcanada or something similar as a mark. There are a few that appear not to have one and I am assuming they are replies but this can be noted and checked, if needed, later.

ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Tweets with Hashtags") +
  scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))

Replies, Retweets and Tweets

There are approximately as many retweets as tweets and there are very few replies.

ggplot(tweets, aes(factor(tweets$retweets != 0))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Retweeted Tweets") +
  scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))

ggplot(tweets, aes(factor(tweets$replies != 0))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Replied Tweets") +
  scale_x_discrete(labels=c("Not in reply", "Replied tweets"))

tweets$type <- "tweet"
tweets[(tweets$retweets > 0), c("type")] <- "RT"
tweets[(tweets$replies > 0),c("type")] <- "reply"
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])

ggplot(data = tweets, aes(x = timestamp, fill = type)) +
  geom_histogram() +
  xlab("Time") + ylab("Number of tweets") +
  scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.