Introduction to rtweet: Collecting Twitter data

Michael W. Kearney May 25, 2017

Getting started with Twitter data

Twitter data were already trendy, but the unpresidented 2016 U.S. election has escalated things to a fever pitch. One of the biggest drivers of the trend is the widespread availability of Twitter data. Twitter makes much of its user-generated data freely available to the public via Application Program Interfaces (APIs). APIs refer to sets of protocols and procedures for interacting with sites. Twitter maintains several APIs. The two most condusive to data collection are the REST API and the stream API, both of which I describe below.

Twitter’s REST API provides a set of protocols for exploring and interacting with Twitter data related to user statuses (tweets), user profiles and timelines, and user network connections. The data are restful in that they have been archived by Twitter. Navigating these resting endpoints can, at times, be resource intensive, but it also makes it possible to perform highly complex and specific queries.

Twitter data not yet archived and accessible via the REST API can be accessed using Twitter’s stream API. As its name suggests, the stream API provides users with a live stream of Twitter data. Because the data are streamed, or pushed, to the user, the stream API reduces overhead associated with performing queries on archived data sources. This makes it possible to collect large amounts of data very quickly and with relatively little strain on computational resources. The downside to the stream API is that it is limited to prospective (tracking, monitoring, etc.) but not retrospective (surveying, searching, etc.) queries.

Installing rtweet

Install from CRAN using install.packages.

## install from CRAN
install.packages("rtweet")

Alternatively, install the most recent [development] version from Github using install_github (from the devtools package).

## install from Github (dev version)
if (!"devtools" %in% installed.packages()) {
    install.packages("devtools")
}
devtools::install_github("mkearney/rtweet", build_vignettes = TRUE)

API authorization

I’ve tried to make the API token [oauth] process as painless as possible. That’s why I’ve included the “auth” vignette, which ships with the package and contains step-by-step instructions on how to create and manage your Twitter API token. The vignette also includes instructions for saving a token as an environment variable, which automates the token-loading process for all future sessions (at least, for the machine you’re using). View the authorization vignette online or enter the following code into your R console to load the vignette locally:

## Open Twitter token vignette in web browser.
vignette(topic = "auth", package = "rtweet")

Package documentation

In addition to the API authorization vignette, rtweet also includes a brief package overview vignette as well as a vignette demonstrating how to access Twitter’s stream API. To open the vignettes locally, use the code below.

## overview of rtweet package
vignette(topic = "intro", package = "rtweet")

## accessing Twitter's stream API
vignette(topic = "stream", package = "rtweet")

And thanks to pkgdown, rtweet now has a dedicated package documentation website. *Btw, while I’m on the subject of package documentation/maintenance, I’d also like to point out rtweet’s Github page. Contributions are welcome and if you run into any bugs or other issues, users are encouraged to create an Github issue.

Some applied examples

search_tweets

Searching for tweets is easy. For example, we could search for all [publically] available statuses from the past 7-10 days that use the hashtags #ica17 or #ica2017. In the code below I’ve specified 18,000 statuses (tweets), which is the maximum number a user may request every 15 minutes.

## load rtweet
library(rtweet)

## search for tweets containing ICA17 or ICA2017 (not case sensitive)
ica17 <- search_tweets(
    "#ica17 OR #ica2017", n = 18000, include_rts = FALSE
)

If there were more than 18,000 statuses that (a) fit the search query and (b) exist in the last 7-10 days (the limit put in place by Twitter), then users can continue where they left off by using the max_id parameter. Since Twitter statuses are returned in order from newest to oldest, the max_id value should just be the last status ID returned by the previous search.

## select last (oldest) status ID from previous search
last_status_id <-  ica17$status_id[nrow(ica17)]

## pass last_status_id to max_id and run search again.
ica17_contd <- search_tweets(
    "#ica17 OR #ica2017", n = 18000, include_rts = FALSE,
    max_id = last_status_id
)

Data returned by search_tweets is quite extensive. One recently added feature makes navigating the data a bit easier. As of version 0.4.3, rtweet returns tibble data frames (assuming the user has installed the tibble package, which is a dependency for nearly all packages in the tidyverse). Tibbles are especially nice when working with larger data sets because accidental printing in R has been known to take years off of one’s life (needs citation).

ts_filter and ts_plot

Included in the rtweet package are a few convenenience functions, which have been designed to assist in the analysis of Twitter data. One of these convenient functions is ts_plot, which is a plot-based wrapper around ts_filter. The ts_plot and ts_filter functions aggregate the frequency of tweets over specified intervals of time. Hence, the “ts” (time series) naming convention. In addition to aggregating the frequency of statuses, ts_plot will also plot the time series.

## aggregate freq of tweets in one-hour intervals
agg <- ts_filter(ica17, by = "hours")

## view data
agg
## # A tibble: 212 x 3
##                   time  freq filter
##                 <dttm> <dbl>  <chr>
##  1 2017-05-16 20:00:00     2       
##  2 2017-05-16 21:00:00     0       
##  3 2017-05-16 22:00:00     0       
##  4 2017-05-16 23:00:00     0       
##  5 2017-05-17 00:00:00     1       
##  6 2017-05-17 01:00:00     0       
##  7 2017-05-17 02:00:00     0       
##  8 2017-05-17 03:00:00     0       
##  9 2017-05-17 04:00:00     0       
## 10 2017-05-17 05:00:00     0       
## # ... with 202 more rows
## plot data
ts_plot(agg)

The plot produced by ts_plot depends on whether the user has installed ggplot2, which is a suggested but not required package dependency for rtweet. If you haven’t installed ggplot2 then I highly recommend it. Assuming you have, then the object returned by ts_plot can be treated like any other ggplot object, meaning you can easily add layers and customize the plot to your liking.

## load ggplot2
library(ggplot2)
## plot a time series of tweets, aggregating by one-hour intervals
p1 <- ts_plot(ica17, "hours") +
    labs(
        x = "Date and time",
        y = "Frequency of tweets",
        title = "Time series of #ICA17 tweets",
        subtitle = "Frequency of Twitter statuses calculated in one-hour intervals."
    ) +
    ## a custom ggplot2 theme I mocked up for ICA
    theme_ica17()

## render plot
p1

plain_tweets

The second convenience function for analysing tweets is plain_tweets. As you might guess, plain_tweets strips the text of the tweets down to plain text. Because there are already variables included in the default tweets data that contain links, hashtags, and mentions, those entities are stripped out of the text as well. What’s returned are lower case words. Below I’ve applied the function to the first few ICA17 tweets.

## strip text of tweets
plain_tweets(ica17$text[1:5])
## [1] "excellent posttruth preconference heres some background"                   
## [2] "panel w"                                                                   
## [3] "nato"                                                                      
## [4] ""                                                                          
## [5] "inspiring talk by about creating engaging progressive and subversive media"

The plain_tweets function is relatively straight forward at cutting through the clutter, but it still may not prepare you for quick and easy analysis. For that, you can use the tokenize argument in plain_tweets. The tokenize argument will return a vector of plain text words for each tweet.

## tokenize by word
wrds <- plain_tweets(ica17$text, tokenize = TRUE)
wrds[1:5]
## [[1]]
## [1] "excellent"     "posttruth"     "preconference" "heres"        
## [5] "some"          "background"   
## 
## [[2]]
## [1] "panel" "w"    
## 
## [[3]]
## [1] "nato"
## 
## [[4]]
## character(0)
## 
## [[5]]
##  [1] "inspiring"   "talk"        "by"          "about"      
##  [5] "creating"    "engaging"    "progressive" "and"        
##  [9] "subversive"  "media"

Identifying stop words

This can easily be converted into a word count [frequency] table, but it still leaves one problem. The most common words probably aren’t going to tell us a lot about our specific topic / set of tweets.

## get word counts
wrds <- table(unlist(wrds))

## view top 40 words
head(sort(wrds, decreasing = TRUE), 40)
## 
##           the            to            of           and            in 
##           614           504           429           388           363 
##           for            on            at             a            is 
##           290           284           247           235           181 
##          from         media         about            as           san 
##           121           115           111           110           110 
##          with         diego            be             i          this 
##           108           105            98            92            91 
##            we            by           you           are           our 
##            90            88            85            78            76 
## preconference            my            it        social          that 
##            75            71            66            65            63 
##          data           not           see      research communication 
##            61            60            59            55            52 
##           now           but            up         great            us 
##            51            50            50            49            49

See, these words don’t appear to be very unique to ICA 2017. Of course, we could always find a premade list of stopwords to exclude, but those may not appropriately reflect the medium (Twitter) here. With rtweet, however, it’s possible to create your own dictionary of stopwords by locating overlap between (a) a particular sample of tweets of interest and (b) a more general sample of tweets.

To do this, we’re going to search for each letter of the alphabet separated by the boolean OR. It’s a bit hacky, but it returns massive amounts of tweets about a wide range of topics. So, if we can identify the unique words used in our sample, we may yet accomplish our goal.

In the code below, I’ve excluded retweets since those add unnecessary redundancies (and, ideally, we’d want a diverse pool of tweets). It’s still not perfect, but it gives us a systematic starting point that I imagine could be developed into a more reliable method.

## construct boolean-exploiting search query
all <- paste(letters, collapse = " OR ")

## conduct search for 5,000 original (non-retweeted) tweets
sw <- search_tweets(all, n = 5000, include_rts = FALSE)
## create freq table of all words from general pool of tweets
stopwords <- plain_tweets(sw$text, tokenize = TRUE)
stopwords <- table(unlist(stopwords))

Now that we’ve identified the frequencies of words in this more general pool of tweets, we can exclude all ICA tweet words that appear more than N number of times in the general pool.

## cutoff
N <- 5L

## exclude all ica17 words that appear more than N times in stopwords
wrds <- wrds[!names(wrds) %in% names(stopwords[stopwords > N])]

## check top words again
head(sort(wrds, decreasing = TRUE), 40)
## 
##         diego preconference          data      research communication 
##           105            75            61            55            52 
##      politics    conference        indigo     political       excited 
##            48            42            40            37            36 
##       forward         panel       preconf    presenting          comm 
##            34            29            29            29            26 
##        online         paper       digital      populism      ballroom 
##            25            24            23            23            19 
##     discourse           ica          nato       session      altheide 
##            19            19            19            19            18 
##       between          join      populist   interesting         hills 
##            18            18            18            17            16 
##    technology          fear        friday       hashtag        papers 
##            16            15            15            15            15 
##  presentation      sapphire      schedule      scholars      students 
##            15            15            15            15            15

Creating a word cloud

That turned out well! These words look a lot more unique to the topic. We can quickly survey all of these words with a simple word cloud.

## get some good colors
cols <- sample(rainbow(10, s = .5, v = .75), 10)

## plot word cloud
par(bg = "black")
suppressWarnings(wordcloud::wordcloud(
    words = names(wrds),
    freq = wrds,
    min.freq = 5,
    random.color = FALSE,
    colors = cols,
    family = "Roboto Condensed",
    scale = c(4, .25))
)

Filtering topics

If we wanted to model the topics of tweets, we could conduct two searches for tweets over the same time period and then compare the frequencies of tweets over time using time series. That’s what I’ve done in the example below.

First I searched for tweets mentioning “North Korea”, since I know they conducted another missile test on Monday.

## search tweets mentioning north korea (missle test on Monday)
nk <- search_tweets(
    "north korea", n = 18000, include_rts = FALSE
)

Then I searched for tweets mentioning “CBO health care” (in any order, anywhere in the tweet), since I know the CBO was released on Wednesday.

## search for tweets about the CBO (released on Wed.)
cbo <- search_tweets(
    "CBO health care", n = 18000, include_rts = FALSE
)

And then I combined the two data sets into one big data frame.

## create query (search) variable
cbo$query <- "CBO health care"
nk$query <- "North Korea"

## row bind into single data frame
df <- rbind(cbo, nk)

Using the ts_plot function, I then provide a list of filter words (via regular expression; the bar is like an “OR”). Use the key argument if you want to have nicer looking filter labels. By default ts_plot will create groups based on the text of the tweet and the filters provided. However, you can pass along the name of any variable in DF and the function will use that to classify groups. In the code below, I applied plain_tweets to the text to create a new variable, and then specified that I wanted to apply the filters to that variable by using the txt argument in ts_plot.

## create plain tweets variable
df$text_plain <- plain_tweets(df$text)

## filter by search topic
p3 <- ts_plot(
    df, by = "15 mins",
    filter = c("cbo|health|care|bill|insured|deficit|budget",
               "korea|kim|jung un|missile"),
    key = c("CBO", "NKorea"),
    txt = "text_plain"
)

Now it’s easy to add more layers and make this plot look nice.

## add theme and more style layers
p3 <- p3 + 
    theme_ica17() +
    scale_x_datetime(date_labels = "%b %d %H:%m") +
    theme(legend.title = element_blank()) +
    labs(x = NULL, y = NULL, 
         title = "Tracing topic salience in Twitter statuses",
         subtitle = paste("Tweets (N = 23,467) were aggregated in 15-minute",
                          "intervals. Retweets were not included.")
    )

## render plot
p3

Tidy sentiment analysis

The syuzhet package makes sentiment analysis criminally easy.

## conduct sentiment analysis
sa <- syuzhet::get_nrc_sentiment(df$text_plain)

Within a few seconds, the analysis returns coded variables for several categories of emotion and valence. A preview of the sentiment scores returned by get_nrc_sentiment is provided below.

## view output
tibble::as_tibble(sa)
## # A tibble: 23,467 x 10
##    anger anticipation disgust  fear   joy sadness surprise trust
##    <dbl>        <dbl>   <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>
##  1     1            0       0     1     0       1        0     0
##  2     0            0       0     0     0       0        0     1
##  3     0            0       0     0     0       1        1     0
##  4     0            0       0     0     0       1        1     0
##  5     0            0       0     0     0       1        1     0
##  6     0            0       0     0     0       0        0     0
##  7     0            1       0     0     0       0        0     1
##  8     0            0       0     0     0       0        0     0
##  9     0            0       0     0     0       1        1     0
## 10     0            0       1     0     0       1        0     0
## # ... with 23,457 more rows, and 2 more variables: negative <dbl>,
## #   positive <dbl>

Since the return object is a data frame with the same number of rows as the CBO/North Korea data, the columns can easily be combined to create one data frame.

## bind columns
df <- cbind(df, sa)

This data structure is useful for most media researchers, but it’s not very flexible—either for summarizing the data or for visualizing it. Fortunately, recent advancements1 in data wrangling in R make converting this wide data to tidy, long data a breeze. In the code below, I’ve created a user function to assist in time-rounding efforts, and I’ve enlisted dplyr and tidyr to do the dirty work.

## load dplyr
suppressPackageStartupMessages(library(dplyr))

## create function for aggregating date-time vectors
round_time <- function(x, interval = 60) {
    ## round off to lowest value
    rounded <- floor(as.numeric(x) / interval)
    ## center so value is interval mid-point
    rounded <- rounded + round(interval * .5, 0)
    ## return to date-time
    as.POSIXct(rounded * interval, origin = "1970-01-01")
}

## use pipe (%>%) operator for linear syntax
long_emotion_ts <- df %>%
    ## select variables (columns) of interest
    dplyr::select(created_at, query, anger:positive) %>%
    ## convert created_at variable to desired interval 
    ## here I chose 6 hour intervals (3 * 60 seconds * 60 mins = 3 hours)
    mutate(created_at = round_time(created_at, 3 * 60 * 60)) %>%
    ## transform data to long form
    tidyr::gather(sentiment, score, -created_at, -query) %>%
    ## group by time, query, and sentiment
    group_by(created_at, query, sentiment) %>%
    ## get mean for each grouping
    summarize(score = mean(score, na.rm = TRUE), 
              n = n()) %>% 
    ungroup()

The result is a tidy data paradise:

## view data
long_emotion_ts
## # A tibble: 250 x 5
##             created_at           query    sentiment      score     n
##                 <dttm>           <chr>        <chr>      <dbl> <int>
##  1 2019-03-29 07:00:00 CBO health care        anger 0.13043478    23
##  2 2019-03-29 07:00:00 CBO health care anticipation 1.30434783    23
##  3 2019-03-29 07:00:00 CBO health care      disgust 0.08695652    23
##  4 2019-03-29 07:00:00 CBO health care         fear 0.26086957    23
##  5 2019-03-29 07:00:00 CBO health care          joy 0.86956522    23
##  6 2019-03-29 07:00:00 CBO health care     negative 0.21739130    23
##  7 2019-03-29 07:00:00 CBO health care     positive 1.43478261    23
##  8 2019-03-29 07:00:00 CBO health care      sadness 0.08695652    23
##  9 2019-03-29 07:00:00 CBO health care     surprise 0.82608696    23
## 10 2019-03-29 07:00:00 CBO health care        trust 0.21739130    23
## # ... with 240 more rows

Which we can pass right along to ggplot2 for the finish:

## plot data
long_emotion_ts %>%
    ggplot(aes(x = created_at, y = score, color = query)) + 
    geom_point(aes(size = n)) + 
    geom_smooth(method = "loess") + 
    facet_wrap(~ sentiment, scale = "free_y", nrow = 2) + 
    theme_bw() + 
    theme(text = element_text(family = "Roboto Condensed"),
          plot.title = element_text(face = "bold"),
          legend.position = "bottom", 
          axis.text = element_text(size = 9),
          legend.title = element_blank()) + 
    labs(x = NULL, y = NULL, 
         title = "Sentiment analysis of Twitter statuses over time",
         subtitle = "Tweets aggregated by hour on topics of the CBO and North Korea") +
    scale_x_datetime(date_breaks = "18 hours", date_labels = "%b %d")

And that’s it!

1: I’ll admit that for a time I was hesitant to embrace the collection of packages collectively known as the tidyverse (formerly known as the Hadleyverse; see: https://github.com/hadley). But the tidyverse, and especially dplyr, is really quite amazing.