Illuminating seeks to understand political discourses engaged by the 2016 US presidential candidates on social media. We developed a platform for collecting, analyzing and visualizing data, using advanced computational approaches. This article puts particular emphasis on the data aspect of the project. We present how we collected raw data from social media sites, the development of gold standard datasets as the ground truth about performative speech nature of a message, and how we used them to train classification models. As the byproducts of the platform, our datasets and the models could be adopted to answer different questions in the context of political communication. Our methodologies also make a meaningful contribution to computational social science research especially literature concerning communication on social media.
Although the context of this project centers around presidential campaigns. Our automated classification models were developed using both 2014, U.S. gubernatorial election data and 2016 U.S. presidential campaign data. This section presents the data collections on both Twitter and Facebook.
We used 2 primary collections of presidential data: one for Twitter and one for Facebook. We identified lists of Twitter handles and Facebook public pages for each of the presidential candidates from the Republican, Democratic, and third party. The account information was obtained from their campaign websites, or by searching for their names on both platforms. We collected the data from 32 candidates during the period of their campaigns. Specifically, we collected their tweets and posts since the date they officially announced running for the election until the date they dropped out. The last two candidates we collected were Hillary Clinton and Donald Trump, which were continued through President Trump’s inauguration. The longest period of collection is 1.7 years for Hillary Clinton, and the shortest period is about 3.5 months for Tim Kaine.
The primary Twitter collection was developed using timeline API and stored in a MySQL database. Using the timeline API, we retrieved a) user profile meta-data (e.g. profile description, account created date and follower count at the time of collecting); and b) all tweets created by candidates. The collector run periodically to update the associated meta-data of each tweet such as retweet, like and follower counts. This design is ideally suited the project since it enables us to repeatedly collect tweets from many different handles at a specified time interval over a long period of time. It also ensures that we keep an up-to-date collection and do not lose any data if a candidate drops out of the race and deletes their account. The timeline collection comprises 132,239 tweets with an average of 4,132 tweets per candidate.
In addition, we developed redundant collections using streaming API to supplement our primary collection. They were used to cross-check if we missed any tweets as well as the counting data (e.g. retweet and follower counts). They also contained information about the public who retweeted, replied and mentioned our candidates. To develop the redundant collections, we used an open source toolkit that collected data from Twitter’s streaming API. In an effort to ensure a robust collection effort, we deployed 2 primary servers, each with a backup, giving us 4 total servers. We used the same set of Twitter handles as our search terms for all 4 collections, though we used 2 different collection strategies such that one pair of servers used Twitter’s follow parameter and the other two used the track parameter. Twitter’s follow parameter returns tweets that were a) created by one of the candidates in the list; b) retweeted by a candidate; c) replies to a candidate; and d) retweets of the candidates’ tweets. The track parameter returns tweets where one of the candidates’ twitter handle appeared in the text, a URL or a screen_name of a tweet. Using the track collection tends to give us everything the follow collection does, but includes cases where tweets contain the candidate’s handle without an ‘@’ symbol (“realDonaldTrump”), or as a hashtag (“#realDonaldTrump”). It also includes retweets of tweets posted by a non-candidate that @mentions one of our candidates. Thus, the track collections are much larger. After combining the collection on the main server to one on its backup, and removing duplicates, the follow collection comprises 97,878,756 tweets and the track collection has 322,578,912 tweets. Out of these, we have 105,744 tweets posted by our candidates.
For Facebook data, we used an open source toolkit that collected the data from Facebook's Graph API and stored in a MySQL database. Using the Graph API, we retrieved a) user meta-data (e.g. profile description, page created date and page’s number of likes); b) all the posts created by the pages; and c) associated data to the posts (e.g. number of post likes and comments). Similar to the Twitter timeline collection, we ran this collector periodically to update number of likes and comments. This also ensures that we keep an up-to-date data if a post was changed, and that we do not lose any data if a candidate drops out of the race and deletes their page. This dataset contains 65,532 Facebook posts. On average, a candidate created 2,047.88 posts.
Before the presidential collections, we created the other 2 collections for candidates running for the gubernatorial elections in 2014. We constructed the list of Twitter handles and Facebook public pages for each of the 72 candidates from 36 states. We relied on the information provided on their campaign websites, or searched both platforms for their names. For Twitter collection, we used the same toolkit that collected tweets from a list of Twitter handles using Streaming API and follow parameter. Of 72 candidates, two were not active on Twitter thus dropped from the analysis: Robert Goodman (Nevada) and Charles Brown (Tennessee). The collection spans from September 14th 2014 to November 11th 2014 and contains 28,981 tweets.
We used an initial version of the same Facebook toolkit for the collection of gubernatorial candidates on Facebook. From 72 candidates, we missed the data for 5 candidates because their accounts were deleted soon after the election and before we could collect them. The two candidates – Robert Goodman and Charles Brown – were not active on Facebook either thus dropped from the analysis. The collection spans from September 14th 2014 to November 11th 2014 and contains 8,045 Facebook posts.
Gold Standard Data Development
This section presents the development of gold standard datasets for training our classification models. For each platform, we created a subset of data by randomly selecting 20% of gubernatorial and 10% of presidential campaign data from an early phrase of the campaigns. We manually annotated each of these messages to generate ground truth about the performative speech categories according to our codebook, and used the annotated data to train the classification models.
Our codebook is composed of 6 main categories: call-to-action, strategic message, informative, endorsement, ceremonial, and conversational (only applies to Twitter). We delved deeper into strategic message by examining the types (either attack or advocate) and focus (either image or issue) of messages. Call-to-action messages are also broken into 5 types: traditional engagement, digital engagement, media appearance, voting, giving money and buying merchandise. Each category reflects different nature of a message employed by a candidate: to urge supporters to act (call-to-action), advocate for themselves or go on the attack (strategic message), provide neutral information (informative), feature an endorsement (endorsement), honor people or holidays (ceremonial), and use the affordances of social media to interact with the public (conversational). With the established codebook, we trained 4 annotators and conducted a pilot study to measure inter-coder agreement on a random sample of 648 messages. Our codebook is reliable as indicated by high Krippendorff’s alphas: 0.79 agreement on main categories, 0.77 agreement on type of strategic messages, 0.72 on focus of strategic messages, and 0.80 on call-to-action sub-categories. Details about our codebook and human-annotator training can be found here.
The high Krippendorff’s alphas give us confidence that our codebook is clearly defined and that the categories are mutually exclusive. The annotators then continued to develop gold standard datasets through revisions of annotations and reconciliations where disagreements occurred. Our gold datasets comprise 7,136 messages for Twitter, and 5,132 messages for Facebook. These datasets were then used for training classification models discussed in the following section.
Automated Text Classification
With the gold data, we developed classification models to automatically categorize campaign messages based on SVM, a machine learning classification algorithm. Due to the differences in message characteristics amongst Twitter and Facebook, we trained the models for each platform separately. Our models were fine tuned through a series of experiments to achieve the highest performance (details can be found here).
We developed 4 sets of models: one for the main categories, one for the type of strategic message, one for the focus of strategic message, and one for call-to-action sub categories. All models were evaluated using 5 folds cross-validation and reported with a micro-average F1 score. Our main-category models were trained with the whole set of gold data -- 7,136 tweets and 5,132 Facebook posts. The sub-category models were trained with the subset of gold data of the corresponding types. Specifically, the type and focus models were trained on the subset of 2,780 tweets and 1,660 Facebook posts labeled as strategic message, and the call-to-action sub-category models were trained on 1,575 tweets and 2,058 Facebook posts labeled as call-to-action.
For Twitter data, our main-category model achieves an average F1-score of 0.74 (micro-average). The sub-category models of strategic message also perform quite well. We obtained 0.81 for type (either attack or advocacy), and 0.77 for focus (image or issue). For call-to-action, all sub categories but traditional engagement and buying merchandise, achieve an F1-score of greater than 0.78.
The main-category model for Facebook data achieves an average F1-score of 0.76. The sub-category models of strategic message obtained an F1-score of 0.80 for type (attack or advocacy), and 0.78 for focus (image or issue). For call-to-action, all but traditional engagement achieves an F1-score of greater than 0.78.
We conducted an experiment to evaluate model’s reliability using human-machine comparison. Specifically, we constructed the test data by randomly selecting 1,000 messages from each platform over the course of the 2016 presidential campaign. We used our models to predict the category of each message and asked our annotators to correct each of the predicted categories. Then, we calculated F1-scores of machine-predicted categories vs. human-corrected results. With exceptions of ceremonial and endorsement, F1-scores are all over 0.75 for Twitter, and 0.74 for Facebook. This indicates that our models perform well and can be generalized to the wide range of political messages. We are using the models to study presidential candidates’ campaign messages in real-time, details can be found here.
This article presents the datasets we used in Illuminating project. We developed a number of large data collections on both Twitter and Facebook during the period of electoral campaigns. The datasets were used to develop gold standard data for training automated classification models. Our models are currently being used for exploring performative speech categories of political messages. As a platform, Illuminating provides a set of tools for data collection, analysis and visualization. The byproducts of the platform are the large collections of social media data and classification models, all of which could be furthered and advanced in future studies.