Campaign 2016 Usage Guidelines

Our Data

Illuminating 2016 focuses on Presidential Facebook posts and Tweets. We used open-source tools, to collect real-time data from Twitter and Facebook. Separate collectors were set up on multiple servers to build a collection system that does not miss any data in the collection process. All of the data that we collect goes through our preprocessing pipelines. Because Facebook accounts have different identifiers than Twitter (e.g., @realdonaldtrump on Twitter vs. “Donald Trump” on Facebook), we combine Facebook and Twitter candidate information. Although candidates produce, roughly, the same volume of content on Facebook and Twitter, the substance of the content that they produce differs between the two platforms so the two datasets were kept separate throughout both the human annotation and machine learning classification process.

Our Message Categories

The Illuminating team developed a codebook to classify political speech found in political candidates’ social media posts. Using this codebook, coders could apply up to seven labels to a given message. Prior to creating any gold-standard datasets, human coders trained in pairs on pre-existing sets of gold-standard messages until achieving consistent intercoder agreement. The gold-standard label data set was used to train the Machine Learning algorithms. From there, the trained ML models labeled the complete set of candidate messages, which are used in the interactive dashboards.

Below are the seven categories that are used to classify candidates’ social media posts. Notwithstanding the Campaigning Information category, messages can include as many of the following labels that are applicable.

Persuasive content

Persuasive messages attempt to persuade the reader to support a candidate and/or reject their opponent(s). We categorize four forms of persuasive messaging:

  • Advocacy messages advocate for a candidate, highlight their strengths as a leader, emphasize their popularity, describe their positive personality characteristics, and/or explain their prior/future policies.
  • Attack messages criticize a clear target, such as the opponent, opposing administration, or opposing party, affiliated political institutions, social media platforms, newspapers, and/or politicized figures.
  • Image (Persona) messages primarily describe the candidate or the opponent’s character, personality, style and/or values, such as their competence, popularity, moral character, charm or level of benevolence.
  • Issue messages primarily describe the candidate or opponent’s issue/policy positions, both past or future. This can refer to economic issues, social programs, immigration and citizenship, environmental issues, safety, military, fireign policy, governance, social/cultural issues and media issues. The Issue category includes broad claims about the state of the country and/or national (rather than individual) values.
In addition to classifying persuasive content, we categorize the following message types:
  • Call to Action messages are those that include a clear directive for readers to take (e.g., watch, retweet, share, etc.).
  • Ceremonial messages contain a social element and help to build community even if they aren’t, on the face of it, directly/explicitly related to the campaign. Ceremonial messages may give thanks, praise, pay tribute, honor, pray, or express condolences. Ceremonial messages are typically directed at: supporters, volunteers, attendees, family members, the public, or cities.
  • Campaigning Information describes messages about the campaign’s organization, ground game, and strategy. Messages that belong to this category are typically directed to, or about, supporters and observers of the campaign. We do not apply the campaigning information label to messages that have been labeled as persuasive and/or call to action.
How we classify our data

We use supervised Machine Learning Algorithms like Support-Vector Machine to classify our data. The models are trained on a sample of the corpus marked by human annotators. These trained models are then used to mark the whole corpus and graphs on our website are based on the corpus. Each label is treated as a distinct binary class and labels are not interdependent.

Machine Learning training and data filtering

Our models used a supervised learning algorithm Support-Vector Machine (SVM). We follow basic text preprocessing steps for NLP using Python’s NLTK package, such as tokenization and stopword removal and TF-IDF to convert text to features. We restricted our input features to only include unigrams and bigrams—we did this both to avoid overfitting and to limit feature space. In order to optimize the number of features and overall performance, we used five-fold cross-validation to tune the SVM model’s hyperparameters. Given the differential distribution of codes, we weight-balanced the classes with weights that were inversely proportional to the label proportions of our data. Finally, as both precision and recall were equally important for our research, we used the F1 score to optimize our SVM models.

To cite the categories and the machine learning work that created the 2016 data:

Zhang, F., Stromer-Galley, J., Tanupabrungsun, S., Hegde, Y., McCracken, N., & Hemsley, J. (2017). Understanding discourse acts: Political campaign messages classification on Facebook and Twitter. In Lee, D., Lin, Y. R., Osgood, N., Thomson, R. (Eds). Social, cultural, and behavioral modeling. SBP-BRiMS 2017. Lecture Notes in Computer Science, vol. 10354. Springer, Cham. DOI: 10.1007/978-3-319-60240-0_29.