Campaign 2020 Usage Guidelines
Illuminating 2020 focuses on collecting Presidential candidate’s political advertisements and categorizing them with our previously developed codebook. We use Apache Airflow to fuel our data pipelines that collect streaming ads data from the Facebook Ad Library APIs. The database gets updated every 4 hours with new metrics and new posts and gets tagged by our machine learning models. This data contains ads from Facebook and Instagram for all the presidential candidates with valid Facebook and Instagram accounts. We pull data only from the official candidate pages, and not from other entities advertising on the candidates behalf. We only pull data for candidates who ran long enough to be included in debates. The Facebook Ad Library API provides spending and impressions data for each ad in ranges, including a minimum and maximum amount. To estimate the amount of spending and impressions for an ad, we take the midpoint of each range. The data we provide are thus estimates and should be treated as such.
The Illuminating team developed a codebook to classify political speech found in political candidates’ social media posts, which was then applied to political advertisements. Using this codebook, coders could apply up to five labels to a given ad. Prior to creating any gold-standard datasets, human coders trained in pairs on pre-existing sets of gold-standard social media messages until achieving consistent intercoder agreement. The gold-standard labels were used to train the Machine Learning algorithms. From there, the ML models labeled the complete set of candidate ads, which are used in the interactive dashboards.
Below are the five categories that are used to classify candidates’ political advertisements. Messages can include as many of the following labels that are applicable.
- Advocacy messages advocate for a candidate, highlight their strengths as a leader, emphasize their popularity, describe their positive personality characteristics, and/or explain their prior/future policies.
- Attack messages criticize a clear target, such as the opponent, opposing administration, or opposing party, affiliated political institutions, social media platforms, newspapers, and/or politicized figures.
- Persona messages primarily describe the candidate or the opponent’s character, personality, style and/or values, such as their competence, popularity, moral character, charm or level of benevolence.
- Issue messages primarily describe the candidate or opponent’s issue/policy positions, both past or future. This can refer to economic issues, social programs, immigration and citizenship, environmental issues, safety, military, foreign policy, governance, social/cultural issues, and media issues. The Issue category includes broad claims about the state of the country and/or national (rather than individual) values.
- Call to Action messages are those that include a clear directive for readers to take (e.g., watch, retweet, share, etc.).
The Illuminating team developed a codebook to classify political speech acts found in political candidates’ social media posts, which was then applied to political advertisements. Using this codebook, coders could apply the categories of civil and uncivil. Prior to creating any gold-standard datasets, human coders trained in pairs on pre-existing sets of gold-standard social media messages until achieving consistent intercoder agreement. The gold-standard labels were used to train the Machine Learning algorithms. From there, the ML models labeled the complete set of candidate ads, which are used in the interactive dashboards. Unlike other approaches to classifying civility, we do not use a lexicon - or list of words - approach. Our approach provides a more nuanced classification of civility than simply hostile and derogatory words.
Thus, for this project we define Uncivil messages as having a disrespectful or rude tone, for instance, using insulting language (demeaning, belittling, mean or vulgar) towards opponents or groups, lying and deception accusations, mockery, and misrepresentative exaggeration.
We use specialized Deep Learning Algorithms for Natural Language Processing like BERT to classify our data. The pre-trained BERT models are trained (fine-tuned for text classification tasks) on a sample of the corpus from the 2016 and 2020 Facebook post content marked by annotators. These trained models are then used to mark the whole corpus and graphs on our website are based on the corpus. Each label is treated as a distinct binary class and labels are not interdependent.
We used a pre-trained BERT (Bidirectional Encoder Representations from Transformers) Base model (12-layer, 768-hidden, 12-heads, 110M parameters) which is trained on Wikipedia and English textbooks.
We used Facebook post data from 2016 and 2020 to fine-tune pre-trained BERT models. This model was then tested on 700 unique ads from 2020 Facebook ads Data. Further, these models were used to mark the whole data corpus. The optimal hyperparameters (epochs, batch size, input length, weight balancing etc.) were used for each label.
To cite the categories and the machine learning work that created the 2020 data:
Gupta, S., Bolden, S.E., Kachhadia, J., Korsunska, A., Stromer-Galley, J. (2020) PoliBERT: Classifying political social media messages with BERT. Paper presented at the Social, Cultural and Behavioral Modeling (SBP-BRIMS 2020) conference. Washington, DC, October 18-21, 2020.