Campaign 2020 Usage Guidelines

Our Data

Illuminating 2020 focuses on collecting Presidential candidate’s political advertisements and categorizing them with our previously developed codebook. We use Apache Airflow to fuel our data pipelines that collect streaming ads data from the Facebook Ad Library APIs. The database gets updated every 4 hours with new metrics and new posts and gets tagged by our machine learning models. This data contains ads from Facebook and Instagram for all the presidential candidates with valid Facebook and Instagram accounts. We pull ads and their metadata from the main candidate pages as well as the ads the Trump and Biden campaigns purchased on other affiliated pages. We do not pull ads for other entities advertising on the candidates’ behalf, such as political action committees. We only pull data for candidates who ran long enough to be included in debates. The Facebook Ad Library API provides spending and impressions data for each ad in ranges, including a minimum and maximum amount. To estimate the amount of spending and impressions for an ad, we take the midpoint of each range. The data we provide are thus estimates and should be treated as such.

Message Type

The Illuminating team developed a codebook to classify political speech found in political candidates’ social media posts, which was then applied to political advertisements. Using this codebook, coders could apply up to five labels to a given ad. Prior to creating any gold-standard datasets, human coders trained in pairs on pre-existing sets of gold-standard social media messages until achieving consistent intercoder agreement. The gold-standard labels were used to train the Machine Learning algorithms. From there, the ML models labeled the complete set of candidate ads, which are used in the interactive dashboards.

Below are the five categories that are used to classify candidates’ political advertisements. Messages can include as many of the following labels that are applicable.

  • Advocacy messages advocate for a candidate, highlight their strengths as a leader, emphasize their popularity, describe their positive personality characteristics, and/or explain their prior/future policies.
  • Attack messages criticize a clear target, such as the opponent, opposing administration, or opposing party, affiliated political institutions, social media platforms, newspapers, and/or politicized figures.
  • Persona messages primarily describe the candidate or the opponent’s character, personality, style and/or values, such as their competence, popularity, moral character, charm or level of benevolence.
  • Issue messages primarily describe the candidate or opponent’s issue/policy positions, both past or future. This can refer to economic issues, social programs, immigration and citizenship, environmental issues, safety, military, foreign policy, governance, social/cultural issues, and media issues. The Issue category includes broad claims about the state of the country and/or national (rather than individual) values.
  • Call to Action messages are those that include a clear directive for readers to take (e.g., watch, retweet, share, etc.).
Message Topic

The Illuminating team iteratively developed a codebook to classify political speech acts found in political candidates’ advertisements on social media by 12 topic categories. These are likely to correspond to the “issue” category under the message type classification, but are not exclusively messages from this type category and the two codebooks, gold label datasets and Machine Learning algorithms were developed independently. Human coders went through an iterative process of annotation, codebook and lexicon development, until stable definitions and category boundaries were agreed upon. Subsequently, gold-standard datasets were created and used to train and test the Machine Learning algorithm utilizing a lexicon-based approach. Once the precision and recall were deemed satisfactory for all 12 topics, the ML models were used to label the complete set of candidate ads, which are visualized in the interactive dashboards. Below are the 12 topic descriptions. Topics are not mutually exclusive, and one message can contain anywhere from 1 to all 12 topic labels, if applicable. The below descriptions are non exhaustive.

  • Economic

    Economic includes messages regarding Federal spending, taxes, wages, infrastructure (transportation, roads, housing, etc.), business policy, financial sector issues (mortgages, investments, banking, unions), discussions about socioeconomic classes (upper class, the middle class, 99%, gender pay gap), U.S. trade policy, job creation and reparations payments.

  • Social Programs

    Social Programs includes messages regarding Health Care (health insurance, PPE, costs, Obamacare), Social Security (medicaid, medicare), Welfare programs (affordable housing) and general solutions to social/cultural issues.

  • Immigration

    Immigration includes messages regarding border protection (border wall, illegal immigration, ICE), Citizenship (Green cards, visas) as well as any immigrant-specific issues (immigrant students, DACA).

  • Environment

    Environment includes messages regarding Global warming/Climate change (greenhouse gases, pollution, overpopulation, natural disasters, water pollution), Renewable Energy (clean coal, wind, wind farms, electric cars) and related policy (Green New Deal, Blue New Deal).

  • Safety

    Safety includes messages regarding crime and safety in our country (crime rate, school violence, murder rates, suicide rate, death penalty, mass shooting), drugs & alcohol-related issues, gun legislation (gun laws, second amendment, gun ownership, automatic weapons), NSA-related issues (Snowden, surveillance of the public, cybersecurity, domestic terrorism) and general statements regarding the failure of government to keep people safe.

  • Military

    Military includes messages regarding Veteran’s Affairs (healthcare, treatment, PTSD, housing), service (military preparedness, draft, selective service), Military technology R&D, Military budget & spending.

  • Foreign Policy

    Foreign Policy includes messages regarding trade agreements or pacts with other nations, NATO, policies related to foreign governments and institutions, aid to other countries, war on terror, Homeland Security.

  • Governance

    Governance includes messages regarding policies around how government works (the size of government, corruption, shutdowns), process of governance (elections and campaigns, executive action, congressional hearings), judicial matters (balance of powers, Supreme Court), campaigning (finance, political parties, cabinet positions, transparency of tax returns), voting (elections, electoral college, right to vote, voter fraud, voter turnout), citizen rights and media related issues (Freedom of speech, social media policies). Does not include issues of governance in other countries.

  • Social and cultural

    Social and cultural includes messages regarding values on how we should treat all people, but especially minority and disenfranchised population, women’s issues (abstinence programs, contraception, stem cells, Planned Parenthood, reproductive rights), LGBT Issues (Gay/Lesbian/Gay Marriage/Civil Unions), racial Issues (HBCUs), religious Issues (e.g. Muslims and their rights), disability rights, aging population Issues.

  • Health

    Health includes messages regarding virus outbreaks (Zika, Ebola, COVID-19), vaccines (existing and new development), health science (new technology, new treatments), illnesses, treatments, access to healthy things or services (food, prison healthcare, abortion, big pharma raising prices, etc), health system (Obamacare, health insurance, paying for treatment) and mental health issues (PTSD, addiction, suicide, etc.)

  • Education

    Education includes messages regarding financing education (debt, debt-free college, loans, paying teachers, for-profit universities), education standards, safety in schools (shootings, abuse), teacher education (training) subject-specific learning, education for prisoners/training to rejoin society, discussion of HBCUs.

  • COVID

    COVID includes messages regarding the origins of the virus, government stopping the spread of the virus (flatten the curve, lockdown, quarantine, social distancing, closing borders, etc.), medical professionals (Hospital capacity, doctors overworked, etc), medical supplies (PPE, ventilators, masks, disinfectants, etc), possible treatments (vaccines, remdesivir, hydrochloroquine, etc) and economic outcomes (jobs loss, recession, stimulus checks, panic shopping, shortages, etc) Note: Many of the COVID messages will overlap with the health, economic and foreigh policy categories.

Message Tone

The Illuminating team developed a codebook to classify political speech acts found in political candidates’ social media posts, which was then applied to political advertisements. Using this codebook, coders could apply the categories of civil and uncivil. Prior to creating any gold-standard datasets, human coders trained in pairs on pre-existing sets of gold-standard social media messages until achieving consistent intercoder agreement. The gold-standard labels were used to train the Machine Learning algorithms. From there, the ML models labeled the complete set of candidate ads, which are used in the interactive dashboards. Unlike other approaches to classifying civility, we do not use a lexicon - or list of words - approach. Our approach provides a more nuanced classification of civility than simply hostile and derogatory words.

Thus, for this project we define Uncivil messages as having a disrespectful or rude tone, for instance, using insulting language (demeaning, belittling, mean or vulgar) towards opponents or groups, lying and deception accusations, mockery, and misrepresentative exaggeration.

How we classify our data

We use specialized Deep Learning Algorithms for Natural Language Processing like BERT to classify our data. The pre-trained BERT models are trained (fine-tuned for text classification tasks) on a sample of the corpus from the 2016 and 2020 Facebook post content marked by annotators. These trained models are then used to mark the whole corpus and graphs on our website are based on the corpus. Each label is treated as a distinct binary class and labels are not interdependent.

Machine Learning training and data filtering

We used a pre-trained BERT (Bidirectional Encoder Representations from Transformers) Base model (12-layer, 768-hidden, 12-heads, 110M parameters) which is trained on Wikipedia and English textbooks.

We used Facebook post data from 2016 and 2020 to fine-tune pre-trained BERT models. This model was then tested on 700 unique ads from 2020 Facebook ads Data. Further, these models were used to mark the whole data corpus. The optimal hyperparameters (epochs, batch size, input length, weight balancing etc.) were used for each label.

Citation
To cite the categories and the machine learning work that created the 2020 data:

Gupta, S., Bolden, S.E., Kachhadia, J., Korsunska, A., Stromer-Galley, J. (2020) PoliBERT: Classifying political social media messages with BERT. Paper presented at the Social, Cultural and Behavioral Modeling (SBP-BRIMS 2020) conference. Washington, DC, October 18-21, 2020.