Predicting election results in Brazil based on social media data, some basic assumptions and a little bit of AI

Our analytical election prediction model, applied to the 2022 Brazilian Presidential runoff election, came within 0.5% of the actual results.

Reinaldo Bergamaschi
17 min readFeb 16, 2023

We have been monitoring daily social media data (only public data) for about 5 years now, from all types of accounts, e.g., politicians, celebrities, influencers, products, companies, sports teams, colleges, you name it. Everything that happens in the real world gets commented, explored, faked, blown up, discredited, on social media. Still, one thing that has become very clear is that people (real, not bots) are honestly blunt when engaging or posting something on social media, especially if they feel strongly about some product, company, or politicians.

For a while now, we have been fascinated by the relationship between social media and elections. We have analyzed the social media contents of candidates in several elections in Brazil and the US and looked for correlations between social media indicators and the final election results. At the end of this article, you can find links to our analyses of past elections.

In this article we will detail our approach for predicting elections results, based solely on social media data, some basic assumptions about the size of the electorate and the candidates, and a little bit of AI (for data mining and sentiment analysis). In the last Brazilian presidential election in 2022, our prediction on the eve of the runoff election came within 0.5% difference from the final result, which was more accurate than all, but one opinion poll published on the day before. This article explains our methodology for coming up with the prediction.

In the last several years, with the increasing popularity and accessibility of machine learning technologies, many scholars have published methods to predict the outcome of elections based on data. Not data collected in an interview or phone call, but data volunteered by people, by means of social media.

Research [1] has shown that many people feel more at ease to express their personal views online than when talking to a pollster. For better or worse, interactions on social media tend to be more polarized and less susceptible to social norms that are usually present in the work or family environment. Thus, perhaps more representative of the person’s real intention at the ballot box.

The machine learning methods for predicting election results use a variety of social media data, most often, number of fans and followers, number of likes and shares, and number of comments mentioning a candidate positively or negatively. Sentiment analysis is widely used for assessing whether a comment is favorable to a given candidate or not.

Early studies were able to correlate the volume of social media posts associated with a given candidate or party, and the sentiment of the comments posted about the candidates with the election results [2, 3].

More recently, researchers have used more sophisticated models, considering various metrics such as number of posts, likes, shares, and comments (number and sentiment) and using machine learning algorithms to predict election results [4, 5].

Fake news and bots make the monitoring and understanding of real engagement and opinions more difficult. Moreover, the language used in social media tends to be very informal, sarcastic, ironic, full of slangs, lots of grammar and spelling mistakes, which make any type of analysis (e.g., detecting sentiment or opinion) much less precise.

Politicians, especially, the very polarizing ones, usually have huge numbers of followers and very high engagement on their social media posts. The majority of fans and followers are normally the converted ones, that is, the people who will vote for the candidate anyway, and do not represent the undecided electorate. In very competitive elections, the undecided voters (if they choose to vote) will decide for a candidate closer to the election and may swing the victory to a candidate, even if they have a lot fewer total fans and followers. Thus, considering total quantities in social media, as an indicator of electorate trends, can be very misleading.

The incremental change in social media indicators, especially closer to the elections, are much better indicators of the trends among the undecided voters.

This effect can be estimated quantitatively by looking at the variations of fans and followers along time, comparatively for all candidates in an election. When a person, who is not yet a fan or follower of any candidate, starts following a given candidate, especially close to the election, it likely means that something in that candidate’s recent message resonated with the person — enough for them to engage with the candidate on social media (and perhaps vote for).

This can be measured as follows: consider an election with 4 candidates, and on a given day, there are 100 new people that follows the 4 candidates (summing up the new followers). Let’s say that candidate A had 40 new followers, B had 25, C had 20 and D had15. From this we can estimate that, of all “undecided” people, on this day, 40% chose candidate A, 25% chose candidate B, 20% chose C and only 15% chose D. At this point, if we have an estimate on the number of undecided voters, we could estimate how their votes will be allocated to each candidate.

To give an idea of this daily effect, we monitored the total number of fans and followers, and their daily incremental change, for the top four presidential candidates in the 2022 Brazilian elections, for a period prior to the 1st round election, including a period when candidates were very active on TV and campaigning. The figures below show the details.

Both graphs above show the same information, the number of new daily fans/followers gained by each candidate, as a percentage of the total of new daily fans/followers.

The first graph is annotated with specific events that triggered distinct reactions. For example, on the day(s) after each candidate was interviewed by “Jornal Nacional” (JN), a popular TV news program, their number of new fans/followers went up. On Aug/22, the incumbent president and candidate Jair Bolsonaro was interviewed by JN, and immediately after (Aug/23) he gathered 75% of the new fans/followers that day. Similarly, Lula was interviewed by JN on Aug/25, and on Aug/26 he collected 71% of the new fans that day. On Aug/28 there was a Presidential Debate on TV, in which candidate Simone Tebet was considered to have won, and on the day after, Aug/29, Simone collected 30% of the new fans, more than any other candidate that day, and more than she collected on any other day. On Sep/7, president Bolsonaro participated in a large event celebrating Brazilian Independence Day, and on the following day he gathered 61% of the new fans.

These daily variations also show that the effect of an event is usually short lived, that is, the spotlight candidate gets proportionally more fans immediately after the event (and the other candidates proportionally fewer), but on later days their levels go back to normal. Nevertheless, we found that the average of these daily variations does correlate with the amounts of undecided voters that make a decision for a candidate. We will detail how to use it quantitatively later.

Analytical Model for Election Prediction

Our analytical model for election prediction consists of the following steps:

  1. Initial Estimates of:
    a. Total number of eligible voters (eV)
    b. Number of actual voters (aV), i.e., people who actually cast a vote
    c. Number of valid votes (vV)
    d. Number of invalid votes (iV)

For the purposes of estimating the election results, we will consider the number of valid votes only, in a fully proportional voting system. These numbers can be taken from the official electoral sites and estimated based on previous elections. They can clearly change for each election, but in Brazil, where voting is compulsory, the percentage of valid votes with respect to eligible voters has not changed drastically over the last 3 presidential elections.

2. Percentage of potential votes (ppv)
In the days leading to election day (this could be one day, one week, one month, depending on the granularity and reliability of the data available), each candidate is associated with a percentage of potential votes (ppv). For example, before a 1st round election (or single round), since we do not have any “real” numbers, we could use the average of the latest opinion polls for each candidate. If there is a 2nd round/runoff election, we should use the actual number of votes that each candidate got in the 1st round election.
Let ppvₖ denote this percentage of potential votes for candidate k, then

pVₖ = aV * ppvₖ

is the potential votes to be received by candidate k in the election (1st or single round) or actual votes received by candidate k in a 1st round election.

3. Definite and Undecided Votes
Out of the potential votes for a given candidate k (item 2 above), we estimate the percentage that will definitely go to candidate k on election day (percentage of definite votes = pdv), and the percentage that are still undecided (percentage of undecided votes = puv). These percentages can be taken from opinion polls, which normally ask the voters how certain they are about their chosen candidate and whether they may change. Hence the definite votes (dVk) and undecided votes (uVk) associated with candidate k can be estimated as:

dVₖ = pVₖ * pdvₖ
uVₖ = pVₖ * puvₖ

The total number of undecided votes will be given by the sum for all N candidates:

4. Vote Transfer
In a 2-stage election, only the top 2 candidates in the 1st round will compete in the runoff election (if none gets more than 50% of the valid votes). The voters who voted for the other candidates will them have to choose between the 2 remaining candidates (or abstain or invalidate the ballot). Some of these voters may stay undecided for longer (the uVₖ factor explained in item 3.), and the remaining ones will transfer their votes. In this step we estimate this transfer.

If candidates A and B are the top 2 in the 1st round, and candidate C is left behind, the percent transfer of C to A and C to B, denoted as pt_c->A), or
pt_c->B, is estimated based on 2 factors:

  • a) How aligned is the party of candidate C with the party of A and B? If C is totally aligned with A (e.g., two right-wing parties, or two left-wing parties), then pt_c->A = 100% or pt_c->B = 0%. So, some political context is needed here.
  • b) In the absence of a definite alignment (e.g., a center party and a left/right wing party), we will estimate the transfer percentage based on data mining and sentiment analysis. See details below.

In the formulas below, we will call this percentage of votes transfer from candidate j to candidate k as as pt_j->k. Let’s assume that in a 1st round election with N candidates, each candidate i got Vᵢ votes and the number of undecided votes is uVᵢ. In the event of a runoff election with the first two candidates (1 and 2, to make the formulas cleaner), the total transferred votes from candidates 3 to N to candidates 1 and 2 will be:

5. Assigning the Undecided Votes
Assuming that these voters (still undecided in the days leading to the election) are going to the polls, they will have to choose between the remaining candidates. We will estimate the percentage of undecided votes that go to each candidate based on the average percentage of new daily fans/followers (pdF) gathered by each candidate in the time interval being considered (these daily values can be seen in the figures above).
The Undecided Votes that “decide” for Candidate i is denoted as udVᵢ:

6. Final Tally
The final total estimated number of votes that a candidate i will receive in the election will be given by the sum of direct votes, transferred votes and part of the undecided votes:

The final percentage of valid votes estimated for candidate i will be given by:

where m is the final number of candidates (e.g., 2 in a runoff election)

Putting it together

The figure below illustrates how the final number of votes is composed, according to the 6 steps presented above, in a particular case of a runoff election, after a 1st round election with 4 candidates, assuming candidates A and B are the top 2. For example, the final votes predicted for candidate A (fVA) will be the sum of: (1) votes directly transferred from the 1st round election (dVA), (2) the portion of undecided votes that go to A based on the average percentage of new daily fans/followers gathered by A (udVA) and (3) the votes transferred from candidates C (tV_C->A) and D (tV_D->A).

A little bit of AI

As indicated in item 4b. above, we use data mining and sentiment analysis to estimate the percentage of votes transferred from any candidate who does not make it to a runoff election to another who does.

We have collected the posts and the comments made directly onto the candidates’ social media accounts (Facebook and Twitter) for several months before the elections. In particular, if we want to evaluate the transfer of votes, we need to look at the comments between the day after the 1st round election and the day before the runoff election posted on the social media accounts of the losing candidates. It is very common that a supporter of a losing candidate will post comments telling their candidate to support A or B in the runoff election; or the comment will directly criticize or support candidate A or B or their parties.

First, we created a long list of word expressions associated with the winning candidates and their parties. Then we apply data mining to further select only those comments that contain any expression similar to those in the list. Lastly, we run sentiment analysis on those selected comments. The proportion of positive comments associated with candidate A, plus negative comments associated with B, will represent the percentage of votes transferred to candidate A, and reversely for candidate B. The figure below illustrates these steps.

Our Predictions for the 2nd round Brazilian Presidential Election 2022

The 1st round of the elections was held on Sunday, October 2, 2022, with 11 candidates. The table below shows the actual results. We will use these numbers as a starting point for our 2nd round predictions.

The top 2 candidates, Lula and Bolsonaro would compete in the runoff election on October 30, 2022.

On October 29 we evaluated our prediction model to come up with the likely winner on the following day. Our assumptions were the following.

Definite and Undecided Votes

Since this was an extremely polarized election, we chose to use 100% as the percentage of definite votes for Lula and Bolsonaro. That is, we assumed that everyone that voted for Lula or Bolsonaro in the 1st round would vote in the same way in the 2nd round.

According to our modeling, the votes received by the other candidates in the 1st round would be either transferred to one of the top candidates, or go into a pool of undecided votes. The percentages used and our reasoning are given in the table below.

The percentages of undecided votes (puv) in the table above are input parameters to our model and were defined based on our assessment of the political parties and alliances being formed in the days before the 2nd round election. They are certainly subjective.

Vote Transfer

As presented in step 4. above, the votes given to a non-top-2 candidate become either undecided or transferred to one or both of the top-2 candidates according to a percentage of vote transfer (pt_i->1, pt_i->2). This percent transfer is either an input parameter (green cells in the tables below) to the model (when the political alignment is clear), or estimated based on data mining and sentiment analysis of the comments posted on the social media accounts of the non-top-2 candidates (gray cells).

We performed data mining and sentiment analysis on all comments posted in the social media accounts (Facebook, Twitter and YouTube) of candidates Simone Tebet and Ciro Gomes by their fans/followers. The other candidates had considerably fewer comments and their party alignment was more well defined, thus we used estimated values. The tables below present the vote transfer percentages. The total votes to be transferred from each non-top-2 candidate is the total votes received in the 1st round minus the number of undecided votes.

Assigning the Undecided Votes

As explained in Step 5. above, the undecided votes are “assigned” to one of the top-2 candidates based on the average percentage of new daily fans/followers gathered by each candidate.

The figure below shows the daily incremental change in the total number of fans and followers in the social media accounts of Lula and Bolsonaro (considering the sum of Facebook, Instagram and Twitter) in the period between the 1st round and the 2nd round elections.

Using the average values from the figure above and the total number of undecided votes, we assign the undecided votes to each candidate as follows:

Now we are able to estimate the final number of votes to be received by Lula and Bolsonaro in the runoff election, using the estimated values of Definite Votes, Transfer Votes and Undecided Votes and the formula on Step 6.

The whole process can be visualized in the following diagram. According to our projections, when going from the 1st round to the runoff election, Lula would gain additional 3.5 million votes, and Bolsonaro would collect another 6.3 million. However, due to Lula’s advantage in the 1st round, Lula would still win the election at the end.

Final Results and Opinion Polls

We produced this analysis on the eve of the runoff election (Oct/29): Lula would defeat Bolsonaro by a small margin, 51.41% against 48.59%. On the same day, there were several opinion polls released. The table below compares our estimates with the other opinion polls and the official election results.

As the table shows, our predictions were better than all but one of the polls (see references [6,7,8] for the sources of these numbers).

Final Considerations

This article presented a methodology for estimating the results of elections based on 3 main components:

  1. Daily fans/followers collected from the social media accounts of the candidates.
  2. Data mining and sentiment analysis of comments posted on those accounts.
  3. Basic assumptions about the candidates’ initial positions (through polls or 1st round elections), and their political alignments.

Using these 3 components, we estimated the number of decided votes, transfer of votes and votes from previously undecided people.

Several simplifying assumptions were taken, such as: (1) nationwide proportional voting (as it is the case in Brazil), (2) reasonable expectancy on the number of valid votes on election day (based on compulsory voting and historical trends), and (3) very polarized election where voters felt very strongly for one candidate or another. These assumptions were true in Brazil in 2022.

Our estimates were as good as the best 2 opinion polls published in Brazil on the eve of the runoff election, at a considerably lower cost. The price for conducting opinion polls in Brazil in 2022 varied from over five hundred thousand Brazilian Reais (approximately one hundred thousand USD) for the more prestigious ones (Datafolha, IPEC) with bigger sample population, to fifty thousand Reais (about 10K USD) for lesser known companies and smaller samples [9].

Our methodology is fully based on data collected from social media [13] and can be evaluated daily at no extra cost, after the initial setup. The main cost is associated with the social media data collection, which is significantly lower than the cheapest opinion poll, while giving consistently more accurate results.

Our simplifying assumptions may not be directly applicable to different election systems (such as the Electoral College process in the US), or to elections that are much less polarized. For US Presidential elections, we would need to collect social media data on a state by state basis, which is not always available (note that all data we collect from social media is public — there is no personal information used). As for elections that are significantly less polarized, we could adapt our input parameters to handle that, perhaps at some cost in accuracy (this is also a problem for opinion polls).

We have tested our methodology in the Brazilian elections for President 2018 [10], and 2022 (this article), and for Mayor of São Paulo in 2020 [11], as well as the US Georgia Senate elections in 2021 [12]. In all cases we obtained results comparable to the best opinion poll results.

Social media continues to draw a lot of criticism for its negative effects of overuse (especially in kids), potential for cyberbullying, dissemination of fake news, among many others. However, in the particular case of expressing opinions about candidates and elections, we have found social media to be fairly representative of the general population (subtracting the fake news and bots, of course) and a useful source of data in predicting results of elections. Our analytical methodology described here shows that is feasible.

References

  1. Andrea Ceron, Luigi Curini and Stefano Iacus, How pollsters could use social media data to improve election forecasts (2016), Washington Post, Dec. 21, 2016.
  2. A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment (2010), Fourth International AAAI Conference on Weblogs and Social Media.
  3. B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, From tweets to polls: Linking text sentiment to public opinion time series (2010), Fourth International AAAI Conference on Weblogs and Social Media.
  4. Kellyton dos Santos Brito and Paulo J. L. Adeodato, Predicting Brazilian and U.S. Elections with Machine Learning and Social Media Data (2020), 2020 International Joint Conference on Neural Networks (IJCNN).
  5. Z. Zhou, M. Serafino, L. Cohan, G. Caldarelli and H. Makse, Why polls fail to predict elections (2021), Journal of Big Data 8, 137 (2021).
  6. Official Election Results — Brasil (in Portuguese)
  7. Gabriel Sestrem, As pesquisas eleitorais acertaram ou erraram no 2.º turno? (2022), Jornal Gazeta do Povo, Oct. 30, 2022 (in Portuguese).
  8. Pesquisas eleitorais (2022), G1 Globo web site (in Portuguese).
  9. Lula x Bolsonaro: veja as últimas pesquisas que serão divulgadas hoje (29) (2022), Jornal Estado de Minas, Oct. 29, 2022 (in Portuguese).
  10. Podem Mídias Sociais Prever o Resultado das Eleições para Presidente? (31/out/2018), LinkedIn article (in Portuguese).
  11. Prevendo o Resultado das Eleições em São Paulo (2020), Odysci Blog (in Portuguese).
  12. The Polls Are Dead — Long Live Social Media (2021), Odysci Blog.
  13. All social media data was collected using the Odysci Media Analyzer.

--

--

Reinaldo Bergamaschi

From circuit design, to cad, to simulation, to search, to data mining, to AI and a lot of coding in between. Researcher, Entrepreneur, Manager, Developer.