# Message Boards

Answer
(Unmark)

GROUPS:

9

# Detection of statistical outliers of voting tables in Peruvian election

Posted 4 months ago

Introduction In this post I will describe and apply a methodology to detect voting tables that may be considered statistical outliers. This will be done at 2 different levels, at district level, and at polling station level. The Peruvian second round had 2 candidates representing 2 different political parties: Pedro Castillo for Perú Libre (PL) and Keiko Fujimori for Fuerza Popular (FP). Warning: original article was written in Spanish, so some code is using Spanish variable names.
Theory and hypothesis: votes distribution in the voting tables
Random distribution of existing votes Let’s assume that we already have the number of votes for each candidate, blanks and invalids. These values will be fixed for a given delimited area (which could be a province, district or polling station). Let’s start assuming that we have a total of votes, distributed as votes for candidate 1, votes for candidate 2, ℬ blank votes and invalid votes ( =++ℬ+ The distribution that generates when choosing randomly in that way is known as hypergeometric distribution (and for more variables it would be the generalized multivariate version). Here a practical example on how it can be deduced: Let’s say we have = 11, = 10, ℬ = 1, = 3, which means = 25. Now let’s try to fill a voting table with =5 votes using that data. Let’s say that we have a voting table with all the votes for , what’s the probability of that happening? First, the probability of getting a out of that bag of 25 would be 11/25, and then, to get again another would be 10/24, and following that it would be: In[]:= 11 25 10 24 9 23 8 22 7 21 Out[]= 1 115 Which in terms of factorials could be expressed as: In[]:= 11!/6! 25!/20! Out[]= 1 115 And if we say that there are 4 and 1 , then it would be 11/25 to get first a , then 10/24 for the second, 9/23 and 8/22 for third and fourth, and for the fifth, a , it would be 10/21; but it could happen also that the first to get is a , or the second, or third or fourth, then we have to consider all those cases, and the total probability would be the sum of all those cases: In[]:= 11 25 10 24 9 23 8 22 10 21 11 25 10 24 9 23 10 22 8 21 11 25 10 24 10 23 9 22 8 21 11 25 10 24 10 23 9 22 8 21 10 25 11 24 10 23 9 22 8 21 Out[]= 10 161 Which is equivalent to: In[]:= 11!/7!*10 25!/20! Out[]= 10 161 Now let’s try to fill a voting table with =5 votes of any type. How many different ways are there to fill it? We can use Permutations to get a full list of all possible combinations: In[]:= votos={,,,,,,,,,,,,,,,,,,,,,ℬ,,,}; In[]:= Permutations[votos,{5}]//Short Out[]//Short= {{,,,,},{,,,,},{,,,,ℬ},{,,,,},{,,,,},{,,,,},{,,,,ℬ},618,{,,,,},{,,,,ℬ},{,,,,},{,,,,},{,,,,ℬ},{,,,ℬ,},{,,,ℬ,}} In[]:= %//Length Out[]= 632 There are 632 ways (ordered) in which we can fill those 5 votes. But as we saw before, not all these combinations are equally probable, because each may correspond to a differnet “voter”. Then we have to think how can we choose first the voters. In general, if we ask how many ways we can choose 5 out of 25, the answer is: In[]:=
Out[]= 53130 Then, that would be the denominator, and numerator for the probability that we are estimating would be the binomial of each number of total votes with respect to the chosen votes. For the example of 4 and 1 it would be the value that we computed before: In[]:=
Out[]= 10 161 If we generalize the formula, we would see that the probability to find votes for , for , for ℬ and for for a total of votes would be: Out[]//TraditionalForm=
And that’s the probability function of the multivariate hypergeometric distribution.
Example simulation We can test this function with a simulation example too. First, let’s generate 1000 random voting tables out of the proposed votes in the example: In[]:= randomSample=Table[RandomSample[votos,5],1000]; And let’s count how many votes of each type there are in each voting table: In[]:= countsPerTable=Lookup[Rule@@@Tally[#],{,,ℬ,},0]&/@randomSample; In[]:= countsPerTable//Short Out[]//Short= {{1,2,0,2},{3,1,0,1},{2,3,0,0},{4,1,0,0},{3,2,0,0},{3,1,0,1},{3,2,0,0},{1,4,0,0},985,{1,3,0,1},{3,1,0,1},{5,0,0,0},{2,3,0,0},{2,2,1,0},{1,1,1,2},{3,2,0,0}} And now we can compare the marginal distribution for each variable: In[]:= Table[Show[Histogram[countsPerTable[[All,v]],{1},"ProbabilityDensity"],DiscretePlot[PDF[MarginalDistribution[MultivariateHypergeometricDistribution[5,{11,10,1,3}],v],i],{i,0,5},ExtentSizeScaled[1/2]],PlotLabel->Row[{"Distribution for ",{,,ℬ,}[[v]]}]],{v,4}] Out[]= , , , Where we can see that there is a good agreement between the random sample histogram and the theoretical distribution. Now let’s do the same simulation with 100000 voting tables, and let’s see their counts and distributions: In[]:= randomSample=Table[RandomSample[votos,5],100000]; And let’s count how many votes of each type there are in each voting table: In[]:= countsPerTable=Lookup[Rule@@@Tally[#],{,,ℬ,},0]&/@randomSample; In[]:= countsPerTable//Short Out[]//Short= {{3,1,0,1},{1,4,0,0},{1,4,0,0},{2,3,0,0},{1,4,0,0},{0,3,1,1},{1,3,1,0},99986,{2,3,0,0},{2,1,1,1},{3,1,1,0},{2,1,0,2},{2,2,0,1},{2,1,1,1},{3,2,0,0}} And now we can compare the marginal distribution for each variable: In[]:= Table[Show[Histogram[countsPerTable[[All,v]],{1},"ProbabilityDensity"],DiscretePlot[PDF[MarginalDistribution[MultivariateHypergeometricDistribution[5,{11,10,1,3}],v],i],{i,0,5},ExtentSizeScaled[1/2]],PlotLabel->Row[{"Distribution for ",{,,ℬ,}[[v]]}]],{v,4}] Out[]= , , , Where we can see how the agreement converges between the random sample distribution and the theoretical distribution.
Distribution by groups There is an important consideration to take into account, and that is that the grouping in a table is not random. The voting tables are sorted by last name within the same polling station, which generates clusters and bias that wouldn’t be following a random sample as the one analyzed before.
Null hypothesis For this analysis we will use as null hypothesis that the distribution of votes within some circumscription may follow a multivariate hypergeometric distribution, and we will analyze if we reject or not that hypothesis. Then, we will evaluate if we can identify voting tables that may be considered outliers. This hypothesis implicitly assumes that the voters within that circumscription have been distributed randomly among the voting tables, which means that we won’t consider the bias generated by the last name clusters, which is undeterminable, as far as we know.
About the data
Utility code
Load data Let’s load the full list of voting tables: In[]:= actasCompletas=Uncompress@CloudGet[CloudObject[ https://www.wolframcloud.com/obj/franciscoj/elecciones/2021-2/a ctasCompletas ]];The structure of this data is a table that contains in each “row” a list of the following fields: {“voting table code”, “polling station code”, “ubigeo”, “electors”, “votes”, “PL”, “FP”, “blanks”, “invalids”, “disputed votes”, “observation 1”, “observation 2”} “ubigeo” is the code for a district, which may be an actual Peruvian district (or municipality) or a foreign state (like NY, needed for voters outside Peru). In[]:= TextGrid[RandomSample[actasCompletas,5],Frame->All] Out[]=
The observations are: In[]:= TextGrid[Tally[actasCompletas[[All,11]]],Frame->All] Out[]=
In[]:= TextGrid[Tally[actasCompletas[[All,12]]],Frame->All] Out[]=
And we can check the total vote count of all the voting tables: In[]:= showTotalVotes@Total[actasCompletas[[All,4;;9]]] Out[]=
and verify that it is exactly the same as the one published by ONPE when they reached 100% counted in https://www.resultadossep.eleccionesgenerales2021.pe/SEP2021/EleccionesPresidenciales/RePres/T :
Filtering by type In this analysis we will only consider the “ACTAS ELECTORAL NORMAL” case, which are the voting tables that didn’t go the JEE (Jurado Electoral Especial). When JEE resolves a table, it may change the results, or may invalidate a table. So we will focus only in the tables that have not been modified after the election day. The number of unmodified tables is 84864. List of unmodified tables: In[]:= actasNormales=Cases[actasCompletas,{__,"ACTA ELECTORAL NORMAL",_}]; And the list of tables that went to JEE and then were counted includes 1397 cases: In[]:= actasResueltasContabilizadas=Cases[actasCompletas,{__,"ACTA ELECTORAL RESUELTA","CONTABILIZADAS NORMALES"}];
Basic stats Let’s see in each group of tables the distribution of the different numeric variables (electors, votes, votes for Perú Libre, for Fuerza Popular, blanks and invalids).
Utility code
Histograms of normal voting tables (which didn’t go to JEE) Out[]=
Histograms of voting tables resolved by JEE and counted Out[]=
District stats Let’s analyze stats at district level. In Peru there are 1874 districts, and we will include as “districts” all the “ubigeos” used by ONPE which include foreign countries. That makes a total of 2082 ubigeos. In each district we will get the total value for electors, votes, PL, FP, blanks and invalid votes, for only normal voting tables. In[]:= totalByDistrict=GroupBy[actasNormales,Extract[3],Total[#[[All,4;;9]]]&]; Out[]=
Let’s see some extreme cases, so that we can get an idea on how the different districts behave. Largest district: In[]:= showTotalVotes/@MaximalBy[totalByDistrict,First] Out[]= 140137
That ubigeo code corresponds to San Juan de Lurigancho, Lima. District with most Perú Libre votes over Fuerza Popular: In[]:= showTotalVotes/@MaximalBy[totalByDistrict,#[[3]]-#[[4]]&] Out[]= 200901
That ubigeo code corresponds to Juliaca, San Ramón, Puno. District with most Fuerza Popular votes over Perú Libre: In[]:= showTotalVotes/@MaximalBy[totalByDistrict,#[[4]]-#[[3]]&] Out[]= 140130
That ubigeo corresponds to Santiago de Surco, Lima. District with more than 50000 electors, but with least aboslute difference between both: In[]:= showTotalVotes/@MinimalBy[Select[totalByDistrict,First[#]>50000&],Abs[#[[4]]-#[[3]]]&] Out[]= 130301
That ubigeo corresponds to Lambayeque, Lambayeque, Lambayeque. The detail about how the distribution for each behaves can be see in the Appendix section.
Computations and results
Simulation and distribution fit test of null hypothesis
At district level The multivariate hypergeometric distribution has the issue of only being applicable to a fixed number of elements. So, to test if we reject or not the hypothesis, we will generate randomly for each table a sample of this distribution, and we will combine the samples for each district. We will use 1000 distributions for each voting table. This effectively means that we will simulate 1000 scenarios using this distribution. valorPDistribucionEmpiricaDistritos=(ilocal=0;PrintTemporary[Dynamic[ilocal]];GroupBy[SortBy[actasNormales,Extract[3]],Extract[3],Function[{lista},ilocal++;With[{totales=Total[lista[[All,4;;10]]],lista0=DeleteCases[lista,{_,_,_,_,0,___}]},If[totales[[2]]==0||Length[lista0]<=4,None,With[{vectorreferencia=totales[[3;;6]],total=totales[[2]]},With[{simulaciones=Developer`ToPackedArray[Join@@(RandomVariate[MultivariateHypergeometricDistribution[#,vectorreferencia],1000]&/@lista0[[All,5]])]},DistributionFitTest[lista0[[All,6;;9]],EmpiricalDistribution[simulaciones]]]]]]]]);//AbsoluteTiming Out[]= {2079.63,Null} Summary: Out[]=
Out[]= Percentage of districts with enough information where the null hypothesis is not rejected: Out[]= 78.2166 %
At polling station level We will do the same computation at polling station level. In[]:= valorPDistribucionEmpiricaLocal=(ilocal=0;PrintTemporary[Dynamic[ilocal]];GroupBy[actasNormales,Extract[2],Function[{lista},ilocal++;With[{totales=Total[lista[[All,4;;10]]],lista0=DeleteCases[lista,{_,_,_,_,0,___}]},If[totales[[2]]==0||Length[lista0]<=4,None,With[{vectorreferencia=totales[[3;;6]],total=totales[[2]]},With[{simulaciones=Developer`ToPackedArray[Join@@(RandomVariate[MultivariateHypergeometricDistribution[#,vectorreferencia],1000]&/@lista0[[All,5]])]},DistributionFitTest[lista0[[All,6;;9]],EmpiricalDistribution[simulaciones]]]]]]]]);//AbsoluteTiming Out[]= {1549.15,Null} Summary: Out[]=
Out[]= Percentage of polling stations with enough information where the null hypothesis is not rejected: Out[]= 96.7424 %
Evaluation of outliers Given that at district level, and at polling station level there are many cases where the distribution is not rejected, we will assign a probability value for each voting table using the multivariate hypergeometric distribution. We will consider as outlier any case with a probability lower than -6 10
Utility code
At district level In[]:= actasNormalesConProbabilidadPorDistrito=GroupBy[SortBy[actasNormales,Extract[3]],Extract[3],Function[{lista},With[{totales=Total[lista[ |