Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

PREDICTION MODEL FOR US PRESIDENTIAL ELECTIONS USING R SOFTWARE

Explanation of the model:

 This model is prepared for the purpose of election forecasting which is mainly the art and
science of predicting the winner of elections before actual’ casting of votes using polling data
from likely voters.
 Here, the primal focus will be on the US presidential elections which are conducted after
every 4 years.
 There are mainly 2 competitive candidates, i.e., Republican and Democratic Candidate
 While in most countries, majority is considered for deciding on the winner of the elections, in
USA, that isn’t the case.
 There are 50 states in the United States, and each is assigned a number of electoral votes
based on its population.
 The candidate who receives the most votes in that state gets all of its electoral votes. And
then across the entire country, the candidate who receives the most electoral votes wins the
entire presidential election.
 Data from RealClearPolitics.com that basically represents polling data that was collected in
the months leading up to the 2004, 2008, and 2012 US presidential elections was used.
 Each row in the data set represents a state in a particular election year.
 And the dependent variable, which is called Republican, is a binary outcome
 It's 1 if the Republican won that state in that particular election year, and a 0 if a Democrat
won
 The independent variables, again, are related to polling data in that state.
 So for instance, the Rasmussen and SurveyUSA variables are related to two major polls that
are assigned across many different states in the United States.
 And it represents the percentage of voters who said they were likely to vote Republican
minus the percentage who said they were likely to vote Democrat. So for instance, if the
variable SurveyUSA in our data set has value -6, it means that 6% more voters said they were
likely to vote Democrat than said they were likely to vote Republican in that state.
 DiffCount counts the number of all the polls leading up to the election that predicted a
Republican winner in the state, minus the number of polls that predicted a Democratic
winner.
 And PropR, or proportion Republican, has the proportion of all those polls leading up to the
election that predicted a Republican winner.
 Also, as it is not known as to what model would be a better fit in this case, we will first run
regressions for data of 2004 and 2008, decide on the best model and then apply it to 2012
data.
 EXPLANATION AND RESULTS
 Herein, I’ll train data from the 2004 and 2008 elections, and test on data from the 2012
presidential election.
 First of all, a data frame called Train is created using the subset function that breaks down
the original polling data frame and only stores the observations when either the Year was
2004 or when the Year was 2008
 Another subset, ‘Test’ would be created to save the values for 2012 polling data
 Next, we need to understand the prediction of our baseline model against which we would
compare the logistic regression model
 For this, we need to check the breakdown of the dependent variable in the training set

 Interpretation: It is reflective of the fact that in 47 out of the 100 observations, the
democrats won the elections and in 53 Republicans won the elections.
 So, it can be stated that the baseline model is going to predict the more common outcome,
i.e., Republicans won in the state and will have 53% accuracy in the results.
 It is a weak model because it will predict Republican even for the state where Democrats
have higher chances of winning or are polling 15 to 20% ahead of Republicans.
 So, a better baseline model would be one which considers only one state. In this case,
Rasmussen should be considered and prediction should be based on as to which party was
picked out by the polls as the actual winner in the particular state and then decide on the
ultimate result.
 In this case sign function should be used, which would return positive value, if Republicans
are winning, -1, if democrats are winning and 0 if the model is inconclusive or if there is a tie.
 Results of the baseline model :

 It is reflective of the fact that Republicans are predicted to win in 55 states, democrats in 42
states and the model was inconclusive with respect to 3 states.
 Comparison of Baseline model 1 with baseline model 2

 Here, 0 and 1 are depictive of the republic and democrat wins


 The results are depictive of the fact that for 42 states, the model 2 correctly predicted that
democrats will win and there were a total of 3 mistakes and 2 inconclusive results on the part
of the model2 where model 1 made 47 mistakes
 Model 2 is definitely better in this scenario.

LOGISTIC REGRESSION MODEL

 Before starting with regression, there is a possibility of multi-collinearity in the model.


 For that, we need to check the correlation of the independent variables with each other and
also with the response variable.
 The following results were obtained :

There is a high degree of correlation among the variables. So, we can consider involving only one
predictor at a time.

Starting with Prop R,

In the model 1, we will be predicting the probability of success of republicans on the basis of the
polling PropR.

The results are:

 Interpretation :
 For every 1 unit change in PropR, the log-odds of Republicans winning against democrats
increase by 11.390 and the results are also statistically significant.
 Next, we need to check the predictability prowess of this model
 Interpretation: In the columns, 0 shows that the democrat won and 1 shows that the
republicans won. ‘True’ shows that we predicted Republican and ‘False’ shows that we
predicted Democrat.
 The results are clearly depictive of the fact that we were incorrect at 4 places and this
accuracy is very close to the baseline as set.
 So, we need to further improvise on the model.
 TWO VARIABLE MODEL
 Going back to the correlation matrix, we will have to check for those variables which have
comparatively lower correlation with each other.
 For model 2, we will consider SurveyUSA and Diffcount as dependent variables and compute
the predictions.

 Interpretation : With a unit change in SurveyUS, the log-odds of Republicans against


democrats increase by 0.2976
 With a unit change in the Diffcount, the log-odds of Republicans against democrats increase
by 0.76
 Both the results are statistically significant.
 Also, AIC which is used for evaluation of the model has lower value as compared to the
previous model and hence, shows that it is a better fit.
 Prediction Results :
 The model makes one less mistake than the previous one and is also a better fit.

PREDICTION FOR 2012:

 First we will run the baseline model on the 2012 data. The results are :

 There are 18 times where the smart baseline predicted that the Democrat would win and it's
correct.
 21 where it predicted the Republican would win and was correct
 Two times when it was inconclusive
 Four times where it predicted Republican but the Democrat actually won.

Next, we need to check predictions by the logistic Regression Model

It made only one mistake in its prediction which is considerably better than the baseline
model which wasn’t able to provide correct results for 6 times.

VISUALIZATION OF PREDICTED RESULTS THROUGH MAPS:


 It is depictive of transition from Democrat preference (0) to Republican Preference
(1) across states.

It is a more discrete depiction.

SUMMARY OF PREDICTION RESULTS:


Test Prediction TestPredictionBinary Test.State region

7 0.9739028 1 Arizona

10 0.9994949 1 Arkansas

13 0.0000926 0 California

16 0.0094330 0 Colorado

19 0.0000343 0 Connecticut

24 0.9640395 1 Florida

27 0.9901680 1 Georgia

30 0.0000478 0 Hawaii

33 0.9996372 1 Idaho

36 0.0000926 0 Illinois

39 0.9992970 1 Indiana

42 0.0648667 0 Iowa

45 0.9506137 1 Kansas

48 0.9901659 1 Kentucky

51 0.9994949 1 Louisiana

54 0.0009383 0 Maine

57 0.0000024 0 Maryland

60 0.0000001 0 Massachusetts
Test Prediction TestPredictionBinary Test.State region

63 0.0000177 0 Michigan

66 0.0004843 0 Minnesota

69 0.9325489 1 Mississippi

72 0.9990219 1 Missouri

75 0.9986385 1 Montana

78 0.9998655 1 Nebraska

81 0.0001795 0 Nevada

84 0.0000665 0 New Hampshire

87 0.0000127 0 New Jersey

90 0.0018172 0 New Mexico

93 0.0000013 0 New York

96 0.9506205 1 North Carolina

99 0.9998655 1 North Dakota

102 0.0000024 0 Ohio

105 0.9996372 1 Oklahoma

108 0.0035166 0 Oregon

111 0.0000926 0 Pennsylvania

114 0.0004844 0 Rhode Island


Test Prediction TestPredictionBinary Test.State region

117 0.9994949 1 South Carolina

120 0.9949023 1 South Dakota

123 0.9996372 1 Tennessee

126 0.9973641 1 Texas

129 0.9992969 1 Utah

134 0.0181252 0 Virginia

137 0.0000246 0 Washington

140 0.9981049 1 West Virginia

143 0.0006740 0 Wisconsin

This table shows the predicted probability values for election of Democrat or Republican in different
states of USA for 2012 elections.

Hence, it can be said that the predicted probability for the Republican winning the elections is
nearing to 1.

Conclusion:

1. Logistic Regression models are mainly used for predicting the probabilities and have a
defined range of outcomes, i.e., between 0 and 1.
2. Linear regression fails in this criterion and hence, can’t be used for predicting probabilities
3. Before applying forecasting techniques, it is advisable to test them on previous data, test the
accuracy of the model and then use it for predicting outcomes for better results.

You might also like