Data Project - Election Analysis

Election Data Project - Polls and Donors
In this Data Project we will be looking at data from the 2012 election.
In this project we will analyze two datasets. The first data set will be the results of political polls. We will analyze this aggregated poll data and answer some questions:
1.) Who was being polled and what was their party affiliation?
2.) Did the poll results favor Romney or Obama?
3.) How do undecided voters effect the poll?
4.) Can we account for the undecided voters?
5.) How did voter sentiment change over time?
6.) Can we see an effect in the polls from the debates?
We'll discuss the second data set later on!
Let's go ahead and start with our standard imports:
In [604… # For data

import pandas as pd
from pandas import Series,DataFrame
import numpy as np
# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
from __future__ import division
The data for the polls will be obtained from HuffPost Pollster. You can check their website here. There are some pretty awesome politcal data stes to play with there so I encourage you to go and mess around with it
yourself after completing this project.
We're going to use the requests module to import some data from the web. For more information on requests, check out the documentation here.
We will also be using StringIO to work with csv data we get from HuffPost. StringIO provides a convenient means of working with text in memory using the file API, find out more about it here
In [605… # Use to grab data from the web(HTTP capabilities)

import requests
# We'll also use StringIO to work with the csv file, the DataFrame will require a .read() method
from StringIO import StringIO
In [606… # This is the url link for the poll data in csv form
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
# Use requests to get the information in text form

source = requests.get(url).text
# Use StringIO to avoid an IO error with pandas

poll_data = StringIO(source)
Now that we have our data, we can set it as a DataFrame.
In [607… # Set poll data as pandas DataFrame

poll_df = pd.read_csv(poll_data)
# Let's get a glimpse at the data

poll_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 589 entries, 0 to 588
Data columns (total 14 columns):
Pollster 589 non-null object
Start Date 589 non-null object
End Date 589 non-null object
Entry Date/Time (ET) 589 non-null object
Number of Observations 567 non-null float64
Population 589 non-null object
Mode 589 non-null object
Obama 589 non-null int64
Romney 589 non-null int64
Undecided 422 non-null float64
Pollster URL 589 non-null object
Source URL 587 non-null object
Partisan 589 non-null object
Affiliation 589 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 69.0+ KB
Great! Now let's get a quick look with .head()
In [608… # Preview DataFrame

poll_df.head()
Out[608]: Entry
Start End Number of
Pollster Date/Time Population Mode Obama Romney Undecided Pollster URL Source URL Partisan A
Date Date Observations
(ET)
2012-11-
06 2000-
2012- 2012- Likely Live
0 Politico/GWU/Battleground 01-01 1000 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.politico.com/news/stories/1112/8338... Nonpartisan
11-04 11-05 Voters Phone
08:40:26
UTC
2012-11-
05 2000-
1 UPI/CVOTER 01-01 3000 49 48 NaN http://elections.huffingtonpost.com/pollster/p... NaN Nonpartisan
18:30:15
UTC
2012-11-
06 2000-
2012- 2012- Likely Automated
2 Gravis Marketing 01-01 872 48 48 4 http://elections.huffingtonpost.com/pollster/p... http://www.gravispolls.com/2012/11/gravis-mark... Nonpartisan
09:22:02
UTC
2012-11-
06 2000-
2012- 2012- Likely
3 JZ Analytics/Newsmax 01-01 1041 Internet 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.jzanalytics.com/ Sponsor
11-03 11-05 Voters
07:38:41
UTC
2012-11-
06 2000-
4 Rasmussen 01-01 1500 48 49 NaN http://elections.huffingtonpost.com/pollster/p... http://www.rasmussenreports.com/public_content... Nonpartisan
08:47:50
UTC
Let's go ahead and get a quick visualization overview of the affiliation for the polls.
In [609… # Factorplot the affiliation

sns.factorplot('Affiliation',data=poll_df)
<seaborn.axisgrid.FacetGrid at 0xb9bd37b8>
Out[609]:
Looks like we are overall relatively neutral, but still leaning towards Democratic Affiliation, it will be good to keep this in mind. Let's see if sorting by the Population hue gives us any further insight into the data.
In [610… # Factorplot the affiliation by Population

sns.factorplot('Affiliation',data=poll_df,hue='Population')
<seaborn.axisgrid.FacetGrid at 0xcbc90fd0>
Out[610]:
Looks like we have a strong showing of likely voters and Registered Voters, so the poll data should hopefully be a good reflection on the populations polled. Let's take another quick overview of the DataFrame.
In [611… # Let's look at the DataFrame again

poll_df.head()
Out[611]: Entry
Start End Number of
(ET)
2012-11-
06 2000-
08:40:26
UTC
2012-11-
05 2000-
18:30:15
UTC
2012-11-
06 2000-
09:22:02
UTC
2012-11-
06 2000-
2012- 2012- Likely
11-03 11-05 Voters
07:38:41
UTC
2012-11-
06 2000-
08:47:50
UTC
Let's go ahead and take a look at the averages for Obama, Romney , and the polled people who remained undecided.
In [612… # First we'll get the average

avg = pd.DataFrame(poll_df.mean())
avg.drop('Number of Observations',axis=0,inplace=True)
# After that let's get the error

std = pd.DataFrame(poll_df.std())
std.drop('Number of Observations',axis=0,inplace=True)
# now plot using pandas built-in plot, with kind='bar' and yerr='std'
avg.plot(yerr=std,kind='bar',legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x3a3a6898>
Out[612]:
Interesting to see how close these polls seem to be, especially considering the undecided factor. Let's take a look at the numbers.
In [613… # Concatenate our Average and Std DataFrames

poll_avg = pd.concat([avg,std],axis=1)
#Rename columns
poll_avg.columns = ['Average','STD']
#Show
poll_avg
Out[613]: Average STD
Obama 46.772496 2.448627
Romney 44.573854 2.927711
Undecided 6.549763 3.702235
Looks like the polls indicate it as a fairly close race, but what about the undecided voters? Most of them will likely vote for one of the candidates once the election occurs. If we assume we split the undecided evenly
between the two candidates the observed difference should be an unbiased estimate of the final difference.
In [614… # Take a look at the DataFrame again

poll_df.head()
Out[614]: Entry
Start End Number of
(ET)
2012-11-
06 2000-
08:40:26
UTC
2012-11-
05 2000-
18:30:15
UTC
2012-11-
06 2000-
09:22:02
UTC
2012-11-
06 2000-
2012- 2012- Likely
11-03 11-05 Voters
07:38:41
UTC
2012-11-
06 2000-
08:47:50
UTC
If we wanted to, we could also do a quick (and messy) time series analysis of the voter sentiment by plotting Obama/Romney favor versus the Poll End Dates. Let's take a look at how we could quickly do tht in pandas.
Note: The time is in reverse chronological order. Also keep in mind the multiple polls per end date.
In [615… # Quick plot of sentiment in the polls versus time.

poll_df.plot(x='End Date',y=['Obama','Romney','Undecided'],marker='o',linestyle='')
<matplotlib.axes._subplots.AxesSubplot at 0xd2f09c50>
Out[615]:
While this may give you a quick idea, go ahead and try creating a new DataFrame or editing poll_df to make a better visualization of the above idea!
To lead you along the right path for plotting, we'll go ahead and answer another question related to plotting the sentiment versus time. Let's go ahead and plot out the difference between Obama and Romney and how it
changes as time moves along. Remember from the last data project we used the datetime module to create timestamps, let's go ahead and use it now.
In [616… # For timestamps

from datetime import datetime
Now we'll define a new column in our poll_df DataFrame to take into account the difference between Romney and Obama in the polls.
In [617… # Create a new column for the difference between the two candidates
poll_df['Difference'] = (poll_df.Obama - poll_df.Romney)/100
# Preview the new column
poll_df.head()
Out[617]: Entry
Start End Number of
(ET)
2012-11-
06 2000-
08:40:26
UTC
2012-11-
05 2000-
18:30:15
UTC
2012-11-
06 2000-
09:22:02
UTC
2012-11-
06 2000-
2012- 2012- Likely
11-03 11-05 Voters
07:38:41
UTC
2012-11-
06 2000-
08:47:50
UTC
Great! Keep in mind that the Difference column is Obama minus Romney, thus a positive difference indicates a leaning towards Obama in the polls.
Now let's go ahead and see if we can visualize how this sentiment in difference changes over time. We will start by using groupby to group the polls by their start data and then sorting it by that Start Date.
In [618… # Set as_index=Flase to keep the 0,1,2,... index. Then we'll take the mean of the polls on that day.
poll_df = poll_df.groupby(['Start Date'],as_index=False).mean()
# Let's go ahead and see what this looks like

poll_df.head()
Out[618]: Start Date Number of Observations Obama Romney Undecided Difference
0 2009-03-13 1403 44 44 12 0.00
1 2009-04-17 686 50 39 11 0.11
2 2009-05-14 1000 53 35 12 0.18
3 2009-06-12 638 48 40 12 0.08
4 2009-07-15 577 49 40 11 0.09
Great! Now plotting the Differencce versus time should be straight forward.
In [619… # Plotting the difference in polls between Obama and Romney

fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple')
It would be very interesting to plot marker lines on the dates of the debates and see if there is any general insight to the poll results.
The debate dates were Oct 3rd, Oct 11, and Oct 22nd. Let's plot some lines as markers and then zoom in on the month of October.
In order to find where to set the x limits for the figure we need to find out where the index for the month of October in 2012 is. Here's a simple for loop to find that row. Note, the string format of the date makes this
difficult to do without using a lambda expression or a map.
In [620… # Set row count and xlimit list

row_in = 0
xlimit = []
# Cycle through dates until 2012-10 is found, then print row index
for date in poll_df['Start Date']:
if date[0:7] == '2012-10':
xlimit.append(row_in)
row_in +=1
else:
row_in += 1
print min(xlimit)
print max(xlimit)
329
356
Great now we know where to set our x limits for the month of October in our figure.
In [621… # Start with original figure

fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple',xlim=(329,356))
# Now add the debate markers

plt.axvline(x=329+2, linewidth=4, color='grey')
<matplotlib.lines.Line2D at 0x4025aa58>
Out[621]:
Surprisingly, thse polls reflect a dip for Obama after the second debate against Romney, even though memory serves that he performed much worse against Romney during the first debate.
For all these polls it is important to remeber how geographical location can effect the value of a poll in predicting the outcomes of a national election.
Donor Data Set

Let's go ahead and switch gears and take a look at a data set consisting of information on donations to the federal campaign.
This is going to be the biggest data set we've looked at so far. You can download it here , then make sure to save it to the same folder your iPython Notebooks are in.
The questions we will be trying to answer while looking at this Data Set is:
1.) How much was donated and what was the average donation?
2.) How did the donations differ between candidates?
3.) How did the donations differ between Democrats and Republicans?
4.) What were the demographics of the donors?
5.) Is there a pattern to donation amounts?
In [622… # Set the DataFrame as the csv file

donor_df = pd.read_csv('Election_Donor_Data.csv')
In [623… # Get a quick overview

donor_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id 1001731 non-null object
cand_id 1001731 non-null object
cand_nm 1001731 non-null object
contbr_nm 1001731 non-null object
contbr_city 1001712 non-null object
contbr_st 1001727 non-null object
contbr_zip 1001620 non-null object
contbr_employer 988002 non-null object
contbr_occupation 993301 non-null object
contb_receipt_amt 1001731 non-null float64
contb_receipt_dt 1001731 non-null object
receipt_desc 14166 non-null object
memo_cd 92482 non-null object
memo_text 97770 non-null object
form_tp 1001731 non-null object
file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 129.9+ MB
In [624… # let's also just take a glimpse

donor_df.head()
Out[624]: cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
Bachmann, HARVEY,
0 C00410118 P20002978 MOBILE AL 3.660103e+08 RETIRED RETIRED 250 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle WILLIAM
Bachmann, HARVEY,
Michelle WILLIAM
Bachmann, INFORMATION INFORMATION

2 C00410118 P20002978 SMITH, LANIER LANETT AL 3.686334e+08 250 05-JUL-11 NaN NaN NaN SA17A 749073
Michelle REQUESTED REQUESTED
Bachmann, BLEVINS,
3 C00410118 P20002978 PIGGOTT AR 7.245483e+08 NONE RETIRED 250 01-AUG-11 NaN NaN NaN SA17A 749073
Michelle DARONDA
HOT
Bachmann, WARDENBURG,
4 C00410118 P20002978 SPRINGS AR 7.190165e+08 NONE RETIRED 300 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle HAROLD
NATION
What might be interesting to do is get a quick glimpse of the donation amounts, and the average donation amount. Let's go ahead and break down the data.
In [625… # Get a quick look at the various donation amounts

donor_df['contb_receipt_amt'].value_counts()
100.0 178188
Out[625]:
50.0 137584
25.0 110345
250.0 91182
500.0 57984
2500.0 49005
35.0 37237
1000.0 36494
10.0 33986
200.0 27813
20.0 17565
15.0 16163
150.0 14600
75.0 13647
201.2 11718
...
0.88 1
19.35 1
58.18 1
71.20 1
70.68 1
163.90 1
14.97 1
264.39 1
162.60 1
81.15 1
45.15 1
106.56 1
62.20 1
58.82 1
73.83 1
Length: 8079, dtype: int64
8079 different amounts! Thats quite a variation. Let's look at the average and the std.
In [626… # Get the mean donation

don_mean = donor_df['contb_receipt_amt'].mean()
# Get the std of the donation

don_std = donor_df['contb_receipt_amt'].std()
print 'The average donation was %.2f with a std of %.2f' %(don_mean,don_std)
The average donation was 298.24 with a std of 3749.67
Wow! That's a huge standard deviation! Let's see if there are any large donations or other factors messing with the distribution of the donations.
In [627… # Let's make a Series from the DataFrame, use .copy() to avoid view errors
top_donor = donor_df['contb_receipt_amt'].copy()
# Now sort it
top_donor.sort()
# Then check the Series

top_donor
114604 -30800.00
Out[627]:
226986 -25800.00
101356 -7500.00
398429 -5500.00
250737 -5455.00
33821 -5414.31
908565 -5115.00
456649 -5000.00
574657 -5000.00
30513 -5000.00
562267 -5000.00
30584 -5000.00
86268 -5000.00
708920 -5000.00
665887 -5000.00
...
90076 10000.00
709859 10000.00
41888 10000.00
65131 12700.00
834301 25000.00
823345 25000.00
217891 25800.00
114754 33300.00
257270 451726.00
335187 512710.91
319478 526246.17
344419 1511192.17
344539 1679114.65
326651 1944042.43
325136 2014490.51
Name: contb_receipt_amt, Length: 1001731, dtype: float64
Looks like we have some negative values, as well as some huge donation amounts! The negative values are due to the FEC recording refunds as well as donations, let's go ahead and only look at the positive
contribution amounts
In [628… # Get rid of the negative values

top_donor = top_donor[top_donor >0]
# Sort the Series

top_donor.sort()
# Look at the top 10 most common donations value counts

top_donor.value_counts().head(10)
100 178188
Out[628]:
50 137584
25 110345
250 91182
500 57984
2500 49005
35 37237
1000 36494
10 33986
200 27813
dtype: int64
Here we can see that the top 10 most common donations ranged from 10 to 2500 dollars.
A quick question we could verify is if donations are usually made in round number amounts? (e.g. 10,20,50,100,500 etc.) We can quickly visualize this by making a histogram and checking for peaks at those values.
Let's go ahead and do this for the most common amounts, up to 2500 dollars.
In [629… # Create a Series of the common donations limited to 2500

com_don = top_donor[top_donor < 2500]
# Set a high number of bins to account for the non-round donations and check histogram for spikes.
com_don.hist(bins=100)
<matplotlib.axes._subplots.AxesSubplot at 0xdf456d68>
Out[629]:
Looks like our intuition was right, since we spikes at the round numbers.
Let's dive deeper into the data and see if we can seperate donations by Party, in order to do this we'll have to figure out a way of creating a new 'Party' column. We can do this by starting with the candidates and their
affliliation. Now let's go ahead and get a list of candidates
In [630… # Grab the unique object from the candidate column

candidates = donor_df.cand_nm.unique()
#Show
candidates
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',

Out[630]:
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
'Perry, Rick'], dtype=object)
Let's go ahead and seperate Obama from the Republican Candidates by adding a Party Affiliation column. We can do this by using map along a dictionary of party affiliations. Lecture 36 has a review of this topic.
In [631… # Dictionary of party affiliation

party_map = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican',
'Santorum, Rick': 'Republican'}
# Now map the party with candidate

donor_df['Party'] = donor_df.cand_nm.map(party_map)
A quick note, we could have done this same operation manually using a for loop, however this operation would be much slower than using the map method.
In [632… '''
for i in xrange(0,len(donor_df)):
if donor_df['cand_nm'][i] == 'Obama,Barack':
donor_df['Party'][i] = 'Democrat'
else:
donor_df['Party'][i] = 'Republican'
'''
"\nfor i in xrange(0,len(donor_df)):\n if donor_df['cand_nm'][i] == 'Obama,Barack':\n donor_df['Party'][i] = 'Democrat'\n else:\n donor_df['Party'][i] =

Out[632]:
'Republican'\n"
Let's look at our DataFrame and also make sure we clear refunds from the contribution amounts.
In [633… # Clear refunds

donor_df = donor_df[donor_df.contb_receipt_amt >0]
# Preview DataFrame
donor_df.head()
Out[633]: cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
Bachmann, HARVEY,
Michelle WILLIAM
Bachmann, HARVEY,
Michelle WILLIAM
Bachmann, INFORMATION INFORMATION

2 C00410118 P20002978 SMITH, LANIER LANETT AL 3.686334e+08 250 05-JUL-11 NaN NaN NaN SA17A 749073
Michelle REQUESTED REQUESTED
Bachmann, BLEVINS,
3 C00410118 P20002978 PIGGOTT AR 7.245483e+08 NONE RETIRED 250 01-AUG-11 NaN NaN NaN SA17A 749073
Michelle DARONDA
HOT
Bachmann, WARDENBURG,
4 C00410118 P20002978 SPRINGS AR 7.190165e+08 NONE RETIRED 300 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle HAROLD
NATION
Let's start by aggregating the data by candidate. We'll take a quick look a the total amounts received by each candidate. First we will look a the total number of donations and then at the total amount.
In [634… # Groupby candidate and then displayt the total number of people who donated
donor_df.groupby('cand_nm')['contb_receipt_amt'].count()
cand_nm
Out[634]:
Bachmann, Michelle 13082
Cain, Herman 20052
Gingrich, Newt 46883
Huntsman, Jon 4066
Johnson, Gary Earl 1234
McCotter, Thaddeus G 73
Obama, Barack 589127
Paul, Ron 143161
Pawlenty, Timothy 3844
Perry, Rick 12709
Roemer, Charles E. 'Buddy' III 5844
Romney, Mitt 105155
Santorum, Rick 46245
Name: contb_receipt_amt, dtype: int64
Clearly Obama is the front-runner in number of people donating, which makes sense, since he is not competeing with any other democratic nominees. Let's take a look at the total dollar amounts.
In [635… # Groupby candidate and then displayt the total amount donated
donor_df.groupby('cand_nm')['contb_receipt_amt'].sum()
cand_nm
Out[635]:
Bachmann, Michelle 2.711439e+06
Cain, Herman 7.101082e+06
Gingrich, Newt 1.283277e+07
Huntsman, Jon 3.330373e+06
Johnson, Gary Earl 5.669616e+05
McCotter, Thaddeus G 3.903000e+04
Obama, Barack 1.358774e+08
Paul, Ron 2.100962e+07
Pawlenty, Timothy 6.004819e+06
Perry, Rick 2.030575e+07
Roemer, Charles E. 'Buddy' III 3.730099e+05
Romney, Mitt 8.833591e+07
Santorum, Rick 1.104316e+07
Name: contb_receipt_amt, dtype: float64
This isn't super readable, and an important aspect of data science is to clearly present information. Let's go ahead and just print out these values in a clean for loop.
In [636… # Start by setting the groupby as an object

cand_amount = donor_df.groupby('cand_nm')['contb_receipt_amt'].sum()
# Our index tracker

i = 0
for don in cand_amount:

print " The candidate %s raised %.0f dollars " %(cand_amount.index[i],don)
print '\n'
i += 1
The candidate Bachmann, Michelle raised 2711439 dollars
The candidate Cain, Herman raised 7101082 dollars
The candidate Gingrich, Newt raised 12832770 dollars
The candidate Huntsman, Jon raised 3330373 dollars
The candidate Johnson, Gary Earl raised 566962 dollars
The candidate McCotter, Thaddeus G raised 39030 dollars
The candidate Obama, Barack raised 135877427 dollars
The candidate Paul, Ron raised 21009620 dollars
The candidate Pawlenty, Timothy raised 6004819 dollars
The candidate Perry, Rick raised 20305754 dollars
The candidate Roemer, Charles E. 'Buddy' III raised 373010 dollars
The candidate Romney, Mitt raised 88335908 dollars
The candidate Santorum, Rick raised 11043159 dollars
This is okay, but its hard to do a quick comparison just by reading this information. How about just a quick graphic presentation?
In [637… # PLot out total donation amounts

cand_amount.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xacd06278>
Out[637]:
Now the comparison is very easy to see. As we saw berfore, clearly Obama is the front-runner in donation amounts, which makes sense, since he is not competeing with any other democratic nominees. How about we
just compare Democrat versus Republican donations?
In [638… # Groupby party and then count donations

donor_df.groupby('Party')['contb_receipt_amt'].sum().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x3a4a8358>
Out[638]:
Looks like Obama couldn't compete against all the republicans, but he certainly has the advantage of their funding being splintered across multiple candidates.
Finally to start closing out the project, let's look at donations and who they came from (as far as occupation is concerned). We will start by grabing the occupation information from the dono_df DataFrame and then using
pivot_table to make the index defined by the various occupations and then have the columns defined by the Party (Republican or Democrat). FInally we'll also pass an aggregation function in the pivot table, in this case
a simple sum function will add up all the comntributions by anyone with the same profession.
In [639… # Use a pivot table to extract and organize the data by the donor occupation
occupation_df = donor_df.pivot_table('contb_receipt_amt',
index='contbr_occupation',
columns='Party', aggfunc='sum')
In [640… # Let's go ahead and check out the DataFrame

occupation_df.head()
Out[640]: Party Democrat Republican
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100 NaN
AREA VICE PRESIDENT 250 NaN
RESEARCH ASSOCIATE 100 NaN
TEACHER 500 NaN
THERAPIST 3900 NaN
Great! Now let's see how big the DataFrame is.
In [641… # Check size

occupation_df.shape
(45067, 2)
Out[641]:
Wow! This is probably far too large to display effectively with a small, static visualization. What we should do is have a cut-off for total contribution amounts. Afterall, small donations of 20 dollars by one type of
occupation won't give us too much insight. So let's set our cut off at 1 million dollars.
In [642… # Set a cut off point at 1 milllion dollars of sum contributions

occupation_df = occupation_df[occupation_df.sum(1) > 1000000]
In [643… # Now let's check the size!

occupation_df.shape
(31, 2)
Out[643]:
Great! This looks much more manageable! Now let's visualize it.
In [644… # plot out with pandas

occupation_df.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xde7771d0>
Out[644]:
This is a bit hard to read, so let's use kind = 'barh' (horizontal) to set the ocucpation on the correct axis.
In [645… # Horizontal plot, use a convienently colored cmap

occupation_df.plot(kind='barh',figsize=(10,12),cmap='seismic')
<matplotlib.axes._subplots.AxesSubplot at 0x466637f0>
Out[645]:
Looks like there are some occupations that are either mislabeled or aren't really occupations. Let's get rid of: Information Requested occupations and let's combine CEO and C.E.O.
In [646… # Drop the unavailble occupations

occupation_df.drop(['INFORMATION REQUESTED PER BEST EFFORTS','INFORMATION REQUESTED'],axis=0,inplace=True)
Now let's combine the CEO and C.E.O rows.
In [647… # Set new ceo row as sum of the current two

occupation_df.loc['CEO'] = occupation_df.loc['CEO'] + occupation_df.loc['C.E.O.']
# Drop CEO
occupation_df.drop('C.E.O.',inplace=True)
Now let's repeat the same plot!
In [648… # Repeat previous plot!

occupation_df.plot(kind='barh',figsize=(10,12),cmap='seismic')
<matplotlib.axes._subplots.AxesSubplot at 0xd6233358>
Out[648]:
Awesome! Looks like CEOs are a little more conservative leaning, this may be due to the tax philosphies of each party during the election.
Great Job!
There's still so much to discover in these rich datasets! Come up with your own political questions you want answered! Or just play around with different methods of visualizing the data!
For more on general data analysis of politics, I highly suggest the 538 website!
Again, great job on getting through the course this far! Go ahead and search the web for more data to discover!
In [ ]:

Data Project - Election Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Project - Election Analysis

Uploaded by

Copyright:

Available Formats

Election Data Project - Polls and Donors

We'll discuss the second data set later on!

Let's go ahead and start with our standard imports:

In [604… # For data

from __future__ import division

In [605… # Use to grab data from the web(HTTP capabilities)

# Use requests to get the information in text form

# Use StringIO to avoid an IO error with pandas

Now that we have our data, we can set it as a DataFrame.

In [607… # Set poll data as pandas DataFrame

# Let's get a glimpse at the data

Great! Now let's get a quick look with .head()

In [608… # Preview DataFrame

In [609… # Factorplot the affiliation

In [610… # Factorplot the affiliation by Population

In [611… # Let's look at the DataFrame again

In [612… # First we'll get the average

# After that let's get the error

In [613… # Concatenate our Average and Std DataFrames

Out[613]: Average STD

Obama 46.772496 2.448627

Romney 44.573854 2.927711

Undecided 6.549763 3.702235

In [614… # Take a look at the DataFrame again

In [615… # Quick plot of sentiment in the polls versus time.

In [616… # For timestamps

# Let's go ahead and see what this looks like

Out[618]: Start Date Number of Observations Obama Romney Undecided Difference

0 2009-03-13 1403 44 44 12 0.00

1 2009-04-17 686 50 39 11 0.11

2 2009-05-14 1000 53 35 12 0.18

3 2009-06-12 638 48 40 12 0.08

4 2009-07-15 577 49 40 11 0.09

In [619… # Plotting the difference in polls between Obama and Romney

In [620… # Set row count and xlimit list

In [621… # Start with original figure

# Now add the debate markers

Donor Data Set

In [622… # Set the DataFrame as the csv file

In [623… # Get a quick overview

In [624… # let's also just take a glimpse

Bachmann, INFORMATION INFORMATION

In [625… # Get a quick look at the various donation amounts

In [626… # Get the mean donation

# Get the std of the donation

The average donation was 298.24 with a std of 3749.67

# Then check the Series

In [628… # Get rid of the negative values

# Sort the Series

# Look at the top 10 most common donations value counts

In [629… # Create a Series of the common donations limited to 2500

In [630… # Grab the unique object from the candidate column

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',

In [631… # Dictionary of party affiliation

# Now map the party with candidate

"\nfor i in xrange(0,len(donor_df)):\n if donor_df['cand_nm'][i] == 'Obama,Barack':\n donor_df['Party'][i] = 'Democrat'\n else:\n donor_df['Party'][i] =

In [633… # Clear refunds

Bachmann, INFORMATION INFORMATION

In [636… # Start by setting the groupby as an object

# Our index tracker

for don in cand_amount:

The candidate Bachmann, Michelle raised 2711439 dollars

The candidate Cain, Herman raised 7101082 dollars

from future import division