Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Election Data Project - Polls and Donors

In this Data Project we will be looking at data from the 2012 election.

In this project we will analyze two datasets. The first data set will be the results of political polls. We will analyze this aggregated poll data and answer some questions:

1.) Who was being polled and what was their party affiliation?
2.) Did the poll results favor Romney or Obama?
3.) How do undecided voters effect the poll?
4.) Can we account for the undecided voters?
5.) How did voter sentiment change over time?
6.) Can we see an effect in the polls from the debates?

We'll discuss the second data set later on!

Let's go ahead and start with our standard imports:

In [604… # For data


import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

from __future__ import division

The data for the polls will be obtained from HuffPost Pollster. You can check their website here. There are some pretty awesome politcal data stes to play with there so I encourage you to go and mess around with it
yourself after completing this project.

We're going to use the requests module to import some data from the web. For more information on requests, check out the documentation here.

We will also be using StringIO to work with csv data we get from HuffPost. StringIO provides a convenient means of working with text in memory using the file API, find out more about it here

In [605… # Use to grab data from the web(HTTP capabilities)


import requests

# We'll also use StringIO to work with the csv file, the DataFrame will require a .read() method
from StringIO import StringIO

In [606… # This is the url link for the poll data in csv form
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"

# Use requests to get the information in text form


source = requests.get(url).text

# Use StringIO to avoid an IO error with pandas


poll_data = StringIO(source)

Now that we have our data, we can set it as a DataFrame.

In [607… # Set poll data as pandas DataFrame


poll_df = pd.read_csv(poll_data)

# Let's get a glimpse at the data


poll_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 589 entries, 0 to 588
Data columns (total 14 columns):
Pollster 589 non-null object
Start Date 589 non-null object
End Date 589 non-null object
Entry Date/Time (ET) 589 non-null object
Number of Observations 567 non-null float64
Population 589 non-null object
Mode 589 non-null object
Obama 589 non-null int64
Romney 589 non-null int64
Undecided 422 non-null float64
Pollster URL 589 non-null object
Source URL 587 non-null object
Partisan 589 non-null object
Affiliation 589 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 69.0+ KB

Great! Now let's get a quick look with .head()

In [608… # Preview DataFrame


poll_df.head()

Out[608]: Entry
Start End Number of
Pollster Date/Time Population Mode Obama Romney Undecided Pollster URL Source URL Partisan A
Date Date Observations
(ET)

2012-11-
06 2000-
2012- 2012- Likely Live
0 Politico/GWU/Battleground 01-01 1000 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.politico.com/news/stories/1112/8338... Nonpartisan
11-04 11-05 Voters Phone
08:40:26
UTC

2012-11-
05 2000-
2012- 2012- Likely Live
1 UPI/CVOTER 01-01 3000 49 48 NaN http://elections.huffingtonpost.com/pollster/p... NaN Nonpartisan
11-03 11-05 Voters Phone
18:30:15
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
2 Gravis Marketing 01-01 872 48 48 4 http://elections.huffingtonpost.com/pollster/p... http://www.gravispolls.com/2012/11/gravis-mark... Nonpartisan
11-03 11-05 Voters Phone
09:22:02
UTC

2012-11-
06 2000-
2012- 2012- Likely
3 JZ Analytics/Newsmax 01-01 1041 Internet 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.jzanalytics.com/ Sponsor
11-03 11-05 Voters
07:38:41
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
4 Rasmussen 01-01 1500 48 49 NaN http://elections.huffingtonpost.com/pollster/p... http://www.rasmussenreports.com/public_content... Nonpartisan
11-03 11-05 Voters Phone
08:47:50
UTC

Let's go ahead and get a quick visualization overview of the affiliation for the polls.

In [609… # Factorplot the affiliation


sns.factorplot('Affiliation',data=poll_df)

<seaborn.axisgrid.FacetGrid at 0xb9bd37b8>
Out[609]:

Looks like we are overall relatively neutral, but still leaning towards Democratic Affiliation, it will be good to keep this in mind. Let's see if sorting by the Population hue gives us any further insight into the data.

In [610… # Factorplot the affiliation by Population


sns.factorplot('Affiliation',data=poll_df,hue='Population')

<seaborn.axisgrid.FacetGrid at 0xcbc90fd0>
Out[610]:

Looks like we have a strong showing of likely voters and Registered Voters, so the poll data should hopefully be a good reflection on the populations polled. Let's take another quick overview of the DataFrame.

In [611… # Let's look at the DataFrame again


poll_df.head()

Out[611]: Entry
Start End Number of
Pollster Date/Time Population Mode Obama Romney Undecided Pollster URL Source URL Partisan A
Date Date Observations
(ET)

2012-11-
06 2000-
2012- 2012- Likely Live
0 Politico/GWU/Battleground 01-01 1000 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.politico.com/news/stories/1112/8338... Nonpartisan
11-04 11-05 Voters Phone
08:40:26
UTC

2012-11-
05 2000-
2012- 2012- Likely Live
1 UPI/CVOTER 01-01 3000 49 48 NaN http://elections.huffingtonpost.com/pollster/p... NaN Nonpartisan
11-03 11-05 Voters Phone
18:30:15
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
2 Gravis Marketing 01-01 872 48 48 4 http://elections.huffingtonpost.com/pollster/p... http://www.gravispolls.com/2012/11/gravis-mark... Nonpartisan
11-03 11-05 Voters Phone
09:22:02
UTC

2012-11-
06 2000-
2012- 2012- Likely
3 JZ Analytics/Newsmax 01-01 1041 Internet 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.jzanalytics.com/ Sponsor
11-03 11-05 Voters
07:38:41
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
4 Rasmussen 01-01 1500 48 49 NaN http://elections.huffingtonpost.com/pollster/p... http://www.rasmussenreports.com/public_content... Nonpartisan
11-03 11-05 Voters Phone
08:47:50
UTC

Let's go ahead and take a look at the averages for Obama, Romney , and the polled people who remained undecided.

In [612… # First we'll get the average


avg = pd.DataFrame(poll_df.mean())
avg.drop('Number of Observations',axis=0,inplace=True)

# After that let's get the error


std = pd.DataFrame(poll_df.std())
std.drop('Number of Observations',axis=0,inplace=True)

# now plot using pandas built-in plot, with kind='bar' and yerr='std'
avg.plot(yerr=std,kind='bar',legend=False)

<matplotlib.axes._subplots.AxesSubplot at 0x3a3a6898>
Out[612]:

Interesting to see how close these polls seem to be, especially considering the undecided factor. Let's take a look at the numbers.

In [613… # Concatenate our Average and Std DataFrames


poll_avg = pd.concat([avg,std],axis=1)

#Rename columns
poll_avg.columns = ['Average','STD']

#Show
poll_avg

Out[613]: Average STD

Obama 46.772496 2.448627

Romney 44.573854 2.927711

Undecided 6.549763 3.702235

Looks like the polls indicate it as a fairly close race, but what about the undecided voters? Most of them will likely vote for one of the candidates once the election occurs. If we assume we split the undecided evenly
between the two candidates the observed difference should be an unbiased estimate of the final difference.

In [614… # Take a look at the DataFrame again


poll_df.head()

Out[614]: Entry
Start End Number of
Pollster Date/Time Population Mode Obama Romney Undecided Pollster URL Source URL Partisan A
Date Date Observations
(ET)

2012-11-
06 2000-
2012- 2012- Likely Live
0 Politico/GWU/Battleground 01-01 1000 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.politico.com/news/stories/1112/8338... Nonpartisan
11-04 11-05 Voters Phone
08:40:26
UTC

2012-11-
05 2000-
2012- 2012- Likely Live
1 UPI/CVOTER 01-01 3000 49 48 NaN http://elections.huffingtonpost.com/pollster/p... NaN Nonpartisan
11-03 11-05 Voters Phone
18:30:15
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
2 Gravis Marketing 01-01 872 48 48 4 http://elections.huffingtonpost.com/pollster/p... http://www.gravispolls.com/2012/11/gravis-mark... Nonpartisan
11-03 11-05 Voters Phone
09:22:02
UTC

2012-11-
06 2000-
2012- 2012- Likely
3 JZ Analytics/Newsmax 01-01 1041 Internet 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.jzanalytics.com/ Sponsor
11-03 11-05 Voters
07:38:41
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
4 Rasmussen 01-01 1500 48 49 NaN http://elections.huffingtonpost.com/pollster/p... http://www.rasmussenreports.com/public_content... Nonpartisan
11-03 11-05 Voters Phone
08:47:50
UTC

If we wanted to, we could also do a quick (and messy) time series analysis of the voter sentiment by plotting Obama/Romney favor versus the Poll End Dates. Let's take a look at how we could quickly do tht in pandas.

Note: The time is in reverse chronological order. Also keep in mind the multiple polls per end date.

In [615… # Quick plot of sentiment in the polls versus time.


poll_df.plot(x='End Date',y=['Obama','Romney','Undecided'],marker='o',linestyle='')

<matplotlib.axes._subplots.AxesSubplot at 0xd2f09c50>
Out[615]:

While this may give you a quick idea, go ahead and try creating a new DataFrame or editing poll_df to make a better visualization of the above idea!

To lead you along the right path for plotting, we'll go ahead and answer another question related to plotting the sentiment versus time. Let's go ahead and plot out the difference between Obama and Romney and how it
changes as time moves along. Remember from the last data project we used the datetime module to create timestamps, let's go ahead and use it now.

In [616… # For timestamps


from datetime import datetime

Now we'll define a new column in our poll_df DataFrame to take into account the difference between Romney and Obama in the polls.

In [617… # Create a new column for the difference between the two candidates
poll_df['Difference'] = (poll_df.Obama - poll_df.Romney)/100
# Preview the new column
poll_df.head()

Out[617]: Entry
Start End Number of
Pollster Date/Time Population Mode Obama Romney Undecided Pollster URL Source URL Partisan A
Date Date Observations
(ET)

2012-11-
06 2000-
2012- 2012- Likely Live
0 Politico/GWU/Battleground 01-01 1000 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.politico.com/news/stories/1112/8338... Nonpartisan
11-04 11-05 Voters Phone
08:40:26
UTC

2012-11-
05 2000-
2012- 2012- Likely Live
1 UPI/CVOTER 01-01 3000 49 48 NaN http://elections.huffingtonpost.com/pollster/p... NaN Nonpartisan
11-03 11-05 Voters Phone
18:30:15
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
2 Gravis Marketing 01-01 872 48 48 4 http://elections.huffingtonpost.com/pollster/p... http://www.gravispolls.com/2012/11/gravis-mark... Nonpartisan
11-03 11-05 Voters Phone
09:22:02
UTC

2012-11-
06 2000-
2012- 2012- Likely
3 JZ Analytics/Newsmax 01-01 1041 Internet 47 47 6 http://elections.huffingtonpost.com/pollster/p... http://www.jzanalytics.com/ Sponsor
11-03 11-05 Voters
07:38:41
UTC

2012-11-
06 2000-
2012- 2012- Likely Automated
4 Rasmussen 01-01 1500 48 49 NaN http://elections.huffingtonpost.com/pollster/p... http://www.rasmussenreports.com/public_content... Nonpartisan
11-03 11-05 Voters Phone
08:47:50
UTC

Great! Keep in mind that the Difference column is Obama minus Romney, thus a positive difference indicates a leaning towards Obama in the polls.

Now let's go ahead and see if we can visualize how this sentiment in difference changes over time. We will start by using groupby to group the polls by their start data and then sorting it by that Start Date.

In [618… # Set as_index=Flase to keep the 0,1,2,... index. Then we'll take the mean of the polls on that day.
poll_df = poll_df.groupby(['Start Date'],as_index=False).mean()

# Let's go ahead and see what this looks like


poll_df.head()

Out[618]: Start Date Number of Observations Obama Romney Undecided Difference

0 2009-03-13 1403 44 44 12 0.00

1 2009-04-17 686 50 39 11 0.11

2 2009-05-14 1000 53 35 12 0.18

3 2009-06-12 638 48 40 12 0.08

4 2009-07-15 577 49 40 11 0.09

Great! Now plotting the Differencce versus time should be straight forward.

In [619… # Plotting the difference in polls between Obama and Romney


fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple')

It would be very interesting to plot marker lines on the dates of the debates and see if there is any general insight to the poll results.

The debate dates were Oct 3rd, Oct 11, and Oct 22nd. Let's plot some lines as markers and then zoom in on the month of October.

In order to find where to set the x limits for the figure we need to find out where the index for the month of October in 2012 is. Here's a simple for loop to find that row. Note, the string format of the date makes this
difficult to do without using a lambda expression or a map.

In [620… # Set row count and xlimit list


row_in = 0
xlimit = []

# Cycle through dates until 2012-10 is found, then print row index
for date in poll_df['Start Date']:
if date[0:7] == '2012-10':
xlimit.append(row_in)
row_in +=1
else:
row_in += 1

print min(xlimit)
print max(xlimit)

329
356

Great now we know where to set our x limits for the month of October in our figure.

In [621… # Start with original figure


fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple',xlim=(329,356))

# Now add the debate markers


plt.axvline(x=329+2, linewidth=4, color='grey')
plt.axvline(x=329+10, linewidth=4, color='grey')
plt.axvline(x=329+21, linewidth=4, color='grey')

<matplotlib.lines.Line2D at 0x4025aa58>
Out[621]:

Surprisingly, thse polls reflect a dip for Obama after the second debate against Romney, even though memory serves that he performed much worse against Romney during the first debate.

For all these polls it is important to remeber how geographical location can effect the value of a poll in predicting the outcomes of a national election.

Donor Data Set


Let's go ahead and switch gears and take a look at a data set consisting of information on donations to the federal campaign.

This is going to be the biggest data set we've looked at so far. You can download it here , then make sure to save it to the same folder your iPython Notebooks are in.

The questions we will be trying to answer while looking at this Data Set is:

1.) How much was donated and what was the average donation?
2.) How did the donations differ between candidates?
3.) How did the donations differ between Democrats and Republicans?
4.) What were the demographics of the donors?
5.) Is there a pattern to donation amounts?

In [622… # Set the DataFrame as the csv file


donor_df = pd.read_csv('Election_Donor_Data.csv')

In [623… # Get a quick overview


donor_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id 1001731 non-null object
cand_id 1001731 non-null object
cand_nm 1001731 non-null object
contbr_nm 1001731 non-null object
contbr_city 1001712 non-null object
contbr_st 1001727 non-null object
contbr_zip 1001620 non-null object
contbr_employer 988002 non-null object
contbr_occupation 993301 non-null object
contb_receipt_amt 1001731 non-null float64
contb_receipt_dt 1001731 non-null object
receipt_desc 14166 non-null object
memo_cd 92482 non-null object
memo_text 97770 non-null object
form_tp 1001731 non-null object
file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 129.9+ MB

In [624… # let's also just take a glimpse


donor_df.head()

Out[624]: cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num

Bachmann, HARVEY,
0 C00410118 P20002978 MOBILE AL 3.660103e+08 RETIRED RETIRED 250 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle WILLIAM

Bachmann, HARVEY,
1 C00410118 P20002978 MOBILE AL 3.660103e+08 RETIRED RETIRED 50 23-JUN-11 NaN NaN NaN SA17A 736166
Michelle WILLIAM

Bachmann, INFORMATION INFORMATION


2 C00410118 P20002978 SMITH, LANIER LANETT AL 3.686334e+08 250 05-JUL-11 NaN NaN NaN SA17A 749073
Michelle REQUESTED REQUESTED

Bachmann, BLEVINS,
3 C00410118 P20002978 PIGGOTT AR 7.245483e+08 NONE RETIRED 250 01-AUG-11 NaN NaN NaN SA17A 749073
Michelle DARONDA

HOT
Bachmann, WARDENBURG,
4 C00410118 P20002978 SPRINGS AR 7.190165e+08 NONE RETIRED 300 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle HAROLD
NATION

What might be interesting to do is get a quick glimpse of the donation amounts, and the average donation amount. Let's go ahead and break down the data.

In [625… # Get a quick look at the various donation amounts


donor_df['contb_receipt_amt'].value_counts()

100.0 178188
Out[625]:
50.0 137584
25.0 110345
250.0 91182
500.0 57984
2500.0 49005
35.0 37237
1000.0 36494
10.0 33986
200.0 27813
20.0 17565
15.0 16163
150.0 14600
75.0 13647
201.2 11718
...
0.88 1
19.35 1
58.18 1
71.20 1
70.68 1
163.90 1
14.97 1
264.39 1
162.60 1
81.15 1
45.15 1
106.56 1
62.20 1
58.82 1
73.83 1
Length: 8079, dtype: int64

8079 different amounts! Thats quite a variation. Let's look at the average and the std.

In [626… # Get the mean donation


don_mean = donor_df['contb_receipt_amt'].mean()

# Get the std of the donation


don_std = donor_df['contb_receipt_amt'].std()

print 'The average donation was %.2f with a std of %.2f' %(don_mean,don_std)

The average donation was 298.24 with a std of 3749.67

Wow! That's a huge standard deviation! Let's see if there are any large donations or other factors messing with the distribution of the donations.

In [627… # Let's make a Series from the DataFrame, use .copy() to avoid view errors
top_donor = donor_df['contb_receipt_amt'].copy()

# Now sort it
top_donor.sort()

# Then check the Series


top_donor

114604 -30800.00
Out[627]:
226986 -25800.00
101356 -7500.00
398429 -5500.00
250737 -5455.00
33821 -5414.31
908565 -5115.00
456649 -5000.00
574657 -5000.00
30513 -5000.00
562267 -5000.00
30584 -5000.00
86268 -5000.00
708920 -5000.00
665887 -5000.00
...
90076 10000.00
709859 10000.00
41888 10000.00
65131 12700.00
834301 25000.00
823345 25000.00
217891 25800.00
114754 33300.00
257270 451726.00
335187 512710.91
319478 526246.17
344419 1511192.17
344539 1679114.65
326651 1944042.43
325136 2014490.51
Name: contb_receipt_amt, Length: 1001731, dtype: float64

Looks like we have some negative values, as well as some huge donation amounts! The negative values are due to the FEC recording refunds as well as donations, let's go ahead and only look at the positive
contribution amounts

In [628… # Get rid of the negative values


top_donor = top_donor[top_donor >0]

# Sort the Series


top_donor.sort()

# Look at the top 10 most common donations value counts


top_donor.value_counts().head(10)

100 178188
Out[628]:
50 137584
25 110345
250 91182
500 57984
2500 49005
35 37237
1000 36494
10 33986
200 27813
dtype: int64

Here we can see that the top 10 most common donations ranged from 10 to 2500 dollars.

A quick question we could verify is if donations are usually made in round number amounts? (e.g. 10,20,50,100,500 etc.) We can quickly visualize this by making a histogram and checking for peaks at those values.
Let's go ahead and do this for the most common amounts, up to 2500 dollars.

In [629… # Create a Series of the common donations limited to 2500


com_don = top_donor[top_donor < 2500]

# Set a high number of bins to account for the non-round donations and check histogram for spikes.
com_don.hist(bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0xdf456d68>
Out[629]:

Looks like our intuition was right, since we spikes at the round numbers.

Let's dive deeper into the data and see if we can seperate donations by Party, in order to do this we'll have to figure out a way of creating a new 'Party' column. We can do this by starting with the candidates and their
affliliation. Now let's go ahead and get a list of candidates

In [630… # Grab the unique object from the candidate column


candidates = donor_df.cand_nm.unique()
#Show
candidates

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',


Out[630]:
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
'Perry, Rick'], dtype=object)

Let's go ahead and seperate Obama from the Republican Candidates by adding a Party Affiliation column. We can do this by using map along a dictionary of party affiliations. Lecture 36 has a review of this topic.

In [631… # Dictionary of party affiliation


party_map = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican',
'Santorum, Rick': 'Republican'}

# Now map the party with candidate


donor_df['Party'] = donor_df.cand_nm.map(party_map)

A quick note, we could have done this same operation manually using a for loop, however this operation would be much slower than using the map method.

In [632… '''
for i in xrange(0,len(donor_df)):
if donor_df['cand_nm'][i] == 'Obama,Barack':
donor_df['Party'][i] = 'Democrat'
else:
donor_df['Party'][i] = 'Republican'
'''

"\nfor i in xrange(0,len(donor_df)):\n if donor_df['cand_nm'][i] == 'Obama,Barack':\n donor_df['Party'][i] = 'Democrat'\n else:\n donor_df['Party'][i] =


Out[632]:
'Republican'\n"

Let's look at our DataFrame and also make sure we clear refunds from the contribution amounts.

In [633… # Clear refunds


donor_df = donor_df[donor_df.contb_receipt_amt >0]

# Preview DataFrame
donor_df.head()

Out[633]: cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num

Bachmann, HARVEY,
0 C00410118 P20002978 MOBILE AL 3.660103e+08 RETIRED RETIRED 250 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle WILLIAM

Bachmann, HARVEY,
1 C00410118 P20002978 MOBILE AL 3.660103e+08 RETIRED RETIRED 50 23-JUN-11 NaN NaN NaN SA17A 736166
Michelle WILLIAM

Bachmann, INFORMATION INFORMATION


2 C00410118 P20002978 SMITH, LANIER LANETT AL 3.686334e+08 250 05-JUL-11 NaN NaN NaN SA17A 749073
Michelle REQUESTED REQUESTED

Bachmann, BLEVINS,
3 C00410118 P20002978 PIGGOTT AR 7.245483e+08 NONE RETIRED 250 01-AUG-11 NaN NaN NaN SA17A 749073
Michelle DARONDA

HOT
Bachmann, WARDENBURG,
4 C00410118 P20002978 SPRINGS AR 7.190165e+08 NONE RETIRED 300 20-JUN-11 NaN NaN NaN SA17A 736166
Michelle HAROLD
NATION

Let's start by aggregating the data by candidate. We'll take a quick look a the total amounts received by each candidate. First we will look a the total number of donations and then at the total amount.

In [634… # Groupby candidate and then displayt the total number of people who donated
donor_df.groupby('cand_nm')['contb_receipt_amt'].count()

cand_nm
Out[634]:
Bachmann, Michelle 13082
Cain, Herman 20052
Gingrich, Newt 46883
Huntsman, Jon 4066
Johnson, Gary Earl 1234
McCotter, Thaddeus G 73
Obama, Barack 589127
Paul, Ron 143161
Pawlenty, Timothy 3844
Perry, Rick 12709
Roemer, Charles E. 'Buddy' III 5844
Romney, Mitt 105155
Santorum, Rick 46245
Name: contb_receipt_amt, dtype: int64

Clearly Obama is the front-runner in number of people donating, which makes sense, since he is not competeing with any other democratic nominees. Let's take a look at the total dollar amounts.

In [635… # Groupby candidate and then displayt the total amount donated
donor_df.groupby('cand_nm')['contb_receipt_amt'].sum()

cand_nm
Out[635]:
Bachmann, Michelle 2.711439e+06
Cain, Herman 7.101082e+06
Gingrich, Newt 1.283277e+07
Huntsman, Jon 3.330373e+06
Johnson, Gary Earl 5.669616e+05
McCotter, Thaddeus G 3.903000e+04
Obama, Barack 1.358774e+08
Paul, Ron 2.100962e+07
Pawlenty, Timothy 6.004819e+06
Perry, Rick 2.030575e+07
Roemer, Charles E. 'Buddy' III 3.730099e+05
Romney, Mitt 8.833591e+07
Santorum, Rick 1.104316e+07
Name: contb_receipt_amt, dtype: float64

This isn't super readable, and an important aspect of data science is to clearly present information. Let's go ahead and just print out these values in a clean for loop.

In [636… # Start by setting the groupby as an object


cand_amount = donor_df.groupby('cand_nm')['contb_receipt_amt'].sum()

# Our index tracker


i = 0

for don in cand_amount:


print " The candidate %s raised %.0f dollars " %(cand_amount.index[i],don)
print '\n'
i += 1

The candidate Bachmann, Michelle raised 2711439 dollars

The candidate Cain, Herman raised 7101082 dollars

The candidate Gingrich, Newt raised 12832770 dollars

The candidate Huntsman, Jon raised 3330373 dollars

The candidate Johnson, Gary Earl raised 566962 dollars

The candidate McCotter, Thaddeus G raised 39030 dollars

The candidate Obama, Barack raised 135877427 dollars

The candidate Paul, Ron raised 21009620 dollars

The candidate Pawlenty, Timothy raised 6004819 dollars

The candidate Perry, Rick raised 20305754 dollars

The candidate Roemer, Charles E. 'Buddy' III raised 373010 dollars

The candidate Romney, Mitt raised 88335908 dollars

The candidate Santorum, Rick raised 11043159 dollars

This is okay, but its hard to do a quick comparison just by reading this information. How about just a quick graphic presentation?

In [637… # PLot out total donation amounts


cand_amount.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0xacd06278>
Out[637]:

Now the comparison is very easy to see. As we saw berfore, clearly Obama is the front-runner in donation amounts, which makes sense, since he is not competeing with any other democratic nominees. How about we
just compare Democrat versus Republican donations?

In [638… # Groupby party and then count donations


donor_df.groupby('Party')['contb_receipt_amt'].sum().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x3a4a8358>
Out[638]:

Looks like Obama couldn't compete against all the republicans, but he certainly has the advantage of their funding being splintered across multiple candidates.

Finally to start closing out the project, let's look at donations and who they came from (as far as occupation is concerned). We will start by grabing the occupation information from the dono_df DataFrame and then using
pivot_table to make the index defined by the various occupations and then have the columns defined by the Party (Republican or Democrat). FInally we'll also pass an aggregation function in the pivot table, in this case
a simple sum function will add up all the comntributions by anyone with the same profession.

In [639… # Use a pivot table to extract and organize the data by the donor occupation
occupation_df = donor_df.pivot_table('contb_receipt_amt',
index='contbr_occupation',
columns='Party', aggfunc='sum')

In [640… # Let's go ahead and check out the DataFrame


occupation_df.head()

Out[640]: Party Democrat Republican

contbr_occupation

MIXED-MEDIA ARTIST / STORYTELLER 100 NaN

AREA VICE PRESIDENT 250 NaN

RESEARCH ASSOCIATE 100 NaN

TEACHER 500 NaN

THERAPIST 3900 NaN

Great! Now let's see how big the DataFrame is.

In [641… # Check size


occupation_df.shape

(45067, 2)
Out[641]:

Wow! This is probably far too large to display effectively with a small, static visualization. What we should do is have a cut-off for total contribution amounts. Afterall, small donations of 20 dollars by one type of
occupation won't give us too much insight. So let's set our cut off at 1 million dollars.

In [642… # Set a cut off point at 1 milllion dollars of sum contributions


occupation_df = occupation_df[occupation_df.sum(1) > 1000000]

In [643… # Now let's check the size!


occupation_df.shape

(31, 2)
Out[643]:

Great! This looks much more manageable! Now let's visualize it.

In [644… # plot out with pandas


occupation_df.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0xde7771d0>
Out[644]:

This is a bit hard to read, so let's use kind = 'barh' (horizontal) to set the ocucpation on the correct axis.

In [645… # Horizontal plot, use a convienently colored cmap


occupation_df.plot(kind='barh',figsize=(10,12),cmap='seismic')

<matplotlib.axes._subplots.AxesSubplot at 0x466637f0>
Out[645]:

Looks like there are some occupations that are either mislabeled or aren't really occupations. Let's get rid of: Information Requested occupations and let's combine CEO and C.E.O.

In [646… # Drop the unavailble occupations


occupation_df.drop(['INFORMATION REQUESTED PER BEST EFFORTS','INFORMATION REQUESTED'],axis=0,inplace=True)

Now let's combine the CEO and C.E.O rows.

In [647… # Set new ceo row as sum of the current two


occupation_df.loc['CEO'] = occupation_df.loc['CEO'] + occupation_df.loc['C.E.O.']
# Drop CEO
occupation_df.drop('C.E.O.',inplace=True)

Now let's repeat the same plot!

In [648… # Repeat previous plot!


occupation_df.plot(kind='barh',figsize=(10,12),cmap='seismic')

<matplotlib.axes._subplots.AxesSubplot at 0xd6233358>
Out[648]:

Awesome! Looks like CEOs are a little more conservative leaning, this may be due to the tax philosphies of each party during the election.

Great Job!
There's still so much to discover in these rich datasets! Come up with your own political questions you want answered! Or just play around with different methods of visualizing the data!

For more on general data analysis of politics, I highly suggest the 538 website!

Again, great job on getting through the course this far! Go ahead and search the web for more data to discover!

In [ ]:

You might also like