Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Read the Data

The data called itcont from US Federal Election Committee website includes

information about transactions of individual contributions to each committee. The size of

data is large, with more than 30 millions sets of information for every transaction. This report

only includes the first 3 millions transactions.

Before reading the data, I have noticed that every type of information is seperated by

a verticle bar “|”. Therefore, when reading the file, I choosed “I” as the seperator, aiming to

split the massive long strings into individual pieces of data. Between each “|”, sometimes

there is invalid information, such as “NA”, “N/A” and vaccancies. To arrange these

information, I chose all the cells will “NA”, “N/A” and vaccancies and transfer them into the

logical constant NA. These three types may not include all the invalide data, but have high

frequency.

Another change that I made to the dataset is that I changed the number of transaction

date (variable: TRANSACTION_DT) into a date format. To change the format, I read all the

“dates” in a loop. The variable could be 7-digit or 8-digit number, differetiated by the figures

of month. I made the “if” condition for these two types of date, and changed them in to a

formal date format.

Plot the Data

As we have unified most of the invalide inputs. We can use the data to plot the

number of contributions by state, the number of contributions by date, and the number of

contributions by state per capita.

When plotting the number of contributions by state and date, we can simply use the

function “ggplot”. We remove all the cells with NAs and plot the frequency of each date and

state. One small challenge when plotting the number of contributions by state is that there are
too many variable names along the x-axis that they overlapped each other. To to save more

spaces for each state abbreviation, I rotated the variable names as shown below. As for

plotting the number of contributions by date, I controlled the time range in between 2019 and

2020, which is the corresponded time range of the itcont data.

number of contributions by date


number of contributions by state

A bigger challenge is plotting the the number of contributions by state per capita,

since I have to find another dataset with the population of each state, and then match the

population number with the itcont data. To deal with this, I found two other datasets, one

with the state names and their populations in 2020; another with state abbreviation and their

names. Then I put them into two sheets in Excel, and used the Excel function “Vlookup” to

match the abbreviations with the population data.

The next step is to match the newly gathered data in Excel with the itcont dataset.

Here I stored the state abbreviation and their population in a dataframe, and used the R

function “merge” to match the state population with the state abbreviations in data itcont.

When we use function “merge”, it will create another column “Freq”, which shows the

frequency of each state in data itcont. Therefore, we can calculate the contributions by state

per capita by dividing the frequency of contributions in each state by the population of each

state. Again, I rotated the variable names along x-axis to save more spaces.
the contributions by state per capita

Interpret the Plots

The three plots shown above shows the level of contributions by date, state, and state

per capita. However, as we read a limited amount of transacations, the plots may failed to

show the actual performances of contributors. For example, the plot of contributions by date

shows that the amount of transactions peaked at July 2019, and decreased in the rest of the

year. This trend may be denied if we look at the entire data, since the first 3 million data are

only a small portion of the entire dataset and are not guaranteed “randomly selected”.

However, some aspects of the plot are seemingly convincing. For example, in the plot

of contributions by state, it shows that California (CA) has the highest total amount of

transactions. It seems true because CA has the highest census population, which is a large

base that can gaurantee a large amount of contribution transactions. Also, in the plot of

contributions by state per capita, it shows that Washington DC has the highest amount.
Washington DC, the capital of US, has a lot of federal government agencies and embassies,

indicating that it may has a prodigious amount of people engaged in politics.

Comparing with 2016

We tried to compare the total transcations of contributions by states between 2016 and

2020. To figure this out, I used the difference of the two amounts (states’ transcations of

contributions in 2020 minus that of contributions in 2016) and plot them. The plot is shown

below. From the plot, we can tell that CA has increased about 40000 times, which is the

larggest increasement among all states. On the contrary, PA has decreased about 40000

times, which is the larggest decreasement among all states. The rest of states varied within

the the range of frequency of 20000.

Interesting findings

I try to find the relationship between the average income of states and the

contributions by state per capita. I found data of average income by state from website:

worldpopulationreview.com. Then I used the excel function “Vlookup” again to match the
states names and their abbreviations. In R studio, I used the function “merge” to combine the

information of states income and the contributions per capita by states. The result has shown

below. From the result, we would conclude that there is a significant positive linear

relationship between the average household income and the contributions by state per capita.

You might also like