Professional Documents
Culture Documents
Report 1
Report 1
The data called itcont from US Federal Election Committee website includes
data is large, with more than 30 millions sets of information for every transaction. This report
Before reading the data, I have noticed that every type of information is seperated by
a verticle bar “|”. Therefore, when reading the file, I choosed “I” as the seperator, aiming to
split the massive long strings into individual pieces of data. Between each “|”, sometimes
there is invalid information, such as “NA”, “N/A” and vaccancies. To arrange these
information, I chose all the cells will “NA”, “N/A” and vaccancies and transfer them into the
logical constant NA. These three types may not include all the invalide data, but have high
frequency.
Another change that I made to the dataset is that I changed the number of transaction
date (variable: TRANSACTION_DT) into a date format. To change the format, I read all the
“dates” in a loop. The variable could be 7-digit or 8-digit number, differetiated by the figures
of month. I made the “if” condition for these two types of date, and changed them in to a
As we have unified most of the invalide inputs. We can use the data to plot the
number of contributions by state, the number of contributions by date, and the number of
When plotting the number of contributions by state and date, we can simply use the
function “ggplot”. We remove all the cells with NAs and plot the frequency of each date and
state. One small challenge when plotting the number of contributions by state is that there are
too many variable names along the x-axis that they overlapped each other. To to save more
spaces for each state abbreviation, I rotated the variable names as shown below. As for
plotting the number of contributions by date, I controlled the time range in between 2019 and
A bigger challenge is plotting the the number of contributions by state per capita,
since I have to find another dataset with the population of each state, and then match the
population number with the itcont data. To deal with this, I found two other datasets, one
with the state names and their populations in 2020; another with state abbreviation and their
names. Then I put them into two sheets in Excel, and used the Excel function “Vlookup” to
The next step is to match the newly gathered data in Excel with the itcont dataset.
Here I stored the state abbreviation and their population in a dataframe, and used the R
function “merge” to match the state population with the state abbreviations in data itcont.
When we use function “merge”, it will create another column “Freq”, which shows the
frequency of each state in data itcont. Therefore, we can calculate the contributions by state
per capita by dividing the frequency of contributions in each state by the population of each
state. Again, I rotated the variable names along x-axis to save more spaces.
the contributions by state per capita
The three plots shown above shows the level of contributions by date, state, and state
per capita. However, as we read a limited amount of transacations, the plots may failed to
show the actual performances of contributors. For example, the plot of contributions by date
shows that the amount of transactions peaked at July 2019, and decreased in the rest of the
year. This trend may be denied if we look at the entire data, since the first 3 million data are
only a small portion of the entire dataset and are not guaranteed “randomly selected”.
However, some aspects of the plot are seemingly convincing. For example, in the plot
of contributions by state, it shows that California (CA) has the highest total amount of
transactions. It seems true because CA has the highest census population, which is a large
base that can gaurantee a large amount of contribution transactions. Also, in the plot of
contributions by state per capita, it shows that Washington DC has the highest amount.
Washington DC, the capital of US, has a lot of federal government agencies and embassies,
We tried to compare the total transcations of contributions by states between 2016 and
2020. To figure this out, I used the difference of the two amounts (states’ transcations of
contributions in 2020 minus that of contributions in 2016) and plot them. The plot is shown
below. From the plot, we can tell that CA has increased about 40000 times, which is the
larggest increasement among all states. On the contrary, PA has decreased about 40000
times, which is the larggest decreasement among all states. The rest of states varied within
Interesting findings
I try to find the relationship between the average income of states and the
contributions by state per capita. I found data of average income by state from website:
worldpopulationreview.com. Then I used the excel function “Vlookup” again to match the
states names and their abbreviations. In R studio, I used the function “merge” to combine the
information of states income and the contributions per capita by states. The result has shown
below. From the result, we would conclude that there is a significant positive linear
relationship between the average household income and the contributions by state per capita.