Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

Name:

Zainab Pate zmp2111

UNI:

Homework 3

Plotting Politics By Proxy

keyboard_arrow_down All students on both tracks must complete this homework.


Goal:

Using unsupervised learning we're going to produce a plot of US politicians via their individual language usage on Twitter, starting with ingesting
the tweets, preparing these tweets for PCA, and finally performing PCA. We're going to use this exercise to reflect, analyze, and explore how
such practices both reify and challenge our contemporary understanding of community and the public discourse.

More specifically, you're going to

1. tokenize and get word counts for a set of tweets;


2. produce a PCA plot of twitter user accounts based on word usage;
3. explore how word usage is (or isn't) a good proxy for political affliation for people in public office.

To do this homework, you will make use of the following files:


1. 10K2016tweets.csv
2. pol_aff.csv

You can access them at these URLS (use as the first argument in the read_csv function):

http://www.columbia.edu/~chw2/Courses/PPF/10K2016tweets.csv

http://www.columbia.edu/~chw2/Courses/PPF/pol_aff.csv

keyboard_arrow_down Policies
This assignment is to be done on your own, but you can talk about the assignment with your classmates if you get stuck.
Feel free to also use stackoverflow but please provide citation and link to the specific answer if you do this.
You may use Generative AI, but please note which LLM technologies/resources you used (eg. ChatGPT (OpenAI), Gemini (Google), etc.),
and you must include your prompt(s).
You may also visit Ananya, Andrew, or Colby during TA office hours or post in the #homework channel on Slack.

Instructions
Be sure to rename this homework notebook so that it includes your name.

Provide your code to justify your answer to each question. Your code must run with the "10K2016tweets.csv" and "pol_aff.csv" files as
originally provided to you.

Where not specified, please run functions with default argument settings.

Please 'Restart and Run All' prior to submission.

Only include relevant information in your output. We should be able to see your answer clearly.

Save pdf in Landscape and check that all of your code is shown in the submission.

When submitting in Gradescope, be sure to submit to both the pdf and ipynb assignments.

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 1/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

List any students you talked with about this assignment here:

1. [person 1]
2. [person 2]
3. etc.

Note: Lab 5 will be very helpful

keyboard_arrow_down Homework Problems


keyboard_arrow_down Step 1: Getting Word Frequencies For All Words in each tweet
question 1 [5 points]
1. Ingest the file 10K2016tweets.csv into pandas as a dataframe entitled tweets . Note that each tweet is a row. We've provided the libraries
you'll need for this assignment.
2. Use code provdied to drop rows with NAN values.
3. Use tweets.tail() to get a sense of what the data frame is.

We've given you some libraries you'll need for this assignment.

import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

# insert your code to read tweet csv into a dataframe called "tweets".

# WRITE CODE HERE

from google.colab import files


uploaded = files.upload()

Choose Files 10K2016tweets.csv


10K2016tweets.csv(text/csv) - 1557106 bytes, last modified: 5/15/2024 - 100% done
Saving 10K2016tweets.csv to 10K2016tweets (1).csv

tweets = pd.read_csv ("10K2016tweets.csv")


tweets

id user_id created_at tweet_text

0 752888708320329729 15160884 2016-07-12 15:33:50 RT @HouseJudiciary: .@Randy_Forbes to @Loretta...

1 752888783922683904 15160884 2016-07-12 15:34:08 RT @RebeccaARainey: @Randy_Forbes grills @Lore...

2 735952848937177089 15160884 2016-05-26 21:56:47 RT @aircraftcarrier: .@USNavy duty is to defen...

3 735870521502486529 15160884 2016-05-26 16:29:38 Regardless of the cause of this crash-- we nee...

4 735870454116782082 15160884 2016-05-26 16:29:22 Our pilots aren't getting the flying time they...

... ... ... ... ...

8887 738088645987557376 234128524 2016-06-01 19:23:40 In this e-news update: 1) Nominate a vet for C...

8888 738082324638773250 234128524 2016-06-01 18:58:33 Because of great POW/MIA work by History Fligh...

8889 717729778699214848 234128524 2016-04-06 15:04:48 To learn more try the following link: https://...

8890 737729206348570625 234128524 2016-05-31 19:35:23 RT @IndianaDCS: Thx @RepToddYoung for supporti...

8891 737658165966327808 234128524 2016-05-31 14:53:06 Trouble dealing with a federal agency? We're h...

8892 rows × 4 columns

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 2/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

Next steps: toggle_off View recommended plots

# Now run this code to drop rows with NAN values


tweets.dropna(inplace=True)
tweets = tweets.reset_index(drop=True)

# Now use tail() to see last five rows of dataframe "tweets"

# WRITE CODE HERE


tweets.tail(5)

id user_id created_at tweet_text

2016-06-01 In this e-news update: 1)


8887 738088645987557376 234128524
19:23:40 Nominate a vet for C...

2016-06-01 Because of great POW/MIA work


8888 738082324638773250 234128524
18:58:33 by History Fligh...

2016-04-06 To learn more try the following


8889 717729778699214848 234128524
15:04:48 link: https://...

2016-05-31 RT @IndianaDCS: Thx


8890 737729206348570625 234128524

keyboard_arrow_down question 2 [10 points]


1. What day and time was the oldest tweet in the data set posted? Hint: use tweets['created_at'].min() . Also, be sure to make use of
the to_datetime() command.

# january first 2016, 2:02pm

# WRITE CODE HERE

tweets['created_at'] = pd.to_datetime(tweets['created_at'])
#tweets['created_at']=.tweets['created_at'].to_datetime()
#tweets['created_at'].min()
tweets['created_at'].min()

Timestamp('2016-01-01 14:02:28')

2. What day and time was the most recent tweet in the data set posted?

# WRITE CODE HERE


tweets['created_at'].max()

Timestamp('2017-01-29 23:58:06')

3. How many tweets does the data set have?

# WRITE CODE HERE


len(tweets)

8892

4. How many unique twitter users does the data set have? Hint: use the nunique() command.

# WRITE CODE HERE


#len(tweets['user_id'].unique())
test = tweets['user_id'].nunique()
print(test)

14

keyboard_arrow_down question 3 [30 points]


https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 3/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Use sklearn's CountVectorizer to produce a table of word counts in which each tweet is a row and each word is a column for the first 100
tweets in your dataset. Name this table "wc".

For an EXAMPLE of using sklearn countvectorizer to produce this wc , see the code in the following 5 cells. Note that this
example just uses three made-up tweets so that it's easier to see what's happening.

EXAMPLE CODE STARTS HERE.

from sklearn.feature_extraction.text import CountVectorizer

# Here we list a few sample strings, where each


# string can be considered one text. For the tweets
# dataframe you'll to produce a list in which each tweet is one string.
# The string "The animal is large" is tweet 0 below; the string
# "is is is" is tweet 1 below; etc.
three_tweets = ["The animal is large.", "is is is", "The can't"]

# create vectorizer
cv = CountVectorizer()

# tokenize texts & get vocabulary


cv.fit(three_tweets) #note that the sklearn tokenizer turns "can't" into "can"

# create a dictionary where each word is given an index number.


# If you need to know what word a number represents, refer back to this dictionary
word_index = cv.vocabulary_
word_index

{'the': 4, 'animal': 0, 'is': 2, 'large': 3, 'can': 1}

# to find out what column number a particular word is, say, "animal" we can use
word_index["animal"]

# generate a table of tweets vs words


word_counts = cv.transform(three_tweets)

# and get the "shape" of this table like this:


word_counts.shape
# where the 3 rows are the three tweets and the
# 5 columns are the 5 words in this corpus

(3, 5)

# to get dataframe of with words instead of word index numbers:


feature_names = cv.get_feature_names_out()
word_counts_df = pd.DataFrame.sparse.from_spmatrix(word_counts, columns=feature_names).fillna(0)
word_counts_df

animal can is large the

0 1 0 1 1 1

1 0 0 3 0 0

2 0 1 0 0 1

Next steps: toggle_off View recommended plots

# Finally, you can dump the word_counts table into an array


# where each row is a tweet, each column is a word type, and each
# element is a word token count (where each word is denoted by a number
# as defined in the word_index above). For the difference between word types and word tokens,
# see "type-token distinction" in wikipedia.
wc = word_counts.toarray()
wc

array([[1, 0, 1, 1, 1],
[0, 0, 3, 0, 0],
[0, 1, 0, 0, 1]])

THIS CONCLUDES THE EXAMPLE CODE.

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 4/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Now do the same thing we did for the example above, but substitute the "three_texts" list in the above example with the first 100 tweets that
appear in your data set.

You can get a list called "all_tweets" of all your tweet texts using the following code: all_tweets = tweets['tweet_text'] . To get the first
100 tweets, you can use a python slice: all_tweets[0:100] . We've provided this code to get you started.

At the very end of your code for this question, you should output wc just like we did in the example above.

keyboard_arrow_down Put your code for question 3 below:

USED OPEN AI, PROMPT WAS THE ERROR MESSAGE I RECIEVED

# extract all_tweets from df as a list of strings.


all_tweets = tweets['tweet_text']
hundred_tweets = all_tweets[0:100]

from sklearn.feature_extraction.text import CountVectorizer

# Here we list a few sample strings, where each


# string can be considered one text. For the tweets
# dataframe you'll to produce a list in which each tweet is one string.
# The string "The animal is large" is tweet 0 below; the string
# "is is is" is tweet 1 below; etc.
#three_tweets = ["The animal is large.", "is is is", "The can't"]

# create vectorizer
cv = CountVectorizer()

# tokenize texts & get vocabulary


cv.fit(hundred_tweets) #note that the sklearn tokenizer turns "can't" into "can"

# create a dictionary where each word is given an index number.


# If you need to know what word a number represents, refer back to this dictionary
word_index = cv.vocabulary_
word_index

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 5/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
'virginia': 772,
'abasummit': 33,
'stay': 681,
'warm': 784,
'winter': 806,
'storm': 687,
'tips': 733,
'here': 341,
's1iewdwagy': 617,
'winterstormjonas': 807,
'rbn844aade': 571,
'review': 601,
'future': 299,
'io03mqmqrv': 388,
'americasstreng': 60,
'intro': 383,
'takes': 698,
'steps': 682,

# generate a table of tweets vs words


word_counts = cv.transform(hundred_tweets)

# and get the "shape" of this table like this:


word_counts.shape
# where the 3 rows are the three tweets and the
# 5 columns are the 5 words in this corpus

(100, 828)

# to get dataframe of with words instead of word index numbers:


feature_names = cv.get_feature_names_out()
word_counts_df = pd.DataFrame.sparse.from_spmatrix(word_counts, columns=feature_names).fillna(0)
word_counts_df

0bvj2mgqrf 0cgonkid 10 10am 14 1b 20 2015 2016 2017 ... wrong wrote

0 0 0 0 0 0 0 0 0 0 0 ... 0 0

1 0 0 0 0 0 0 0 0 0 0 ... 0 0

2 0 0 0 0 0 0 0 0 0 0 ... 0 0

3 0 0 0 0 0 0 0 0 0 0 ... 0 0

4 0 0 0 0 0 0 0 0 0 0 ... 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ...

95 0 0 0 0 0 0 0 0 0 0 ... 0 0

96 0 0 0 1 0 0 0 0 0 0 ... 0 0

97 0 0 0 0 0 0 0 0 0 0 ... 0 0

98 0 0 0 0 0 0 0 0 0 0 ... 0 1

99 0 0 0 0 0 0 0 0 0 0 ... 0 0

100 rows × 828 columns

from sklearn.feature_extraction.text import CountVectorizer

# WRITE CODE HERE


cv = CountVectorizer()
cv.fit(hundred_tweets)
word_index = cv.vocabulary_
word_counts = cv.transform(hundred_tweets)
word_counts

<100x828 sparse matrix of type '<class 'numpy.int64'>'


with 1764 stored elements in Compressed Sparse Row format>

keyboard_arrow_down Step 2: Find word usage for each twitter account


question 4 [30 points]
The table wc give us the word counts for all tweets, but what we really want is a table of word counts for each twitter account in our first 100
tweets.

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 6/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
To do this: Make a new array called wc_accounts where each row is a twitter account user and the columns are word token counts for unique
word types. Note that the dataframe "tweets" has a user_id provided for each tweet.

To make wc_accounts use a FOR loop which runs over all user_id s. For each loop in the FOR loop you should do the following:

1. get all the rows of one user's tweets from tweets in a list and convert this to a numpy array
2. put this array of tweet rows into the wc array you built in question 3 (which will return all the words of user tweets), and use the sum()
command on this resultant array to produce a single row with total word counts for one user. When taking the sum, be careful that you
have the COLUMN sum. For the numpy sum functions ( np.ndarray.sum() or np.sum() ), you have to SPECIFY the axis you sum over.
The normal sum() function gives the column sum.
3. take this single row of total word counts for one user and add this row `temp_results;

Once the for loop is completed, you can save temp_results as wc_accounts .

To get you started, here's a way to get a list of all user_id s, which you need to run your FOR loop: twitter_users =
tweets['user_id'].unique() . We've provided this code.

We've also provided you with some additional code. Your job is to fill in the appropriate commands in the FOR loop provided.

Once you're done, be sure to output wc_accounts . (It will resemble the output for wc , but with different numbers.) We've provided this code for
you too.

keyboard_arrow_down Put your code for question 4 below:

tweets

id user_id created_at tweet_text

2016-07-12 RT @HouseJudiciary:
0 752888708320329729 15160884
15:33:50 .@Randy_Forbes to @Loretta...

2016-07-12 RT @RebeccaARainey:
1 752888783922683904 15160884
15:34:08 @Randy_Forbes grills @Lore...

2016-05-26 RT @aircraftcarrier: .@USNavy


2 735952848937177089 15160884
21:56:47 duty is to defen...

2016-05-26 Regardless of the cause of this


3 735870521502486529 15160884
16:29:38 crash-- we nee...

2016-05-26 Our pilots aren't getting the flying


4 735870454116782082 15160884
16:29:22 time they...

... ... ... ... ...

2016-06-01 In this e-news update: 1)


8887 738088645987557376 234128524
19:23:40 Nominate a vet for C...

2016-06-01 Because of great POW/MIA work


8888 738082324638773250 234128524
18:58:33 by History Fligh...

toggle_off
2016 04 06 T l t th f ll i li k
Next steps: View recommended plots

temp_results = []
for user in twitter_users:
user_tweets = new_tweets[new_tweets['user_id'] == user]['tweet_text'].tolist()
user_tweets_string = ' '.join(user_tweets)
user_word_counts = cv.transform([user_tweets_string])
total_word_counts = user_word_counts.sum(axis=0)
temp_results.append(total_word_counts.A[0])
wc_accounts = np.array(temp_results)

keyboard_arrow_down Step 3: Perform PCA on wc_accounts

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 7/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

Now that we have wc_acounts, we'd like to plot this to see how all these accounts use language. But how plot this in two dimensions? Let's use
a form of unsupervised learning called PCA, where each account (i.e., each row) represents one data point. This is the same as the PCA we
performed on the Iris and Spearman data sets in our labs. We can plot our twitter account data in exactly in the same way we did then.

However, we have one additional wrinkle in our task. It'd be helpful to know which political affiliation each individual twitter account has.
Fortunately, I have this data for you! In pol_aff.csv I have included user_ID and their respective political affiliation. Use this csv to get the political
association of each user account. Note that this file includes user_IDs that don't exist in your corpus of tweets so you'll need to selectively add
political affiliation based on user IDs.

keyboard_arrow_down question 5 [40 points]


Perform PCA on individual user accounts using the data in wc_accounts . Plot this in 2 dimensions where each data point represents one
twitter account. Color code each data point according to account political affiliation, with red = republicans and blue = democrats. To make
overlapping data points easier to see, reduce the opacity (i.e., the "alpha" parameter) to 0.3.

You can break this up into several steps:

Step 1: Generate a dataframe called merged_data with two columns: user_id (that comes from tweets ) and affliation (that comes from
pol_aff.csv ). You should have the same number of rows as you have twitter users that you found in question 2. Here's the commands you'll
need:

1. read_csv() (to read in pol_aff.csv from http://www.columbia.edu/~chw2/Courses/PPF/pol_aff.csv, you're familiar by now so we've
done this for you)
2. unique() (to get unique twitter users in tweets & put them in a new data frame)
3. merge() (to merge dataframe with user_id with dataframe with affliation )
4. drop() (to drop any columns you have in your final dataframe that you don't need)

Note: In the pol_aff.csv dataset, the column affliation is misspelled (make sure you are referring to it as affliation , not
affiliation ).

Step 2: Next you want to produce a dataframe called Graph in which the rows are twitter users, and the three columns are the twitter user's
respective principal component 1 coordinates, principal component 2 coordinates, and affliation. To get the principal components, you'll need
to do PCA on wc_accounts (from question #4) and then place the results of this PCA into a dataframe that you merge with affiliation
column in step one.

Step 3: Once you've constructed the Graph dataframe in Step 2, just run the code provided to generate your PCA plot.

Put your code for question 5 below:

# Step 1

# insert your code to read tweet csv into a dataframe called "tweets".

# WRITE CODE HERE

from google.colab import files


uploaded = files.upload()

Choose Files pol_aff.csv


pol_aff.csv(text/csv) - 90859 bytes, last modified: 5/15/2024 - 100% done
Saving pol_aff.csv to pol_aff (1).csv

merged_data = pd.read_csv("pol_aff.csv")

merged_data

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 8/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

id screen_name description affliation

Official Twitter Account for


0 2962813893 RepStefanik republican
Congresswoman Eli...

Proud to represent New York’s


1 1058256326 RepChrisCollins republican
27th Congression...

Official account of Senator


2 17494010 SenSchumer democrat
Chuck Schumer - Ne...

3 2955485182 SenatorRounds U.S. Senator for South Dakota republican

Proudly representing the 4th


4 339852137 SupJaniceHahn democrat
District of Los A...

... ... ... ... ...

I represent the 3rd District of


663 252819642 RepKevinYoder republican
Kansas in the ...

Follow me on instagram:
664 233693291 RepRickCrawford republican
https://t.co/z0CXmmrsL...

Proudly Representing
665 1128514404 RepDavidValadao republican

toggle_off
C lif i ' 21 t C
Next steps: View recommended plots

merged_data.rename(columns={'id': 'user_id'})
unique_users = tweets['user_id'].unique()

unique_users_df = pd.DataFrame({'user_id': unique_users})


unique_users_df

user_id

0 15160884

1 76649729

2 21111098

3 1074278209

4 817076257770835968

5 2914571740

6 1058256326

7 140519774

8 2869746172

9 236916916

10 18967498

11 235373000

12 558769636

13 234128524

Next steps: toggle_off View recommended plots

merged_data

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 9/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab

id screen_name description affliation

Official Twitter Account for


0 2962813893 RepStefanik republican
Congresswoman Eli...

Proud to represent New York’s


1 1058256326 RepChrisCollins republican
27th Congression...

Official account of Senator


2 17494010 SenSchumer democrat
Chuck Schumer - Ne...

3 2955485182 SenatorRounds U.S. Senator for South Dakota republican

Proudly representing the 4th


4 339852137 SupJaniceHahn democrat
District of Los A...

... ... ... ... ...

I represent the 3rd District of


663 252819642 RepKevinYoder republican
Kansas in the ...

Follow me on instagram:
664 233693291 RepRickCrawford republican
https://t.co/z0CXmmrsL...

Proudly Representing
665 1128514404 RepDavidValadao republican

toggle_off
C lif i ' 21 t C
Next steps: View recommended plots

merged_df = pd.merge(unique_users_df, merged_data, left_on='user_id', right_on='id', how='left')


merged_df

user_id id screen_name description affliation

Representing
Virginia's
0 15160884 15160884 Randy_Forbes Fourth republican
Congressional
D...

Follow Sen.
Alexander in
1 76649729 76649729 SenAlexander republican
TN, in DC, and
in the...

The official
Twitter page
2 21111098 21111098 SenShelby republican
for U.S.
Senator Ric...

Social
Progressive
3 1074278209 1074278209 BilboBagman but, in this democrat
fascist area
o...

U. S.
Representative
4 817076257770835968 817076257770835968 RepEspaillat proudly democrat
serving New

toggle_off
Y k’
Next steps: View recommended plots

# Check the shape of wc_accounts


print("Shape wc_accounts:", wc_accounts.shape)

print(wc_accounts[:5])

Shape wc_accounts: (2, 828)


[[0 0 1 ... 0 0 1]
[1 1 1 ... 1 2 0]]

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 10/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
# Step 2
# WRITE CODE HERE

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
principal_components = pca.fit_transform(wc_accounts)
Graph
Graph = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])
Graph['affliation'] = merged_data['affliation'] # Assuming merged_data contains 'affliation'
PC1 PC2 affliation

0 31.693848 8.643278e-14 republican

1 -31.693848 8.643278e-14 republican

Next steps: toggle_off View recommended plots

# Code to run for Step #3 -- DON'T CHANGE THIS, JUST RUN IT


# RUN TO PLOT PCA DATA ONCE YOU HAVE CONSTRUCTED THE GRAPH DATAFRAME

fig = plt.figure(figsize = (8,8))


ax = fig.add_subplot(1,1,1)
ax.set_xlabel('PC1', fontsize = 15)
ax.set_ylabel('PC2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['democrat', 'republican']
colors = ['b','r']
for target, color in zip(targets,colors):
indicesToKeep = Graph['affliation'] == target
ax.scatter(Graph.loc[indicesToKeep, 'PC1']
, Graph.loc[indicesToKeep, 'PC2']
, c = color
, s = 50
, alpha =.3)
ax.legend(targets)
ax.grid()

https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 11/11

You might also like