Professional Documents
Culture Documents
Zainab Pate Data PPF #3 - Colab
Zainab Pate Data PPF #3 - Colab
Name:
UNI:
Homework 3
Using unsupervised learning we're going to produce a plot of US politicians via their individual language usage on Twitter, starting with ingesting
the tweets, preparing these tweets for PCA, and finally performing PCA. We're going to use this exercise to reflect, analyze, and explore how
such practices both reify and challenge our contemporary understanding of community and the public discourse.
You can access them at these URLS (use as the first argument in the read_csv function):
http://www.columbia.edu/~chw2/Courses/PPF/10K2016tweets.csv
http://www.columbia.edu/~chw2/Courses/PPF/pol_aff.csv
keyboard_arrow_down Policies
This assignment is to be done on your own, but you can talk about the assignment with your classmates if you get stuck.
Feel free to also use stackoverflow but please provide citation and link to the specific answer if you do this.
You may use Generative AI, but please note which LLM technologies/resources you used (eg. ChatGPT (OpenAI), Gemini (Google), etc.),
and you must include your prompt(s).
You may also visit Ananya, Andrew, or Colby during TA office hours or post in the #homework channel on Slack.
Instructions
Be sure to rename this homework notebook so that it includes your name.
Provide your code to justify your answer to each question. Your code must run with the "10K2016tweets.csv" and "pol_aff.csv" files as
originally provided to you.
Where not specified, please run functions with default argument settings.
Only include relevant information in your output. We should be able to see your answer clearly.
Save pdf in Landscape and check that all of your code is shown in the submission.
When submitting in Gradescope, be sure to submit to both the pdf and ipynb assignments.
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 1/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
List any students you talked with about this assignment here:
1. [person 1]
2. [person 2]
3. etc.
We've given you some libraries you'll need for this assignment.
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
# insert your code to read tweet csv into a dataframe called "tweets".
3 735870521502486529 15160884 2016-05-26 16:29:38 Regardless of the cause of this crash-- we nee...
4 735870454116782082 15160884 2016-05-26 16:29:22 Our pilots aren't getting the flying time they...
8887 738088645987557376 234128524 2016-06-01 19:23:40 In this e-news update: 1) Nominate a vet for C...
8888 738082324638773250 234128524 2016-06-01 18:58:33 Because of great POW/MIA work by History Fligh...
8889 717729778699214848 234128524 2016-04-06 15:04:48 To learn more try the following link: https://...
8890 737729206348570625 234128524 2016-05-31 19:35:23 RT @IndianaDCS: Thx @RepToddYoung for supporti...
8891 737658165966327808 234128524 2016-05-31 14:53:06 Trouble dealing with a federal agency? We're h...
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 2/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
#tweets['created_at']=.tweets['created_at'].to_datetime()
#tweets['created_at'].min()
tweets['created_at'].min()
Timestamp('2016-01-01 14:02:28')
2. What day and time was the most recent tweet in the data set posted?
Timestamp('2017-01-29 23:58:06')
8892
4. How many unique twitter users does the data set have? Hint: use the nunique() command.
14
For an EXAMPLE of using sklearn countvectorizer to produce this wc , see the code in the following 5 cells. Note that this
example just uses three made-up tweets so that it's easier to see what's happening.
# create vectorizer
cv = CountVectorizer()
# to find out what column number a particular word is, say, "animal" we can use
word_index["animal"]
(3, 5)
0 1 0 1 1 1
1 0 0 3 0 0
2 0 1 0 0 1
array([[1, 0, 1, 1, 1],
[0, 0, 3, 0, 0],
[0, 1, 0, 0, 1]])
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 4/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Now do the same thing we did for the example above, but substitute the "three_texts" list in the above example with the first 100 tweets that
appear in your data set.
You can get a list called "all_tweets" of all your tweet texts using the following code: all_tweets = tweets['tweet_text'] . To get the first
100 tweets, you can use a python slice: all_tweets[0:100] . We've provided this code to get you started.
At the very end of your code for this question, you should output wc just like we did in the example above.
# create vectorizer
cv = CountVectorizer()
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 5/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
'virginia': 772,
'abasummit': 33,
'stay': 681,
'warm': 784,
'winter': 806,
'storm': 687,
'tips': 733,
'here': 341,
's1iewdwagy': 617,
'winterstormjonas': 807,
'rbn844aade': 571,
'review': 601,
'future': 299,
'io03mqmqrv': 388,
'americasstreng': 60,
'intro': 383,
'takes': 698,
'steps': 682,
(100, 828)
0 0 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 0 0 0 0 0 0 0 0 0 0 ... 0 0
96 0 0 0 1 0 0 0 0 0 0 ... 0 0
97 0 0 0 0 0 0 0 0 0 0 ... 0 0
98 0 0 0 0 0 0 0 0 0 0 ... 0 1
99 0 0 0 0 0 0 0 0 0 0 ... 0 0
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 6/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
To do this: Make a new array called wc_accounts where each row is a twitter account user and the columns are word token counts for unique
word types. Note that the dataframe "tweets" has a user_id provided for each tweet.
To make wc_accounts use a FOR loop which runs over all user_id s. For each loop in the FOR loop you should do the following:
1. get all the rows of one user's tweets from tweets in a list and convert this to a numpy array
2. put this array of tweet rows into the wc array you built in question 3 (which will return all the words of user tweets), and use the sum()
command on this resultant array to produce a single row with total word counts for one user. When taking the sum, be careful that you
have the COLUMN sum. For the numpy sum functions ( np.ndarray.sum() or np.sum() ), you have to SPECIFY the axis you sum over.
The normal sum() function gives the column sum.
3. take this single row of total word counts for one user and add this row `temp_results;
Once the for loop is completed, you can save temp_results as wc_accounts .
To get you started, here's a way to get a list of all user_id s, which you need to run your FOR loop: twitter_users =
tweets['user_id'].unique() . We've provided this code.
We've also provided you with some additional code. Your job is to fill in the appropriate commands in the FOR loop provided.
Once you're done, be sure to output wc_accounts . (It will resemble the output for wc , but with different numbers.) We've provided this code for
you too.
tweets
2016-07-12 RT @HouseJudiciary:
0 752888708320329729 15160884
15:33:50 .@Randy_Forbes to @Loretta...
2016-07-12 RT @RebeccaARainey:
1 752888783922683904 15160884
15:34:08 @Randy_Forbes grills @Lore...
toggle_off
2016 04 06 T l t th f ll i li k
Next steps: View recommended plots
temp_results = []
for user in twitter_users:
user_tweets = new_tweets[new_tweets['user_id'] == user]['tweet_text'].tolist()
user_tweets_string = ' '.join(user_tweets)
user_word_counts = cv.transform([user_tweets_string])
total_word_counts = user_word_counts.sum(axis=0)
temp_results.append(total_word_counts.A[0])
wc_accounts = np.array(temp_results)
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 7/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Now that we have wc_acounts, we'd like to plot this to see how all these accounts use language. But how plot this in two dimensions? Let's use
a form of unsupervised learning called PCA, where each account (i.e., each row) represents one data point. This is the same as the PCA we
performed on the Iris and Spearman data sets in our labs. We can plot our twitter account data in exactly in the same way we did then.
However, we have one additional wrinkle in our task. It'd be helpful to know which political affiliation each individual twitter account has.
Fortunately, I have this data for you! In pol_aff.csv I have included user_ID and their respective political affiliation. Use this csv to get the political
association of each user account. Note that this file includes user_IDs that don't exist in your corpus of tweets so you'll need to selectively add
political affiliation based on user IDs.
Step 1: Generate a dataframe called merged_data with two columns: user_id (that comes from tweets ) and affliation (that comes from
pol_aff.csv ). You should have the same number of rows as you have twitter users that you found in question 2. Here's the commands you'll
need:
1. read_csv() (to read in pol_aff.csv from http://www.columbia.edu/~chw2/Courses/PPF/pol_aff.csv, you're familiar by now so we've
done this for you)
2. unique() (to get unique twitter users in tweets & put them in a new data frame)
3. merge() (to merge dataframe with user_id with dataframe with affliation )
4. drop() (to drop any columns you have in your final dataframe that you don't need)
Note: In the pol_aff.csv dataset, the column affliation is misspelled (make sure you are referring to it as affliation , not
affiliation ).
Step 2: Next you want to produce a dataframe called Graph in which the rows are twitter users, and the three columns are the twitter user's
respective principal component 1 coordinates, principal component 2 coordinates, and affliation. To get the principal components, you'll need
to do PCA on wc_accounts (from question #4) and then place the results of this PCA into a dataframe that you merge with affiliation
column in step one.
Step 3: Once you've constructed the Graph dataframe in Step 2, just run the code provided to generate your PCA plot.
# Step 1
# insert your code to read tweet csv into a dataframe called "tweets".
merged_data = pd.read_csv("pol_aff.csv")
merged_data
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 8/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Follow me on instagram:
664 233693291 RepRickCrawford republican
https://t.co/z0CXmmrsL...
Proudly Representing
665 1128514404 RepDavidValadao republican
toggle_off
C lif i ' 21 t C
Next steps: View recommended plots
merged_data.rename(columns={'id': 'user_id'})
unique_users = tweets['user_id'].unique()
user_id
0 15160884
1 76649729
2 21111098
3 1074278209
4 817076257770835968
5 2914571740
6 1058256326
7 140519774
8 2869746172
9 236916916
10 18967498
11 235373000
12 558769636
13 234128524
merged_data
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 9/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
Follow me on instagram:
664 233693291 RepRickCrawford republican
https://t.co/z0CXmmrsL...
Proudly Representing
665 1128514404 RepDavidValadao republican
toggle_off
C lif i ' 21 t C
Next steps: View recommended plots
Representing
Virginia's
0 15160884 15160884 Randy_Forbes Fourth republican
Congressional
D...
Follow Sen.
Alexander in
1 76649729 76649729 SenAlexander republican
TN, in DC, and
in the...
The official
Twitter page
2 21111098 21111098 SenShelby republican
for U.S.
Senator Ric...
Social
Progressive
3 1074278209 1074278209 BilboBagman but, in this democrat
fascist area
o...
U. S.
Representative
4 817076257770835968 817076257770835968 RepEspaillat proudly democrat
serving New
toggle_off
Y k’
Next steps: View recommended plots
print(wc_accounts[:5])
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 10/11
5/15/24, 2:50 PM Zainab Pate Data PPF #3 - Colab
# Step 2
# WRITE CODE HERE
pca = PCA(n_components=2)
principal_components = pca.fit_transform(wc_accounts)
Graph
Graph = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])
Graph['affliation'] = merged_data['affliation'] # Assuming merged_data contains 'affliation'
PC1 PC2 affliation
https://colab.research.google.com/drive/1zvp9vAKs5igSZYnInPkJx0kI6bG71HIe#scrollTo=H5xXq8_TjUCq&printMode=true 11/11