Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Social and Web Computing

Gareth Tyson


How did the web work?

● More symmetrical power

● More resilient?
● No central points of failure?
● Superior privacy?
How does the web work?

● Critical mass of user?

● Orchestrated deployments?
● Economies of scale?
How does the Fediverse work?
How does Mastodon work?

Learning objectives
1. To understand the importance of social messaging applications, and
how we can gather data from them

2. To understand how information and misinformation spreads on

these messaging platforms

3. To understand ways to mitigate abuse on these platforms, and

some further
What do you think when you
hear “social media”?
But this only covers a fraction of
online communications…
Who uses WeChat?
Questions you might want to ask…
• How much data do you provide WeChat?

• How much data does WeChat expose?

• To whom is that data exposed?

• What are the risks of this centralization?

• How can such platforms be misused?

Next question:
Who uses WhatsApp?
You’re not alone…

• WhatsApp is one of the most popular social app in the world

• 1.5 billion active users each day!
• 5 billion downloads from the Android Play Store alone!
• Over 60 billion texts, 100 million audio and 55 million video calls daily!
WhatsApp Data Collection for Social
Computing Studies

Kiran Garimella and Gareth Tyson. WhatApp Doc?: A first look at WhatsApp public group data. In 12th International AAAI Conference on Web and Social Media (ICWSM), Stanford, CA (2018).
Getting data from WhatsApp is tough!
• About 80-90% of messages are unicast

• There are no Application Programming Interfaces providing control of


• End-to-end encryption even prevents from gathering data

But WhatsApp also supports groups

So, can we collect data from
these public groups?
Data collection options

Web Rooted
Manually broken
WhatsApp phone
Step 1: Obtain a list of public group URLs
• Use public listings, e.g.
• Use search engines or social media
• Search for
• And then manually filter
Step 2: Join the groups
• Create a dedicated WhatsApp account

• Run our script (which uses the interface)

• It takes a list of WhatsApp join URLs and programmatically joins them
Step 3: Receive the updates
• Messages will start to come through (via the phone app)
• Our script extracts from the phone’s local SQLitedatabase file
• storage/WhatsApp/Databases/msgstore.db.crypt12

• But it is encrypted - this is where rooting is necessary
What sort of data might you see ?

Group metadata
Text Content

User behaviour Geographic information

COVID-19 (Mis)Information
Sharing on WhatsApp

Rana Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal, Gareth Tyson, Ignacio Castro and Kiran Garimella. A Deep Dive into COVID-19-Related Messages on WhatsApp
in Pakistan. In Social Network Analysis and Mining (SNAM) (2022).
How are public WhatsApp
groups used to share COVID-19
Data collection (Step 1)
• We compiled a list of 227 political groups in Pakistan using
Google and Twitter

• 60K messages from 18.5K users

Message Type # %
Text 28.5K 47%
Images 14.6K 24.5%
Videos 2.6K 18.6%
URLs 3.2K 2.5%
Data collection (Step 2)
• Next need to extract COVID-19 related messages!
• Compiled list of keywords, e.g. covid, covid19

5K messages
across the
Data collection (Step 3)
• But this won’t work for images…
• Two annotators tagged a total of 6,699 images

35% of
images are
Let’s ask some
questions of our
What type of messages are shared? Majority of content is
simple information,
e.g. news articles,
government actions

Large volume of
religious commentary

But 14% of the total

messages are
What types of misinformation are shared?
• Fake news covers 45% of misinformation texts, e.g.
• COVID related deaths of world figures such as Ivanka Trump, Prince William,
Imran Khan
• Conspiracy theories about Bill Gates intending to place RFID chips in people to
track COVID-19
• Fake origins also prominent, e.g.
• COVID-19 developed in research lab in Lake Corona in Kazakhstan
• Predicted in films such as Resident Evil
• Fake remedies less prominent but circulate for longer
Most prominent in the
How long do these messages last..? tail – 2% of
misinformation exceeds
100 hours
Let’s drill into the details…

• ‘Fake News’ category has the

shortest lifespan
• Fake Remedies’ category has a
mean life of 10 hrs
Who shares what?
The majority
Only 37 users
exclusively shared
Does content spread across
Does content spread across platforms?
• We gathered 67K Twitter images using hashtags, e.g.
#CovidPakistan, #CoronaFreePakistan
• 1.5K shared across both WhatsApp and Twitter.
• 1/3 were COVID-19 related
• Largest category shared across both Twitter and
WhatsApp is misinformation (29%)
Who influences whom?
Okay, so maybe graphs are useful for
understand the interconnection of
these groups…?
Let’s look at how graph data can
be used

Using data from Brazil

Data gathered from Brazil

• Truck drivers strike in Brazil

• May 21st to June 2nd 2018

• Brazilian presidential elections

• August 16th to October 7th 2018
What type of images are shared?
A look at the group network

Trucker drivers’ strike Election

A look at the user network
Can we use these graphs to study
misinformation spread?
Labelling misinformation
Misinformation spread
• Each node is a group
• Edge indicates the group spread
information to another group

• Size of a node represents the number of

images with misinformation posted on
that group
• Color represents the total number of
images that were “first seen” in that

• Few groups are responsible for

spreading misinformatinon
How does this differ from the web?
• Authors used google to find webpage
that host the same misinformation

• Twitter shares many misinformation

images…but who influences whom?
• The central node represents all
WhatsApp groups
• Color represents average time
difference between
appearance of image on
WhatsApp and on the specific

• Images that were first

published on the Web take
much longer to reach the
WhatsApp groups (more than
a year) than the other way
around (only a few days) for
both types of images
But images aren’t the only

Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Audio messages are growing in

Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Let’s ask some questions…
• RQ1: What are the characteristics of audio messages in terms of
content properties and propagation dynamics?

• RQ2: What are the properties of audio content (e.g., gender of

speaker, music versus speech content) and how do these properties
correlate with propagation dynamics?
Data summary

32% (truckers) and 21% (election) of all users in the monitored groups shared 1+ audio message
How long are messages?
What is in the audio messages?
What is in the audio messages?
• Used LIWC to categorize words in
the transcript
• Calculate relative difference
between messages shared more
than 20 times vs. a single time
• Most popular words were related to
• Sad emotions, negations, needs,
achievement, family, work, time,
money, anxiety, and future
Wow. So, misbehavior is common!

What else
might happen…
Dissemination of in public
“Spam” is
unsolicited and
unwanted messages
sent out in bulk

“Ham” refers to the

Data collection
• Gathered data from 5,051 political groups:
2.6 million messages posted by over 172K
• We take Hindi, English, Telugu and Tamil
(74%) and filter boilerplate

• Labeled posts as spam vs ham

• Identify similar text and images to group in
“message clusters” aka spam campaigns
Identifying spam
1. Create a ground truth
• Identify a seed set of users who were manually removed from at least two
groups by their admins (257 users, 68K messages)
• Use human to manually annotate frequently seen messages as spam or ham
2. Construct a dictionary of spam words
1. Extract commonly occurring (5 times) words
2. Manually filter strong signals to produce 324 spam words
3. Extract all messages containing spam words from frequently sent
Who spreads spam?

• Not individual phone numbers!

• And large clusters of messages

tend to be spam
• Mean 83.6 in spam clusters
• …vs 35 for ham clusters
Who spreads spam?

• Not individual phone numbers!

• And large clusters of messages

tend to be spam
• Mean 83.6 in spam clusters
• …vs 35 for ham clusters

• Mostly from India but…

Jobs largely Ham tend to
What is contained
containing in spam? not include
phone URLs or phone
numbers numbers

Yet click &

earn are
mostly URLs Over half of
spam contains
How long does spam circulate?
But non-spam users
Spam messages live much longer
circulate for longer
Why doThenon-spammers
majority of live longer?
removals are
spammers Few spammers
are added by
How can spammers avoid removal?
• Under ½ days during spam
campaign are active with 10+
• With noticeable outliers
• Get Free Win Award!
• Pay with Reward!

• Spammers also tend to leave

& join regularly
But how can we deal with spam
on WhatsApp?

Particularly with end-to-end encryption!

What is end-to-end encryption?

Hi Holly! How are you?

What is end-to-end encryption?

Hi Holly! How are you? Hi Holly! How are you?
So, let’s build some spam
Classifier 1: Let’s assume we don’t have end-to-end encryption…so
we can run a text spam classifier on the server (similar to email)
Content-based spam detection on-server

• We use off-the-shelf
email spam classifier

• Accuracy of 87%

Relies on text
Classifier 2: The problem is that message text is inaccessible to the server
because of end-to-end encryption!
Metadata-based spam detection on-server
Feature Importance
• We build a ‘metadata’ classifier
Posted message 0.52
Non-domestic number 0.15
• Create user profiles containing Posted URL 0.12

counts for the different actions Joined via link 0.08

Posted phone number 0.05
Left group 0.04
• Train a Random Forest Classifier Added by member 0.023
Added by admin 0.021
Hmmm, but Removed from group 0.01
• Accuracy of 90% isn’t this Number changed 0.003
Classifier 3: Why not run a classifier on the user’s device?!
Content & metadata detection~80%
of Others very
groups do
• We can use both user profile and well!
text-based scores

• We train local Random Forest

model for each group on device

• 86% mean accuracy across

Let’s conclude with some
Challenges in using WhatsApp for social
computing studies
• Public messaging group data is highly biased

• We only ever have a lower bound of activity

• Groups are independent and may be managed differently

• Difficult to definitively link behaviours across platforms or identify

Learning objectives
1. To understand the importance of social messaging applications, and
how we can gather data from them

2. To understand how information and misinformation spreads on

these messaging platforms

3. To understand ways to mitigate abuse on these platforms, and

some further
Further Reading
• Pushkal Agaarwal, Aravind Raman, Damilola Ibosiola, Nishanth Sastry, Kiran Garimela,
Gareth Tyson. “Countering Spam in the Era of End-to-End Encryption: A study of Indian
Political WhatsApp Groups”. In Web Conference (WWW), Lyon, France (2022).
• Rana Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal,
Gareth Tyson, Ignacio Castro, Kiran Garimella. “A First Look at COVID-19 Messages on
WhatsApp in Pakistan”. In IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), The Hague, Netherlands (2020).
• Resende, Gustavo, et al. "(Mis) information dissemination in WhatsApp: Gathering,
analyzing and countermeasures." The World Wide Web Conference. 2019.
• Resende, Gustavo, et al. "Analyzing textual (mis) information shared in WhatsApp
groups." Proceedings of the 10th ACM conference on web science. 2019.
• Kiran Garimella & Gareth Tyson. “WhatApp Doc?: A first look at WhatsApp public group
data” In AAAI International Conference on Web and Social Media (ICWSM), Stanford, CA

You might also like