Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi – 590 014

Assignment work -1
Module-1
“CASE-STUDY ON TWITTER”
Submitted in partial fulfilment of the requirements for
BIG DATA ANALYTICS [18CS72]
IN
7th SEM COMPUTER SCIENCE AND ENGINEERING 2022-2023
Submitted by:
Neha K K 1GG19CS028
Deepak D Gowda 1GG19CS012
Harshitha M V 1GG19CS017
Sowmya S Tawarkhed 1GG19CS041
UNDER THE GUIDANCE OF
Dr Shabeen Taj G
Assistant Prof. Dept of CS&E
GEC Ramanagara

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING


GOVERNMENTENGINEERINGCOLLEGEB.M.ROAD, RAMANAGARA-562 159
CONTENTS

1. INTRODUCTION

2. LITERATURE SURVEY

3. FRAMEWORK

4. METHODOLOGY

5. TOOLS

6. CONCLUSION

7. REFERENCES
ABSTRACT

With the huge growth of social media, especially with 500 million Twitter messages being posted
per day, analyzing these messages has caught intense interest of researchers. Topics of interest
include micro-blog summarization, breaking news detection, opinion min ing and discovering
trending topics. In information extraction, re searchers face challenges in applying data mining
techniques due to the short length of tweets as opposed to normal text with longer length
documents. Short messages lead to less accurate results. This has motivated investigation of
efficient algorithms to over come problems that arise due to the short and often informal text of
tweets. Another challenge that researchers face is stream data, which refers to the huge and
dynamic flow of text generated contin uously from social media. In this paper, we discuss the
possibility of implementing successful solutions that can be used to overcome the inconclusiveness
of short texts. In addition, we discuss methods that overcome stream data problems.
INTRODUCTION

By the term social media, we mean Internet-based applications that include methods for
communication among their users. One of the fastest growing social media applications is Twitter1
. Currently, Twitter is gaining 135,000 new users every day, with a total of 645,750,000 users in
20132 . Social networks have received attention of analysts and researchers because decision
makers rely on statistics such as summaries of people’s opinions that can be obtained from analysis
of social media. We focus on Twitter as a case study in this paper because it has become a tool
that can help decision makers in various domains connect with changing and disparate of
consumers and other stakeholders at various levels. The reason is that Twitter posts reflect people’s
instantaneous opinions regarding an event or a product, and these opinions spread quickly [39].

As researchers, we concentrate on Twitter for three reasons. The first reason for choosing Twitter
is its popularity. Enormous numbers of people constantly post on Twitter regarding many varied
topics. Topics could be politics, sports, religion, marketing, people’s opinions or friends’
conversations. Being a constantly updated huge repository of facts, opinions, banters and other
minutiae, Twitter has received a large amount of attention from business leaders, decision makers,
and politicians.

This attention comes from the desire to know people’s views and opinions regarding specific topics
[71]. The second reason for using Twitter is the structure of its data, which is easy for software
developers to deal with. The data is structured in such a way that all information regarding a tweet
is rolled into one block using the Json file format. A block consists of many fields regarding user
information, tweet description and re-tweet status.

This type of structure eases difficulties in mining for specific information such as tweet content
while ignoring other details such as user or re-tweet status. Finally, Twitter provides data filtering
based on features such as retrieving tweets in a specific language or from a certain location. This
flexibility in retrieving data encourages developers to perform research and analysis using Twitter.
However, there is a limit on the size of retrieved data within a certain period of time. To retrieve
more than 5% of all tweets, developers need special permission from Twitter.
LITERATURE REVIEW

Datasets

Analyzing structured data have been widely used. In such case, the traditional Relational Database
Management System (RDBMS) can deal with the data. With the increasing amounts of
unstructured data on various sources (e.g. Web, Social media, and Blog data) that are considered
as Big Data, a single computer processor cannot process such huge amount of data. Hence, the
RDBMS cannot deal with the unstructured data; a nontraditional database is needed to process the
data, which is called NoSQL database. Most studies focused on tools, such as R (the programming
language and the software environment for data analysis).

R has limitations when processing twitter data, and is not efficient in dealing with large volume of
data. To solve this problem a hybrid big data framework is usually employed, such as Apache
Hadoop (an open source Java framework for processing and querying vast amounts of data on
large clusters of commodity hardware) [14]. Hadoop also deals with structured and semi-structured
data, XML/JSON files, for example. The strength of using Hadoop comes in storing and processing
large volume of data, while the strength of using R comes in analyzing the already-processed data.
There are different types of twitter data such as user profile data and tweet messages. The former
is considered static, while the latter is dynamic.

Tweets could be textual, images, videos, URL, or spam tweets. Most studies do not, usually, take
spam tweets and automatic tweets engines into account as they can, often, affect the accuracy and
add noise and bias to analysis results. In [3], the mechanism of FireFox add-on and Clean Tweet
filter was employed to remove users that have been on twitter for less than a day and they removed
tweets that contain more than three hashtags.

Data Retrieval

Before retrieving the data, some questions should be addressed: What are the characteristics of
the data? Is the data static, such as the profile user information “name, user Id, and bio”; or dynamic
such as user’s tweets, and user’s network? Why is the data important? How is the data will be
used? And how big the data is? It is important to note that it is easier to track a certain keyword
attached to a hashtag rather than a keyword not attached to it. Twitter-API is a widely used
application to retrieve, read and write twitter data. Other studies, as in [2], have used GNU/GPL
application like YourTwapperKeeper tool, which is a web-based application that stores social
media data in MySQL tables. However, YourTwapperKeeper in storing and handling large size of
data exhibits some limitations in using, as MySQL and spreadsheets databases can only store a
limited size of data. Using a hybrid big data technology might address such limitations as we
suggested above.

Ranking and Classifying Twitter Users

There are different types of user’s networks; a network of users within a specific event (hashtag),
a network of users in a specific user’s account, and a network of users within a group in the
network, that is, Twitter Lists. Lists are used to group sets of users into topical or other categories
to better organize and filter incoming tweets [24]. To rank twitter users, it is important to study the
characteristics of twitter by studying the network-topology (number of followers/ followed) for
each user in the dataset. Many techniques have been employed in ranking analysis. In [3], twitter
users are ranked by identifying the number of followers by studying the PageRank, and by the
retweet rate. In that study, 41.7 million user profiles, 1.47 billion social relations, and 106 million
tweets were used. In [24], a new methodology is introduced to rank twitter users by using the
Twitter Lists to classify users into the Elite users (Celebrities, Media news, Politicians, Bloggers,
and Organizations) and the Ordinary users.

Homophily

Homophily is defined as the tendency that contacts among similar users occur at a higher rate than
among dissimilar users [3], that is, similar users tend to follow each other. It requires studying the
static characteristics of twitter data, such as the profile name and the geographic feature of each
user in twitter network. [3], [24] studied the homophily in twitter; [3] studied the geographical
feature in twitter to investigate the similarity between users based on their location. Additional
work had been investigated in [24], homophily was studied using Twitter Lists to identify the
similarity between the elite and ordinary users.
FRAMEWORK

The framework of the proposed approach. Twitter datasets are collected using the TwitterAPI.
Each tweet will pass via the domain knowledge inference module. AlchemyAPI is incorporated in
this module to infer tweets taxonomies. Big data infrastructure is used for data storage. A metric
incorporating a number of attributes based on user analysis and content analysis is investigated in
the trust evaluation approach. The output of this approach is domain-based trustworthiness values
for users. The established framework has proven its ability to address the indicated classification
problem, evidenced by the good results obtained from almost all the incorporated machine learning
algorithms. This paper is a report on work in progress as it is an ongoing project intended to
develop a methodology for Social Business Intelligence (SBI) that incorporates semantic analysis
and trust notions to enrich textual data and determine the trustworthiness of data, respectively
[7,10,22,34]. The approaches developed in this paper have produced optimistic results.
METHODOLOGY

The proposed methodology for analyzing City Logistics content is shown in Figure 1. Data
collection was performed by scraping the Twitter website with the search terms “City Logistics”,
“Last Mile Logistics”, “Urban Logistics” and “Urban Freight”. The collected entries (i.e. tweets)
were filtered in order to erase repeated entries. The text cleaning and lemmatization consists in
removing undesired content from the data (such as links, symbols and linking words) and then
lemmatizing the text inputs. Lemmatization is the process of grouping several forms of a word
together so they can be analyzed as a single item. For example, the verb “to contribute” may appear
as “contributed”, “contributes”, “contributing”, etc. The base form “contribute” (i.e. the one in the
dictionary) is called the lemma of the word.

In the first part of the analysis we built an interest map of features by performing 4 steps: (1) Input
content is transformed into a features vector in which the lemmas are grouped by n-grams (sets of
1, 2 or 3 words), then this vector is used to build a sparse matrix, which is a binary matrix that
indicates if each feature is present in each entry. The sparse matrix has quite a large number of
dimensions and is almost empty. Next, (2) dimensionality reduction is performed. We use
Truncated Singular Value Decomposition (SVD) to reduce the number of dimensions. SVD is
preferred over Principal Components Analysis, because it can work with sparse matrices more
efficiently (Pedregosa et al., 2011). The resulting matrix is denser and has continuous values. Next,
(3) K-Means is used to group features that are “close” in terms of user interest. Finally, (4) a
manifold learning algorithm is applied to obtain a twodimensional result. The algorithm used is t-
Distributed Stochastic Neighbour Embedding, which allows to reveal data that lie in multiple
different manifolds or clusters (Pedregosa et al., 2011; Van Der Maaten and Hinton, 2008). The
second part of the methodology involved performing sentiment analysis on the inputs. Sentiment
analysis is the procedure in which information is extracted from the opinions, appraisals and
emotions of people with regard to entities, events and their attributes (Unnisa et al., 2016).

In this research, we are interested in finding if the tweets related to City Logistics have positive,
negative or neutral sentiments. This analysis is performed using the Nltk library (Bird et al., 2009).
The first step of sentiment analysis consists of calculating the polarity score (negative vs. positive)
of each input document; this was done with VADER (Valence Aware Dictionary and sentiment
Reasoner), a rulebased sentiment intensity analyzer (Hutto and Gilbert, 2014). The second step
consists of computing traditional statistical metrics.
TWITTER TOOLS FOR ANALYTICS

1. Twitter Analytics Dashboard


Every Twitter account has free access to the Twitter Analytics Dashboard. View how many
impressions and engagements your tweets get at specific times of the day and week. You can also
track the performance of your Twitter cards.

2. Hootsuite Analytics
Get real-time data about your key Twitter metrics using Hootsuite Analytics. Reports are clear and
concise, and you can export and share them with your team.

3. TruFan
Want to know all the juicy deets about your followers? Generate first-party data that’s both ethical
and high quality, and then export and re-market to those target audiences.

4. Cloohawk
Cloohawk watches your social media metrics like, well, a hawk. The AI engine continuously
monitors your own activities and the actions of your user base, and then makes suggestions to
improve your engagement. Cloohawk is available in the Hootsuite App Directory.

5. SocialBearing
Dig deep with this robust (and free!) Twitter analytics tool that allows you to find, filter and sort
tweets or followers by categories like location, sentiment, or engagement. You can also view via
timeline or Twitter map to process the data in whatever way works best for your brain.
CONCLUSION

The results show that for premier brands, Twitter can be a very effective way to communicate with
consumers, with the best performing Luxury brands achieving millions of followers – for Louis
Vuitton, with only 1.3 tweets per day. More surprisingly, the results show that even low
involvement products can obtain very large follower numbers, with the best performing FMCG
brand (Pampers) having more than a 100,000 followers with fewer than thirteen tweets per day –
and receiving more than 400 retweets for its most popular tweet, so through retweets, reaching an
even wider audience.

The results also show evolution of Twitter tactics over the comparison period, with much higher
use of hashtags across all industries, but diverging practice in other areas. Although social media
is often argued to be an interactive medium, Luxury brands’ Twitter handles – the industry with
the largest number of followers – had become significantly less engaged with their followers over
the year, with fewer replies, mentions and retweets of others, but those brands had still experienced
a large increase in the number of followers. In contrast, Auto brands were replying much more in
their tweets (73%, up from 57%), but had not achieved the same increase in retweeting. Some
brands can clearly be very successful on Twitter with very limited interaction with followers. The
comparison across industries also revealed divergent strategies: Luxury brands were primarily
broadcasting favorable company information using weblinks and embedded photos, 40 while Auto
and FMCG brands were primarily interactive. Some FMCG brands primarily posted corporate
communications news, with very little interaction.
REFERENCES

International Journal of Computer Applications (0975 8887) Volume 111. Classifying Short Text
in Social Media: Twitter as Case Study.

Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Seman tic enrichment of twitter posts for
user profile construction on the social web. In Grigoris Antoniou, Marko Grobelnik, Elena
Simperl, Bijan Parsia, Dimitris Plexousakis, Pieter Leenheer, and Jeff Pan, editors, The Semanic
Web: Research and Appli cations, volume 6644 of Lecture Notes in Computer Science, pages
375–389. Springer Berlin Heidelberg, 2011.

James Benhardus and Jugal Kalita. Streaming trend detection in twitter. Int. J. Web Based
Communities, 9(1):122–139, Jan uary 2013.

You might also like