FX RTM

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Suleyman Demirel University

Faculty of Engineering and Natural Science

Twitter Trend Analysis Using Latent Dirichlet


Allocation

Aizada Amankeldi

RESEARCH PROPOSAL
Abstract

The topic is about Twitter trend analysis using Latent Dirichlet Allocation (LDA),
a machine learning technique that can discover hidden topics from a collection of
text documents. The goal is to understand the types and triggers of social trends
on Twitter, and to develop a system that can search and display the latest trends
based on user input.

Keywords: Twitter trends, social media trends, Latent Dirichlet Allocation


(LDA), trending tweets, trigger typology, news, ongoing events, memes, commem-
oratives, keyword search.
INTRODUCTION
What is the significance of Twitter in the modern social media land-
scape, and why are trends on Twitter important?

Twitter is a prominent micro-blogging platform that facilitates online social


interaction by allowing users to post short messages, or tweets, each limited to 140
characters. While Twitter is primarily employed for sharing news and updates, the
quality and content of tweets can vary significantly. As a result, there is a critical
need for the development of algorithms capable of effectively sorting and classifying
relevant topics within this dynamic and concise communication medium.

The central aim of this study is to automatically categorize tweets within a


live stream, relying solely on the textual content of the tweet. This classification
approach has demonstrated success in handling more extensive textual sources,
such as news articles, Wikipedia entries, and other text-based content. A preva-
lent method for uncovering topics within such textual sources is Latent Dirichlet
Allocation (LDA). This research endeavors to assess the suitability of LDA for
identifying topics in a platform marked by very concise messages, such as Twitter.

It’s noteworthy that in 2013, Twitter was witnessing an average of 500 million
tweets posted daily, which highlights the immense volume of data generated on
this platform. Given this enormous data volume, there is a compelling need for
unsupervised algorithms capable of identifying topics efficiently. Should LDA prove
effective in this context, it has the potential to be employed for categorizing the
extensive archive of tweets ever posted, thus creating a substantial dataset of
classified tweets.

This research is motivated by the pressing need to enhance the automatic


categorization of tweets on Twitter, a platform that is inherently characterized
by its brevity and the vast number of tweets generated daily. By investigating
the applicability of LDA in this unique context, we seek to contribute to the
development of algorithms for real-time topic identification, with implications for
data analysis, information retrieval, and content recommendation.

An effective topic detection algorithm can serve a wide range of entities, from
intelligence agencies to private companies. Intelligence agencies could use topic
detection to automatically identify potential terrorist-related conversations by an-
alyzing message streams. These detected messages could then be flagged for further
manual review, reducing the volume of potentially relevant messages and enabling
a more in-depth analysis of each one. On the corporate side, companies can utilize
topic detection to gauge public sentiment and customer feedback regarding their

1
products and services. This is particularly valuable during product launches, as it
provides insights into customer reactions, both positive and negative.

To evaluate the efficacy of LDA, a Java-based system was developed. The eval-
uation encompasses both quantitative measurements, including Perplexity, which
assesses how well the model represents reality, and qualitative analysis of the sys-
tem’s output.

It’s essential to acknowledge the disparities between tweets and traditional news
articles. One significant distinction is that news articles are typically filtered and
considered trustworthy, whereas tweets can include noise and spam. LDA doesn’t
differentiate between noise, spam, and relevant content, potentially leading to the
emergence of topics generated by spam. Twitter’s ever-evolving language poses
another challenge since LDA necessitates a static vocabulary. While alternative
methods like those proposed by Ke Zhai and Jordan Boyd-Graber exist, which
involve character combinations in place of a static vocabulary, these were not
explored due to the scope limitations of this thesis.

In conclusion, this research aims to explore the applicability of LDA in classi-


fying Twitter topics, recognizing its potential for broader applications in diverse
fields, from national security to business analytics.

Literature Review
Social media data analysis has gained much attention in recent years, as platforms
like Twitter have become the central medium for real-time information sharing,
opinion expression, and trend formation. The analysis of social media data has
gained substantial attention in recent years, as platforms like Twitter have become
a central medium for real-time information sharing, opinion expression, and trend
formation. For both researchers and practitioners, it is necessary to understand
Twitter trends, its origins in terms of the underlying topic. Among the most
popular methods of addressing this challenge is Latent Dirichlet Allocation (LDA),
a machine learning technique that uncovers secrets in text documents. At the
moment, in this literature review, we reiterate Twitter trend analysis as a central
problem, discuss our context of research and provide an example from relevant
studies employing LDA to achieve it.

2
Methodology
Data Collection
1. Twitter API Access: Access to the Twitter API was secured through
Twitter Developer credentials, enabling the real-time retrieval of tweets via
the streaming API.

2. Keyword Selection: A set of pertinent keywords and hashtags was metic-


ulously curated to encapsulate a diverse array of topics within the realm of
interest for trend analysis.

3. Data Filtering: Rigorous filters were implemented to exclude retweets,


replies, and extraneous content, ensuring a focus on authentic, original tweets.
Stringent criteria were applied to eliminate spam and maintain data quality.

Preprocessing
1. Text Cleaning: Special characters, URLs, and non-text elements were ex-
punged from the tweets, promoting a standardized dataset by converting all
text to lowercase.

2. Tokenization: The cleaned text underwent tokenization, with the addi-


tional step of removing stop words to enhance signal-to-noise ratio.

3. Lemmatization or Stemming: Employing lemmatization or stemming


techniques reduced words to their base forms, augmenting topic coherence
and interpretability.

Latent Dirichlet Allocation (LDA) Implementation


1. Model Training: The LDA model was trained on the preprocessed tweet
data using Gensim, with an exploration of different topic numbers to identify
the optimal configuration.

2. Topic Identification: Topics and associated keywords were extracted from


the LDA model, and their coherence and interpretability were rigorously
evaluated.

3. Temporal Analysis: Temporal information was incorporated to discern


the evolution of topics over time. This facilitated the identification and
assessment of trending topics, including their duration and intensity.

3
Evaluation
1. Perplexity Measurement: The perplexity of the LDA model was evalu-
ated, serving as a metric to gauge the model’s effectiveness in representing
the intricacies of the Twitter data.

2. Qualitative Analysis: A manual review of a sample of tweets from each


identified topic was conducted to validate the model’s accuracy. Adjustments
to model parameters were made based on qualitative insights.

Comparative Analysis
1. Compare with Previous Studies: The results obtained were bench-
marked against existing studies utilizing LDA for Twitter trend analysis.
Disparities were examined, and methodological variations were justified.

2. Cross-Validation: Cross-validation was undertaken to affirm the robust-


ness of the model across distinct time periods or datasets, ensuring the reli-
ability of findings.

Implications and Applications


1. Interpretation of Results: The identified topics were interpreted in the
context of Twitter trends, with a nuanced discussion on potential implica-
tions for applications such as sentiment analysis, crisis response, and mar-
keting strategies.

2. Recommendations for Future Research: Insights into areas for future


research and enhancements in Twitter trend analysis using LDA were pro-
vided, accompanied by an acknowledgment of study limitations and sugges-
tions for their mitigation in subsequent research endeavors.

Sentiment Analysis Integration


1. Sentiment Analysis Tool Selection: An additional layer of analysis was
introduced through sentiment analysis to discern the emotional tone of tweets
within identified topics. Natural Language Processing (NLP) tools, such as
VADER or TextBlob, were employed for this purpose.

2. Sentiment Labeling: Each tweet was labeled with sentiment scores, cat-
egorizing them as positive, negative, or neutral. The sentiment labels were
appended to the dataset, providing a nuanced perspective on the emotional
undercurrents within trending topics.

4
Correlation with Twitter Trends
1. Correlating Sentiment with Trending Topics: The sentiment-labeled
tweets were correlated with trending topics identified through LDA. This
correlation aimed to uncover patterns between the emotional tone of tweets
and the popularity or persistence of specific topics.

2. Temporal Sentiment Analysis: Temporal aspects of sentiment were ex-


plored to understand how emotional tones fluctuated over time, potentially
influencing the trajectory of trending topics on Twitter.

Comparative Sentiment Analysis


1. Sentiment Comparison with Previous Studies: Sentiment results were
compared with sentiments observed in previous studies utilizing LDA for
Twitter trend analysis. Discrepancies and consistencies were examined to
provide a more holistic understanding of sentiment dynamics.

2. Cross-Validation of Sentiment Analysis: Cross-validation techniques


were applied to validate the reliability of sentiment analysis results across
different time periods or datasets, ensuring consistency and generalizability.

Implications for Practical Applications


1. Insights into Public Perception: The integration of sentiment analy-
sis provided nuanced insights into public sentiment towards trending topics.
Such insights are valuable for public relations, marketing, and crisis manage-
ment strategies.

2. Recommendations for Strategic Decision-Making: Recommendations


were formulated for leveraging sentiment insights in strategic decision-making
processes, ranging from marketing campaigns to public policy considerations.

Concluding Remarks
1. Comprehensive Understanding: The integration of sentiment analysis
enriched the overall analysis, offering a more comprehensive understanding of
Twitter trends, encompassing both the topics’ prevalence and the associated
public sentiment.

5
METHODOLOGY
The methodology that has been suggested consists of multiple phases, which
are gathering both static and real-time tweets from Twitter and performing
trend analysis. The suggested method makes use of both real-time trend
analysis of tweets and static tweets. To prepare tweets for additional analysis,
preprocessing is first required. These static and real-time tweets are then
subjected to a variety of machine learning techniques in order to identify
trends.

This algorithm uses multiple techniques to assess Twitter trending top-


ics. The tweets are first gathered and pre-processed in order to facilitate
additional analysis. Then, many techniques including hashtag counting,
noun counting, cosine similarity, Jacquard similarity, LDA, and K-means
approaches are used to analyze these preprocessed data. Every algorithm’s
performance is assessed. After then, the data is examined to determine a
pertinent subject.

Data Collection Methods


In this research, I describe the process of collecting data from social media, with a
focus on Twitter, using the Twitter API and subsequently processing the data in
Apache Spark. This process aims to collect tweets related to various domains such
as sports, healthcare, economics, politics, and social networks. The total number
of collected tweets is 20,000, and the data collection period spans from January
15, 2021, to June 30, 2021.

Utilizing Twitter API


For efficient data collection from Twitter, I utilized the Twitter API, gaining access
through a created application on the Twitter Developer Platform. This allowed
me to retrieve tweets related to specified keywords and hashtags.

Selection of Keywords and Hashtags


Keywords and hashtags were defined for each category to facilitate more accurate
and relevant data collection. This involved a thorough analysis of the context and
primary themes of each domain.

6
Expanding Data Sources
To achieve more comprehensive coverage of topics, I explored the integration of
data from various sources, including news websites and forums, to complement
information gathered from Twitter.

Process Optimization
Data Filtering and Processing
Preliminary data processing was applied to eliminate duplicates and remove unnec-
essary characters, enhancing the quality of data subjected to processing in Apache
Spark.

Scaling and Optimization


Attention was given to optimizing processes for handling large volumes of data.
Parallelization of processes was applied to improve performance.

Data Security and Processing Rules


Data collection and processing adhered to security rules, considering data protec-
tion legislation.

The combination of Twitter API with Apache Spark allows for the collection
and processing of a substantial amount of data from social networks to gain insights
into various domains. Proper selection of keywords, hashtags, and expanding data
sources contribute to obtaining a more comprehensive and accurate representation
of each topic. Applying these methods in the data collection and processing process
enables a deeper analysis, revealing important trends in the areas of interest.

Data Analysis
The acquisition and analysis of a substantial volume of data are imperative for
ensuring the robustness of analytical endeavors. The efficacy of the analysis is
inherently tied to the breadth and diversity of the dataset under examination.
In the context of researching Twitter trends, a comprehensive approach entails
encompassing a significant corpus of tweets from diverse global sources, spanning
various topics. Twitter, as a platform, facilitates this research by providing access
to a wealth of user-generated content.

7
Nevertheless, the conventional approach of developing a program for the sys-
tematic collection, preprocessing, storage, and subsequent algorithmic analysis of
tweets necessitates a considerable investment of time and resources. An alter-
native strategy, which proves to be more efficient, involves the utilization of real-
time streaming in conjunction with Apache Spark (SPARK). This is made possible
through Twitter’s provision of data streaming capabilities. In this approach, the
incoming Twitter stream is directed to a TCP socket on a system, as opposed to
being stored in a file. A SPARK session is then initiated to read the incoming data
from this TCP socket. This dynamic fusion of real-time streaming and SPARK
can significantly enhance the outcomes of the aforementioned algorithms.

A notable advantage of this methodology lies in its immediacy. Unlike tradi-


tional batch processing, where the model awaits the completion of tweet gathering
before commencing analysis, the SPARK-enabled system begins processing Twitter
data as soon as it is written to the socket. This expeditious response is attributed
to SPARK’s structured streaming capability, ensuring that results are updated in
real-time as new data is analyzed. While SPARK inherently generates results in
batches, each batch reflects the outcomes corresponding to the data streamed up
to that specific moment.

The chosen data source for this research is a TCP socket, serving as the conduit
for Twitter data obtained through the Twitter API. Tweepy, a Python library for
accessing the Twitter API, facilitates the real-time acquisition of tweets, directing
them to the designated data source. Subsequently, PySpark processes the tweets
in batches during predefined time intervals, streaming both input and output in a
synchronized manner. The variable quantities of tweets within each batch dynam-
ically reflect the prevailing topics trending on Twitter at that particular moment.
This continuous and real-time transmission of data not only allows for the accu-
mulation of a larger volume of tweets compared to static datasets but also ensures
the provision of precise trends at any given time of day.

The developed LDA model has been utilized to obtain results using a sample
dataset consisting of 15 000 tweets. The model’s input was the preprocessed
dataset that was kept in a data frame. varied outcomes were obtained when the
algorithm was run for a varied number of subjects (). To identify an appropriate
number of topics for the provided dataset, the coherence value has been calculated
for each value of k . This metric aids in determining the subjects’ coherence,
or how effectively the identified themes complement one another. The coherence
value for various k values is displayed in Figure 1.

In the process of classifying tweets into distinct documents based on various


subjects, cosine similarity was employed to assess the angular separation between

8
the documents treated as vectors. Initially, the documents were generated with a
focus on hashtags, and subsequently, the corresponding hashtags were appended
to the respective documents. In cases where a tweet lacked a hashtag, TF-IDF
(Term Frequency-Inverse Document Frequency) was utilized to ascertain its cosine
similarity with all previously established documents. The document exhibiting the
highest similarity was then assigned the appropriate hashtag, and the correspond-
ing count was incremented. Notably, in the current documentation derived from
the gathered dataset, the hashtag covid19 has emerged as a trending topic.

Figure 1

Figure 2

Data Preprocessing
In the proposed approach for Twitter data streaming, the initial step involves
sourcing input data from the specified dataset, which serves as the foundation for
the proficient information streaming. The Twitter input dataset is subjected to
preprocessing techniques, specifically tokenization and stop word removal, aimed
at eliminating inconsistent or noisy data. The data preprocessing pipeline encom-
passes the following procedures:

a. Tokenization Symbolization, or tokenization, is the process of segmenting


the input data into distinct units known as tokens, while simultaneously discarding

9
certain characters such as punctuation. Tokenization involves breaking down the
provided text into units, which can be words, numbers, or punctuation samples.
The primary objective of symbolization is to eliminate punctuation marks like
commas, periods, hyphens, and brackets. The tokenized data is represented as
follows:

Datatokenized = {ti | i = 1, 2, 3, . . . , n}

Here, ti represents individual tokens, and n is the total number of tokens.

b. Stop Word Removal Following tokenization, the tokenized information serves


as input for stop word removal, where undesirable words are excluded through
the application of stop word elimination. Stop words are generally considered
to be linguistically insignificant and include conjunctions, prepositions, articles,
and other frequently occurring words such as adverbs, verbs, and adjectives [36].
Common examples of stop words include "a," "me," "of," "the," "he," "she," and
"you." The preprocessed dataset, obtained after stop word elimination, is denoted
as:

P reprocessedDatastopwordremoved = {wi | w = 1, 2, 3, . . . , n}

In this representation, wi signifies individual words, and n denotes the total


number of words in the preprocessed set. This meticulous preprocessing enhances
the quality of the input data by refining it for subsequent stages of analysis and
information extraction.

Results
In this study, we conducted trend analysis employing various techniques and al-
gorithms to discern the most effective method for real-time analysis and accurate
results. Our analysis, presented through graphs and tables, demonstrated con-
sistent outcomes across different approaches. The experimentation encompassed
both static and dynamic data analyses. For static analysis, pre-collected data was
employed, stored in a file, and subjected to basic counting methods and machine
learning algorithms. Conversely, dynamic analysis involved real-time data stream-
ing, utilizing SPARK structured streaming for simultaneous analysis, facilitating
the processing of a substantial volume of tweets to derive precise trends.

Basic Counting Methods: Initially, we employed hashtag counting as a rudi-


mentary method to identify prevalent topics on Twitter. The hashtag, being a
pivotal element of tweets, was frequently used to convey opinions and support

10
tweet content. In our sample dataset of 20,000 static tweets, "COVID19" emerged
as the most frequently used hashtag, occurring 87 times. Recognizing the limi-
tation of hashtag counting in neglecting tweet content, we introduced the noun
counting method. This method involved part-of-speech tagging to identify and
count nouns used repeatedly in tweets. The results indicated that the term "vac-
cine" was the most frequently used noun, complementing the hashtag counting
outcomes. Consequently, these two techniques were chosen for real-time analysis.

Analysis with Hashtag and Noun Counting: Figures 2 and 3 depict the results
of Twitter trend analysis using hashtag counting and noun counting, respectively.
These methods were selected for their robust correlation with real-world scenarios.

Figure 3

LDA Model Analysis: Utilizing a sample dataset of 7,000 tweets, we imple-


mented a Latent Dirichlet Allocation (LDA) model for topic analysis. The pre-
processed dataset was input into the model, and the algorithm was executed for
varying numbers of topics (k). Coherence values were calculated for each k to de-
termine the optimal number of topics. Figure 4 illustrates the coherence values for
different k values. The selection of an appropriate k value is non-trivial, and in this
case, k = 3 was deemed optimal. Further refinement can be achieved by adjusting
parameters such as alpha, beta, or the number of iterations. In conclusion, our

comprehensive analysis, incorporating hashtag counting, noun counting, and LDA


modeling, offers valuable insights into effective methods for trend analysis, both
in static and real-time scenarios. These findings contribute to the advancement of
techniques in the domain of social media trend analysis.

11
Conclusion
In conclusion, the multifaceted landscape of Twitter trends analysis reveals itself as
a potent tool for shaping and informing decision-making across diverse sectors. The
amalgamation of sophisticated methodologies such as subject modeling, machine
learning clustering, and brute force counting techniques provides decision-makers
with a versatile toolkit to glean insights from the rich tapestry of Twitter data.

The proposal to integrate real-time streaming with the Apache SPARK big
data analytics tool emerges as a pivotal advancement in the quest for expeditious
and seamless model runtimes. The demand for faster insights in today’s dynamic
environment is met by this amalgamation, promising to revolutionize the speed
and efficiency of Twitter trends analysis.

Our exploration into performance metrics underlines the reliability of methods


such as Latent Dirichlet Allocation (LDA) and the Jaccard technique, each offering
distinct advantages in discerning patterns within the Twitterverse. These insights,
whether applied to the business realm for sales optimization, political landscapes
for gauging public sentiment, or the entertainment industry for constructive cri-
tiques, showcase the far-reaching implications of Twitter trends analysis.

As the digital sphere continues to evolve, the ability to decipher and respond
to real-time trends becomes paramount. The proposed methods not only address
this need but also position decision-makers strategically, providing them with a
nuanced understanding of the Twitter landscape. The diverse approaches discussed
underscore the adaptability of Twitter data analysis, making it an invaluable asset
for those seeking to navigate the ever-shifting currents of public opinion, interests,
and sentiments. In the pursuit of leveraging Twitter trends for actionable insights,
the integration of cutting-edge tools and methodologies stands as a testament to
the continuous evolution of data analysis in our interconnected world.

References
1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3, 993-1022. Link
2. Paul, M. J., & Dredze, M. (2011). You Are What You Tweet: Analyzing
Twitter for Public Health. ICWSM. Link
3. Hong, L., & Davison, B. D. (2010). Empirical Study of Topic Modeling in
Twitter. Proceedings of the First Workshop on Social Media Analytics, ACM.
Link

12
4. Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA:
A supervised topic model for credit attribution in multi-labeled corpora. Pro-
ceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing.
5. Weng, J., Lim, E. P., Jiang, J., & He, Q. (2010). TwitterRank: Finding
Topic-sensitive Influential Twitterers. WSDM. Link
6. Suh, B., Hong, L., Pirolli, P., & Chi, E. H. (2010). Want to be retweeted?
Large scale analytics on factors impacting retweet in Twitter network. So-
cialCom. Link
7. Davidov, D., Tsur, O., & Rappoport, A. (2010). Enhanced sentiment learning
using Twitter hashtags and smileys. COLING. Link
8. Petrovic, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detec-
tion with application to Twitter. Proceedings of the International Conference
on World Wide Web. Link
9. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X. (2011).
Comparing Twitter and traditional media using topic models. ECIR. Link
10. Hong, L., Convertino, G., Gan, C., Hsieh, G., & Chi, E. H. (2011). Language
matters in Twitter: A large scale study. ICWSM. Link

13

You might also like