Professional Documents
Culture Documents
FX RTM
FX RTM
FX RTM
Aizada Amankeldi
RESEARCH PROPOSAL
Abstract
The topic is about Twitter trend analysis using Latent Dirichlet Allocation (LDA),
a machine learning technique that can discover hidden topics from a collection of
text documents. The goal is to understand the types and triggers of social trends
on Twitter, and to develop a system that can search and display the latest trends
based on user input.
It’s noteworthy that in 2013, Twitter was witnessing an average of 500 million
tweets posted daily, which highlights the immense volume of data generated on
this platform. Given this enormous data volume, there is a compelling need for
unsupervised algorithms capable of identifying topics efficiently. Should LDA prove
effective in this context, it has the potential to be employed for categorizing the
extensive archive of tweets ever posted, thus creating a substantial dataset of
classified tweets.
An effective topic detection algorithm can serve a wide range of entities, from
intelligence agencies to private companies. Intelligence agencies could use topic
detection to automatically identify potential terrorist-related conversations by an-
alyzing message streams. These detected messages could then be flagged for further
manual review, reducing the volume of potentially relevant messages and enabling
a more in-depth analysis of each one. On the corporate side, companies can utilize
topic detection to gauge public sentiment and customer feedback regarding their
1
products and services. This is particularly valuable during product launches, as it
provides insights into customer reactions, both positive and negative.
To evaluate the efficacy of LDA, a Java-based system was developed. The eval-
uation encompasses both quantitative measurements, including Perplexity, which
assesses how well the model represents reality, and qualitative analysis of the sys-
tem’s output.
It’s essential to acknowledge the disparities between tweets and traditional news
articles. One significant distinction is that news articles are typically filtered and
considered trustworthy, whereas tweets can include noise and spam. LDA doesn’t
differentiate between noise, spam, and relevant content, potentially leading to the
emergence of topics generated by spam. Twitter’s ever-evolving language poses
another challenge since LDA necessitates a static vocabulary. While alternative
methods like those proposed by Ke Zhai and Jordan Boyd-Graber exist, which
involve character combinations in place of a static vocabulary, these were not
explored due to the scope limitations of this thesis.
Literature Review
Social media data analysis has gained much attention in recent years, as platforms
like Twitter have become the central medium for real-time information sharing,
opinion expression, and trend formation. The analysis of social media data has
gained substantial attention in recent years, as platforms like Twitter have become
a central medium for real-time information sharing, opinion expression, and trend
formation. For both researchers and practitioners, it is necessary to understand
Twitter trends, its origins in terms of the underlying topic. Among the most
popular methods of addressing this challenge is Latent Dirichlet Allocation (LDA),
a machine learning technique that uncovers secrets in text documents. At the
moment, in this literature review, we reiterate Twitter trend analysis as a central
problem, discuss our context of research and provide an example from relevant
studies employing LDA to achieve it.
2
Methodology
Data Collection
1. Twitter API Access: Access to the Twitter API was secured through
Twitter Developer credentials, enabling the real-time retrieval of tweets via
the streaming API.
Preprocessing
1. Text Cleaning: Special characters, URLs, and non-text elements were ex-
punged from the tweets, promoting a standardized dataset by converting all
text to lowercase.
3
Evaluation
1. Perplexity Measurement: The perplexity of the LDA model was evalu-
ated, serving as a metric to gauge the model’s effectiveness in representing
the intricacies of the Twitter data.
Comparative Analysis
1. Compare with Previous Studies: The results obtained were bench-
marked against existing studies utilizing LDA for Twitter trend analysis.
Disparities were examined, and methodological variations were justified.
2. Sentiment Labeling: Each tweet was labeled with sentiment scores, cat-
egorizing them as positive, negative, or neutral. The sentiment labels were
appended to the dataset, providing a nuanced perspective on the emotional
undercurrents within trending topics.
4
Correlation with Twitter Trends
1. Correlating Sentiment with Trending Topics: The sentiment-labeled
tweets were correlated with trending topics identified through LDA. This
correlation aimed to uncover patterns between the emotional tone of tweets
and the popularity or persistence of specific topics.
Concluding Remarks
1. Comprehensive Understanding: The integration of sentiment analysis
enriched the overall analysis, offering a more comprehensive understanding of
Twitter trends, encompassing both the topics’ prevalence and the associated
public sentiment.
5
METHODOLOGY
The methodology that has been suggested consists of multiple phases, which
are gathering both static and real-time tweets from Twitter and performing
trend analysis. The suggested method makes use of both real-time trend
analysis of tweets and static tweets. To prepare tweets for additional analysis,
preprocessing is first required. These static and real-time tweets are then
subjected to a variety of machine learning techniques in order to identify
trends.
6
Expanding Data Sources
To achieve more comprehensive coverage of topics, I explored the integration of
data from various sources, including news websites and forums, to complement
information gathered from Twitter.
Process Optimization
Data Filtering and Processing
Preliminary data processing was applied to eliminate duplicates and remove unnec-
essary characters, enhancing the quality of data subjected to processing in Apache
Spark.
The combination of Twitter API with Apache Spark allows for the collection
and processing of a substantial amount of data from social networks to gain insights
into various domains. Proper selection of keywords, hashtags, and expanding data
sources contribute to obtaining a more comprehensive and accurate representation
of each topic. Applying these methods in the data collection and processing process
enables a deeper analysis, revealing important trends in the areas of interest.
Data Analysis
The acquisition and analysis of a substantial volume of data are imperative for
ensuring the robustness of analytical endeavors. The efficacy of the analysis is
inherently tied to the breadth and diversity of the dataset under examination.
In the context of researching Twitter trends, a comprehensive approach entails
encompassing a significant corpus of tweets from diverse global sources, spanning
various topics. Twitter, as a platform, facilitates this research by providing access
to a wealth of user-generated content.
7
Nevertheless, the conventional approach of developing a program for the sys-
tematic collection, preprocessing, storage, and subsequent algorithmic analysis of
tweets necessitates a considerable investment of time and resources. An alter-
native strategy, which proves to be more efficient, involves the utilization of real-
time streaming in conjunction with Apache Spark (SPARK). This is made possible
through Twitter’s provision of data streaming capabilities. In this approach, the
incoming Twitter stream is directed to a TCP socket on a system, as opposed to
being stored in a file. A SPARK session is then initiated to read the incoming data
from this TCP socket. This dynamic fusion of real-time streaming and SPARK
can significantly enhance the outcomes of the aforementioned algorithms.
The chosen data source for this research is a TCP socket, serving as the conduit
for Twitter data obtained through the Twitter API. Tweepy, a Python library for
accessing the Twitter API, facilitates the real-time acquisition of tweets, directing
them to the designated data source. Subsequently, PySpark processes the tweets
in batches during predefined time intervals, streaming both input and output in a
synchronized manner. The variable quantities of tweets within each batch dynam-
ically reflect the prevailing topics trending on Twitter at that particular moment.
This continuous and real-time transmission of data not only allows for the accu-
mulation of a larger volume of tweets compared to static datasets but also ensures
the provision of precise trends at any given time of day.
The developed LDA model has been utilized to obtain results using a sample
dataset consisting of 15 000 tweets. The model’s input was the preprocessed
dataset that was kept in a data frame. varied outcomes were obtained when the
algorithm was run for a varied number of subjects (). To identify an appropriate
number of topics for the provided dataset, the coherence value has been calculated
for each value of k . This metric aids in determining the subjects’ coherence,
or how effectively the identified themes complement one another. The coherence
value for various k values is displayed in Figure 1.
8
the documents treated as vectors. Initially, the documents were generated with a
focus on hashtags, and subsequently, the corresponding hashtags were appended
to the respective documents. In cases where a tweet lacked a hashtag, TF-IDF
(Term Frequency-Inverse Document Frequency) was utilized to ascertain its cosine
similarity with all previously established documents. The document exhibiting the
highest similarity was then assigned the appropriate hashtag, and the correspond-
ing count was incremented. Notably, in the current documentation derived from
the gathered dataset, the hashtag covid19 has emerged as a trending topic.
Figure 1
Figure 2
Data Preprocessing
In the proposed approach for Twitter data streaming, the initial step involves
sourcing input data from the specified dataset, which serves as the foundation for
the proficient information streaming. The Twitter input dataset is subjected to
preprocessing techniques, specifically tokenization and stop word removal, aimed
at eliminating inconsistent or noisy data. The data preprocessing pipeline encom-
passes the following procedures:
9
certain characters such as punctuation. Tokenization involves breaking down the
provided text into units, which can be words, numbers, or punctuation samples.
The primary objective of symbolization is to eliminate punctuation marks like
commas, periods, hyphens, and brackets. The tokenized data is represented as
follows:
Datatokenized = {ti | i = 1, 2, 3, . . . , n}
P reprocessedDatastopwordremoved = {wi | w = 1, 2, 3, . . . , n}
Results
In this study, we conducted trend analysis employing various techniques and al-
gorithms to discern the most effective method for real-time analysis and accurate
results. Our analysis, presented through graphs and tables, demonstrated con-
sistent outcomes across different approaches. The experimentation encompassed
both static and dynamic data analyses. For static analysis, pre-collected data was
employed, stored in a file, and subjected to basic counting methods and machine
learning algorithms. Conversely, dynamic analysis involved real-time data stream-
ing, utilizing SPARK structured streaming for simultaneous analysis, facilitating
the processing of a substantial volume of tweets to derive precise trends.
10
tweet content. In our sample dataset of 20,000 static tweets, "COVID19" emerged
as the most frequently used hashtag, occurring 87 times. Recognizing the limi-
tation of hashtag counting in neglecting tweet content, we introduced the noun
counting method. This method involved part-of-speech tagging to identify and
count nouns used repeatedly in tweets. The results indicated that the term "vac-
cine" was the most frequently used noun, complementing the hashtag counting
outcomes. Consequently, these two techniques were chosen for real-time analysis.
Analysis with Hashtag and Noun Counting: Figures 2 and 3 depict the results
of Twitter trend analysis using hashtag counting and noun counting, respectively.
These methods were selected for their robust correlation with real-world scenarios.
Figure 3
11
Conclusion
In conclusion, the multifaceted landscape of Twitter trends analysis reveals itself as
a potent tool for shaping and informing decision-making across diverse sectors. The
amalgamation of sophisticated methodologies such as subject modeling, machine
learning clustering, and brute force counting techniques provides decision-makers
with a versatile toolkit to glean insights from the rich tapestry of Twitter data.
The proposal to integrate real-time streaming with the Apache SPARK big
data analytics tool emerges as a pivotal advancement in the quest for expeditious
and seamless model runtimes. The demand for faster insights in today’s dynamic
environment is met by this amalgamation, promising to revolutionize the speed
and efficiency of Twitter trends analysis.
As the digital sphere continues to evolve, the ability to decipher and respond
to real-time trends becomes paramount. The proposed methods not only address
this need but also position decision-makers strategically, providing them with a
nuanced understanding of the Twitter landscape. The diverse approaches discussed
underscore the adaptability of Twitter data analysis, making it an invaluable asset
for those seeking to navigate the ever-shifting currents of public opinion, interests,
and sentiments. In the pursuit of leveraging Twitter trends for actionable insights,
the integration of cutting-edge tools and methodologies stands as a testament to
the continuous evolution of data analysis in our interconnected world.
References
1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3, 993-1022. Link
2. Paul, M. J., & Dredze, M. (2011). You Are What You Tweet: Analyzing
Twitter for Public Health. ICWSM. Link
3. Hong, L., & Davison, B. D. (2010). Empirical Study of Topic Modeling in
Twitter. Proceedings of the First Workshop on Social Media Analytics, ACM.
Link
12
4. Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA:
A supervised topic model for credit attribution in multi-labeled corpora. Pro-
ceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing.
5. Weng, J., Lim, E. P., Jiang, J., & He, Q. (2010). TwitterRank: Finding
Topic-sensitive Influential Twitterers. WSDM. Link
6. Suh, B., Hong, L., Pirolli, P., & Chi, E. H. (2010). Want to be retweeted?
Large scale analytics on factors impacting retweet in Twitter network. So-
cialCom. Link
7. Davidov, D., Tsur, O., & Rappoport, A. (2010). Enhanced sentiment learning
using Twitter hashtags and smileys. COLING. Link
8. Petrovic, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detec-
tion with application to Twitter. Proceedings of the International Conference
on World Wide Web. Link
9. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X. (2011).
Comparing Twitter and traditional media using topic models. ECIR. Link
10. Hong, L., Convertino, G., Gan, C., Hsieh, G., & Chi, E. H. (2011). Language
matters in Twitter: A large scale study. ICWSM. Link
13