Professional Documents
Culture Documents
An Approach To Detecting Writing Styles Based On Clustering Technique
An Approach To Detecting Writing Styles Based On Clustering Technique
Abstract—With the advancement of AI technology, detection, which includes stylometry and metadata
the generation of content has become easier and more analysis.
accessible. In such a scenario, it is difficult to
differentiate between human-generated text and AI- In recent times, there has been interest in studying
generated text. To address this concern, our
linguistics as well as the structural features of a
methodology proposes an intelligent system capable of
analyzing text files and classifying unique writing styles language. This is done primarily for identifying the
using stylometric analysis. This work also compares different styles of writing a given text. Such type of
various clustering algorithms, including k-means, k- analysis is classified as stylometric analysis.
means++, hierarchical, and DBSCAN, utilizing Stylometry is applied in digital text forensics as well
silhouette scores as performance metrics. This ensures for identifying doubtful text written by one or more
the effectiveness of our system in distinguishing between authors. Literary style’s measurable features are
similar and dissimilar writing styles based on advanced sentence length, readability-based scores, the richness
linguistic and structural features of text. Our tools of the vocabulary, and frequencies of the document’s
separate text of different styles and clustered it together
entities such as words, lengths word forms, etc. There
to provide a valuable solution for detecting plagiarism
across multiple document files by grouping it. Our are various tools available to find individual words or
system demonstrated its efficacy by successfully sentences within a document.
identifying two distinct writing styles within a
document. This highlights the practical application and Stylometry is used in various problems of detecting
accuracy of our approach in effectively addressing plagiarism of a given text, separation of genres,
complex challenges associated with stylistic differences verification of the authors and their attributions,
and the potential for plagiarism detection. author gender detection, etc. The primary task of
stylometry is classifying different writing styles from
Keywords— plagiarism detection, clustering
algorithms, stylometric analysis, authorship analysis,
a given text.
silhouette score.
II. RELATED WORK
The research paper of Muhammad Badruddin Khan,
I. INTRODUCTION
Mousumi Saha, and Saptarshi Ghosh have been
Various methods for plagiarism detection have been particularly helpful in understanding the analysis and
introduced by researchers. After World War II, when importance of each feature in feature extraction
NLP was introduced, only text similarity (cosine, [3][4]. One research paper by Patrick Juola helps us
Jaccard similarity, etc.) was used. After the to think that stylometric analysis can also be used in
advancement of the internet, web scraping algorithms plagiarism detection [5][6]. Patrick Juola and Haris
and techniques were also introduced. With the advent Muneer employ a statistical-based technique known
of modern AI/ML/DL techniques, it is necessary to as “principal-component-analysis” (PCA) for
incorporate additional resources for plagiarism dimensional reduction for reducing multidimensional
features to 2d [6][2].
SCEECS 2024
These stages are data collection, data preprocessing, some important methods in our system that
feature extraction, dimensional reduction, machine are
learning models for clustering, and comparison with
other algorithms using silhouette score (performance 1. Type-token Ratio (TTR)
metrics). 2. Average word length
3. Average word frequency
4. Average Syllables per Word
3.1 Dataset 5. Page count
6. Functional word count
We created one small dataset of 10 samples which
contains text contains of stories, poems, research
paper and articles of same and different authors.
Type-token Ratio (TTR): The proportion of distinct
words (types) to the words count (tokens) in a text is
given by this measure. Equation-1 gives the formula
3.2 Feature Extraction
for calculating the Type-Token Ratio (TTR)
Out of all stages features selection/features extraction
is very important stages of our system. Since our = = ...............Equation-1
system does not rely on a large corpus so we need to
crafted features very carefully. For features
where, N represents the total word count (tokens) and
extraction, we use two python library pdfplumber to
V represents the total count of unique words.
extract text from the PDF and PyMuPDF to obtain
font information. The font information includes
Average word length: This measure finds the
details such as font name, type, encoding, and
average count of characters in words.
statistics. Subsequently, we perform stylometry
analysis on the extracted text data, concentrating on
Average word frequency: This measure finds the
three main parts shown in Fig 2
average count of a word occurs in a text. It provides
insights of unique and repetition patterns of words.
SCEECS 2024
used in the text. This might be indicative of Shannon Entropy: This formula measure of
repetitive language or a more information entropy in a text. It quantifies the
straightforward writing style. Methods used uncertainty or surprise associated with predicting the
in our system to measure Vocabulary next word in a sequence. The formula for finding
Richness are: Shannon Entropy is given by equation-4.
1. Uber`s Index
2. HD-D Index 3( ) = − ∑:6;& 5( 6 ) ∗ log(5( 6 ))……...Equation-4
3. Tuldava's Measure U
4. Yules Characteristic K Where 5( 6 ) is the probability of occurrence of the i-
5. Honores R Measure th
word type.
6. Sichel`s Measure
7. Shannon Entropy Simpson’s Index: assesses the likelihood that two
8. Simpson’s Index words in a text that are chosen at random will be
identical. Higher values indicate greater diversity.
Uber’s Index: It carries combines information of The range is 0 to 1. The formula for finding
hapax dislegomena (words that occur exactly twice), Simpson’s Index is given by equation-5.
hapax legomena (word that occur unique), and total
%
words in a text to assess vocabulary richness. <= ∑. " ………………....……………...…Equation-5
Equation-2 gives the formula for computing the > 0 1 =(,> )
( )"
=
th word type.
…………………Equation-2
#
We are using Uber’s Index Because Uber's Index
where N is the total count of tokens and V is total depends on hapax legomena (words that occur
count of types. exactly twice) and hapax dislegomena (word that
occur unique), we are not taking Honores R Measure
Hypergeometric Distribution of Diversity (HD-D) and Sichel's Measure into consideration.
index: also known as the Dale-Chall Readability
Formula, is a measure of how closely the TTR C. Readability Scores: These represent the
changes as a text gets longer. text's level of difficulty or simplicity, and
they differ from person to person. The
Tuldava's Measure U: It considers the number of readability score incorporates theoretical
unique stems (root forms) in relation to the total linguistics, statistical modeling, and
number of words. psychology theory to evaluate the level of
accessibility of writing patterns (Dubay,
Yules Characteristic K: It assesses how evenly the 2004). The text is easier to read if its
frequencies of different words are distributed in a readability score is higher. Our system
text. If K’s value is smaller, it indicates vocabulary as measures readability using the following
being diversified. On the other hand, high K value techniques:
suggests that the author's vocabulary is heavily
concentrated on a small number of repeated words 1. Flesch-Kincaid Grade Level
Equation-3 gives the formula for computing the 2. Dale Chall Readability Formula
Yules Characteristic K. 3. Linsear Write Index
. 4. Gunning Fog Index
%.&%K∗L PQ.R∗MS
? @AB = 206.835 − J OJ O................Eq-6
MN L
SCEECS 2024
Equation-7 gives the formula for calculating the level Analysis (LDA), Principal Component Analysis
of Flesch-Kincaid Grade. (PCA) and Generalized Discriminant Analysis
(GDA) are some of the methods used in dimensional
Level of Flesch-Kincaid Grade: reduction. Our system used linear discriminant
analysis to take that 15D vector and turn it into a 2D
&.[\∗L %%.P∗MS
?$T U V W = −15.59 Z J OZ J O……. Equation-7 vector by extracting its essence.
MN L
where W denotes the word count, St denotes the 3.3 Machine Learning Algorithms
sentence count, and Sy denotes the syllable count.
The genesis of the present work was clustering the
related styles together; hence supervised ML
algorithms were used. Algorithms which we can use
Dale Chall Readability Formula (DCRF):
are kmeans, kmeans++, hierarchical and DBSCAN.
evaluates a text's readability by taking into account
both the overall word count and the amount of
Kmeans: Kmeans is a popular clustering method that
difficult words. It provides a grade level estimate.
separates data into K clusters based on similarities
Equation-8 gives the dale chall readability formula.
between them. The overall squared distances between
100 ∗ (_`6++6abcN )
each data point and the corresponding cluster
]=^ e
d
centroid are shortened as a result.
SCEECS 2024
Equation-11 gives the formula for finding the When we used the same author, silhouette value is 60
silhouette score. – 70 but if different author text taken then silhouette
value decreased to 40 – 50. In silhouette score,
~(6)#s(6)
(}) = ……………………Equation-11 average silhouette width of over 0.7 is considered to
•€•(s(6),~(6)) be "strong", a value over 0.5 "reasonable" and over
0.25 "weak" cluster.
The average distance between the i-th datapoint and
the other data points in the same cluster is denoted by
a(i). If we increased value of k from 2 to 5 on another data
The minimum average distance, minimized across still silhouette value decreased to 30 – 40 (see Fig 6)
clusters, between the i-th data points in a distinct and those 5 clusters were very close to each other.
cluster is denoted by b(i).
IV. RESULTS
First, we conducted an experiment with two different
styles (k = 2), and found its separate stories of one
author and poem of another author in two distinct
clusters. (see Fig 4) In this experiment, it was found
that kmeans, kmeans++ and hierarchical clustering
algorithm have same silhouette score.
Fig 6: scatter plot when k = 5
SCEECS 2024
CONCLUSION AND FUTURE SCOPE [7] Krause, M., 2015. Stylometry-based fraud and plagiarism
detection for learning at scale. In Proceeding of the KSS Workshop
From the above experiments and results, we conclude (Vol. 15).
that our system works properly for determining [8] Khan, J., 2017. Style breach detection: An unsupervised
different writing styles when k = 2 (when there are detection model—Notebook for PAN at CLEF 2017. In
two clusters). However, as the value of k increases Proceedings of the Conference and Labs of the Evaluation Forum
(indicating more clusters or indicating the number of and Workshop (CLEF’17).
author content, story, and poem styles), [9] Terreau, E., Gourru, A. and Velcin, J., 2021, November.
Consequently, the silhouette value decreases, making Writing style author embedding evaluation. In Proceedings of the
it difficult to discern style differences in documents, 2nd Workshop on Evaluation and Comparison of NLP Systems
resulting in decreased accuracy. (pp. 84-93).
SCEECS 2024