Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

2024 IEEE International Students' Conference on Electrical, Electronics and Computer Science

An Approach to Detecting Writing Styles Based


on Clustering Technique
Devkinandan Jagtap1, Shweta Ambekar2, Harshit Singh3,
Nakul Sharma 4
1
B. Tech, Artificial Intelligence and Data Science, VIIT, Pune, India – 411048
2
B. Tech, Artificial Intelligence and Data Science, VIIT, Pune, India – 411048
3
B. Tech, Artificial Intelligence and Data Science, VIIT, Pune, India – 411048
4
Assistant Professor, Dept. of Artificial Intelligence and Data Science, VIIT, Pune, India – 411048
1
devkinandan.22010627@viit.ac.in
2
shweta.22120183@ viit.ac.in
3
harshit.22011072@viit.ac.in
4
nakul777@ gmail.com

Abstract—With the advancement of AI technology, detection, which includes stylometry and metadata
the generation of content has become easier and more analysis.
accessible. In such a scenario, it is difficult to
differentiate between human-generated text and AI- In recent times, there has been interest in studying
generated text. To address this concern, our
linguistics as well as the structural features of a
methodology proposes an intelligent system capable of
analyzing text files and classifying unique writing styles language. This is done primarily for identifying the
using stylometric analysis. This work also compares different styles of writing a given text. Such type of
various clustering algorithms, including k-means, k- analysis is classified as stylometric analysis.
means++, hierarchical, and DBSCAN, utilizing Stylometry is applied in digital text forensics as well
silhouette scores as performance metrics. This ensures for identifying doubtful text written by one or more
the effectiveness of our system in distinguishing between authors. Literary style’s measurable features are
similar and dissimilar writing styles based on advanced sentence length, readability-based scores, the richness
linguistic and structural features of text. Our tools of the vocabulary, and frequencies of the document’s
separate text of different styles and clustered it together
entities such as words, lengths word forms, etc. There
to provide a valuable solution for detecting plagiarism
across multiple document files by grouping it. Our are various tools available to find individual words or
system demonstrated its efficacy by successfully sentences within a document.
identifying two distinct writing styles within a
document. This highlights the practical application and Stylometry is used in various problems of detecting
accuracy of our approach in effectively addressing plagiarism of a given text, separation of genres,
complex challenges associated with stylistic differences verification of the authors and their attributions,
and the potential for plagiarism detection. author gender detection, etc. The primary task of
stylometry is classifying different writing styles from
Keywords— plagiarism detection, clustering
algorithms, stylometric analysis, authorship analysis,
a given text.
silhouette score.
II. RELATED WORK
The research paper of Muhammad Badruddin Khan,
I. INTRODUCTION
Mousumi Saha, and Saptarshi Ghosh have been
Various methods for plagiarism detection have been particularly helpful in understanding the analysis and
introduced by researchers. After World War II, when importance of each feature in feature extraction
NLP was introduced, only text similarity (cosine, [3][4]. One research paper by Patrick Juola helps us
Jaccard similarity, etc.) was used. After the to think that stylometric analysis can also be used in
advancement of the internet, web scraping algorithms plagiarism detection [5][6]. Patrick Juola and Haris
and techniques were also introduced. With the advent Muneer employ a statistical-based technique known
of modern AI/ML/DL techniques, it is necessary to as “principal-component-analysis” (PCA) for
incorporate additional resources for plagiarism dimensional reduction for reducing multidimensional
features to 2d [6][2].

979-8-3503-4846-0/24/$31.00 ©2024 IEEE


Train models on a large corpus of texts by different that currently many researches are doing text
writers to identify which text belongs to which analytics using stylometric features [2][3][4].
author. This is one way of recognizing different
writing styles (Intrinsic Plagiarism Detection). After
examining each author's writing style, they were III. METHODOLOGY
required to assess whether or not an author in a
document had copied content from another author The system goes through multiple stages before the
[1][8]. We used the idea of feature extraction, instead final value could be predicted accurately (see Fig 1).
of using a large corpus is taken from research paper
[2] and [3]. We gathered that currently many
researches are doing text analytics using stylometric
features [2][3][4].

There are some paper, used supervised learning


algorithms like SVM for Clustering after features
extraction [7]. while some others paper uses
unsupervised algorithms like kmeans clustering
algorithms [8]. There is one paper which use KNN
clustering algorithms to find plagiarism detection in
multiple files but this paper not most focused on
feature extraction part [10].

After features extraction we need to find best


algorithm to use to our system. It is evident from
literature that the complexity algorithms is depends
on various factors like dataset, number of data points,
total count of clusters, total iterations in an algorithm
and total count of dimensions [11][12][13]. Hence in
the proposed work, we compare with top 3 clustering
algorithms that are k-means, hierarchal and
DBSCAN.

If we compare different clustering algorithms to


optimize result, we need to find strong performance
parameter, that helps to analysis of clustering
algorithms [14]. We found various parameters like
Silhouette Score, Davies-Bouldin Index, Adjusted
Rand Index (ARI), etc.

The objective of this research paper is to optimize


results by using different features extraction methods
and clustering algorithms, and to analyze the
variations in clustering algorithms and silhouette
values across different datasets from the same and
different authors. And variation of silhouette value as
value of clusters changes.
One approach of detecting different writing styles
(Intrinsic Plagiarism Detection) is to utilize text of
large corpus written by varied authors while train
models which will be deployed for checking which
text is written by which author. The styles of writing
are detected for each author and plagiarism is hence
detected. The amount of plagiarism is detected for
each author [1][8]. The present work used idea of
feature extraction, instead of using large corpus is
taken from research paper [2] and [3]. We gathered Fig 1: flowchart of our system

SCEECS 2024
These stages are data collection, data preprocessing, some important methods in our system that
feature extraction, dimensional reduction, machine are
learning models for clustering, and comparison with
other algorithms using silhouette score (performance 1. Type-token Ratio (TTR)
metrics). 2. Average word length
3. Average word frequency
4. Average Syllables per Word
3.1 Dataset 5. Page count
6. Functional word count
We created one small dataset of 10 samples which
contains text contains of stories, poems, research
paper and articles of same and different authors.
Type-token Ratio (TTR): The proportion of distinct
words (types) to the words count (tokens) in a text is
given by this measure. Equation-1 gives the formula
3.2 Feature Extraction
for calculating the Type-Token Ratio (TTR)
Out of all stages features selection/features extraction
is very important stages of our system. Since our = = ...............Equation-1
system does not rely on a large corpus so we need to
crafted features very carefully. For features
where, N represents the total word count (tokens) and
extraction, we use two python library pdfplumber to
V represents the total count of unique words.
extract text from the PDF and PyMuPDF to obtain
font information. The font information includes
Average word length: This measure finds the
details such as font name, type, encoding, and
average count of characters in words.
statistics. Subsequently, we perform stylometry
analysis on the extracted text data, concentrating on
Average word frequency: This measure finds the
three main parts shown in Fig 2
average count of a word occurs in a text. It provides
insights of unique and repetition patterns of words.

Average Syllables per Word: This measure is


calculated by dividing the total count of syllables
with the total count of words, giving an average
count of syllables per word. It is used to measure of
linguistic complexity. A syllable is word or part of a
word which contains one vowel sound.

Functional word count: The functional word means


commonly used words without significant meaning,
e.g., articles, conjunctions in a text. It conveys the
grammatical relationships between words in a
sentence.

B. Vocabulary Richness Features: These


features evaluate a text's vocabulary's
complexity, diversity, and originality. A
higher Vocabulary Richness value
represents that text has a wide range of
unique words relative to its total word count.
suggesting a rich lexicon with a broad
selection of words. This could suggest a
diverse and varied vocabulary, potentially
Fig 2: Stylometric analysis Features Extraction
indicating a sophisticated or expressive
writing style. On the other hand, a lower
A. Lexical Features: This measure is used to
Vocabulary Richness value represents that
extract word usage patterns within a text. In
there is more limited lexicon with fewer
other words, we can say structure of text.
unique words and less variation in the words
There are many methods in lexical, we use

SCEECS 2024
used in the text. This might be indicative of Shannon Entropy: This formula measure of
repetitive language or a more information entropy in a text. It quantifies the
straightforward writing style. Methods used uncertainty or surprise associated with predicting the
in our system to measure Vocabulary next word in a sequence. The formula for finding
Richness are: Shannon Entropy is given by equation-4.

1. Uber`s Index
2. HD-D Index 3( ) = − ∑:6;& 5( 6 ) ∗ log(5( 6 ))……...Equation-4
3. Tuldava's Measure U
4. Yules Characteristic K Where 5( 6 ) is the probability of occurrence of the i-
5. Honores R Measure th
word type.
6. Sichel`s Measure
7. Shannon Entropy Simpson’s Index: assesses the likelihood that two
8. Simpson’s Index words in a text that are chosen at random will be
identical. Higher values indicate greater diversity.
Uber’s Index: It carries combines information of The range is 0 to 1. The formula for finding
hapax dislegomena (words that occur exactly twice), Simpson’s Index is given by equation-5.
hapax legomena (word that occur unique), and total
%
words in a text to assess vocabulary richness. <= ∑. " ………………....……………...…Equation-5
Equation-2 gives the formula for computing the > 0 1 =(,> )

Where 5( 6 ) is the probability of occurrence of the i-


Uber’s Index

( )"
=
th word type.
…………………Equation-2
#
We are using Uber’s Index Because Uber's Index
where N is the total count of tokens and V is total depends on hapax legomena (words that occur
count of types. exactly twice) and hapax dislegomena (word that
occur unique), we are not taking Honores R Measure
Hypergeometric Distribution of Diversity (HD-D) and Sichel's Measure into consideration.
index: also known as the Dale-Chall Readability
Formula, is a measure of how closely the TTR C. Readability Scores: These represent the
changes as a text gets longer. text's level of difficulty or simplicity, and
they differ from person to person. The
Tuldava's Measure U: It considers the number of readability score incorporates theoretical
unique stems (root forms) in relation to the total linguistics, statistical modeling, and
number of words. psychology theory to evaluate the level of
accessibility of writing patterns (Dubay,
Yules Characteristic K: It assesses how evenly the 2004). The text is easier to read if its
frequencies of different words are distributed in a readability score is higher. Our system
text. If K’s value is smaller, it indicates vocabulary as measures readability using the following
being diversified. On the other hand, high K value techniques:
suggests that the author's vocabulary is heavily
concentrated on a small number of repeated words 1. Flesch-Kincaid Grade Level
Equation-3 gives the formula for computing the 2. Dale Chall Readability Formula
Yules Characteristic K. 3. Linsear Write Index
. 4. Gunning Fog Index

%&' ∗)∑. "


/01 +, - 2#
$= " ………………....………Equation-3
Flesch-Kincaid Grade Level: This metric
determines the grade level in American schools at
where which a text must be understood. It takes into account
fx -frequencies for each X, variables like the typical sentence length and word
N- total count of tokens syllable count. Equation-6 gives the formula for
X- vector of the frequencies calculating Score for Flesch Reading Ease:

%.&%K∗L PQ.R∗MS
? @AB = 206.835 − J OJ O................Eq-6
MN L

SCEECS 2024
Equation-7 gives the formula for calculating the level Analysis (LDA), Principal Component Analysis
of Flesch-Kincaid Grade. (PCA) and Generalized Discriminant Analysis
(GDA) are some of the methods used in dimensional
Level of Flesch-Kincaid Grade: reduction. Our system used linear discriminant
analysis to take that 15D vector and turn it into a 2D
&.[\∗L %%.P∗MS
?$T U V W = −15.59 Z J OZ J O……. Equation-7 vector by extracting its essence.
MN L

where W denotes the word count, St denotes the 3.3 Machine Learning Algorithms
sentence count, and Sy denotes the syllable count.
The genesis of the present work was clustering the
related styles together; hence supervised ML
algorithms were used. Algorithms which we can use
Dale Chall Readability Formula (DCRF):
are kmeans, kmeans++, hierarchical and DBSCAN.
evaluates a text's readability by taking into account
both the overall word count and the amount of
Kmeans: Kmeans is a popular clustering method that
difficult words. It provides a grade level estimate.
separates data into K clusters based on similarities
Equation-8 gives the dale chall readability formula.
between them. The overall squared distances between
100 ∗ (_`6++6abcN )
each data point and the corresponding cluster
]=^ e
d
centroid are shortened as a result.

We apply the Elbow approach to determine the ideal


L
fghi = 0,1579 ∗ Z] Z 0.0496 J O…. Equation-8 value of K.
MN

where, W stands for words, St for sentences, and


_`6++6abcN is the difficult words.

Linsear Write Index:

100 ∗ (_n [ MS ) 300 ∗ (_cpqq Nrs: [ MS )


100 − m oZ m o
Fig 3: Elbow Diagram in k = 3
d d
U=
100 ∗ @t
J O
d KMeans++: KMeans++ is an improvement over
KMeans in terms of selecting initial centroids. It uses
where, W is the total number of words, _n [ MS = a smarter initialization strategy to improve
_cpqq Nrs: [ MS is count of words with syllables less convergence.
than three, and St is total number of sentences present
in the text. Hierarchical Clustering: The process of hierarchical
clustering creates a hierarchy of clusters that
Gunning Fog Index: finds the number of years in resembles a tree. It can be divisive (top-down) or
school that are needed to understand a text with just agglomerative (bottom-up). Agglomerative
one reading. It takes into account elements like the hierarchical clustering is more commonly used.
number of complex words and sentence length.
Equation-10 gives the Gunning Fog Index formula. DBSCAN is Density-Based Spatial Clustering of
d 100 ∗ v_
Applications with Noise: DBSCAN is a technique
T = 0.4 um oZm o w … yz{|t}B − 10 that aggregates nearby data points according to a
@t d density criterion. Finding clusters with asymmetrical
shapes is one of its main applications.
where, W stands for words, St for sentences, and Cw
for complex words. Words with three syllables or To find which algorithm is more accurate we use
more are considered complex. Silhouette Score. The silhouette value quantifies the
degree to which an item is more cohesive within its
own cluster than it is with other clusters (separation).
Since we computed nearly ten features, we had to use When an object's value is high, it indicates that it
dimensional reduction to transform our vector into a matches well with its own cluster and poorly with
two-dimensional vector. Linear Discriminant nearby clusters. The silhouette ranges from −1 to +1.

SCEECS 2024
Equation-11 gives the formula for finding the When we used the same author, silhouette value is 60
silhouette score. – 70 but if different author text taken then silhouette
value decreased to 40 – 50. In silhouette score,
~(6)#s(6)
(}) = ……………………Equation-11 average silhouette width of over 0.7 is considered to
•€•(s(6),~(6)) be "strong", a value over 0.5 "reasonable" and over
0.25 "weak" cluster.
The average distance between the i-th datapoint and
the other data points in the same cluster is denoted by
a(i). If we increased value of k from 2 to 5 on another data
The minimum average distance, minimized across still silhouette value decreased to 30 – 40 (see Fig 6)
clusters, between the i-th data points in a distinct and those 5 clusters were very close to each other.
cluster is denoted by b(i).

IV. RESULTS
First, we conducted an experiment with two different
styles (k = 2), and found its separate stories of one
author and poem of another author in two distinct
clusters. (see Fig 4) In this experiment, it was found
that kmeans, kmeans++ and hierarchical clustering
algorithm have same silhouette score.
Fig 6: scatter plot when k = 5

We ran this experiment in 10 samples with different


values of k, the text data concluded stories, articles
and research paper. All that text data and output
screenshot available in GitHub. Based on that output
we created table (Fig 7) and observed that as
increases value of k from 2 to 3, 4, 5, then silhouette
score decreased. When we use k = 2 silhouette value
of all clustering algorithm is same but as k (number
Fig 4: scatter plot when k = 2 (same author) of clusters) increased silhouette score decreased.

However, if we take stories of different author and


poems of different author, it also separates in clusters
but clusters data points are far related to previous
result, because of that silhouette value decreased. (see
Fig 5) In this experiment we found that kmeans
silhouette value is 45.68, kmeans++ silhouette value
is 46.77 and hierarchical silhouette is 43.09. there is
slight difference in silhouette score, that is because of
different authors have different styles but format is Fig 7: Compare different silhouette score
same.

Fig 5: scatter plot when k = 2 (different author)

SCEECS 2024
CONCLUSION AND FUTURE SCOPE [7] Krause, M., 2015. Stylometry-based fraud and plagiarism
detection for learning at scale. In Proceeding of the KSS Workshop
From the above experiments and results, we conclude (Vol. 15).
that our system works properly for determining [8] Khan, J., 2017. Style breach detection: An unsupervised
different writing styles when k = 2 (when there are detection model—Notebook for PAN at CLEF 2017. In
two clusters). However, as the value of k increases Proceedings of the Conference and Labs of the Evaluation Forum
(indicating more clusters or indicating the number of and Workshop (CLEF’17).
author content, story, and poem styles), [9] Terreau, E., Gourru, A. and Velcin, J., 2021, November.
Consequently, the silhouette value decreases, making Writing style author embedding evaluation. In Proceedings of the
it difficult to discern style differences in documents, 2nd Workshop on Evaluation and Comparison of NLP Systems
resulting in decreased accuracy. (pp. 84-93).

[10] Sahu, M., 2016. Plagiarism detection using artificial


When comparing different clustering algorithms, we intelligence technique in multiple files. International Journal 0f
observed that the output of the k-means and k- Scientific and Technology Research, 5(4).
means++ algorithms is nearly the same (because the
[11] Sonagara, D. and Badheka, S., 2014. Comparison of basic
logic behind both algorithms is the same). However, clustering algorithms. Int. J. Comput. Sci. Mob. Comput, 3(10),
the output of the hierarchical algorithm is not pp.58-61.
optimized.
[12] Sehgal, G. and Garg, D.K., 2014. Comparison of various
clustering algorithms. International Journal of Computer Science
There are many opportunities for future research in and Information Technologies, 5(3), pp.3074-3076.
this work, such as adding more parameters to
evaluate style similarity and optimizing the current [13] Kumar, V., Chhabra, J.K. and Kumar, D., 2014. Performance
methodology so, it will also give good accuracy as evaluation of distance metrics in the clustering algorithms.
INFOCOMP Journal of Computer Science, 13(1), pp.38-52.
the value of k increases. This work can be extended
to enhance existing plagiarism detection algorithms. [14] Saputra, D.M., Saputra, D. and Oswari, L.D., 2020, May.
It proves to be particularly useful in academic Effect of distance metrics in determining k-value in k-means
settings for detecting plagiarism between assignments clustering using elbow and silhouette method. In Sriwijaya
International Conference on Information Technology and Its
by detecting style similarities. Applications (SICONIAN 2019) (pp. 341-346). Atlantis Press.

[15] Uzun, L., 2023. ChatGPT and academic integrity concerns:


REFERENCES Detecting artificial intelligence generated content. Language
Education and Technology, 3(1).
[1] López-Escobedo, F., Méndez-Cruz, C.F., Sierra, G. and
Solórzano-Soto, J., 2013. Analysis of stylometric variables in long
and short texts. Procedia-Social and Behavioral Sciences, 95,
pp.604-611.

[2] Elahi, H. and Muneer, H., 2018. Identifying Different Writing


Styles in a Document Intrinsically using Stylometric Analysis. The
complete code and detailed documentation is available on the
attached GitHub Link: https://github. com/harismuneer/Writing-
Styles-Classification-Using-Stylometric-Analysis.

[3] Al Ghamdi, N.M. and Khan, M.B., 2022. Assessment of


performance of machine learning based similarities calculated for
different English translations of Holy Quran. International Journal
of Computer Science & Network Security, 22(4), pp.111-118.

[4] Saha, M. and Ghosh, S., 2023. Identifying Stylometric


Characteristics of Domain Specific Texts using Classification
Algorithms: A Study of Library Science Articles Published in
2020. Journal of Information and Knowledge, pp.159-167.

[5] Juola, P., 2017, May. Detecting contract cheating via


stylometric methods. In Proceedings on the Conference on
Plagiarism across Europe and Beyond (pp. 187-198).

[6] Ison, D.C., 2020. Detection of Online Contract Cheating


Through Stylometry: A Pilot Study. Online Learning, 24(2),
pp.142-165.

SCEECS 2024

You might also like