Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Available online at www.sciencedirect.

com

ScienceDirect
Performance
Performance Comparison of Machine Learning
IFAC PapersOnLine 55-39 (2022) 292–295
Algorithms for Albanian
Performance Comparison of
Comparison of Machine
Machine
News
Learning
Learning
articles
Algorithms
Algorithms for
for Albanian
Albanian
Performance Comparison of Machine News Learning
articles
News articles Algorithms for Albanian
Performance Comparison of Machine Learning Algorithms for Albanian
News articles
Lamir
Lamir Shkurti* Faton News articles
Kabashi** Vehebi Sofiu***
Lamir Shkurti* Faton Kabashi** Vehebi
Shkurti* Faton Kabashi** Vehebi Sofiu***
Arsim Susuri****
Sofiu*** Arsim
Arsim Susuri****
Susuri****
*Faculty
*Faculty Lamir
of
of Shkurti* Faton
Contemporary Kabashi**
Sciences and Vehebi Sofiu***
Technologies, South Arsim
East Susuri****
European University
*Faculty of Contemporary
Lamir Shkurti*
Contemporary
Tetova, Faton
North
Sciences
Sciences
Macedonia,
and
and Technologies,
Kabashi** Technologies,
(e-mail:
South
Vehebils29773@seeu.edu.mk)
Sofiu*** East
East European
South Arsim Susuri****
European University
University
*Faculty of ContemporaryTetova,
Tetova, North Macedonia,
NorthSciences
Macedonia, (e-mail:
(e-mail: ls29773@seeu.edu.mk)
and Technologies, South East European University
ls29773@seeu.edu.mk)
**,*Faculty
***Faculty of Contemporary
of Computer
Tetova, NorthSciences
Macedonia, and Technologies,
Engineering, SouthHigher
UBT,
(e-mail: ls29773@seeu.edu.mk) East European
EducationUniversity
Institution
**,
**, ***Faculty
***Faculty of
of Computer
Computer Sciences and
and Engineering, UBT, Higher
Higher Education
Education Institution
Pristina, Tetova,
Kosova, NorthSciences
(e-mails: Macedonia, Engineering, UBT,vehebi.sofiu@ubt-uni.net)
(e-mail: ls29773@seeu.edu.mk)
faton.kabashi@ubt-uni.net, Institution
Pristina,
Pristina, Kosova,
**, ***Faculty (e-mails:
of Computer
Kosova, (e-mails: faton.kabashi@ubt-uni.net,
Sciences and Engineering, UBT,vehebi.sofiu@ubt-uni.net)
faton.kabashi@ubt-uni.net, Higher Education Institution
vehebi.sofiu@ubt-uni.net)
**, ***Faculty ****Faculty
Pristina, of Computer
Kosova, of Sciences
Computer
(e-mails: and Engineering,
Sciences, University
faton.kabashi@ubt-uni.net, UBT, Higher
of Prizren Education
“Ukshin
vehebi.sofiu@ubt-uni.net) Hoti” Institution
****Faculty
****Faculty of
of Computer
Computer Sciences,
Sciences, University
University of of Prizren
Prizren “Ukshin
“Ukshin Hoti” Hoti”
Pristina, Kosova,
Prizren,(e-mails:
Kosova, faton.kabashi@ubt-uni.net,
(e-mail: vehebi.sofiu@ubt-uni.net)
arsim.susuri@uni-prizren.com)
****Faculty Prizren,
Prizren, Kosova,
Kosova, (e-mail:
of Computer Sciences,
(e-mail: arsim.susuri@uni-prizren.com)
University of Prizren “Ukshin Hoti”
arsim.susuri@uni-prizren.com)
****Faculty of Computer Sciences, University of Prizren “Ukshin Hoti”
Prizren, Kosova, (e-mail: arsim.susuri@uni-prizren.com)
Abstract:
Abstract: This paper
This paper presented
presented a methodology
Prizren, Kosova,
a methodology to classify
classify news news articles
(e-mail: arsim.susuri@uni-prizren.com)
to articles in Albanian
Albanian languages
languages with with
Abstract:
machine This paper
learning presented
algorithms and a methodology
NLP. In this toproposed
classify work articles in
news Multinomial in Albanian
Naïve languages
Bayes, with
Logistic
machine
Abstract:
machine learning
This
learning algorithms
paper presented
algorithms and NLP.
NLP. In
anda Bernoulli,
methodologyIn this
this toproposed
classify
proposed work
news
work Multinomial
articles
Multinomial Naïve
in Albanian
Naïve Bayes,
languages
Bayes, Logistic
with
Logistic
Regression,
Regression, k-nearest
k-nearest neighbors,
neighbors, Bernoulli, Centroid,
Centroid, SVM,
SVM, SGD,
SGD, Perceptron,
Perceptron, Passive
Passive Aggressive,
Aggressive,
Abstract:
machine
Regression, This
learning paper
k-nearest presented
algorithms
neighbors, and a methodology
NLP.
Bernoulli, In this to
Centroid, classify
proposed
SVM, news
workSGD, articles
Multinomial in
Perceptron, Albanian
Naïve languages
Bayes,
Passive with
Logistic
Aggressive,
Decision
Decision Tree,
Tree, and
and Random
Random Forest
Forest machine
machine learning
learning Classification
Classification algorithms
algorithms are
are trained
trained to
to classify
classify
machine
Regression,
Decision learning
Tree, k-nearestalgorithms
and Random neighbors, and NLP.
ForestBernoulli,
machine In this
Centroid,
learning proposed
SVM, work
ClassificationSGD,Multinomial
Perceptron,
algorithms Naïve
arePassiveBayes,
trained Logistic
Aggressive,
to classify
Albanian
Albanian news
news articles
articles and
and compared in term of accuracy, training time and testing time. We have used
Regression,
Decision
Albanian
70% of theTree,
news k-nearest
andfor
articles
dataset and compared
neighbors,
Random
training Forest
compared
and 30%
in
Bernoulli,term
machine
infor
term of
of
tests.
accuracy,
Centroid,
learning
accuracy,
The
SVM,
execution
training
ClassificationSGD,
training
time
time
time
of
and
and testing
Perceptron,
algorithms
training testing
and
time.
arePassive
time.
testing
We
trained
We have
Aggressive,
to
have used
classify
algorithms used is
70%
Decision
Albanian
70% of
of thethenewsdataset
Tree, and for
articles
dataset training
Random
and
for training and
Forest
compared 30%
and 30% for
machine
infor
termtests. The
learning
ofofaccuracy,
tests. The execution
execution time
Classification
training
timetimeof training
algorithms
and
of training and
testingare testing
trained
time. We
and different
testing algorithms
to classify
have used
algorithms is
is
recorded
recorded and
and presented
presented for
for different
different input
input sizes
sizes data. For experimental we used numbers of
Albanian
70%
recorded news
ofstarting
theand articles
dataset for
presented and
trainingcompared
andand
forarticles,
different 30% infor
input term ofof
tests.
sizes Thedata.
ofaccuracy, For
execution experimental
training
timetime and
of 8000 we
training used
wetestingand different
time. We
testing numbers
have used
algorithms of
is8
inputs,
inputs, from 8000 continuing indata. For
increasing experimental
order by used
until different
the level of
of numbers
80000 in inof
70% ofstarting
recorded
inputs, theand
starting from
dataset for
presented
from 8000
8000 for articles,
training andand
different
articles, 30%
and continuing
input forsizes
tests.ofThe
continuing in
in increasing
execution
data. For
increasing order
time by of 8000
experimental
order by training
8000 we untiland
used
until the level
testing
different
the level of 80000
algorithms
numbers
80000 in is8
of8
categories.
categories. Experimental
Experimental results
results indicate
indicate that
that the
the Passive
Passive Aggressive
Aggressive algorithm
algorithm shows
shows the
the best
best performance
performance
recorded
inputs,
categories. and
starting presented
from
Experimental 8000 for different
articles,
results and
indicateinput sizes
continuing
that the of in
Passivedata. For
increasing
Aggressiveexperimental
order by 8000
algorithm we used
until
shows different
the
the level
best of numbers
80000
performance in of8
in
in terms
terms of
of accuracy
accuracy for Albanian news articles. While the classifier with the lowest accuracy was Random
Random
inputs,
categories.
in terms starting
of from for
Experimental
accuracy 8000
for Albanian
articles,
results
Albanian news
indicate
news articles.
and continuing
that the
articles. While
Passive
While the
the classifier
in increasing
Aggressive orderwith
classifier the
by 8000
algorithm
with the lowest
until
shows
lowest accuracy
the
the level
accuracybest was
of 80000
performance
was Random in 8
Forest.
Forest. Logistic Regression, SVM, and Decision Tree have aa high delay in training data. SVM also has the
termsLogistic
categories.
in
Forest.
longest of accuracy
Logistic
testing
Regression,
Experimental
time for Albanian
Regression,
compared
SVM,
results
SVM,
to
and
indicate
news Decision
articles.
and classifiers.
other Decision Tree
Tree have
that the While
Passive the
have a high
high delay
Aggressive
classifier with
delay in
algorithm training
inthe shows
lowest
training data.
theSVM
accuracy
data. best was
SVM also has
has the
performance
alsoRandom the
longest
in
Forest.
longest testing
termsLogistic
testing time
of accuracy compared
for Albanian
Regression,
time compared to
SVM,
to other
and classifiers.
news
other articles. While
Decision
classifiers. Tree have the classifier
a high delay withinthe lowestdata.
training accuracy
SVMwas alsoRandom
has the
Keywords: ©Machine learning, NLP, text
is an classification,
open access data
Treearticle amining,
high the
under classification
trainingmethod,
CCinBY-NC-ND license news articles,
Copyright
Forest.
Keywords:
longest
Keywords:
Logistic
testing 2022
Machine
time
Machine
The
Regression,Authors.
learning,
compared
learning, toThis
SVM,
NLP,
NLP,
and
text
other Decision
classification,
textclassifiers.
classification,
have
data
data mining, delay
classification
mining,bernoulli,
classification
data.
method,
method,
SVMnews
news
also has the
articles,
articles,
multinomial
multinomial naïve
(https://creativecommons.org
longest testing time
naïve bayes,
compared
bayes, logistic
to
logistic regression,
/licenses/by-nc-nd/4.0/)
other k-nearest
classifiers.
regression, k-nearest neighbors,
neighbors, bernoulli, centroid,
centroid, SVM,
SVM, SGD,
SGD,
Keywords:
multinomial
perceptron, Machine
naïve bayes,
passive learning,
aggressive, NLP,
logistic text classification,
regression,
decision tree, k-nearest
random data mining,bernoulli,
neighbors,
forest. classificationcentroid,method,
SVM, news
SGD, articles,
perceptron,
perceptron,Machine
Keywords:
multinomial passive aggressive,
naïve bayes,
passive learning,
aggressive,logisticdecision
NLP, regression,
decision tree, random
random forest.
text classification,
tree, k-nearest data mining,bernoulli,
neighbors,
forest. classificationcentroid,method,
SVM, news
SGD, articles,
multinomial naïve bayes, logistic regression,
perceptron, passive aggressive, decision tree, random forest. k-nearest neighbors, bernoulli, centroid, SVM, SGD,
perceptron, passive aggressive, decision tree, random forest.
1.
1. INTRODUCTION in the news, question answering, and sentiment analysis. Text
1. INTRODUCTION
INTRODUCTION in
in the
the news,
news, question
categorization question
can be
answering,
answering,
divided
and sentiment
andbinary
into sentiment analysis.
analysis. Text
classification Text
and
The rapid development 1. INTRODUCTION
of Internet technology is changing the categorization
in the news,
categorization can
question
can be
be divided
answering,
divided into
and
into binary
sentiment
binary classification
analysis.
classification and
Text
and
The rapid
rapid development
development of of Internet
Internet technology
technology is is changing
changing the the multi-class classification. Binary classification is a method to
The
way
way of
of receiving
receiving 1. INTRODUCTION
information.
information. Now
Now people
people can
can read
read many
many news
news
multi-class
in the news,
categorization
multi-class
determine
classification.
thequestion
classification.
can be divided
problems of
Binary
answering,
Binary
Yes or
classification
and sentiment
classification
into
No, binary
Positive or
is a method
analysis.
isNegative,
a methodand
classification to
Text
to
and
The
way
textsrapid development
ofonreceiving
the Internet information.
usingof Internet
Now
their technology
people can read
smartphones. is changing
News manyfrom newsthe
the determine
categorization
determine
multi-class the
the problems
can
problems be
classification. of Binary
Yes or
divided
of Yes orinto
No,
No, Positiveclassification
binary
Positive
classification orisNegative,
or Negative,
a method and
and
to
texts
The
way
textsofon
rapid the
onreceiving
the Internet
development
Internet using
information.
using their
of Internet
Now
their smartphones.
technology
people can
smartphones. News
is changing
read
News manyfrom
from newsthe
the it
it is usually
is usually
usually used
used of in sentiment
inBinary
sentiment analysis.
analysis. Multiclass
Multiclass
Internet
Internet can
can be
be of
of different categories, for example, news multi-class
it is
determine
classification classification.
the used
problems
problems in
canYesbe or classification
sentiment
No,
divided analysis.
Positive
into or is a
single-label method
Multiclass
Negative, to
and
multi-
way
textsofonreceiving
Internet
reports, the
can Internet
news befrom of different
information.
using
different
politics,theircategories,
Now people can
smartphones.
categories,
economy,
for example,
for read
News
sports, many
example, from
technology,
news
newsthe classification
determine
classification
it is problems
themulti-label
usually problems
used can
of
can
in Yesbe
be divided
or into
No, Positive
divided
sentiment into single-label
orWang,
Negative,
single-label
analysis. multi-
and
multi-
Multiclass
reports,
texts
reports,
Internet news
on the
news
can from
Internet
befrom politics,
using
politics,
of different their economy,
smartphones.
economy,
categories, sports,
News
sports,
for example,technology,
from
technology,newsthe class
class and
and multi-label classification
classification (X.
(X.single-label
Wang, 2017).
2017).
lifestyle,
lifestyle, etc.
etc. People's
People's interests
interests in
in reading
reading news
news on
on the
the Internet
Internet it
class
Over isthe usually
and
classificationpast used
multi-label
problems
few decades, inclassification
can sentiment
be
text divided analysis.
(X.
into
categorization Wang,
issues Multiclass
have2017).
multi-
been
Internet
reports,
lifestyle,
are can
news
etc.
different, be from
People's of
therefore different
politics,
interests
the categories,
economy,
in reading
classification forof example,
sports,
news on
thetechnology,
the news news
Internet by Over the
classificationpastmulti-label
few decades,
problems can text
be categorization
divided into issues
single-label have2017).
been
multi-
are different,
reports, news therefore
from the
politics, classification
economy, of
sports, the news
technology, by Over
class the
extensively past
and few decades,
explored and text categorization
classification
handled in (X. issues
numerousWang, have been
practical
lifestyle,
are
categories etc.
different,
is People's
therefore
necessary. interests
The in reading news
theclassification
classification of of on
news thenews
thetext Internet
is a by
key extensively
class
extensively
Over and explored
multi-label
the pastexplored
few decades, and handled
classification
andtext handled in numerous
(X.
in numerous
categorization Wang,
issues practical
2017).
practical
have been
categories
lifestyle,
are is
etc.
different,
categories necessary.
People's
is to therefore
necessary. The
Thetheclassification
interests in
classification
classification of
of news
reading news of
news thetext
on thecan
text is
is aahelp
Internet
news key
by
key applications.
applications. Many
Many researchers
researchers are increasingly
are increasingly
increasingly interested
interested in
in
technology
technology to process
process news
news information,
information, which
which can help Over
creating the past
applications.
extensively few decades,
Many
explored
applications and
that text
researchers categorization
are
handled
make use in
of issues
numerous
text have
interested been
in
practical
categorization
are different,
categories
technology
organize is to therefore
necessary.
information process Thethe
news
effectively classification
information,
classification
and of of
which
news
distinguish thetext news
canis
informationahelpby
key creating
extensively applications
explored that
and make
handled use of
in text
numerous categorization
practical
organize
categories information
is to
necessary. effectively
The and distinguish
classification distinguish
of news information
text isMiao,
ahelp
key creating
applications.
techniques, applications
Many
especially that
researchers
in make
light are
ofuse of
recent text
increasingly categorization
interested
advancements in
in
organize
technology
categories information
according process toeffectively
news
the needs and
information,
of users which
quickly information
(Fang can techniques,
applications.
techniques,
creating especially
Many
especially
applications in light
researchers
in
that light
make ofuse
are
of recent
increasingly
recent
of advancements
interested
advancements
text categorization in
in
categories
technology
categories
organize according
to process
according
information toeffectively
to the
news
the needs
needs of
of
and users
information,
users quickly
which
quickly
distinguish (Fang
(Fang can Miao,
information help
Miao, Natural
Natural Language
Language Processing
Processing (NLP)
(NLP) and
and text
text mining
mining (Jiang et
(Jiang in
et
2018).
2018). The
The majority
majority of
of data
data used
used to
to categorize
categorize genres
genres are
are often
often creating
Natural
techniques,
al., 2018; applications
Language
especially
Kowsari et that
Processing
al., in make
(NLP)
light
2017; ofuserecent
McCallum andoftext
ettext categorization
mining
advancements
al., 1998; (Jiang
Kowsari et
organize
2018).
categories
gathered information
Thefrom
majority
according the of effectively
to data
web, used to
the newsgroups,
needs and distinguish
ofcategorize
usersbulletingenres
quickly information
(Fang areMiao,
boards, often
and al., 2018;
techniques, Kowsari
especiallyet al., 2017;
in lightMcCallum
of recent et al., 1998;
advancements Kowsari in
gathered
categories from
according the of web,
to newsgroups,
the newsgroups,
needs bulletin
ofcategorize
users quickly boards,
(Fang Miao,and al.,
et 2018;
Natural
al., Kowsari
Language
2018; et al., 2017;
Processing
Heidarysafa et al., McCallum
(NLP)
2018; Lai etal.,
andettext al.,mining
1998;Aggarwal
2015; Kowsari
(Jiang et
gathered
2018).
broadcastThefrom
majority
or the
printed web,data
news. used
Due to
to their bulletin
genres
multi-sourceboards, and
arenature,
often et al.,
Natural
et al.,
al., 2018; Heidarysafa
Language
2018;
2018; Processing
Heidarysafa
Kowsari et et al.,
et al., (NLP)
al., 2017; 2018;
2018;
McCallumLai
andet
Lai ettext
al., 2015;
al.,mining
etal., 2015; Aggarwal
(Jiang
1998;Aggarwal
Kowsari et
broadcast
2018).
broadcast
gathered The or printed
majority
or printed
from thehave ofnews.
data
news.
web, Due
used to
to
Due to their
newsgroups, their multi-source
categorize genres
multi-source
bulletin are
boards, nature,
often
nature,
and et
et al.,
al., 2012).
2012).
they
they frequently
frequently have drastically
drastically varying
varying writing
writing styles,
styles, al.,
et 2018;
al., Kowsari
2012).
2018; et
Heidarysafa al., 2017;
et al., McCallum
2018; Lai etet al.,
al., 1998;
2015; Kowsari
Aggarwal
gathered
they
broadcast from
frequently
or thehave
printed web,
news. newsgroups,
drastically
Due to their bulletin
varying
multi-sourceboards,
writing and
styles,
nature, NLP is used for
preferred
preferred
broadcast
preferred
they
vocabulary,
vocabulary,
or printed
vocabulary,
frequently have
and
and
news.
and Due
formats,
formats,
to
formats,
drastically their
even
even
even
varying
for
for
multi-source
for
writing
documents
documents
nature,
documents
styles,
et
et al.,
al.,
NLP
NLP
2018; is Heidarysafa
2012).
is used for
used for aaaet al.,
variety
2018;of
variety
variety
Laitasks,
of
of tasks,
including
et al., 2015;
tasks, including
including
text
Aggarwal
text
text
belonging to the same genre. In particular, the data are categorization,
et al., 2012).
categorization, information
information extraction,
extraction, and
and tracking,
tracking, speech
speech
belonging
they
belonging
preferred to
frequently
to thethe
vocabulary, same
have
sameand genre.
drastically
genre. In
formats, particular,
In varying
particular, the
writing data
thedocuments
evenclassification
for data are
styles,
are categorization,
NLP opinion
is used information
for a and extraction,
variety ofmoreand(Ferrari,
tasks, tracking,
including speech
text
heterogeneous (Ikonomakis, 2005). Text is tagging, mining, much 2018). In
heterogeneous
preferred
heterogeneous
belonging to the (Ikonomakis,
vocabulary,
(Ikonomakis,
same andgenre.2005).
formats,
2005).In Text classification
evenclassification
Text
particular, for the documents
data is isareaaa tagging,
NLP
tagging, opinion
is used
opinion
categorization, mining,
formachine
mining,
information and
a and muchofmore
variety
much
extraction, more
tasks,(Ferrari,
including
and(Ferrari,
tracking, 2018).
2018). In
text
In
speech
method, which is used to confirm the category of an unlabeled this
this research,
research, several
several machine learning
learning algorithms
algorithms are
are applied
applied
method,
belonging
method, which
to
which
heterogeneous is
the
is used
same to
useddefining
(Ikonomakis, confirm
genre.
to confirm 2005). the
In category
particular,
the category
Text of
of anan
the
classificationunlabeled
data
unlabeled are
is a categorization,
this
tagging,research,
opinion information
several
mining,machineand extraction,
learning and tracking,
algorithms
much morethe(Ferrari, are speech
applied
2018). In
text based on the topic categories in advance. to
to a number of datasets to determine accuracy of the
text
text based
heterogeneous
based
method,
In which
mathematics,
on is
on the
(Ikonomakis,
the
it used
is
defining
defining
to confirm
actually a
topic
2005).
topic categories
Text
categories
the category
mapping: of anin unlabeled
classification
in advance.
advance.is a this aaresearch,
tagging,
to number
classifiers. opinion
number of mining,
of datasets
datasets
several
Machine machine to determine
and
to
learning
determine
much
learning more
algorithms
the(Ferrari,
the accuracy
accuracy
algorithmsinclude are of the
2018).
of the
In
applied
several
In
In mathematics,
method, which
mathematics,
text based on isit itthe
is actually
used
is actually
defining aa mapping:
to confirm mapping:
topicthe category
categories of anin unlabeled
advance. classifiers.
this research,
classifiers.
to a number Machine
several
Machine
of datasets learning
machine
learning
to algorithms
learning algorithms
algorithms
determine the include
include
accuracy are several
applied
several
of the
algorithms such as Multinomial Naïve Bayes, Logistic
text based on itthe
In mathematics, is actually𝑓𝑓:
defining
𝑓𝑓: 𝐴𝐴
𝐴𝐴 →
topic
→ 𝐵𝐵
𝐵𝐵
a mapping: categories in advance. algorithms
to
algorithms
classifiers.
Regression,
such
a numberMachine
such
k-nearest
as Multinomial
of datasets
as Multinomial
to determine
learning
neighbors,
Naïve
Naïve
algorithms
Bernoulli,
Bayes,
the Bayes,
accuracy Logistic
of
include Logistic
Centroid, the
several
SVM,
In mathematics, it is actually → 𝐵𝐵
𝑓𝑓: 𝐴𝐴a mapping: Regression,
classifiers.
Regression,
algorithms
SGD,
k-nearest
Machine
k-nearest
such as
Perceptron,
neighbors,
learning
neighbors,
Multinomial
Passive
Bernoulli,
algorithms
Bernoulli,
Aggressive, Naïve Centroid,
include
Centroid,
Bayes,Tree,
Decision
SVM,
several
SVM,
Logistic
and
where
where A
A isis the
is the
the set set
set ofof 𝑓𝑓: 𝐴𝐴 to
texts
texts →be
to 𝐵𝐵 classified,
be classified, B
B is
is the
the set
set of
of SGD,
algorithms
SGD,
Regression, Perceptron,
such
Perceptron,
k-nearest Passive
as
Passive Aggressive,
Multinomial
Aggressive,
neighbors, Naïve
Bernoulli,Decision
Bayes,
Decision
Centroid, Tree, and
Logistic
Tree, and
SVM,
where
categories,
categories,
Aand f is the of 𝑓𝑓: 𝐴𝐴
texts
classifier →
to 𝐵𝐵
be
in classified,
the B
classification is the set
process. of Random
Random
Regression,
Random
SGD,
Forest.
Forest.
Forest.
Perceptron,
In
In
k-nearest
this
this
In this
Passive
paper,
paper,
neighbors,
paper,
we
we
weour
have
have
have
Aggressive,
used
used
Bernoulli,
used
these
these
Centroid,
these
Decision
algorithms
algorithms
SVM,
algorithms
Tree,news
and
Textwhere
categories, Aand
and
classification
ff is
is the the
set
theof
iscan classifier
texts
classifier
be used to
in
in
be
inthe
the
the classification
classified, B is of
classification
classification
process.
the set of
process.
articles to
to train the classifiers
trainPerceptron,
SGD, the classifiers with
with
Passive
part
part of
of our
Aggressive,
dataset
dataset of Albanian
ofthese
Decision AlbanianTree, news
and
Text classification
where
categories, A is the
and f iscan
Text classification can
set be
of
the be used
texts
classifier in
to the
be classification
classified,
used ininthetheclassification B is of
thearticles
classificationofprocess. set
articles of to train
Random
articles the
and classifiers
Forest. In
classify with
this
the part
paper,
other of
we our
part dataset
have
with used
the of Albanian
already news
algorithms
learned
articles
Random
articles
to train and
and
the classify
Forest. In
classify
classifiers thepaper,
this
the
with other
other
part ofpart
we
part
our with
have
with
dataset theofthese
used
the already
already
Albanian learned
algorithms
learned
news
categories,
Text and f iscan
classification the beclassifier
used ininthetheclassification
classificationofprocess.
articles
Text classification can be used in the classification of articles to train the
articles andclassifiers
classify with part ofpart
the other ourwith
dataset theofalready
Albanian news
learned
2405-8963 Copyright © 2022 The Authors. This is an open access article under the CC
articles andBY-NC-ND
classify the license . part with the already learned
other
Peer review under responsibility of International Federation of Automatic Control.
10.1016/j.ifacol.2022.12.037
Lamir Shkurti et al. / IFAC PapersOnLine 55-39 (2022) 292–295 293

classifiers. The used categories are: Region, World, Lifestyle- 3. CLASSIFICATION ALGORITHMS
health, Sport, Fun-Curiosity, Political, Entertainment, and
Technology. Multinomial Naive Bayes classifier is appropriate for the
classification of discrete features such as word counts for text
This paper is organized as follows. The second section classification. The multinomial distribution normally requires
shows the research methodology, the third section described integer feature counts. However, in practice, fractional counts
classification algorithms, in the fourth section shows the results such as TF-IDF may also work. TF-IDF creates a feature
of the research given. Further, the five section will be a matrix from a group of raw documents.
discussion and the sixth section is the conclusion.
2. METHODOLOGY Logistic regression is a linear model for classification rather
than regression.
We used the Albanian news dataset "Kosova-News-Articles"
taken from the URL: Bernoulli is like Multinomial Naïve Bayes, this classifier used
https://www.kaggle.com/datasets/gentrexha/kosovo-news- for discrete data. The difference is that while Multinomial
articles-dataset. Naïve Bayes works with occurrence counts, Bernoulli Naïve
The dataset contains more than 3 million articles in the Bayes is designed for binary/boolean features. A logistic
Albanian language taken from various web news portals from function is used to model the probabilities describing the
25.09.2007 through 27.08.2020. The dataset has 3.85 GB of possible outcomes of a single trial.
which there are 1907045 unique values. From this dataset, we Neighbors-based classification is a type of non-generalizing
got 80000 articles from the year 2020, in the following learning in which no attempt is made to construct a general
categories: Region, World, Lifestyle-health, Sport, Fun- internal model but instead simply stores instances of the
Curiosity, Political, Entertainment, and Technology. For each training data. Neighbor classifier implements learning based
of these categories, we obtained 10000 articles. The dataset on each query point's k nearest neighbors, where is an integer
includes columns: content and category of the news articles. value specified by the user. The optimal k value is highly data-
Then a stop words list was created, to remove occurrences of dependent: in general, a larger value suppresses noise effects
the stop-words that provide no value to the overall sentence but makes classification boundaries less distinct. The
and aim for better efficiency. After removing stop words, we algorithm starts by summarizing the training dataset into a set
used TF-IDF vectorizer to vectorize the content. For the of centers, which are then used to predict new examples.
structured dataset in supervised learning, we vectorize
category column using Label-Encoder, this will give a unique The centroid classifier works similarly to the neighbor
numerical value to every category. Then we split the original classifier in that it assigns a new document to the class of its
data into 70% training and 30% testing sets and implemented nearest neighbor in the training data.
eleven different classifiers for training data. SVM classifier are a set of supervised learning methods used
To do the training and prediction with our created dataset for classification, regression, and outlier detection. Its effective
we used scikit-learn libraries and write code in Python in high dimensional spaces, uses a subset of training points in
programing language. The algorithms that we have used for the decision function, so it is also memory efficient.
comparison involve the following classifiers: Multinomial
Naïve Bayes, Logistic Regression, k-nearest neighbors, Stochastic gradient descent classifier is a simple but highly
Bernoulli, Centroid, SVM (Support Vector Machine), SGD efficient technique to fit linear models. It is especially useful
(Stochastic Gradient Descent), Perceptron, Passive when there are a large number of datasets.
Aggressive, Decision Tree, Random Forest. To evaluate these Perceptron is an algorithm that can be used for large-scale
algorithms, we use an accuracy score, training time, and learning. It does not require a learning rate by default, is not
testing time. After training the datasets with the eleven regularized, and updates its model only on errors. The
classifiers, we created the models by saving them in pickle Perceptron trains slightly faster than SGD with the hinge loss
files that we don't need again for training the datasets. In and that the resulting models are sparser.
addition, we created a web application to predict Albanian
news articles. Users can put a text and can choose one of the The passive-aggressive algorithms are a type of large-scale
eleven algorithms mentioned above. Also, can choose dataset learning algorithm. They are similar to Perceptron in that no
trained datasets from S1 to S10. By clicking on the button learning rate is required. However, unlike the Perceptron, they
Classify the content we will get as a result the predicted have a regularization parameter.
category and its accuracy.
Decision Trees are a non-parametric supervised learning
method used for classification and regression. The goal is to
create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features.
A tree can be seen as a piecewise constant approximation.
A random forest is a meta estimator that fits a number of
decision tree classifiers on various sub-samples of the dataset
and uses averaging to improve the predictive accuracy and
control over-fitting (scikit-learn.org, 2022).

Figure 1. Web interface to predict news articles.


294 Lamir Shkurti et al. / IFAC PapersOnLine 55-39 (2022) 292–295

4. RESULTS 8000, until the level of 80000 articles is reached. The table
presents the results of eleven different classifications for
Testing Algorithms are done iteratively for different numbers
different numbers of inputs, starting from S1=8000,
of inputs, starting from 8000 articles (from each category we
S2=16000, S3=24000, S4=32000, S5=40000, S6=48000,
obtained 1000 articles), and continuing in increasing order by
S7=56000, S8=64000, S9 =72000 and S10=80000.

TABLE I. THE RESULT OF CLASSIFIER FOR DIFFERENT INPUTS


Category S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
Multinomial NB 0.85 0.87 0.88 0.89 0.88 0.88 0.88 0.88 0.88 0.87
Logistic Regression 0.85 0.86 0.88 0.89 0.88 0.88 0.88 0.88 0.88 0.88
k-nearest neighbors 0.85 0.83 0.84 0.85 0.85 0.84 0.84 0.84 0.84 0.84
Bernoulli 0.51 0.58 0.63 0.64 0.66 0.59 0.62 0.63 0.60 0.54
Centroid 0.81 0.78 0.78 0.80 0.79 0.79 0.79 0.79 0.79 0.80
SVM 0.88 0.89 0.89 0.90 0.90 0.90 0.89 0.89 0.89 0.89
SGD 0.89 0.89 0.88 0.89 0.88 0.88 0.87 0.87 0.87 0.86
Perceptron 0.87 0.88 0.88 0.89 0.89 0.89 0.89 0.88 0.89 0.88
Passive Aggressive 0.89 0.89 0.90 0.91 0.90 0.91 0.90 0.90 0.90 0.90
Decision Tree 0.67 0.68 0.70 0.71 0.72 0.72 0.72 0.72 0.72 0.72
Random Forest 0.42 0.52 0.49 0.46 0.47 0.46 0.41 0.44 0.45 0.41

For each classifier, main metrics such as maximum and


minimum value, average, median, and deviation are
calculated.

TABLE II. METRICS FOR INVESTIGATED CLASSIFIERS


MAX MIN AVER MEDI DEVIA
Category
AGE AN TION
Multinomial NB 0.89 0.85 0.88 0.88 0.011
Logistic Regression 0.89 0.85 0.88 0.88 0.011
k-nearest neighbors 0.85 0.83 0.84 0.84 0.006
Bernoulli 0.66 0.51 0.60 0.61 0.045
Centroid 0.81 0.78 0.79 0.79 0.009
SVM 0.90 0.88 0.89 0.89 0.007
SGD 0.89 0.86 0.88 0.88 0.008
Perceptron 0.89 0.87 0.88 0.89 0.007
Passive Aggressive 0.91 0.89 0.90 0.90 0.005
Decision Tree 0.72 0.67 0.71 0.72 0.018
Random Forest 0.52 0.41 0.45 0.46 0.035
We also measured the training and testing time for each of the
classifiers for different numbers of inputs starting from
S1=8000 until S10=80000 which are presented in Figure 1 and
Figure 2.

Figure 3. Classifiers testing time in seconds depending on the input


size.
5. DISCUSSION
From the results of the classifiers, in Table 1 we see accuracy
scores with different numbers of inputs. With the increase of
data, increase the percentage of accuracy although there was
no very large increase compared to the input data. After the
input set S4, in some classifiers this increase is minimal, but in
some classifiers have a decrease in accuracy, even though the
decrease in accuracy is almost minimal. From the table, we can
be seen that the Passive Aggressive algorithm has the highest
maximum value (0.90) as well as the highest average (0.91)
and the lowest standard deviation (0.005). From this, we can
conclude that Passive Aggressive shows the best performance
in terms of accuracy in a collection of Albanian news articles.
The worst classifier in this context should be the Random
Figure 2. Classifiers training time in seconds depending on the input Forest classifier with the maximum value (0.52), average
size.
Lamir Shkurti et al. / IFAC PapersOnLine 55-39 (2022) 292–295 295

(0.46), and the standard deviation (0.034), followed by the Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text
Bernoulli classifier with the maximum value (0.66), average classification using machine learning techniques.
(0.61) and the standard deviation (0.043). Accuracy is the most WSEAS transactions on computers, 4(8), 966-974.
important metric when classifying text, but when there is a
large input of data, training time is also of vital importance. Jiang, M.; Liang, Y.; Feng, X.; Fan, X.; Pei, Z.; Xue, Y.; Guan,
Some of the classifiers such as Logistic Regression, SVM, and R. Text classification based on deep belief network and
Decision Tree, even though they have good performance in softmax regression. Neural Comput. Appl. 2018, 29, 61–
accuracy scores but these classifiers have a high delay in 70
training data compared to other classifiers. While other Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Jafari Meimandi,
classifiers have good performance and a very fast training K.; Gerber, M.S.; Barnes, L.E. HDLTex: Hierarchical
time, which varies from 1 to 10 seconds depending on the size Deep Learning for Text Classification. Machine
of the input. For the testing time, SVM has the longest testing Learning and Applications (ICMLA). In Proceedings of
time compared to other classifiers. the 2017 16th IEEE International Conference on
6. CONCLUSIONS Machine Learning and Applications (ICMLA), Cancun,
Mexico, 18–21 December 2017
Text classification now has many application areas in news
classification and sentiment analysis. This paper represents the Kowsari, K.; Heidarysafa, M.; Brown, D.E.; Jafari Meimandi,
results of classifying Albanian news text set using eleven K.; Barnes, L.E. RMDL: Random Multimodel Deep
classifiers. To achieve the best result, we used dataset in Learning for Classification. In Proceedings of the 2018
different input size. The performance of the machine learning International Conference on Information System and
algorithms has also evaluated in terms of accuracy, training Data Mining, Lakeland, FL, USA, 9–11 April 2018;
time and testing time. We demonstrated that this methodology doi:10.1145/3206098.3206111.
provides accurate results when classifying text written in
Albanian. Also, the created web application can be used to Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent Convolutional
predict news articles in the Albanian language. Neural Networks for Text Classification. In Proceedings
of the Twenty-Ninth AAAI Conference on Artificial
The results show that the Passive Aggressive gives the best Intelligence, Austin, TX, USA, 25–30 January 2015;
prediction accuracy. After the Passive Aggressive classifier, Volume 333, pp. 2267–2273.
the SVM classifier has the highest accuracy, but it has the
longest training and testing time compared to other classifiers. McCallum, A.; Nigam, K. A comparison of event models for
While the classifier with the lowest accuracy was Random naive bayes text classification. In Proceedings of the
Forest. Regarding the training time, the classifiers with the AAAI-98 Workshop on Learning for Text
longest training time resulting SVM, Logistic Regression, and Categorization, Madison, WI, USA, 26–27 July 1998;
Decision Tree classifiers. Volume 752, pp. 41–48.

REFERENCES scikit-learn.org, "scikit-learn," [Online]. Available: scikit-


learn.org/stable/modules/linear_model.html. [Accessed
Aggarwal, C.C.; Zhai, C. A survey of text classification 10 07 2022].
algorithms. In Mining Text Data; Springer:
Berlin/Heidelberg, Germany, 2012; pp. 163–222. X. Wang et al., "Research and Implementation of a Multi-label
Learning Algorithm for Chinese Text Classification,"
Aggarwal, C.C.; Zhai, C.X. Mining Text Data; Springer: 2017 3rd International Conference on Big Data
Berlin/Heidelberg, Germany, 2012. Computing and Communications (BIGCOM), Chengdu,
2017, pp. 68-76.
Fang Miao, P. Z. L. J. H. W., 2018. Chinese news text
classification based on machine learning algorithm.
International Conference on Intelligent Human-Machine
Systems and Cybernetics (IHMSC), Volumen 2, pp. 48-
51.
Ferrari A (2018) Natural language requirements processing:
from research to practice. In: IEEE/ACM 40th
international conference on software engineering:
companion (ICSE-Companion), Gothenburg, pp 536–
537.
Heidarysafa, M.; Kowsari, K.; Brown, D.E.; Jafari Meimandi,
K.; Barnes, L.E. An Improvement of Data Classification
Using Random Multimodel Deep Learning (RMDL).
IJMLC 2018, 8, 298–310.

You might also like