Theorical Basis

2.1.
Related works
Manek et al. proposed the method of feature extraction based on gini index and
classification by support vector machine (SVM). Hai et al. proposed a new
probabilistic supervised joint emotion model (SJSM), which could not only identify
semantic sentiments from the comment data but also infer the overall sentiment of the
comment data. Singh et al. used naive bayes, J48, BFTree and OneR four machine
learning algorithms for text sentiment analysis. Huang et al. proposed a multi-modal
joint sentiment theme model. Based on the introduction of user personality features
and sensitive influence factors, the model uses latent dirichlet allocation (LDA) model
to analyze the hidden user sentiments and topic types in Weibo text. Huq et al. used
SVM and k-nearest neighbors (KNN) algorithms to analyze the sentiment of twitter
data. Long et al. used SVM to classify stock forum posts using additional samples
containing prior knowledge. Although the machine learning-based method can
automatically extract features, it often relies on manual feature selection. However,
the deep learning-based approach does not require manual intervention at all. It can
automatically select and extract features through the neural network structure and can
learn from its own errors.
2.2. Theoretical Basis
2.2.1. Sentiment analysis
According to Yelena Mejova (2009), Sentiment analysis is a field of research, it is
closely related to (or can be considered a part of) computational linguistics, natural
language processing, and text mining. Proceeding from the study of affective state
(psychology) and judgment (appraisal theory), this field seeks to answer questions
long studied in other areas of discourse using new tools provided by data mining and
computational linguistics.
CompsYelenaMejova-libre.pdf (d1wqtxts1xzle7.cloudfront.net)
2.2.2. Natural language processing
Natural language processing (NLP) is a theory-motivated range of computational
techniques for the automatic analysis and representation of human language. This
concept was introduced by E. Cambria and B. White in the research paper "Jumping
NLP Curves: A Review of Natural Language Processing Research"
nlp-research-com-intlg-ieee.pdf (krchowdhary.com)
2.2.3. N-gram analysis
N-gram modeling is a popular feature identification and analysis approach used in
language modeling and natural language processing fields. It started with Claude
Shannon in 1948.
The most used n-gram models in text categorization, in which we are interested in
three main types of frequency analysis in the Latin language such as:
- Monogram analysis
- Bigram analysis
- Trigram analysis
Detecting Opinion Spam and fake news using N-gram Analysis and Semnatic
similarity (uvic.ca)
Practical Cryptography
2.2.4. Term Frequency - Inverse Document Frequency (TF-IDF method)
There are many techniques or algorithms that can be used to process data but this
study used one of those, known as TF-IDF. According to Q. Shahzad and A. Ramsha
(2018), TF-IDF is a numerical statistic that shows the relevance of keywords to some
specific documents or it can be said that, it provides those keywords, using which
some specific documents can be identified or categorized.
Inverse Document Frequency is a metric for determining how a rare word is based on
a document.
which:
N(t,d): number of occurrences of term t in document d
T: total number of distinct terms in document
Here T F(t, d) represents the term frequency of the term t in document d, N(t, d) is
the number of times the term t appears in the document d and T is the total number
of terms in the document.
Thus, for each document and word, a different T F(t, d) value will be assigned.
which:
N: number of distinct documents
N(t): number of distinct documents that the term t is presented
The calculation of IDF(t), which is the inverse document frequency of the term t,
N is the no, N(t) is the no of documents with the term t.
TF – IDF = TF.IDF
The equation above gives the calculation of TF − IDF.
Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

(researchgate.net)
2.2.5. SMOTE
Smote is a method to address classification problems with imbalanced class
distribution. The key feature of this method is that it combines under-sampling of the
frequent classes with oversampling of the minority class Chawla et. al. (2002)
smoteR.pdf (waikato.ac.nz)
2.2.6. Classification algorithm
- Support Vector Machine (SVM): Support Vector Machine is derived from
Statistical Learning Theory (SLT) - is a theory used to study machine learning
rules in the case of small samples. In 1995, Vapnik et al developed this theory
to become powerful and become SVM. SVM plays a very good role in solving
small and non-linear samples, which is why it has been noticed by researchers
in the field of machine learning. Support Vector Machines are machine
learning algorithms used for classification and regression purposes. SVM is
one of the powerful machine learning algorithms for classification, regression
and outlier detection purposes. The SVM classifier builds a model that assigns
new data points to one of the given categories. Therefore, it can be viewed as a
non-probability binary linear classifier.
- Logistic Regression:
In 1990, the phrase "logistic models'' was added to the Medical Subject
Headings (MeSH) thesaurus used by the National Library of Medicine to index
articles for the Medline/PubMED database. Logistic regression models are
defined as “statistical models which describe the relationship between a
qualitative dependent variable (that is, one which can take only certain discrete
values, such as the presence or absence of a disease) and an independent
variable”.
Logistic regression models are used to predict a categorical variable by one or
more continuous or categorical independent variables. The dependent variable
can be binary, ordinal or multicategorical.
The independent variable can be interval/scale, dichotomous, discrete, or a
mixture of them all
The logistic regression equation in case the dependent variable is binomial, is
expressed as follows:
- Decision Tree:
According to F. Yang (2019), A decision tree is a tree-based technique in
which any path beginning from the root is described by a data separation
sequence until a Boolean outcome at the leaf node is achieved
An Extended Idea about Decision Trees | IEEE Conference Publication | IEEE
Xplore
- The k-nearest neighbor (KNN):
The k-nearest neighbor algorithm is a technique for classifying objects based
on closest training examples in the problem space. KNN is a type of instance-
based learning, or lazy learning where the function is only approximated
locally and all computation is deferred until classification (Lloyd-Williams,
1998)
Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over

Medical Data set (psu.edu)
- Naive Bayes:
Naive Bayes classifiers is a probabilistic classifiers based
on applying Bayes' theorem with strong (naive) independence assumptions
between the features. Despite its simplicity, the Naive Bayesian classifier often
does surprisingly well and is widely used because it often outperforms more
sophisticated classification methods.
Title (ijiset.com)
2.2.7. The confusion matrix
A confusion matrix is a technique for summarizing the performance of a classification
algorithm. Classification accuracy alone can be misleading if you have an unequal
number of observations in each class or if you have more than two classes in your
dataset. Calculating a confusion matrix can give a better idea of what classification
model is getting right and what types of errors it is making.
2.2.8 ROC-AUC curve
Binary classification is a common task for which machine learning and computational
statistics are used, and the area under the receiver operating characteristic curve (ROC
AUC) has become the common standard metric to evaluate binary classifications in
most scientific fields. The ROC curve has true positive rate (also called sensitivity or
recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range
from 0 (worst result) to 1 (perfect result). (Davide Chicco, 2023)
https://biodatamining.biomedcentral.com/articles/10.1186/s13040-023-00322-4

Theorical Basis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Theorical Basis

Uploaded by

Copyright:

Available Formats

2.1.

Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents

Performance Evaluation of SVM and K-Nearest Neighbor Algorithm over

You might also like