Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Project Report

COURSE NAME: Natural Language Processing

COURSE CODE: SWE1017

Faculty: DR TULASI PRASAD SARIKI


SLOT: G1

Title: Multi-label Classification

Project Report
Submitted by
S AMITH (18MIS1010)
K NIKHIL KUMAR REDDY (18MIS1065)

1 Natural Language Processing


Project Report

Introduction
Multi-label classification arose from the study of the text categorization problem, in which each
document might belong to multiple predefined subjects at the same time.Textual data multi-
label classification is a significant issue. News stories and emails are just a few examples. For
example, based on a storyline summary, this can be used to determine which genres a film
belongs to.
The training set in multi-label classification is made up of examples, each of which is associated
with a set of labels, and the goal is to predict the label sets of unseen instances by evaluating
training instances with known label sets.
The difference between multi-class and multi-label classification is that multi-class issues have
mutually exclusive classes, whereas multi-label problems have each label representing a
different classification work, although the tasks are related in some way.
Multi-class classification, for example, assumes that each sample is allocated to only one label:
a fruit can be either an apple or a pear, but not both at once. A text could be about religion,
politics, finance, or education at the same time, or none of these. This is an example of multi-
label classification.

Problem Definition
Toxic comment classification is a multi-label text classification challenge with an extremely
skewed dataset. We've been given the task of developing a multi-labeled model capable of
detecting many sorts of toxicity, such as threats, vulgarity, insults, and identity-based hate. For
each comment, we need to build a model that forecasts the likelihood of each type of toxicity.

Exploratory Data Analysis


One of the most crucial steps in the data analysis process is exploratory data analysis. The main
here is on making sense of the data at hand, including things like developing the right questions
to ask your dataset, manipulating data sources to get the answers you need, and so on.

 Importing necessary libraries

2 Natural Language Processing


Project Report

 Load the dataset

 The dataset is from Kaggle.

 Counting the number of comments under each label

3 Natural Language Processing


Project Report

 The number of comments with multiple labels is being counted & and plot them.

 The most frequently used words in each comment category are represented as a
WordCloud.

4 Natural Language Processing


Project Report

Data Pre-Processing
We convert the comments to lower case first, then remove html elements, punctuation, and
non-alphabetic characters with custom-made routines. Then, using the default set of stop-words
available in the NLTK package, we delete all stop-words from the comments. A few stop-
words are also added to the regular list.
 Stopwords are a group of regularly used words that can be found in any language, not
just English. Stop words are significant in many applications because they allow us to
focus on the key words by removing the words that are often used in a language.

5 Natural Language Processing


Project Report

 In data pre-processing we are applying the techniques which helps in removing


punctuations, html tags, non-alphabetical characters.

After that, we'll do some stemming. Different types of stemming exist, all of which essentially
reduce words with similar semantics to a single standard form. The stem for amusing,
amusement, and amused, for example, is amus.

6 Natural Language Processing


Project Report

We want to summarise our remarks and turn them into numerical vectors after partitioning the
dataset into train and test sets.
One method is to select the most commonly used phrases (words with high term frequency or
tf). The most frequent word, on the other hand, is a less relevant indicator because some words,
such as 'this' and 'a,' appear often across all documents.As a result, we'd like a metric for how
unique a word is, i.e. how infrequently it appears across all publications (inverse document
frequency or idf).

Multi-Label Classification Techniques


For single-label classification tasks, the majority of classical learning algorithms are designed.
As a result, several techniques in the literature divide the multi-label problem into many single-
label problems, allowing single-label algorithms to be applied.

1. OneVsRest
 Traditional two-class and multi-class problems can be converted to multi-label
problems by limiting each instance to one label. Multi-label problems, on the other
hand, are inherently more difficult to learn due to their generality. Decomposing a
multi-label problem into numerous independent binary classification problems is a
natural way to solve it (one per category).
 In a "one-to-rest" technique, one may create numerous independent classifiers and
choose the class with the highest confidence for an unknown instance.
 The primary premise is that the labels are mutually exclusive. In this method, you
ignore any underlying correlation between the classes.
 It's more like asking simple questions like, "Is the comment toxic or not?" or "Is the
comment dangerous or not?" and so on. Also, because most of the comments are
unmarked, i.e., most of the comments are clean remarks, there may be a significant case
of overfitting here.

2. Binary Relevance
 In this scenario, a set of single-label binary classifiers, one for each class, is trained.
Each classifier predicts whether a person belongs to one of two classes: members
or non-members. The multi-label output is made up of the sum of all anticipated
classes. This method is popular because it is simple to implement, but it ignores the
possibility of class label correlations.
 In other words, if there are q labels, the binary relevance technique creates q new
data sets from the images, one for each label, and uses each new data set to train
single-label classifiers. The "binary" in "binary relevance" refers to a classifier's
ability to answer yes or no to the question "does it include trees?" This is a basic
method, however it fails when there are dependencies between labels.

7 Natural Language Processing


Project Report

 OneVsRest and Binary Relevance appear to be pretty similar. You're back to the
binary relevance scenario if multiple classifiers in OneVsRest answer "yes."

3. Classifier Chains
 A chain of binary classifiers C0, C1,…, Cn is built, with a classifier Ci using all of the
classifier Cj's predictions. This allows the approach, also known as classifier chains
(CC), to take label correlations into account.
 The overall number of classifiers required for this strategy is the same as the number of
classes, but classifier training is more complicated.
 The following is an illustrated example of a classification problem with three categories
linked in that order: C1, C2, and C3.

4. Label Powerset
 This method takes into account the possibility of class label correlations. Because each
member of the power set of labels in the training set is treated as a single label, this
method is more generally referred to as the label-powerset method.
 This strategy necessitates worst-case (2^|C|) classifiers and is computationally
intensive.
 The number of distinct label combinations can expand exponentially as the number of
classes grows. This readily results in combinatorial explosion and, as a result,
computational impossibility. Furthermore, there will be extremely few positive
examples for some label combinations.

8 Natural Language Processing


Project Report

5. Adapted Algorithm

 Algorithm adaptation approaches for multi-label classification focus on changing


cost/decision functions to adapt single-label classification algorithms to the multi-label
case.
 We apply ML-KNN, a multi-label lazy learning strategy evolved from the standard K-
nearest neighbour (KNN) algorithm, in this study.

Conclusion:

Results:
 Problem transformation methods and algorithm adaptation methods are the two basic
approaches to solving a multi-label classification problem.
 The multi-label problem is transformed into a set of binary classification problems,
which can then be solved by single-class classifiers.
 Algorithm adaptation approaches, on the other hand, adapt algorithms to do multi-label
classification directly. In other words, rather of attempting to reduce the problem to a
simpler form, they attempt to handle it in its whole.
 In a thorough comparison of different ways, the label-powerset method outperforms the
one-against-all strategy.
 Because both ML-KNN and label-powerset take a long time to execute on this dataset,
testing was limited to a random sampling of the train data.

Further improvements:

 In deep learning, LSTMs can be used to tackle the same problem.


 We could employ decision trees for increased speed, or ensemble models for a
reasonable trade-off between speed and accuracy.
 Multi-label classification problems can be solved using other frameworks like MEKA.

9 Natural Language Processing

You might also like