Toxic Comment Analyser

Toxic Comment Analyser
Introduction
• The proliferation of social media enables people to express their opinions
widely online. However, at the same time, this has resulted in the emergence
of conflict and hate, making online environments uninviting for users.
Although researchers have found that hate is a problem across multiple
platforms, there is a lack of models for online hate detection.
• Online hate, described as abusive language, aggression, cyberbullying,
hatefulness and many others has been identified as a major threat on online
social media platforms. Social media platforms are the most prominent
grounds for such toxic behavior.
Current Scenario
• There has been a remarkable increase in the cases of cyberbullying and trolls on various
social media platforms. Many celebrities and influences are facing backlashes from
people and have to come across hateful and offensive comments. This can take a toll on
anyone and affect them mentally leading to depression, mental illness, self-hatred and
suicidal thoughts.
• Internet comments are bastions of hatred and vitriol. While online anonymity has
provided a new outlet for aggression and hate speech, machine learning can be used to
fight it. The problem we sought to solve was the tagging of internet comments that are
aggressive towards other users. This means that insults to third parties such as
celebrities will be tagged as unoffensive, but “u are an idiot” is clearly offensive.
Objective
Our goal is to build a prototype of online hate and abuse
comment classifier which can used to classify hate and
offensive comments so that it can be controlled and restricted
from spreading hatred and cyberbullying.
Multi-Class Classification Vs Multi-Label Classification
• Difference between multi-class classification & multi-label

classification is that in multi-class problems the classes are mutually
exclusive, whereas for multi-label problems each label represents a different
classification task, but the tasks are somehow related.
• For example, multi-class classification makes the assumption that each
sample is assigned to one and only one label: a fruit can be either an apple or
a pear but not both at the same time. Whereas, an instance of multi-label
classification ,this can be employed to find the genres that a movie belongs
to, based on the summary of its plot.
Software and Hardware Requirements
Software Used Library Used
Python Pandas
Visual Studio Code Numpy
Kaggle Matplotlib
Tensorflow
Sklearn
Re
Nltk
Word Cloud
Gradio
Exploration of Dataset
 In Total 15971 comments are
there in dataset
 Around 90% comments are
Good/Neutral in nature while rest
10% comments are Negative in
nature.
 Out of total Negative comments
the maximum negative comments
come with Malignant in nature
followed by rude categories.
Data Pre Processing : Text Mining
 Convert the text to lowercase
 Remove the punctuations, digits and special characters
 Remove the stop words
 Applying Text Vectorization to convert text into numeric
Word Cloud
 Word Cloud is a visualization technique for text data wherein each
word is picturized with its importance in the context or its
frequency.
 The more commonly the term appears within the text being
analysed, the larger the word appears in the image generated.
 The enlarged texts are the greatest number of words used there and
small texts are the smaller number of words used.
Machine Learning Model Building
The different classification algorithm used in this project to build ML
model are as below:
 Logistics Regression
 Bidirectional LSTM
LSTM
 Long Short-Term Memory (LSTM) networks are a type of recurrent neural network
capable of learning order dependence in sequence prediction problems.
 Required in complex problem domains like machine translation, speech recognition, and
more.
 LSTM networks are similar to RNNs with one major difference that hidden layer updates are
replaced by memory cells. This makes them better at finding and exposing long-range
dependencies in data which is imperative for sentence structures
 One shortcoming of conventional RNNs is that they are only able to make use of previous
context. … Bidirectional RNNs (BRNNs) do this by processing the data in both directions
with two separate hidden layers, which are then fed forwards to the same output layer. …
Combining BRNNs with LSTM gives Bidirectional LSTM, which can access long-range
context in both input directions
Neural Network Model Architecture
1. **Embedding Layer**: An embedding layer is added as the first layer in the model. It maps each
word index to a dense vector representation of fixed size (32 in this case).
2. **Bidirectional LSTM Layer**: A bidirectional LSTM layer is added to capture contextual

information from the comment text. The LSTM layer has 32 units and uses the hyperbolic tangent (tanh)
activation function.
3. **Fully Connected Layers**: Three fully connected layers with ReLU activation functions are added
as feature extractors. The layer sizes are 128, 256, and 128, respectively.
4. **Final Layer**: The final layer is a dense layer with a sigmoid activation function, which outputs the
probability of each toxicity category.
The model is compiled with the Binary Crossentropy loss function and the Adam optimizer.
Ideally, the loss value for any deep-learning
model should decrease as the number of epochs
increases, on the other hand, the accuracy
should increase as the number of epochs
increases. This gives us a fairly decent idea
about the quality of our deep-learning model,
and whether it has been appropriately trained.
Training Loss and Validation values for the LSTM model over a
span of 2 epochs.
Machine Learning Evaluation Matrix
Model Accuracy Precision Recall
Logistic 0.920 0.89 0.94

Regression
Bidirectional 0.924 0.81 0.73
LSTM
Working Interface
CONCLUSION
 Final Model is giving us Accuracy score of 92.41% which is slightly improved

compare to earlier Accuracy score of 92.00%.
 LSTM classifier is fastest algorithm compare to others.
References
• Machinelearningmastery
: https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts
/
• Prezi.com: https://prezi.com/p/caxzp2skyybc/toxic-comment-classification/
• Research Gate :
https://www.researchgate.net/publication/334123568_Toxic_Comment_Classification
• Kaggle: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data
• Towards Data Science: https://towardsdatascience.com/toxic-comment-classification-using-lstm-and-
lstm-cnn-db945d6b7986

Toxic Comment Analyser

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Toxic Comment Analyser

Uploaded by

Copyright:

Available Formats

Toxic Comment Analyser

• Difference between multi-class classification & multi-label

Software Used Library Used

2. Bidirectional LSTM Layer: A bidirectional LSTM layer is added to capture contextual

Model Accuracy Precision Recall

Logistic 0.920 0.89 0.94

 Final Model is giving us Accuracy score of 92.41% which is slightly improved

You might also like