Professional Documents
Culture Documents
Project Report Toxic Comment Classifier
Project Report Toxic Comment Classifier
By
Bachelor of Technology
in
Computer Science and Engineering/Information Technology
November, 2023
1
CERTIFICATE
The project has fulfilled all the requirements as per the regulations of
the Indian Institute of Information Technology Kalyani. In my opinion, it has
reached the standards needed for submission. The work, techniques, and
the results presented have not been submitted to any other university or
institute for the award of any other degree or diploma.
……………………….
Dr. Anirban Lakshman
Assistant Professor,
Department of Computer Science and Engineering
Indian Institute of Information Technology Kalyani
West Bengal 741235, India
10/11/2023
2
DECLARATION
We hereby affirm that the research presented in this report, titled "Toxic
Comment Analyser" has been submitted to the Indian Institute of
Information Technology Kalyani in partial fulfilment for the degree of
Bachelor of Technology in Computer Science and Engineering.
The work was conducted from July 2023 to Nov 2023 under the
guidance of Dr. Anirban Lakshman, Department of Computer Science and
Engineering, Indian Institute of Information Technology Kalyani, West
Bengal - 741235, India. We declare that the report does not contain any
classified information.
Candidates:
- Ashish Kumar 678
- Amarjit Hore 669
- Roshan Kumar 729
3
ACKNOWLEDGEMENT
Candidates:
Ashish Kumar 678
Amarjit Hore 669
Roshan Kumar 729
IIIT Kalyani
26/11/2023
4
ABSTRACT
The project adheres to the rigorous standards set by the Indian Institute
of Information Technology Kalyani, fulfilling all regulatory requirements.
Under the expert supervision and guidance of their mentor, the students
demonstrate a comprehensive understanding of natural language
processing and deep learning techniques.
5
CONTENTS
Chapters
1. Problem Statement 7
2. Objective of Problem 7
3. Literature Survey 9
4. Proposed System 11
5. Methodology 12
6. Output Interface 22
7. Observations 23
8. Conclusion 23
10. References 25
6
Problem Statement
Objective
Automated Detection of Toxic Comments:
- Develop a machine learning model capable of automatically
detecting toxic comments within digital content.
- Implement natural language processing techniques to analyze
and understand the linguistic features associated with toxicity.
Multi-Class Categorization
- Enable the classifier to categorize toxic comments into specific
classes, such as "toxic," "severe toxic," "obscene," "threat," "insult,"
and "identity hate."
- Provide a nuanced classification system to better understand
and address different forms of toxic language.
7
- Optimise the model for computational efficiency without
compromising accuracy.
User-Friendly Integration
- Develop an interface or integration method that allows users to
easily incorporate the Toxic Comment Classifier into various online
platforms.
8
Literature Survey
Early Approaches
a. Supervised Learning:
b. Deep Learning:
The advent of deep learning has significantly impacted the field, with
recurrent neural networks (RNNs), long short-term memory networks
(LSTMs), and more recently, transformer-based models such as BERT and
GPT, achieving state-of-the-art performance. These models excel in
capturing contextual information and semantic relationships, enabling
them to effectively identify subtle instances of toxicity.
9
patterns of toxicity, and the existence of cultural and contextual variations
pose difficulties for model generalisation. Additionally, issues related to
bias and fairness in models, especially those trained on biassed datasets,
need careful consideration.
10
PROPOSED SYSTEM
In the dynamic realm of online communication, the unrestricted
exchange of ideas on digital platforms has empowered diverse
voices. However, this openness has also given rise to the persistent
challenge of toxic comments, which can undermine the constructive
nature of online discussions. Recognizing the gravity of this issue, our
project, the Toxic Comment Classifier, spearheaded by
undergraduate students Ashish Kumar, Amarjit Hore, and Roshan
Kumar from the Department of Computer Science and Engineering at
the Indian Institute of Information Technology Kalyani, delves into the
intricate domain of natural language processing and machine
learning.
11
METHODOLOGY
Dataset and Training
Dataset Description:
-Taken from Kaggle.In Total 15971 comments in dataset.
-The Toxic Comment Classification project utilizes a dataset
containing comments from online platforms labeled for various forms
of toxicity.
-The dataset, in a CSV format.
-The comments were manually classified into following categories:
1. Toxic
2. Severe toxic
3. Obscene
4. Threat
5. Insult
6. Identity hate
12
1. Data Preprocessing
● Dataset Loading: Load the toxic comment dataset
(e.g., 'train.csv') containing comments and corresponding toxicity
labels.
● Text Vectorization: Utilise the TextVectorization layer to convert
raw text into numerical vectors, allowing for efficient processing by
the model.
● Label Preparation: Extract the target labels (toxicity categories)
and format them for model training.
import pandas as pd
from tensorflow.keras.layers
import TextVectorization
# Load dataset
df = pd.read_csv('train.csv/train.csv')
# Text Vectorization
X = df['comment_text']
y = df[df.columns[2:]].values
vectorizer = TextVectorization(max_tokens=MAX_FEATUR
output_sequence_length=1800,
output_mode='int')
vectorizer.adapt(X.values)
vectorized_text = vectorizer(X.values)
13
2. Model Architecture
Advantages:
● Interpretability: Results are easily interpretable, providing
probabilities for class membership.
● Efficiency: Computationally efficient and does not require
high computational resources.
● Less Prone to Overfitting: Less susceptible to overfitting
compared to more complex models when the feature space
is small.
Disadvantages:
● Linear Decision Boundary: Limited to linear decision
boundaries, which might be a drawback for complex
datasets.
● Assumption of Linearity: Assumes a linear relationship
between independent variables and the log-odds of the
dependent variable.
● Sensitivity to Outliers: Sensitive to outliers, which can
impact the model's performance.
Use Cases:
● Binary Classification: Well-suited for problems with two
classes, such as spam detection or disease diagnosis.
● Probabilistic Predictions: Useful when probability estimates
for class membership are required.
Implementation:
● Algorithm: Uses the logistic function to model the
probability of a particular outcome.
14
● Optimization: Typically optimised using techniques like
gradient descent.
Scalability:
● Scalability: Scales well with the number of features but may
not be the best choice for large and highly complex
datasets.
15
Fully Connected Layers: Three fully connected layers with ReLU
activation functions are added as feature extractors. The layer sizes
are 128, 256, and 128, respectively.
The model is compiled with the Binary Cross Entropy loss function
and the Adam optimizer.
16
Understanding How LSTM Works:
1. Memory Cell:
- The core of an LSTM is its memory cell, which serves as a
storage unit capable of retaining information over long periods. This
memory cell is responsible for keeping track of relevant information
from earlier parts of the sequence.
2. Three Gates:
- LSTMs employ three gates to regulate the flow of information:
the input gate, forget gate, and output gate
- The input gate determines which values from the input should
be stored in the memory cell.
- The forget gate decides what information to discard from the
memory cell.
- The output gate regulates the information that should be output
based on the current input and the memory cell content.
17
3. Cell State:
- The memory cell maintains a continuous 'cell state' that runs
through the entire sequence. This state is modified by the gates,
allowing the LSTM to selectively update, add, or remove information
from the cell state.
4. Hidden State:
- The hidden state is the LSTM's way of capturing and storing
information from previous time steps. It acts as a summary or
representation of the relevant information learned from the entire
sequence.
6. Advantages of LSTMs:
- Long-Term Dependencies: LSTMs excel at capturing and
learning dependencies over extended sequences, making them
suitable for tasks requiring an understanding of context over time.
- Gradient Flow: The gating mechanisms help in mitigating the
vanishing and exploding gradient problems that often hinder the
training of traditional RNNs.
- Versatility: LSTMs can be applied to a wide range of sequential
data tasks, including natural language processing, speech
recognition, and time series prediction.
18
makes them a powerful tool for tasks that involve understanding
context and relationships across extended sequences.
Bidirectional LSTM
model.compile(loss='BinaryCrossentropy', optimizer='Adam')
history = model.fit(train, epochs=6, validation_data=val)
19
from tensorflow.keras.metrics import Precision, Recall,
CategoricalAccuracy
pre = Precision()
re = Recall()
acc = CategoricalAccuracy()
y_true = y_true.flatten()
yhat = yhat.flatten()
pre.update_state(y_true, yhat)
re.update_state(y_true, yhat)
acc.update_state(y_true, yhat)
print(f'Precision: {pre.result().numpy()},
Recall:{re.result().numpy()}, Accuracy:{acc.result().numpy()}')
def score_comment(comment):
vectorized_comment = vectorizer([comment])
results = model.predict(vectorized_comment)
text = ''
for idx, col in enumerate(df.columns[2:]):
20
text += '{}: {}\n'.format(col, results[0][idx]>0.5)
return text
interface = gr.Interface(fn=score_comment,
inputs=gr.inputs.Textbox(lines=2,
placeholder='Comment to score'),
outputs='text')
interface.launch(share=True)
21
Output
22
Observation
Loss Graph
Model Scores
Conclusion
Final Model is giving us an Accuracy score of 92.41% which is
slightly improved compared to earlier Accuracy score of 92.0%.LSTM
classifier is fastest algorithm compared to others.
23
Future Scope Of Work
24
References
25