Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

1

PROJECT REPORT
On
Language detection with machine learning
Submitted to Centurion University of Technology& Management
in partial fulfillment of the requirement for award of the degree of

B. TECH
in
COMPUTER SCIENCE & ENGINEERING
Submitted By

SAGAR PATRO

Holding university registration number

210101120104

Under the Guidance of

MS. ARYALOPA MALLA

DEPT. OF COMPUTER SCIENCE & ENGINEERING

SCHOOL OF ENGINEERING &TECHNOLOGY,

CUTM, Paralakhemundi-761211
2

CERTIFICATE

This is to be certified that the project entitled “Language


detection with machine learning” has been
submitted for the Bachelor of Technology in Computer
Science Engineering of School of Engineering & Technology,
CUTM, Paralakhemundi during the academic year 2022-2023
is a persuasive piece of project work carried out by “SAGAR
PATRO” towards the partial fulfillment for award of the
degree (B.Tech.) under the guidance of “MS. ARYALOPA
MALLA” and no part thereof has been submitted by them for
any degree to the best of my knowledge.

Signature of Candidate Signature of Project Guide

Name of the Candidate Name of the Guide


3

EVALUATION SHEET
1.Title of the Project: Language detection with
machine learning
2. Year of submission: 2023
3. Name of the degree: B. TECH (C.S.E)
4. Date of Examination / Viva:
5. Student Name: SAGAR PATRO
6. Reg No: 210101120104

Name of the Guide: -MS. ARYALOPA MALLA

[APPROVED/REJECTED]

Signature of Project Guide


4

CANDIDATE’S DECLARATION

I “SAGAR PATRO” B. Tech CSE


(Semester- IV) of School of Engineering
&Technology, CUTM, Paralakhemundi, hereby
declare that the Project Report entitled “L” is an
original work and data provided in the study is
authentic one. This report has not been submitted
to any other Institute for the award of any other
degree by me.

Signature of Student
5

INDEX
SI.NO CONTENT PAGE.NO.
01 Abstract 06
02 Introduction 07
03 What is language dection 08
04 Use -case 09
05 Installation and importing 11-12
of libries
06 Data Set 12-13
07 Importing the dataset 13-14
08 Differentiating Independent 15
from dependent features
09 Performing label encoding 15
10 Text preparation 15-16
11 CountVectorizer 16-17
12 Model evalution 18-19
13 Visualization 20-22
14 Conclusion 22
6

Abstract :-

Language detection is an essential task in natural


language processing (NLP) and has numerous
applications in various fields, such as text
classification, machine translation, and speech
recognition. This paper proposes a machine
learning approach for language detection that
utilizes character n-grams and word n-grams as
features. We train and evaluate several models
using a large dataset of text documents in multiple
languages. Our results demonstrate that the
proposed approach achieves high accuracy in
identifying the language of the input text,
outperforming previous state-of-the-art methods.
The proposed method provides a practical solution
for language detection in real-world applications
and can be easily extended to support additional
languages.
7

Introduction

Recently, a wide range of human sectors (e.g


Engineering, Education, Healthcare, Finance,
Media, etc.,) have shown a lot of interest in
machine learning. ML’s attractiveness has
largely been attributed to its ability to make
decisions without human interference.One
common ML task is NLP and today we’ll be
creating a model trained to get a text input and
then predict what language it is. The
technique of determining the language of a
text or document is known as language
detection in natural language processing. It
was difficult to identify languages using
machine learning when little data was
available about them. There are now a
number of effective machine learning models
8

for language detection since data is so easily


accessible.

What is language detection?

The initial stage in any pipeline for text


analysis or natural language processing is
language identification. All ensuing language-
specific models will yield wrong results if the
language of a document is incorrectly
determined. Similar to what happens when an
English language analyzer is used on a
French document, errors at this step of the
analysis might accumulate and provide
inaccurate conclusions. Each document’s
language and any elements written in another
language need to be identified. The language
used in papers varies widely depending on the
nation and culture.
9

Use-Cases

 Monolingual chatbots: When a user


starts speaking in a particular language,
a bot must be able to recognize it even
if it hasn’t been properly educated to
carry on a discussion in that language.

 Spam filtering: Spam filtering systems


that support many languages must
identify the language that emails, online
comments, and other input are written in
before utilizing real spam filtering
algorithms. Internet platforms cannot
efficiently remove content from certain
countries, regions, or locations
suspected to be creating spam without
this identification.
10

 Recognize the language used in emails


and chats: Language detection
identifies the language of a text as well
as the words and sentences where the
language diverges. Since business
messages (chats, emails, and so on)
may be written in a variety of
languages, it is frequently utilized.

 Linguistic blending: Some people are


used to having conversations that are
bilingual. Hinglish, an amalgam of Hindi
and English terminology used in India,
would be a good illustration of this. In
these situations, a language detection
model will examine the number of words
in a sentence written in one or more
languages, with the language with the
most words serving as the primary
11

language for the interaction but the


secondary language also being
recognized and receiving a high
confidence score in our ranking.

With this being settled, let’s get our hands


dirty by building a model which will be able to
predict the given language.

Installation and importing of libraries

We will import all of the necessary libraries

first, but if you don’t have them already

installed, I advise you to install them before

moving on with the article.


import re
import warnings
warnings.simplefilter("ignore")imp
ort pandas as pd
12

import numpy as npimport seaborn


as sns
import matplotlib.pyplot as plt

Dataset

We will make use of a small language


detection dataset from Kaggle. You will build
an NLP model for predicting 17 distinct
languages using this dataset, which contains
text details for 17 different languages.

Languages: English, Malayalam, Hindi, Tamil,


Kannada, French, Spanish, Portuguese,
Italian, Russian, Swedish, Dutch, Arabic,
Turkish, German, Danish, and Greek.

We must build a model that can predict the


given language using the text as a guide. This
provides a solution for many computational
13

linguistics and artificial intelligence


applications. For machine translation, these
sorts of prediction algorithms are frequently
utilized on robots as well as electronic devices
like mobile phones and laptops. Additionally, it
aids in managing and locating papers that are
multilingual. Researchers are still active in the
field of NLP.

Importing the dataset


df = pd.read_csv("Language
Detection.csv")
df.head()

This dataset has 10,337 rows, two columns,


and text details for 17 distinct languages. We
14

can quickly calculate the value count for each


language.
df["Language"].value_counts()
15

Differentiating Independent from


dependent features
The dependent variable, in this case, is the
name of the language (y), and the
independent variable is text data (X), which
we can now separate from each other.
X = data["Text"]
y = data["Language"]

Performing label encoding


Language names make up our output
variable, which is a categorical variable. We
are conducting label encoding on that output
variable since we should need to turn it into a
numerical form for training the model. We are
importing LabelEncoder from sklearn for
this procedure.
from sklearn.preprocessing import
LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Text preparation
16

This dataset contains a lot of


irrelevant/unwanted symbols and numbers
that may degrade the performance of our
model, thus text preparation is required.
text_list = []for text in X:
text = re.sub(r'[!@#$
(),n"%^*?:;~`0-9]', ' ', text)
text = re.sub(r'[[]]', '
', text)
text = text.lower()
text_list.append(text)

In the code above, we created an empty


list text_list for appending the
preprocessed text, we then iterate through all
the text (X), removed the symbols and
numbers, converted the text to lowercase, and
finally append it to the list text_list .

CountVectorizer
Both the input and the output features must
take the form of numbers. We will use the
CountVectorizer’s Bag of Words model to
convert text into numerical form.
from
sklearn.feature_extraction.text
import CountVectorizer
17

cv = CountVectorizer()
X =
cv.fit_transform(text_list).toarra
y()
X.shape

You should get (10337, 39419) as an


output.

Train Test split


Our input and output variables have been
preprocessed, therefore the next stage is to
split our dataset into training and test data.
The training set is for the model’s training and
the test set is for the test set’s evaluation. We
will make use of the train test split for this
procedure.
from sklearn.model_selection
import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(X, y, test_size =
0.20)

The test size is just 20%.

Training and prediction of models


18

The process of creating the model is almost


complete. The Naive Bayes algorithm is what
we’re utilizing to build our model. The model is
afterwards trained using the training set.
from sklearn.naive_bayes import
MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

We used the training set to train our model.


Predicting the results for the test set is what
we’ll do next.
y_prediction =
model.predict(x_test)

Model evaluation
After the successful completion of training,
test, and prediction, the next thing we always
want to do is model evaluation and
assessment.
from sklearn.metrics import
accuracy_score, confusion_matrix,
classification_reportaccuracy =
accuracy_score(y_test,
y_prediction)
confusion_m =
confusion_matrix(y_test,
19

y_prediction)print("The accuracy
is :",accuracy)

We got an accuracy of 97%.


20

Visualization

Using the seaborn heatmap, let’s plot the

confusion matrix for the purpose of

visualization.
plt.figure(figsize=(15,10))
sns.heatmap(confusion_m, annot =
True)
plt.show()
21

Let’s try out the model prediction using text


from several languages. We will write a
function that will take in the text as input and
predict the language in which the text is
written.
def lang_predict(text):
x =
cv.transform([text]).toarray()
lang = model.predict(x)
lang =
le.inverse_transform(lang)
print("The langauge is
in",lang[0])

cv is CountVectorizer that is converting text to


a bag of words model (vector), the
variable lang is storing the predicted
language, and then we finally we can now
print the predicted language to the user.

To test this, we will call


the lang_predict() function and pass any
bunch of text into it, and then allow it to predict
the language.
22

Conclusion
We have come to the end of this article, I hope
you now have a better understanding of how
to predict language using machine learning.
The data has to be evaluated and then
preprocessed as necessary. The text data you
have becomes represented using a bag of
words model. In order to make accurate
predictions in NLP, text extraction and
vectorization are crucial tasks. In these text
classification issues, Naive Bayes consistently
proves to be a stronger model, leading to
more accurate results.

You might also like