FML Cep

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

A

Course end Project report on


Language Detection with Machine Learning

A6413-Funadamentals of Machine Learning

Submitted by
S. No Roll Number Name of the student
1 20881A04A9 S. Sravani
2 20881A04A5 P. Pavani

Course Instructor
Mrs. S. Sujana
Assistant Professor, Dept of ECE

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


VARDHAMAN COLLEGE OF ENGINEERING, HYDERABAD
Autonomous institute, affiliated to JNTUH
2022-23
Contents:

 Abstract

 Introduction

 Proposed method

 Results

 Conclusion

 References

 Code
Abstract:
Language detection is a natural language processing
task where we need to identify the language of a text
or document. Using machine learning for language
identification was a difficult task a few years ago
because there was not a lot of data on languages, but
with the availability of data with ease, several powerful
machine learning models are already available for
language identification. So, if we want to learn how to
train a machine learning model for language detection,
then this documentation will be helpful.
Introduction:
Language detection
As a human, you can easily detect the languages you
know. For example, I can easily identify Hindi and
English, but being an Indian, it is also not possible for
me to identify all Indian languages. This is where the
language identification task can be used. Google
Translate is one of the most popular language
translators in the world which is used by so many
people around the world. It also includes a machine
learning model to detect languages that you can use if
you don’t know which language you want to translate.

The most important part of training a language


detection model is data. The more data you have about
every language, the more accurate your model will
perform in real-time. The dataset that I am using is
collected from Kaggle, which contains data about 22
popular languages and contains 1000 sentences in
each of the languages, so it will be an appropriate
dataset for training a language detection model with
machine learning.
Proposed method:
 Let’s start the task of language detection with machine
learning by importing the necessary Python libraries and
the dataset.

List of libraries:
1. Pandas library
It reads in large data sets such as .csv files or SQL
databases and can help extract data based on a
meaningful range of values and/or indices. It also has
sets of statistical commands to get averages, sums,
medians, etc.
2. NumPy library
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear
algebra, fourier transform, and matrices.
3. sklearn.feature_extraction.text
The sklearn. feature_extraction module can be used to
extract features in a format supported by machine
learning algorithms from datasets consisting of formats
such as text and image.
4. CountVectorizer
CountVectorizer is a great tool provided by the scikit-
learn library in Python. It is used to transform a given
text into a vector on the basis of the frequency (count)
of each word that occurs in the entire text.
5. sklearn.model_selection
Scikit-learn has a collection of classes which can be used
to generate lists of train/test indices for popular cross-
validation strategies. They expose a split method which
accepts the input dataset to be split and yields the
train/test set indices for each iteration of the chosen
cross-validation strategy.
6. train_test_split()
The train_test_split() method is used to split our
data into train and test sets. First, we need to
divide our data into features (X) and labels (y). The
dataframe gets divided into X_train, X_test,
y_train, and y_test. X_train and y_train sets are
used for training and fitting the model.
7. sklearn.naive_bayes
The Naive Bayes is a classification algorithm that is
suitable for binary and multiclass classification.
Naïve Bayes performs well in cases of categorical
input variables compared to numerical variables. It
is useful for making predictions and forecasting
data based on historical results
8. MultinomialNB
The multinomial Naive Bayes classifier is suitable
for classification with discrete features (e.g., word
counts for text classification). The multinomial
distribution normally requires integer feature
counts.
 Let’s have a look at whether this dataset contains any
null values or not:

 Now let’s have a look at all the languages present in this


dataset
This dataset contains 22 languages with 1000
sentences from each language. This is a very
balanced dataset with no missing values, so we can
say this dataset is completely ready to be used to
train a machine learning model.

 Now let’s split the data into training and test sets
I will be using the Multinomial Naïve Bayes algorithm
to train the language detection model as this
algorithm always performs very well on the
problems based on multiclass classification.

 Now let’s use this model to detect the language of a


text by taking a user input
Code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import
CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
data =
pd.read_csv("https://raw.githubusercontent.com/amankh
arwal/Website-data/master/dataset.csv")
print(data.head())
data.isnull().sum()
data["language"].value_counts()
x = np.array(data["Text"])
y = np.array(data["language"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X,
y,test_size=0.33,random_state=42)
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)
Result:
Conclusion:

So our project is used for recognizing the language


used in emails, chats, etc.
Language detection identifies the language of a text
as well as the words and sentences where the
language diverges. Since business messages (chats,
emails, and so on) may be written in a variety of
languages, it is frequently utilized.

You might also like