Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

AUTOMATED LANGUAGE TRANSLATION

FOR INDIGENOUS LANGUAGES


MACHINE LEARNING PROJECT REPORT
Submitted by
Shaik Daniya Sulthana Begam (99220040194)
K. Lakshmi Sai Deepika (99220040283)
N. Yasaswini (99220040320)
P. Manasa (99220040326)
B. Tech - Computer Science and Engineering, Artificial
Intelligence and Machine Learning

Kalasalingam Academy of Research and Education


(Deemed to be University)
Anand Nagar, krishnankoil-626126
March 2024

1
SCHOOL OF COMPUTING
DEPARTMENT OF SCIENCE AND ENGINEERING
DECLARATION BY THE STUDENT

We here by declare that this project “AUTOMATED


LANGUAGE TRANSLATION FOR INDIGENOUS
LANGUAGES” is our genuine work and no part of it has been
reproduced from any other works.

Register No. Student Name Sign


99220040194 Sk. Daniya Sulthana Begam
99220040283 K. Lakshmi Sai Deepika
99220040320 N. Yasaswini
99220040326 P. Manasa

Date:
2
BONAFIDE CERTIFICATE
Certified that this project report “AUTOMATED LANGUAGE
TRANSLATION FOR INDIGENOUS LANGUAGES” is the
bonafide work of “Sk Daniya Sulthana Begam (99220040194),
K. Lakshmi Sai Deepika (99220040283), N. Yasaswini
(99220040320), P. Manasa (99220040326)” who carried out the
project work under my supervision.
Dr. N. Suresh Kumar Mr. R. Mari Selvan
Head of the department Supervisor
Professor Assistant Professor
Department of CSE Department of CSE
Kalasalingam Academy of Kalasalingam Academy
Research and Education of Research and Education
Krishnankoil-626126 Krishnankoil-626126

Project Final Review Viva-voice held on ___________

Internal Examiner External Examiner

3
ACKNOWLEDGEMENT

We would like to express our sincere gratitude to all the


researchers, data scientists, and developers who have
contributed to the field of automated language translation for
indigenous languages using machine learning. Their innovative
work and dedication have paved the way for advancements in
language translation technology. We also acknowledge the
support and collaboration of industry partners, academic
institutions, and regulatory bodies who have provided valuable
insights and resources to enhance our understanding of language
translation techniques. Furthermore, we extend our appreciation
to the open-source community for sharing tools, libraries, and
datasets that have enabled us to experiment and develop
machine learning models for language translation. Lastly, we
thank our colleagues and team members for their hard work and
collaboration in researching, designing, and implementing
automated language translation systems. Their expertise and
commitment have been instrumental in the success of our
projects. Together, we are working towards a safer and more
secure digital environment by leveraging the power of machine
learning for language translation.

4
Table of Contents
Chapter No. Title Page No.
1 Abstract 6
2 Introduction 7
3 Literature Survey 8-10
3.1 Features and advantages 11-12
3.2 Limitations and challenges 13
4 Methodology 14
4.1 Packages used 15
4.2 Data collection methods, sources 15
5 Proposed works
5.1 Flowchart 16
5.2 Code Implementation 17-20
6 Reference papers 21

5
1.ABSTRACT

The preservation and translation of indigenous languages are


crucial for cultural heritage and societal inclusivity. This report
explores the advancements in automated language translation for
indigenous languages facilitated by machine learning
algorithms. It examines the challenges unique to translating
indigenous languages, such as limited linguistic resources,
dialectal variations, and non-standard syntax. Additionally, it
reviews the existing methodologies and technologies used in
machine translation and discusses their applicability to
indigenous languages. The report emphasizes the significance of
language translation in industries like education,
communication, and healthcare, highlighting the benefits of
using machine learning models for real-time language
translation. It also addresses the challenges associated with
implementing machine learning for language translation.
Through this project report, the effectiveness of machine
learning algorithms in language translation activities is
demonstrated, showcasing the potential for automated language
translation systems to enhance security.

6
CHAPTER 2
INTRODUCTION
• Automated language translation for indigenous languages
leverages the capabilities of machine learning algorithms to
decipher and translate text from one language to another.
The translator app is a versatile tool designed to facilitate
seamless communication across different languages.
• Leveraging cutting-edge technologies such as streamlit ,
speech recognition, and translation apis , this application
offers users the ability to translate both text and speech in
real-time.
• With globalization becoming increasingly prevalent in
today's world, the need for efficient language translation
solutions has never been more critical.
• The translator app aims to address this need by providing a
user-friendly platform for individuals and businesses to
overcome language barriers effortlessly.
• Whether you're traveling abroad, conducting international
business, or simply seeking to connect with people from
diverse linguistic backgrounds, this app empowers users to
communicate effectively in any language.
• Join us as we explore the features and functionalities of this
innovative translator app, revolutionizing the way we
interact and communicate in a multilingual world.

7
CHAPTER 3
LITERATURE SURVEY

1. “Survey on Automatic Speech Recognition Systems for Indic Languages”:


This survey focuses on Automatic Speech Recognition (ASR) systems for Indic
languages. It discusses various approaches and techniques suitable for identifying text
from speech. The goal is to help researchers incorporate essential parameters in their
speech recognition systems to overcome existing limitations.
2. “Automatic Language Identification in Texts: A Survey”:
This article provides a brief history of language identification (LI) research and
an extensive survey of features and methods used in the LI literature. It introduces a
unified notation for describing these features and methods. The paper also discusses
evaluation methods, applications of LI, and off-the-shelf LI systems that do not require
user training.
3. “ASRoIL: A Comprehensive Survey for Automatic Speech Recognition
(ASR) in Indian Languages”:

This paper systematically surveys existing literature related to ASR (speech-to-


text conversion) for Indian languages. It covers research on Indian ASR datasets and
associated investigations.

4. “A Survey on Speech Recognition in Indian Languages”:

This survey presents major research works in the development of Automatic


Speech Recognition (ASR) for Indian Languages.

5. “A Comprehensive Survey on Indian Regional Language Processing”:

This survey discusses challenges, data sources, and future directions for enhancing
the processing of Indian regional languages in various language processing tasks.

8
Modules/Libraries used:

 Streamlit is an open-source Python library that allows you to


create and share interactive web apps from your Python scripts. It’s
a fantastic tool for data scientists and developers who want to build
dashboards, generate reports, or even create chat applications
without the need for extensive front-end development knowledge.

 The Speech Recognition module, often referred to as Speech


Recognition, is a library that allows Python developers to convert
spoken language into text by utilizing various speech recognition
engines and APIs. It supports multiple services like Google Web
Speech API, Microsoft Bing Voice Recognition, IBM Speech to
Text, and others.

 pyttsx3 is a Python library that enables text-to-speech conversion.


It serves as a more advanced and feature-rich version of the older
pyttsx library, making it an excellent choice for developing text-to-
speech applications in Python.

 Googletrans is a Python library that provides a convenient


interface for using the Google Translate API. It allows you to
perform language detection and translation tasks within your
Python code.

 gTTS (Google Text-to-Speech) is both a Python library and a


command-line interface (CLI) tool. It allows developers to
interface with Google Translate’s text-to-speech API, enabling the
conversion of text into speech in over 40 languages.

9
Feature Engineering and Selection:
Feature extraction:
 Character n-grams: Extract sequences of characters (e.g., bi-
grams, tri-grams) from the text. This can capture patterns specific
to certain languages.
 Word n-grams: Similarly, extract sequences of words to capture
language-specific patterns.
 Language-specific Features: Incorporate linguistic features that
are known to be characteristic of certain languages. For example,
certain phonetic or orthographic features may be unique to specific
indigenous languages.
 Statistical Features: Compute statistics such as word frequencies,
average word length, or entropy of the text. These can provide
insights into the language's characteristics.
 Syntactic Features: Utilize syntactic features like part-of-speech
tags or syntactic dependencies. Some languages may exhibit
specific syntactic patterns.

10
3.1 FEATURES AND ADVANTAGES

Features:

1. Phonetic and Orthographic Challenges : Indigenous languages


often have phonetic sounds and orthographic conventions that
differ significantly from widely spoken languages. Translation
systems need to be equipped to handle these variations
accurately.

2. Context Sensitivity : Indigenous languages often rely heavily on


context for meaning, including subtle cues and non-verbal
communication. Translation systems should be sensitive to these
contextual cues to provide accurate translations.

3. Customization and Adaptation : Translation systems should


allow for customization and adaptation to specific indigenous
languages, accounting for dialectal variations and linguistic
idiosyncrasies.

4. Feedback Mechanisms : Implementing mechanisms for


feedback from users within indigenous communities can help
improve translation quality over time by incorporating
community feedback and corrections.

5. Ethical Considerations : Ensuring that the development and


deployment of automated translation systems respect the rights,
autonomy, and cultural sensitivities of indigenous communities
is paramount.
11
Advantages:
 It incorporates advanced technologies such as speech recognition,
translation apis, and user-friendly interfaces to enhance user
experience.

 The system offers both text and voice translation functionalities,


catering to a wide range of user preferences and needs.

 Users can input text in their preferred language and translate it into
multiple target languages with just a few clicks.

 Additionally, the system supports voice-to-text translation,


allowing users to speak in one language and receive translations in
their desired language.

 The translator app aims to revolutionize cross-language


communication by providing accurate, efficient, and accessible
translation services.

 With its intuitive design and robust features, the proposed system
seeks to bridge language barriers and promote global connectivity
and understanding.

12
3.2 LIMITATIONS AND CHALLENGES
LIMITATIONS:
Limited Training Data: Indigenous languages often have
limited digital presence and resources compared to widely
spoken languages.
Complex Linguistic Structures: Many indigenous languages
have complex linguistic structures, including unique
grammatical rules, syntax, and semantics.
Lack of Standardization: Indigenous languages often lack
standardization across different dialects and regions.
Cultural Context: Indigenous languages are deeply
embedded in their respective cultures, and translations often
involve conveying cultural nuances, idiomatic expressions,
and contextual meanings.

CHALLENGES:
Code-Switching and Borrowing
Quality Control and Evaluation
Accessibility and Infrastructure
Community Involvement and Ownership
Ethical and Socioeconomic Implications

13
CHAPTER 4
METHODOLOGY
• The provided code initializes a streamlit app with two main
sections: text translation and voice translation.
• In the text translation section, users can enter text, choose
source and target languages, and click the “translate”
button.
• The translated text is displayed, and an audio file is
generated and played.
• The voice translation section allows users to select the
source language for speech input.
• The recognized speech is translated, and the translated text
is displayed along with an audio playback.

14
4.1. PACKAGES USED
Streamlit
Speech_recognition
Pyttsx3
Googletrans
gtts

4.2. DATA COLLECTION METHODS AND SOURCES


Parallel Texts
Digital Archives and Repositories
Government and Institutional Documents
User-Generated Content
Collaboration with Indigenous Communities
Text Corpora

CHAPTER 5
15
PROPOSED WORKS

5.1. Flow Chart

5.2 CODE Implementation


16
17
18
Output

19
Chapter 6
REFERENCE PAPERS
 “Ethical Considerations for Machine Translation of
Indigenous Languages”:
Discusses ethical challenges and emphasizes community
involvement.
 Canadian Indigenous Languages Technology Project:
Develops language technologies for Indigenous
languages.
 “IndT5: A Text-to-Text Transformer for 10 Indigenous
Languages”:
First Transformer model for Indigenous languages.
 “Enhancing Translation for Indigenous Languages”:
Investigates multilingual models for translation.

20

You might also like