Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Department of Computer Science

University of Turbat

FINAL YEAR PROJECT PROPOSAL

NLP Data Collection for Balochi

Group Leader Name Hafeez Ullah


Group Leader Roll Number 1104422

Student-1 Name Aqil Naseer


Student-1 Roll Number 1103780

Student-2 Name Sahib Dad


Student-2 Roll Number 1103781

Student-3 Name Mubashir Ali


Student-3 Roll Number 1105150

Student-4 Name Junaid Akhtar


Student-4 Roll Number 1104070

Program and Batch: BSCS(2020-2023)

Supervisor Balach Khan


Co-Supervisor Junaid Qadir

1
NLP Data Collection for Balochi
1. Motivation
There is less available text data for balochi that the NLP researchers/developers use to develop
NLP tools that the other languages have developed. The best of our knowledge there exists less dataset for
balochi. We will develop a web application that will help NLP researchers and developers to easily have a
dataset for balochi. The purpose of this project is to create a platform that will collect text data from
individuals. All the NLP tasks that can be performed in any other language are the source of motivation for
balochi. It is difficult for researchers to create NLP technologies for balochi. This platform will collect text
data that is contributed by the users. These data will be shaped in different ways that will help researchers
to make efficient tools for balochi which are already done in any other languages. By creating a platform
where users can contribute Balochi words, stories, poetry, and idioms, we are providing a valuable resource
for people who are interested in learning more about the language and its culture.

2. Overview
The core idea behind this project is to create a web application that will help researchers and
developers to easily have dataset for balochi. By the help of this platform a user can be able to
contribute something for balochi such as a balochi word, story, poetry, idioms and also pictures with its
small description. The web application can serve as a platform for people to connect, share their stories
and learn from each other.

2.1 Significance of project

1. Development of NLP application:

Collecting and creating good quality datasets for the Balochi language.It can help in
developing natural language processing (NLP). Like machine translation.which will enable
the Balochi speaker to communicate or provide Balochi words or proverbs to people to
speak Balochi language.

2. Supporting Research:

The web application will also serve as a valuable resource for researchers and scholars who
are interested in studying the balochi language and culture. By providing a platform for
users to contribute their knowledge. We can help build a good database of information
that can be used for academic research and analysis.

3. Advancing knowledge:

This project will contribute to a new discovery or advancement in a particular field,which


could have implications for research or development.

4. Facilitating Communication:

2
Balochi language is spoken in the population of Balochistan region,and there are
communities of Balochi speakers living in other parts of the world.collecting data for NLP
can facilitate communication between these communities over the world.

5. Solving problems:

This will help to solve the particular problem or challenges which can be faced by a
community ,industry or society.This could have the results and lead to real world impacts.

2.2 Description of project

The description of this project is to identify the sources of data.This could include the
words,sentence, and proverb also once the sources of data has been identified,then the next step
will be to organize and categorized the data.In addition this is important to ensure that collected
data are from the language as whole. This means data collected from different areas or we can say
people such as anybody who registered the site to ensure a balanced and accurate representation
of the language.

Finally, it is important to obtain permission from the admin to provide or contribute the words,
sentences and proverb which are related to the language however collecting data for the Balochi
language project involves careful planning , organization, and consideration of the different
dialects and variation of the language, as well ensuring that data is ethically collected and
contribute.

2.3 Background of project

According to our knowledge there exists less dataset for balochi. Which faces a significant
challenge in terms of the availability of text data for Natural Language Processing (NLP)
research and development. We have studied an existing platform “Kissah.org” where you
can only get Balochi kissah. but we will develop a platform where a user can contribute
Balochi poetries, proverbs, idioms, stories and so on. And we will take the collected data
and make a good wordlist or dataset for balochi which will help NLP researchers and
developers to develop NLP technologies.

We have seen our seniors struggle hard to collect textual data for Balochi to develop their
final year projects. Their projects were “Next word prediction” “Entity Name Recognition”
they faced problems for the Balochi dataset. so we want to develop a platform where users
will contribute balochi text and we will use NLP Models and make a good dataset for
balochi.

The availability of a comprehensive Balochi wordlist can contribute to the development of


language technologies such as spell checkers, grammar tools, and machine translation
systems tailored to the needs of Balochi speakers. It can also foster collaboration between
linguists, researchers, and language enthusiasts interested in studying and promoting the
Balochi language.

3
3. Methodology

At the initial stage of development, a developer must choose a methodology design. Our
development team will use the Agile process model to implement our idea with designing,
implementing and testing it.

3.1 Requirement Gathering


At the initial stage we will gather all the requirements of the project which are needed to complete
the project. We will follow the software development process to complete our project. The requirements
can be gathered by interviewing users, learning existing software and so on.

3.2 Design phase


Our development team will use different softwares for the purpose of designing the overall structure
of the project. We will use Adobe XD to design our prototype for our application. The prototype will
clearly demonstrate the overall structure of web application.

3.3 Implementation phase


The project idea shall be implemented using such technologies as HTML, CSS, JavaScript and
laravel/PHP or java Spring boot.

3.4 Testing phase


After the completion of the project we will test each component and functions of the application using
the test cases. In the testing phase we will test the web application to check whether it fulfills the user
requirements. We test each list and categories of the web application. If the web application is
according to a planned idea then it is successful .

4. Features
Below are some of the features of our project idea that will be implemented.

● User Authentication and Registration:


This feature allows users to create accounts, log in to the platform, and access the features of the
application.
● Word Contribution:
This feature allows users to contribute new Balochi words to the platform. Users can also add
meanings, usage examples and audio recording to the words' pronunciation.
● Contributions of Balochi data:
This feature allows users to contribute Balochi stories, Balochi poetry and Idioms to the platform.
● Search Functionality:
This feature allows users to search for content on the platform using keywords or filters. This could
include a search bar or advanced search options. Users can search for Balochi words, stories, poetry
and idioms contributed by the users.
● Feedback and Support:
4
This feature allows users to provide feedback and support to the platform’s administrators. This
could include a feedback form or a contact page.
● Content submission:
Users can submit Balochi words, phrases, and stories through a user-friendly

● Data Validation and Quality Control:


A validation mechanism ensures the accuracy and consistency of user-contributed
content.
● Natural Language Processing (NLP) Processing:
State-of-the-art NLP techniques are employed to process and analyze user-contributed
text. Text preprocessing, linguistic analysis, and feature extraction are performed for
wordlist generation.
● Wordlist Generation:
The processed data is used to generate a comprehensive and accurate Balochi wordlist.
Wordlist Generation:

the

5. Project Planning
Here is a detailed schedule for the successful completion of our project. All the details are listed down.

5
6. Required Hardware and Software( For Development phase)
The following hardware and software are required to complete the project. The following are the
minimum requirements for the successful completion of the project.

6.1 Hardware

● Processor: Intel(R) Core(i7) CPU (Updated Generations)


● RAM: 8 GB
● Hard Disk: 500GB SSD

6.2 Software

● OS: Windows 10 (or updated generation)


● Platform: Visual studio code, HTML & CSS, Java script and Laravel/PHP or Java Spring boot

7.Hardware and Software (For Implementation Phase)


7.1 Hardware

● The system can run on any desktop computer and android devices

7.2 Software

● Can run on any highest version of Android API and Windows/MacOS


● Starting from Android version 2.3.3 and updated versions

6
9. Diagrammatic Representation of the Overall System

11. References
Following are the references that help us to complete our project.
7
[1] https://paperswithcode.com/area/natural-language-processing
[2] https://kissah.org/
[3] https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp
[4] https://www.urdunlp.com/2019/05/urdu-tokenization-usingspacy.html
[5] https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4

You might also like