Professional Documents
Culture Documents
Final Year Project 1
Final Year Project 1
University of Turbat
1
NLP Data Collection for Balochi
1. Motivation
There is less available text data for balochi that the NLP researchers/developers use to develop
NLP tools that the other languages have developed. The best of our knowledge there exists less dataset for
balochi. We will develop a web application that will help NLP researchers and developers to easily have a
dataset for balochi. The purpose of this project is to create a platform that will collect text data from
individuals. All the NLP tasks that can be performed in any other language are the source of motivation for
balochi. It is difficult for researchers to create NLP technologies for balochi. This platform will collect text
data that is contributed by the users. These data will be shaped in different ways that will help researchers
to make efficient tools for balochi which are already done in any other languages. By creating a platform
where users can contribute Balochi words, stories, poetry, and idioms, we are providing a valuable resource
for people who are interested in learning more about the language and its culture.
2. Overview
The core idea behind this project is to create a web application that will help researchers and
developers to easily have dataset for balochi. By the help of this platform a user can be able to
contribute something for balochi such as a balochi word, story, poetry, idioms and also pictures with its
small description. The web application can serve as a platform for people to connect, share their stories
and learn from each other.
Collecting and creating good quality datasets for the Balochi language.It can help in
developing natural language processing (NLP). Like machine translation.which will enable
the Balochi speaker to communicate or provide Balochi words or proverbs to people to
speak Balochi language.
2. Supporting Research:
The web application will also serve as a valuable resource for researchers and scholars who
are interested in studying the balochi language and culture. By providing a platform for
users to contribute their knowledge. We can help build a good database of information
that can be used for academic research and analysis.
3. Advancing knowledge:
4. Facilitating Communication:
2
Balochi language is spoken in the population of Balochistan region,and there are
communities of Balochi speakers living in other parts of the world.collecting data for NLP
can facilitate communication between these communities over the world.
5. Solving problems:
This will help to solve the particular problem or challenges which can be faced by a
community ,industry or society.This could have the results and lead to real world impacts.
The description of this project is to identify the sources of data.This could include the
words,sentence, and proverb also once the sources of data has been identified,then the next step
will be to organize and categorized the data.In addition this is important to ensure that collected
data are from the language as whole. This means data collected from different areas or we can say
people such as anybody who registered the site to ensure a balanced and accurate representation
of the language.
Finally, it is important to obtain permission from the admin to provide or contribute the words,
sentences and proverb which are related to the language however collecting data for the Balochi
language project involves careful planning , organization, and consideration of the different
dialects and variation of the language, as well ensuring that data is ethically collected and
contribute.
According to our knowledge there exists less dataset for balochi. Which faces a significant
challenge in terms of the availability of text data for Natural Language Processing (NLP)
research and development. We have studied an existing platform “Kissah.org” where you
can only get Balochi kissah. but we will develop a platform where a user can contribute
Balochi poetries, proverbs, idioms, stories and so on. And we will take the collected data
and make a good wordlist or dataset for balochi which will help NLP researchers and
developers to develop NLP technologies.
We have seen our seniors struggle hard to collect textual data for Balochi to develop their
final year projects. Their projects were “Next word prediction” “Entity Name Recognition”
they faced problems for the Balochi dataset. so we want to develop a platform where users
will contribute balochi text and we will use NLP Models and make a good dataset for
balochi.
3
3. Methodology
At the initial stage of development, a developer must choose a methodology design. Our
development team will use the Agile process model to implement our idea with designing,
implementing and testing it.
4. Features
Below are some of the features of our project idea that will be implemented.
the
5. Project Planning
Here is a detailed schedule for the successful completion of our project. All the details are listed down.
5
6. Required Hardware and Software( For Development phase)
The following hardware and software are required to complete the project. The following are the
minimum requirements for the successful completion of the project.
6.1 Hardware
6.2 Software
● The system can run on any desktop computer and android devices
7.2 Software
6
9. Diagrammatic Representation of the Overall System
11. References
Following are the references that help us to complete our project.
7
[1] https://paperswithcode.com/area/natural-language-processing
[2] https://kissah.org/
[3] https://www.kaggle.com/code/satishgunjal/tokenization-in-nlp
[4] https://www.urdunlp.com/2019/05/urdu-tokenization-usingspacy.html
[5] https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4