Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Assignment 2: Transcribe it

Instructions:
 The aim of this assignment is to give you an initial hands-on with grapheme to
phoneme conversion and its application in transliteration.
 You can only use Python programming language.
 Carefully read the submission instructions, plagiarism and late days policy at
the end of assignment.
 Deadline to submit this assignment is: Sunday 28th February, 2021.

Part 1:
In this part you are required to design and implement rules to phonemically transcribe the
Urdu text in IPA. You are provided with the Urdu Phonetic Inventory that contains mapping
of Urdu letters and the IPA. The mapping is not entirely one-to-one so you need to design
rules intelligently, taking the context of characters in consideration. Your program should
take input (i.e. Urdu words) from input.txt and write output (i.e. phonemic transcription in
IPA) in output.txt. A sample conversion is shown in table below:
Input ‫ا َ ْش َرفی‬
Output ə ʃ r ə f i:

In order to evaluate your rule based system, you are provided with a lexicon that contains
manually annotated phonemic transcription of 50 Urdu words. You have to compare the
result of your system with the provided phonemic transcriptions. You will use the Word
Error Rate (WER) measure to compare the two transcriptions. The WER is derived from the
Levenshtein distance, and is a string metric for measuring the difference between two
sequences (sequences of IPAs in your case). Informally, the WER between two sequences is
the minimum number of single-character edits (insertions, deletions or substitutions)
required to change one sequence into the other. You don’t have to implement WER from
scratch and can use JiWER Python package to calculate WER between the output of your
system and the provided manual transcriptions.
Note: During evaluation/ grading your system will be evaluated on the words other than
ones provided, so please don’t hard code and design rules that are generalize enough.

Part 2:
In this part you will use the output of part 1 to Romanize the Urdu words. In order to achieve
Romanization you are required to design and implement rules to map IPA to English letters.
You can use the mappings provided here (see Figure 25.1). Again, the mapping is not entirely
one-to-one so you need to design rules intelligently, taking the context of characters in
consideration. Your program should take input (i.e. Urdu words) from input.txt and write
output (i.e. Romanization) in output.txt. A sample conversion is shown in table below:
Input ‫ا َ ْش َرفی‬
IPA ə ʃ r ə f i:
Output a sh r a f i

Notice I have mapped ‘ə’ with an alphabet ‘a’, whereas you won’t find any direct mapping of
‘ə’ to ‘a’ in the above mentioned IPA -> English mapping. This is because English language
does not support short vowels, so I used my best guess to create this mapping. It’s an open-
ended thing and you are free to create your own rules e.g. one could argue that his/her
system Romanize ‫ اَ ْش َرفی‬as ‘ashrafe’ or ‘ashrafy’, which is fine. However, if your system
Romanize it as ‘ashrafa’, this is seemingly incorrect.

Part 3:
In this part you will reverse your pipeline (rules) of part 2 and apply them to convert Roman
words to Urdu. Your program should take input (i.e. Roman words) from input.txt and write
output (i.e. Urdu words) in output.txt.

Submission Instructions:
Submit your code (.py) files, outputs (.txt) for each part. The name of each file should start
with your roll number e.g. your_roll_number_part1.py
Outputs:
- For part 1, submit the output of your system (i.e. phonemic transcription) on the
provided list of 50 words and report the WER
- For part 2, submit the Romanization of the same 50 words provided in part 1
- For part 3, you can design your own test cases. Submit both input.txt (i.e. the
Romanized words) and output.txt (i.e. the corresponding Urdu words)
Zip all the files and name it as `your_roll_number.zip` and submit it to LMS.

Plagiarism Policy:
All work (e.g. code, experiments, writing etc.) MUST be done independently. Any plagiarism
or cheating of work from others or the internet will be immediately referred to the DC. If you
are confused about what constitutes plagiarism, it is YOUR responsibility to consult with the
instructor or the TA in a timely manner. No “after the fact” negotiations will be possible. The
only way to guarantee that you do not lose marks is “DO NOT LOOK AT ANYONE ELSE'S
CODE, WORK OR WRITE-UP NOR DISCUSS IT WITH THEM”.

Late Days Policy:


The deadline of the assignment is final. However, in order to accommodate all the 11th hour
issues there is a late submission policy i.e. you can submit your assignment within 3 days
after the deadline with 25% deduction each day.

You might also like