lecture 6

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Ling 4807: Applications of

Computer in Linguistics
Farig Sadeque
Assistant Professor
Computer Science and Engineering
BRAC University
Development of Bangla language
technology: scope and necessity
Talking points
- Summary of the current status
- Components
- Spell and grammar checker
- Translation
- OCR
- Sentiment analysis
- Speech to text and text to speech
- Plagiarism checker
- Question answering
- Digital assistant
- Sign language to text converter
Summary
Summary
Existing Bangla NLP market analysis
- Market opportunity
- Existing tech
- Entry barrier and challenges
Spell and grammar checker
- Market opportunity
Target Segmentation

Journalists

Published renowned writers

Students

Bloggers and Social Media Users

Translators

Bangla Copywriters and Content Writers

Govt Employees
Existing tech
- One spell checker from EBLICT: https://spell.bangla.gov.bd/
- Some other online spell checkers
- No grammar check/correction tool
- Some SOTA research came out of the ভাষাভ্রম competition this year
Translation
- English to Bangla and Bangla to English
- Potential Market
- 75% of consumers are more likely to buy products from websites in their native language
- 65% of non-native English speakers prefer content in their native tongue
- Was valued at USD 650 million in 2020 and is expected to reach USD 3 billion by 2027
- Interested communities:
- Technology & manufacturing: Translate manuals of different machineries and different products
- Global business people: Translate to understand cultural statements better
- Finance and legal: translate documents without any contextual mistakes
- Marketing (copy & content writers): Translate from Bangla to English or English to Bangla to
advertise products
- E-commerce: Translate to communicate product information
- Healthcare: Translate important healthcare information
- Freelance writers
Existing tech
- Multiple government initiative
- Amar Vasha was supposed to use artificial intelligence to translate Supreme Court orders and
decisions from English to Bangla
- BUET CSE published a 2.75 million sentence-pair translation corpus
- Google's proprietary machine translation technology, dubbed Google Neural
Machine Translation (GNMT), employs recurrent neural networks
- Over 4,000 volunteers from 81 locations throughout the nation entered at
least 400,000 words into the translation software on a single day to celebrate
Independence Day
Entry barriers
- Bangla language structure
- Collected corpus was never deployed to build a proper software
- Why?
OCR
- Potential market: globally valued at 10.65 billion USD

Segment Application

Finance sector (banks) Achieving better transaction security, risk


management, scanning handwriting

ATM booths Providing 2 layer security at ATM

Number plate recognition Providing better security to people

Healthcare industry Scanning and searching patient records

Travel industry Scanning passports, storing personal data

Education Converting handwritten documents into digital


texts

Retail Transcribing written data (package info,


product list, product descriptions etc)
Existing tech
- Bangla OCR has been studied since the 1980s
- BOCRA and Apona Pathak were introduced
- these weren’t open source and weren’t maintained
- CRBLP OCR, 2007
- Tesseract project
- Opensource, maintained by Google
- Google Lens works moderately well for OCR as well
- Puthi was developed by TeamEngine, with 95% claimed accuracy
- But the project failed due to technical reasons, was never released for public use
- Apurba developed one which was funded by EBLICT
- Let’s see how well it works, shall we?
- https://ocr.bangla.gov.bd/
Entry barriers
- Developing a completely new dataset for Bangla is difficult. Why?
- Alpha-syllabary language family utilizes a cursive writing style and diacritics often,
segmenting graphical components according to characters becomes incredibly challenging.
- Broad pixels from the upper or lower portion of a character in a complicated script like Bangla
cannot be removed while eliminating noise because they would erase not just noise but also
the difference between two characters.
- The lettering of Bangla words might also make segmentation difficult.
- Complex typeface, issues with preservation etc.
- No pipeline was developed
Sentiment Analysis
- Potential market: The Asia-Pacific market is expected to reach US$523.6
million by 2027, led by nations such as Australia, India, and South Korea.

Segmentation Use-Case

Brand management Brand Monitoring

Customer Service Companies NLP based chatbots to understand sentiment

Finance & Stock Making investments

Business intelligence sector Analyzing consumers’ responses

NPS surveys Making consumers’ experience better

Market Research Competitive analysis and product/service


demand

Healthcare Optimizing patients' experiences


Existing tech
- One publicly available app:
https://sentiment.bangla.gov.bd/sentiment-emotion-analysis
- Lots of researchers and students work on sentiment analysis, but still no
corpus publicly available
Entry barriers
- Lack of quality data, no standard corpus
- A lot of researchers are willing to work on the problem because it’s trendy, not
because they actually want to develop software that can analyze emotions
- Social media data has issues
Speech-to-text
- Potential market: was valued at USD 1 Billion in 2019 and is expected to grow
to USD 3 Billion by 2027
Segment Application

Students and teachers Dictate and save time on documentation of lectures


and reports.

Writers transcribe ideas and make a draft instantly.

People with certain disabilities Making a draft only with using voice

Journalists Making a draft of interviews instantly

Customer service Transcribe customer complaints and ideas instantly.


Existing tech
- Some major datasets exist, but no usable model
- Not enough data
- Lacks variety
- Needs three major components:
- Acoustic model
- Pronunciation model
- Language model
Entry barriers
- Data acquisition
- Need 10k+ hours of speech data
- No datasets previously mentioned had more than 500 hours
Text-to-speech
- Potential market: Worldwide Text-to-Speech market is expected to reach USD
5790.1 million by 2028, up over USD 2543.1 million in 2021, at a 12.3 percent
CAGR between 2022 and 2028
Segments Applications

Banking and finance Customer service

Tourism Sharing real-time information

Telecommunications Improve the customer experience

Automotive Manufacturing GPS system

Brand management Improving brand image through consistent voice

People with disabilities Improving learning and communication


Existing tech
- Kotha, based on Festival, was released in 2007 by CRBLP
- Other systems includes Subachan and Anuprash
Entry barrier
- Lack of publicly accessible gold standard data
- Difficult to compare models
- Long term sustainability is an issue
- Kotha is still available online, but no one has maintained it in last 10 years, it still needs
windows 7 to run
- Speech synthesis by its nature is a difficult task
Plagiarism checker
- The global market for anti-plagiarism softwares in the education sector is
expected to increase at a CAGR of 13.8 percent between 2020 and 2027,
from USD 819.5 million in 2020 to USD 2,029.4 million in 2027
- Due to the lack of a national plagiarism policy, institutions are sometimes
unable to take action against plagiarized research. No university in
Bangladesh even has a plagiarism policy
Segments Applications

Students Check assignments and reports for plagiarism

Teachers Prevent students plagiarising

Content writers Ensure content is plagiarism-free


Existing tech
- No foolproof distinct tech exists at this moment
- A couple of old efforts are there: one tried to detect plagiarism from NCTB books
Entry barriers
- Lack of plagiarism policy
- Extensive data is required
- Document similarity techniques are not new, but who are we going to compare it with?

You might also like