Ewika: Digitalization of Philippine Languages: Charibeth K. Cheng March 19, 2008

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 25

eWika: Digitalization of Philippine

Languages

Translate

Isalin

Charibeth K. Cheng
March 19, 2008

Machine Translation
Automate translation
A study under Natural
Language Processing

Sentence in
SOURCE LANGUAGE

MT System

Sentence in
TARGET LANGUAGE

ENG-FIL MT System Project

3-year project
started last year
funded by DOST-PCASTRD
composition:
6 faculty members of College of
Computer Studies
15 computer science majors
assisted by the Filipino Dept and
Dept in English & Applied
Linguistics of DLSU-M

Agenda

Architecture of the MT System


Linguistic resources
Demo of the Translation Engine
Results for English to Japanese translation

Architectural Design of the Program


Source Text

User Interface

Target Text

MT: Example-based
MT: Rule-based
Translator Engine

Language Resources:
Lexicon (electronic dictionary),
Morphological Analyzer & Generator
Part-of-Speech tagger
Grammar,
Corpus (Tagged)

Output Modeller

Challenge!
Language resources
Quality of translation is dependent on it.
Built from almost non-existent digital forms
manual vs. automatic construction

Lexicon Builder
Used IsaWika! database as initial lexicon
Created a lexicon extraction program to
automatically determine candidate translation
pairs from corpora
Currently contains about 23,000 entries
Co-occurring words are likely translation
Challenge: Lexical resources
parallel corpora
part-of-speech tagger

Database

Morphological Analyzer
Initially collected morphological rules from
grammar books
Developed an example-based morphological
phenomenon learner
learn from <inflected word, root-word>
example: <kumakain, kain>

Challenge : Lexical resources


lexicon
part-of-speech tagger
morphological rules

Generator

Part-Of-Speech Tagger
automatic association of parts-of-speech to
words in a document
existing Filipino tagger achieves < 80%
accuracy
Challenge : Lexical resource
tagged parallel corpora
lexicon
morphological analyzer
grammar

Grammar
Derived manually
Challenge: Free word order in sentence
formation.
The man bought an umbrella from the store.
Bumili ang lalaki ng payong sa tindahan.
Bumili sa tindahan ng payong ang lalaki.
Ang lalaki ay bumili ng payong sa tindahan.

Corpora
used by the lexicon extractor and part-ofspeech tagger, example-based MT
came from translation works of DLSU English
majors, verified by linguists
consists of 207,000 words, 5000 of which are
tagged

Translation Rules
currently learned from the corpora
disadvantages
garbage-in-garbage-out
comprehensiveness

need for linguistic-verified rules

Bringing it home
171 Philippine Languages (SIL)
No Philippine Corpora
Unfortunately, today, the Philippines has one of
the highest rates of dying languages (Solfed
Foundation Inc)
Without our language, we have no culture, we
have no identity, we are nothing. (Thorrson)

eWika: Digitalization of
Philippine Languages
Build the Philippine Corpus
Build software tools to study or
use the corpus
Across Languages
Across Regions
Across Forms and Genres
Across Land and Sea

Across Languages
171 Philippine Languages (SIL List)
Summer Institute of Linguistics
http://www.ethnologue.com/
Major languages
Near extinction languages
How about the languages in-between?

Filipino Sign Language


The History of Sign Language in the
Philippines: Piecing Together the Puzzle (Abat
& Martinez, 9th Phil Linguistics Congress, 2006)
Deaf individuals: handicapped vs members of a
linguistic minority
Sign languages as true languages

Across Boundaries
Across Languages

Across Regions
Across Forms and Genres
Across Land and Sea

Across Regions
e-Wika: Connecting the Philippine Islands through Language
17 Regions: The regions are: Ilocos Region (Region I),
Cagayan Valley (Region II), Central Luzon (Region III),
CALABARZON (Region IV-A) , MIMAROPA (Region IV-B) ,
Bicol Region (Region V), Western Visayas (Region VI), Central
Visayas (Region VII), Eastern Visayas (Region VIII),
Zamboanga Peninsula (Region IX), Northern Mindanao (Region
X), Davao Region (Region XI), SOCCSKSARGEN (Region XII),
Caraga (Region XIII), Autonomous Region in Muslim Mindanao
(ARMM), Cordillera Administrative Region (CAR), National
Capital Region (NCR) (Metro Manila)

Across Boundaries
Across Time: historical, contemporary
Across Languages
Across Regions

Across Forms and Genres


Across Land and Sea

Across Forms and Genres


In various forms:
Text
Speech: speech to text system (ongoing
project)
Video: Filipino sign language
In various Genres: categories of entries in the
corpus

Across Boundaries

Across Time: historical, contemporary


Across Languages
Across Regions
Across Forms and Genres

Across Land and Sea

Across Land and Sea


Web-based application: c/o Solomon See
(upload, download, tools)
Contributors (Main players)
Verify-ers
Facilitators
Server: DLSU-M commits to host the server for
the next three years.
Terms of Use: Research purposes.

The dream of building Philippine language


resources and tools
Many many many major hurdles to overcome
Language Resources, Tools, & Peopleware:
Needed

You might also like