Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Comparative analysis of actual

language usage and selected


grammar and orthographical rules
for Filipino,
Cebuano-Visayan and Ilokano:
a Corpus-based Approach
Joel P. Ilao, Timothy Israel D. Santos,
and Rowena Cristina L. Guevara, PhD
Digital Signal Processing Laboratory, Electrical and Electronics
Engineering Institute
University of the Philippines – Diliman
Motivations for this study
 Requirement for successful MTBMLE:
 “writing system that will be acceptable to the majority
of mother tongue speakers and to the government and will
encourage members of the language communities to continue
reading and writing in their language” (Malone, 2010),
(Nolasco, 2011)
 High Philippine linguistic diversity means not all the
languages are orthographically well-developed
 There’s a need for a monitoring mechanism for
refining grammatical and orthographic rules based
on actual language usage.
Presentation Overview

 Philippine Linguistic Profile


 Developments in Philippine Orthography
 Spelling Variants and Areas of Variation
 Corpus-based Analysis of Actual Language Use
 Results and Analysis
 Conclusion and Recommendations for MTBMLE
Philippine Languages
• 171 living languages in 7,107 islands
• According to the 2000 National census:
• 10 major languages (speakers >= 1M)

Language Number of Speakers


Tagalog ~21.5 Million (28.15%)
Cebuano-Visayan ~20 Million (26.14%)
Ilokano ~7.7 Million (10.07%)
Hiligaynon ~7 Million (9.15%)
Waray-waray ~3.1 Million (4.05%)

C. Cheng, R. Roxas, and N. Lim, "Philippine


Language Resources: Trends and Directions,"
Suntec, Singapore http://en.wikipedia.org/wiki/Languages_of_the_Philip
pines
Developments in Tagalog Orthography
Orthography Reform Letters Included/Removed
1940 Abakada abkdeghilmnoprstuwy
1976 Tuntunin ch f j ll ñ ng q rr v x z
1987 Patnubay ch ll rr (removed)

• Lope K Santos, "Balarila ng Wikang Pambansa". Manila: Surian ng Wikang


Pambansa, 1940
• DECS Memorandum No. 94, s. 1976
• Surian ng Wikang Pambansa. "Alpabeto at Patnubay sa Ispeling ng Wikang
Filipino". Surian ng Wikang Pambansa, 1987
• Komisyon sa Wikang Filipino, "Revisyon ng Alfabeto at Patnubay sa Ispeling ng
Wikang Filipino". Komisyon sa Wikang Filipino, 2001
• UP Sentro ng Wikang Filipino. "Gabay sa Pagbaybay". UP - Sentro ng Wikang
Filipino, 2004
• Zafra, G. et al., "Gabay sa Ispeling". Quezon City: Sentro ng Wikang Filipino - UP
Diliman, 2008
Spelling Variants: Potential areas of confusion
for the Philippine Languages

 Use of <o> vs <u> and <e> vs <i>


 e.g. bilyonaryo / bilyunaryo, lalake / lalaki
 Symbolization of the off-glides (e.g. sya / siya , bwya / buwaya)
 Representation of the juncture of <n> and <g> when they do not form
<ng>
 e.g. baranggay / barangay , singgalong / singalong
 Hyphenation (e.g. nagjo-jogging / nagjojogging)
 Hyphenation and glottal stop (e.g. ka-alaman / kaalaman)
 Contractions (e.g. bakit / ba’t, kaayo / kay)
 Consonant gemination (double consonants)
 e.g. agkakarruba / agkakaruba

Roger Stone and Neri Zamora, "Designing an Alphabet for an Unwritten Language," in 1st MLE Conference,
“Reclaiming the Right to Learn in One's Own Language”, Capitol University, Cagayan de Oro, Feb 18-20, 2010.
Spelling Variants: Potential areas of
confusion for the Philippine Languages
 Compound words (e.g. bahaykubo vs. bahay-kubo)
 Morphophonemics – the variation due to the collision of affixes
 Assimilation – <d> vs <r> (e.g. nandito vs. narito)
 Vowels that are dropped from words during affixation
 e. g. maibibili / mabibili
 The use of the 8 new letters (ch f j ll ñ ng q rr v x z)
 e.g. taxi vs. taksi

 Loan Words (phonetic transliteration vs. foreign form)


• e.g. girlfriend vs. gerlprend

 Interference/Interaction from other local variety


 e.g. dingding / ringring (from Morong, Rizal)

Roger Stone and Neri Zamora, "Designing an Alphabet for an Unwritten Language," in 1st MLE Conference,
“Reclaiming the Right to Learn in One's Own Language”, Capitol University, Cagayan de Oro, Feb 18-20, 2010.
Corpus-Based Analysis of a Language
Start

Collection of
Sentences and
phrases
Text Corpus of a Spelling Variant
given language Groups Extractor

Spelling
Transformation
Rules
Spelling and
Grammar and Grammar Rules
Orthography Books Spelling Variant
of a given language Counter

Spelling Variant
Group
Frequency
Counts

End
Corpus Collection
Corpus Miner Software
 Filipino/Tagalog
 Bantay-Wika project
 Cebuano
 Sun Star news website (www.sunstar.com.ph)
 Ilokano
 Tawid News Magasin website (www.tawidnewsmag.com)
Lexicon Pruning
 Running lexicons or word lists for each language considered
were built.

 Many misspells, non-standard terms (e.g. usernames


astroboy212, Jejenese terms bk8), and numbers (e.g. dates,
phone numbers)

 Solution is retain entries with only letters and the dash ‘-’
symbol
Spelling Variant Extraction: Levenshtein
Edit Operations*
Edit operation Example
anu-ano → ano-ano (tgl)
Substitution nagmaniho →nagmaneho (ceb)
kataltalonan →kataltalunan(ilk)
hinantay →inantay (tgl)
Deletion makaaguwanta →makaagwanta (ceb)
pammutbuteng →pamutbuteng(ilk)
kolehiala→kolehiyala (tgl)
Insertion nakakuhag→nakakuhaag (ceb)
nagdadakkel → nag-dadakkel (ilk)

Edit distance of N: “N min. number of distinct edit operations to


transform one string to another”
Example : N = 3
mag-asawa → nag-asawa → nagasawa → nagaasawa
* Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Soviet Physics
Doklady, vol. 10, no. 8, pp. 707-710, 1966.
Corpus Based Analysis
Start

Collection of
Sentences and
phrases
Text Corpus of a Spelling Variant
given language Groups Extractor

Spelling
Transformation
Rules
Spelling and
Grammar and Grammar Rules
Orthography Books Spelling Variant
of a given language Counter

Spelling Variant
Group
Frequency
Counts

End
Automatic extraction of Spelling Variants

 Candidate spelling variants were extracted from the


cleaned lexicons
 criterion: words with edit distance of 2 or less

 List of candidate spelling variants were manually analyzed


to construct Spelling Transformation rules
Corpus Based Analysis
Start

Collection of
Sentences and
phrases
Text Corpus of a Spelling Variant
given language Groups Extractor

Spelling
Transformation
Rules
Spelling and
Grammar and Grammar Rules
Orthography Books Spelling Variant
of a given language Counter

Spelling Variant
Group
Frequency
Counts

End
Reference rule books and grammar sketches
Filipino / Tagalog
• Komisyon sa Wikang Fiilipino. (2001). Alfabeto at Patnubay sa Ispelling ng Wikang Filipino.
Manila.
• UP Sentro ng Wikang Filipino (2008). Gabay sa Ispeling. UP Diliman.

Cebuano
• Michael Tanangkingsing, A functional reference grammar of Cebuano: from a discourse
perspective, Volume 1.: LAP Lambert Academic Publishing, 2011.
• E.S. Godin (2007). Mga Batakan sa Panitik sa Binisaya-Sinugbuanon (Rules on Cebuano-
Visayan Spelling). MSU-IIT, Iligan, Philippines.

Ilokano
• Noemi U. Rosal, Pagbasa at Pagsulat sa mga Wika ng Pilipinas (Ilokano). Agsursurotayo nga ag
Ilokano. Quezon City: Sentro ng Wikang Filipino - Diliman Unibersidad ng Pilipinas, 2011.
• S.E. Benosa (2011). An Ilocano Orthography for MTB-MLE. Unpublished Term Project. UP –
Diliman
Corpus Based Analysis
Start

Collection of
Sentences and
phrases
Text Corpus of a Spelling Variant
given language Groups Extractor

Spelling
Transformation
Rules
Spelling and
Grammar and Grammar Rules
Orthography Books Spelling Variant
of a given language Counter

Spelling Variant
Group
Frequency
Counts

End
Corpus Based Analysis
Start

Collection of
Sentences and
phrases
Text Corpus of a Spelling Variant
given language Groups Extractor

Spelling
Transformation
Rules
Spelling and
Grammar and Grammar Rules
Orthography Books Spelling Variant
of a given language Counter

Spelling Variant
Group
Frequency
Counts

End
Results and Analysis

Text Corpora description


Total number of
Total number of Total Number of Total number of Total number of unique words
Language
sentences Unique sentences words unique words after lexicon
pruning

Filipino/Tagalog 667,671 406,904 18,883,496 376,348 301,171


Cebuano-
Visayan 7,539 6,639 203,984 18,904 18,012
Ilokano 18,216 17,013 479,509 58,141 52,818
Transformation rules
Rule # Rule Name Example
babai / babae (Tgl)
1 <i> vs. <e> agdingngeg / agdengngeg (Ilk)
impliyado / empliyado / empleyado (Ceb)
ano-ano / anu-ano (Tgl)
gipaabot / gipaabut
2 <o>vs. <u>
(Ceb)
adayo / adayu (Ilk)
amnesya / amnesia (Tgl)
3 <y> vs. <i> agbiag / agbyag (Ilk)
madugay / madugai (Ceb)
biblia / bibliya (Tgl)
4 <iya> vs. <ia> diyay / diay (Ilk)
biyay / biay (Ceb)
impiyerno / impierno (Tgl)
5 <iye> vs. <ie> giyem / giem (Ilk)
gobiyerno / gobierno (Ceb)
Transformation rules
Rule # Rule Name Example
kolehiyo / kolehio (Tgl, Ilk)
6 <iyo> vs. <io>
diyos / dios (Ceb)
asawa / asaua (Tgl)
7 <w>vs. <u> agrubwat / agrubuat (Ilk)
kawsa / kausa (Ceb)
araw / arao (Tgl)
8 <aw> vs. <ao> daydayawenda / daydayaoenda (Ilk)
mindanaw / mindanao (Ceb)
eskuwelahan / eskwelahan (Tgl)
9 <uw> vs. <w> kaduwa / kadwa (Ilk)
makaaguwanta / makaagwanta (Ceb)
10 <gg> vs. <g> ganggannaet / gangannaet (Ilk)
11 <mm> vs. <m> pammutbuteng / pamutbuteng (Ilk)
12 <nn> vs. <n> maikanniwas / maikaniwas (Ilk)
13 <rr> vs. <r> agkakarruba / agkakaruba (Ilk)
Transformation rules
Rule # Rule Name Example
balcon / balkon (Tgl)
14 <c> vs. <k> Ilocano / Ilokano (Ilk)
caayo / kaayo (Ceb)
princesa / prinsesa (Tgl)
15 <c> vs. <s> anuncio / anunsio (Ilk)
cebuano / sebuano (Ceb)
derecho / deretso (Tgl)
16 <ch> vs. <ts> lechon / letson (Ilk)
niderecho / nideretso (Ceb)
filosopo / pilosopo (Tgl)
17 <f> vs. <p> panagpiesta / panagfiesta (Ilk)
filipino / pilipino (Ceb)
18 <j> vs. <h> jardin / hardin (Tgl, Ilk, Ceb)
janitor / dyanitor (Tgl)
19 <j> vs. <dy>
ijay / idyay (Ilk)
Transformation rules
Rule # Rule Name Example
martillo / martilyo (Tgl)
20 <ll> vs. <ly> kakaillan / kakailyan (Ilk)
apellido / apelyido (Ceb)
hañgarin / hangarin (Tgl)
21 <ñ> vs. <n> malacañang / malacanang (Ilk)
osmeña / osmena (Ceb)
españa / espanya (Tgl)
22 <ñ> vs. <ny>
doña / donya (Ilk)
aquin / akin (Tgl)
23 <qu>vs. <k> quinahanglan / kinahanglan (Ceb)
daquel / dakel (Ilk)
24 <ks> vs. <x> taksi / taxi (Tgl, Ilk, Ceb)
universidad / universidad (Tgl, Ilk)
25 <v>vs. <b>
visayas / bisayas (Ceb)
Transformation rules
Rule # Rule Name Example
luzon / luson (Tgl)
26 <z>vs. <s> vizcaya / viscaya (Ilk)
legazpi / legaspi (Ceb)
bahay-kubo / bahaykubo (Tgl)
DASH vs.
27 agan-nad / agannad (Ilk)
NODASH
gi-aksyonan / giaksyonan (Ceb)
IPINAGREDUP vs. pinag-aagawan / pinapag-agawan (Tgl)
28
IPINAGNODUP pinagrigrigat / pinapagrigat (Ilk)
IPINAREDUP vs.
29 ipinamamalas / pinapamalas (Tgl)
IPINANODUP
maibibili / mabibili (Tgl)
30 MNAI vs. MNA
mailaklako / malaklako (Ilk)
MNAKAPAGREDUP vs.
31 makapag-iisa / makakapag-isa (Tgl)
MNAKAPAGNODUP
Transformation rules
Rule # Rule Name Example
MNAKAREDUP vs. makasasama /
32
MNAKANODUP makakasama (Tgl)
Presence or amnagpintas /
33
absence of -<g>- amnapintas (Ilk)
34 <o> vs. <ao> morag / maorag (Ceb)
Presence or
35 absence of pannakaisurat / pannakasurat (Ilk)
infix -<i>-
Presence or
36 absence of mag-ilokano / ag-ilokano (Ilk)
prefix <m>-
Presence or
37 absence of - gikompirmar / gikompirma (Ceb)
<r>
Transformation rules

Rule # Rule Name Example


Presence or
nagpa-tagaytay / agpa-
38 absence of
tagaytay (Ilk)
prefix <n>-
napimpintas /
39 -<m>- vs. -<n>-
napinpintas (Ilk)
Presence or ospital / hospital (Tgl)
40
absence of -<h>- gipasakahan / gipasakaan (Ceb)
41 <ng> vs. <n> kangi-kangina / kani-kanina (Tgl)
42 <ng> vs. <g> panangbalbaliw / panagbalbaliw (Ilk)
Results and Analysis
 Transformation rules in Filipino/Tagalog ranked
according to number of distinct case pairs observed

Rank Filipino/Tagalog Cebuano Ilokano


1 DASH vs. NODASH <o> vs. <u> <o> vs. <u>
Presence or absence Presence or absence
2 <o> vs. <u>
of -<g>- of infix -<i>-
Presence or absence Presence or absence
3 DASH vs. NODASH
of -<h>- of -<g>-
4 <i> vs. <e> <ng>vs. <n> DASH vs. NODASH
5 <y> vs. <i> <i> vs. <e> <i> vs. <e>
Cases covered by rule books

Agreement
Rule (rulebook-suggested
Variant 1 Variant 2 with
variant)
Rulebook
ano-ano (222) anu-ano (1254) X
<o> vs <u> in reduplication:
sino-sino (94) sinu-sino (298) X
retain the <o> for the
halo-halo (46) halu-halo (19) √
repeated stem (variant 1)
salo-salo (24) salu-salo (112) X
<o> vs <u> conjugated form
of a stem originally ending gulohin (0) guluhin (51) √
with <o> is trasformed to <u> Tugtogin (0) Tugtugin (52) √
(variant 2)
kuwento (1949) kwento (850) √
Rule <uw> vs <w> prefer the
eskuwela (56) eskwela (99) X
use of <uw> (variant 1) for
tuwalya (79) twalya (6) √
academic and professional use
kuwalipikasyon (9) kwalipikasyon (23) X
Cases covered by rule books

Rule (rulebook- Variant 1 Variant 2 Agreement


suggested variant) with
Rulebook
<o> vs <u>
Kalihokan (7) Kalihukan (37) √
Dictionary forms
Kalibotan (2) Kalibutan (20) √
usually indicate the
Kasaulogan (13) Kasaulugan (1) X
use of <u>
Lihok (5) Lihuk (0) X
(variant 2)
Variation Index

absolute difference between total


occurence counts of spelling variants
V .I . 1 100%
total occurrencecounts of both spelling variants

Rank Fil./Tagalog Ilokano Cebuano


Presence or
1 <ch> vs. <ts> <uw> vs. <w>
absence of -<g>-
2 <uw> vs. <w> <ll> vs. <ly> <v> vs. <b>
3 <y> vs. <i> <ng> vs. <n> <aw> vs. <ao>
Presence or
4 <ll> vs. <ly> <ch> vs. <ts>
absence of -<m>-
5 <ng> vs. <n> <ng> vs. <g> <y> vs. <i>
Conclusion
 It is possible to identify areas in language usage showing a
high levels of variation using corpus-based analysis of
written work.
 Information can be used to create better dictionaries and
language reference books
 Improvements to study
 add date-stamps to corpora being studied
 Experiment with different Levenshtein edit distances and
weights for edit operations to detecting more kinds of
spelling variants
 Use other technologies / methodologies (e.g. Parts-of-Speech
Taggers and Morphological analyzers)
ISIP PROJECT
ISIP Project
 Produce high-impact research on Philippine Speech and
Language Technology
 ISIP Project 6: “Philippine Languages Database for Mother
Tongue-Based Multi-lingual Education and Applications”
 Activities
 Creation of the Philippine languages database
 Tagalog, Cebuano, Ilokano, Hiligaynon or Ilonggo, Waray-Waray,
Kapampangan, Chavacano (Spanish Creole), Northern Bicolano,
Pangasinense, and a code mixed language of Filipino and English
Acknowledgments

Department of Science and


Technology (DOST)
Engineering Research
and Development for Technology
(ERDT)

You might also like