Research Proposal - An Approach To Machine Translation of Bantu Languages Using Cinyanja and Ciyawo

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/333161403
Research Proposal - An Approach to Machine Translation of Bantu Languages

using ciNyanja and ciYawo
Research Proposal · May 2019

DOI: 10.13140/RG.2.2.11044.32641
CITATIONS READS
0 713
1 author:
Zangaphee Chris Chimombo

University of Malawi
3 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
An Approach to Bantu Machine Translation using ciNyanja and ciYawo View project
All content following this page was uploaded by Zangaphee Chris Chimombo on 17 May 2019.
The user has requested enhancement of the downloaded file.

Research Proposal
for
PhD in Applied Linguistics (Computational Linguistics),
Chancellor College, University of Malawi:
An Approach to Machine Translation of Bantu Languages
using ciNyanja and ciYawo
Zangaphee Christopher Chimombo
May 2019
1 Introduction
The technology exists today that will enable a Malawian to go to neighbouring Mozambique and
“converse” with a native thereat without knowing their language(s) using a smartphone and
specialised software. The languages would likely be English and Portuguese, of course, but not
ciNyanja and ciNyungwe for instance. Yet, the technology that enables the former language pair can
be readily extended to minority languages such as the latter pair especially if the software is available
under an “open source” licence. The advances in computational linguistics in the major languages of
the world can thus be applied locally.
An approach for one Bantu language pair is likely to be readily modifiable (bootstrappable) for other
Bantu languages. This is due not only to commonalities amongst Bantu languages but also to ongoing
attempts to converge Bantu languages or at least stem their divergence (Chimombo, 2010). This
bootstrapping approach has been demonstrated – though not for a Bantu language pair – by
attempting to extend the Matxin open source deep transfer system from the Spanish-Basque pair to
include English-Basque (Mayor & Tyers, 2009).
CiNyanja (ISO-639-3 “nya”) is spoken in southeast Africa. Its Guthrie code is N.31.1 CiYawo (ISO-639-3
“yao”) is also spoken in southeast Africa. Its Guthrie code is P.21.1 The regional trend among forward-
looking institutions such as the Centre for the Advanced Study of African Societies (CASAS) has been
toward convergence and not further divergence therefore the umbrella term “ciNyanja” will be used
instead of “Chichewa” henceforth in this study unless the context demands otherwise.
Standard orthographies of the two languages will be used (Banda, et al., 2008) and (Center for
Language Studies, 2005) including unpublished amendments in an October 2010 memo from the
Director of the Centre for Language Studies.2
According to the 1998 population and housing census (National Statistical Office of Malawi, 2000),
70% of Malawians are first-language speakers of ciNyanja (as defined herein) and the next largest
ethnic group by first language (defined as the language spoken in the home) are ciYawo speakers at
10.1%. Thus our study potentially caters for over 80% of Malawians. The 2008 census did not include
similar data, so reliable more up-to-date figures aren’t available. Once the 2018 census reports are
out, these figures may be updated.
It is known that these languages are sufficiently different so as to make this study a non-trivial one.
2 Literature Review and Theoretical Framework
Reviewing the literature reveals differences in approaches to Natural Language Processing (NLP)
between languages of the “developed” world and those, like Bantu languages, of the rest of the world.
1
http://www.ethnologue.com, accessed December 2018.
2
http://www.ciyawo.org, accessed December 2018.
1
The first difference is in the use of corpora. Keet and Khumalo (2016) indicate that for Bantu languages
there is a preference in the literature for knowledge-driven bottom-up (as opposed to a corpus-driven
data-oriented learning approach) building of grammars, probably due to the sparsity of adequate and
sizable language corpora for Bantu languages.3
The knowledge-driven approach may yield more precise results but is restricted to those languages
with which the researchers are familiar whereas corpus-based approaches may be blindly applied to
many languages, depending on availability of corpora.
The second difference is in the use of statistical methods. One rare group of Bantuists who have
attempted this approach are (Archangeli, Mielke, & Pulleyblank, 2012) who try to demonstrate the
feasibility of building a grammar from the ground up using corpora from six languages including
ciNyanja and ciYawo. They test their hypothesis on the restricted phenomenon of Bantu vowel height
harmony and report moderate success.
This paucity of literature on statistical computational Bantu linguistics may be due in part to the
several known limitations summarised by (Vandeghinste, et al., 2013) who have the following to say
about statistical machine translation (MT).
It lacks a mechanism to deal with long-distance dependencies, it has no means to generalise
over non-overt linguistic information and it has limited word reordering capabilities.
Furthermore, in some cases the output quality may lack appropriate fluency and
grammaticality to be acceptable for actual MT users. Sometimes essential words are missing
from the translation.
It is thus apparent that only a handful of groups of researchers have tackled the issue of Bantu NLP.
One of the prominent languages to begin the search with is kiSwahili and indeed the work of (De Pauw,
Wagacha, & de Schryver, 2011) seems directly relevant. This is a report on a statistical machine
translation approach for English and kiSwahili using an open source platform called MOSES. NLP works
in KiSwahili date as far back as the early 1990s.
Hurksainen has culminated decades of research in an overview article of ostensibly statistical MT
systems – Google Translate and Microsoft Translate – and suggests “the more fundamental reason
may be that statistical translation systems do not give satisfactory results” and discusses a ten-step
rule-based MT (RBMT) process (Hurskainen, 2012). He presents an online computational linguistics
environment for developing Swahili and English applications that includes a mature MT system
between that language pair.4 A later article describes RBMT from English to Finnish but mentions 4
similar approaches including one using Apertium (Hurskainen & Tiedemann, 2017). These would need
to be reviewed before the study proceeds on a similar path.
It is also to be expected that the Republic of South Africa (RSA) with its eleven official languages (of
which nine are Bantu) would be at the forefront of Bantu linguistics. Indeed that is the case due to
state commitment, e.g. concurrent translations of parliamentary sessions coupled with state
resources that are amongst the highest on the continent. Keet and Khumalo (2016) have already been
mentioned; others include (Bosch, Jones, Pretorius, & Anderson, 2007).
Some relevant work has also been done on natural language processing at the Computer Science
department at Makerere University in Uganda. Katushemererwe and Hanneforth (2010) do not,
however, cover the complete process of machine translation, just the initial (and final) morphological
analysis (and morphological generation) stages. Prominent computational linguist Lauri Kartunnen has
also reportedly done some work on Lingala NLP. This same paper mentions several references with
similar works but with other Bantu languages: KiSwahili, Zulu, Ekegusii, Kinyarwanda and Setswana.
3
Other researchers use the term “rule-based” (Hurskainen & Tiedemann, 2017).
4
www.njas.helsinki.fi/salama, accessed March 2019.
2
The idea that an approach that involves one Bantu language is likely to make it “easier” for another
Bantu language has been embraced using an approach from machine learning called “bootstrapping”
where a human trainer corrects a machine learning analysis in cycles until an acceptable level of
correctness is achieved (Byamugisha, Keet, & DeRenzi, 2016). The researchers replace the machine
learning with a manual analysis and report success with applying an isiZulu controlled natural language
(CNL) to Runyankore.
An interesting very recent contribution to the field of Bantu historical linguistics comes from outsiders
to the field (Whiteley, Xue, & Wheeler, 2018). Techniques that have successfully been used in DNA
sequencing are applied to phonological patterns. The authors declare the superiority of their method
over “untestable authority statements” and overturn hypotheses such as the early split between east
and west Bantu, but it may be the case that this approach suffers similar drawbacks to those of the
statistical methods, e.g. the inability to take into account long-distance dependencies. Alas, such an
approach is beyond the scope of this current study.
Any systematic approach to the study of Bantu morphosyntactic features is likely to be of import to
this study. Thus, Maho’s generalised verbal template or Pan-Bantu Slot System (PBSS) provides a view
of Bantu verbal morphology that no decent Bantu MT can ignore (Maho, 2007). Similarly, a
parameterised approach for comparative Bantu such as that proposed by (Marten, Kula, & Thwala,
2007) is also relevant since an awareness of as much as possible of the variations in other Bantu
languages is useful – even though this study is restricted to just two – not just because the intent of
the study is to forge the way for more language pairs but also because this will inform the study on
what are generics to Bantu and what are specifics to ciNyanja and ciYawo. It is possible to envisage
that the PBSS and parameters will be fully defined for all Bantu languages in the future.
2.1 Bantu Language Corpora
A recent review of Bantu language corpora is available (Khumalo, 2015). In it, prominent South African
linguist Langa Khumalo documents fifteen Bantu languages that have developed corpora ranging in
sizes from a few hundred thousand lexical items to over 20 million. An ambitious roadmap for further
developing isiZulu corpora is also presented. CiNyanja and ciYawo do not feature in this review. The
corpora available for these two languages that are the focus of the present study are few in number
and small in size.
The most active ciNyanja lexicographer is currently Dr Steven Paas. His dictionary is now in its fifth
edition and this latest edition is published by Oxford University Press. It is also available for online
querying5. What is not known is whether the 45,000 individual entries are available for bulk querying.
The Centre for Language Studies also has an online word querying facility available via its website but
this has not been updated for some time.
The first monolingual dictionary in ciNyanja is Mtanthauzira mawu wa chiNyanja (Centre for Language
Studies, 2000). With over 15,000 words and their translations with examples. The first monolingual
dictionary in ciYawo, also from the CLS, is Mgopolela maloŵe jwa ciYawo (Centre for Language Studies,
2013) with 5,715 words. These represent invaluable corpora especially if available electronically with
words and example sentences aligned (without making use of a third language, English).
CiYawo language has benefitted from a dictionary since the early 1950s when George Meredith
Sanderson published his Dictionary of the Yao Language (Sanderson, 1954). He had earlier, in 1922,
published A Yao Grammar in the wake of much earlier works by Alexander Hetherwick, and even
earlier, Edward Steere. More recently, Ian Dicks has edited a 3,000-word English to Yawo (and
5
https://www.chichewadictionary.org, accessed December 2018.
3
CiNgelesi – ciYawo) dictionary (Dicks, 2018).6 A Yawo interest website is also maintained7. Previously
Dicks had authored a ciYawo langauge guide with Shawn Dollar that will be useful for this study (Dicks
& Dollar, 2010). These latter works by Dicks follow the standard orthography.
The Swadesh comparative wordlist was originally developed for lexicostatistical purposes, but it is
applicable in this study due to the widely available online lists in several languages including Bantu
ones, in particular ciNyanja and ciYawo for our purposes. Although a mere 207 words, it will allow us
to craft sentences using just this lexicon in order to test and evaluate a rudimentary MT system.
The Comparative Bantu Online Dictionary (CBOLD) – hosted by the University of California at Berkeley
– was an ambitious attempt to develop an extensive standardised “collaborative, accessible database
for the use of researchers in Bantu languages”8. Although the website does not seem to have been
updated for a very long time, language data is still available for download. The language data for
ciNyanja and ciYawo were provided by Al Mtenje and Armindo Ngunga respectively. The ciNyanja
corpus has around 6,200 words and the ciYawo one around 7,400. Each lexical item has a number of
records associated with it including part-of-speech (PoS), tone characteristics and glossary of
equivalent English words.
In 1999 the Universal Declaration of Human Rights (UDHR) broke a world record for being the most
translated document in the world. Today there are 515 different translations and counting9. The
document, with 30 articles, is short, having only 1500 words in 200 sentences. However, ciNyanja and
ciYawo translations are available and such parallel texts are invaluable for tuning, testing and/ or
evaluation purposes. Likewise, the Sustainable Development Goals have recently been translated into
both languages.
There has been a drive at the University of Ghent10, Belgium to turn BantUGent into a centre of
excellence for Bantu language studies. A recent output from there has been a set of illustrative articles
which describe the building from scratch of a corpus for an interlacustrine Bantu language – Lusoga –
primarily with lexicographical purposes in mind (de Schryver & Nabirye, 2018a, 2018b, 2018c). Thus,
it can be concluded that the situation for Bantu corpus linguistics is only going to improve, albeit at
different paces depending on the language.
Unsurprisingly, many language corpora begin with or are dominated by religious texts. The Bible, with
over 700,000 words and the Koran with slightly more than one-tenth of that have been translated into
many languages, including ciNyanja and ciYawo.
One of the earliest non-religious Bantu parallel texts could well be Ntanu za Esopo (F.A.R., 1895). An
inscription by a “F.A.R.” reveals that it is a translation into ciNyanja from kiSwahili. It consists of 62
short tales from the classic Aesop’s Fables. This may represent a curious (if only in passing) parallel
Bantu corpus.
2.1.1 On the shortfalls of existing tagsets and the Swadesh list
There is a lack of standard terminology for morphosyntactic categories. Thus, three different authors
can and do use subject marker (SM), subject pronoun (SP) and subject concord (SC) to denote the
same morpheme. It is, therefore, valid to propose a standardised morpheme tagset (a list of parts-of-
speech) for both Bantu-specific lexical categories and Bantu morphosyntactic categories (Rose,
Beaudoin-Lietz, & Nurse, 2002). In our case, however, the traditional understanding of parts-of-speech
6
The Oxford University Press is acknowledged by the dictionary editors for its use, under licence, of the 3,000
most common English words with examples. These examples can form an important corpus especially if their
ciNyanja equivalents are found. It will be hard to ignore English as an interlingua for many years to come.
7
http://www.ciyawo.org, accessed December 2018.
8
http://www.cbold.ish-lyon.cnrs.fr, accessed December 2018.
9
https://www.ohchr.org, accessed January 2019.
10
Recently it was host to an international conference on reconstructing proto-Bantu grammar as a 50 year
commemoration of Meussen’s 1967 publication.
4
of words is being extended to encompass other lexical items, i.e. morphemes. Works such as (Hellan,
In preparation) have already attempted to standardize terminology for verbal morphosyntactic
categories as part of the ambitious TypeCraft project which aims to provide a framework for analysis
of grammar constructions for comparative purposes.
One starting point can be the 12-tag “universal” tagset (Bird, Klein, & Loper, 2009) which can be
extended into a Bantu-specific tagset. This would be informed by more established norms such as the
45-tag Penn Treebank Part-of-Speech Tagset, which is the most widely used English language tagset
from the University of Pennsylvania (Marcus, Santorini, & Marcinkiewicz, 1993).
The Swadesh list is a general tool. A more Bantu-specific tool is possible, e.g. a list that includes such
nouns as “drum”, “mortar” and “pestle.” Indeed, a pan-African wordlist is available from SIL
International (Snider & Roberts, 2006). In addition to the actual lexicon, the handful of pronouns in
the Swadesh list is not adequate for capturing the full richness of Bantu pronominal microvariation.
Bantu adjectives, too, can be completely included in such a list since pure adjectives in Bantu number
less than twenty (Miti, 2006).
A negative morpheme (NEG) is a valid morphosyntactic category in Bantu languages:
(1) CiCewa: si – ndi – dza – mu – lil – il – a
NEG SM TAM OM STEM EXT FV
not I FUT him cry for
sindidzamulilila. (Miti, 2006)
(2) CiYawo: nga – n – ku – pikan – a

NEG SM TAM STEM FV
not I PRES hear
ngangupikana. (Dicks & Dollar, 2010)
(3) KiSwahili: ha – wa – ku – som – a

NEG SM TAM STEM FV
not they PAST read
hawakusoma. (Perrott, 1950)
However, a single “not, ADV” list item, as is the case with the Swadesh list, does not suffice. An
additional list item (a morphosyntactic item, not just the word “not”) is required specifically for Bantu
languages in order to distinguish such phrases as the following two:
(4) Sindine mwana. I am not a child.
(5) Ndine bambo, osati mwana. I am a man, not a child.
An additional level of complexity is added when it is taken into account that the negation marker can
occupy multiple slots, e.g. in ciNyanja there are two slots depending on whether the mood is
imperative or not.
(6) Sindinapite. I did not go.
(7) Usapite. Do not go.
A single tag – NEG – would not distinguish the “not” in example (5) – the word osati – from the
morphemes si- and -sa- in examples (6) and (7). Note that there is a phonological effect in the
imperative example (7): si- + -a- → -sa-. Otherwise the negative morpheme is identical for the two
moods, what differs is just the position.
(Maho, 2007) identifies three possible slots for the negative morphemes amongst the medial markers
alone. Thus, more than two slots are required. The situation is further complicated when such
languages as kiSwahili are taken into account, where the negative morpheme depends on person (si-
, hu- and ha-) (Perrott, 1950).
5
In conclusion, amongst other changes, such a Bantu-specific list could maintain the existing Swadesh
“not, ADV” item but would require new list items for the negative marker morpheme. Such an item
would also require such a “morphosyntactic tag” – not just a PoS tag – perhaps NEGXy, where X is a
capital letter I, M, or F (initial, medial or final) and y is a number denoting the exact position of the
morpheme within the slot system. Such “morphotags” have not yet been taken into account, certainly
not in the popular tagsets mentioned above.
2.2 Theoretical Framework
Three trends can be observed to have shaped translation theory. The first, an application-neutral,
theoretical approach was laid down by (Catford, 1965) whose essay is quite theoretical, philosophical
even, but applies to human translation, not machine translation. However, the problems that human
translators encounter are surely multiplied and compounded when a machine is left to carry out the
same task. Additional, machine-specific problems also exist. Nevertheless, it is recognised that MT
cannot achieve 100% perfection, so the theory provides a backdrop with which to frame MT.
The second trend, mostly American-driven but with significant European input, was driven by both
the advent of the computer and, with it, the promise of MT, but also by the geopolitical realities of
the cold war. The starting point is how to represent a grammar. According to (Jurafsky & Martin, 2017),
Chomsky (Chomsky, 1956) and Backus (Backus, 1959) were the first to independently formalise the
idea in the late 1950s. “Context free grammars are also called Phrase-Structure Grammars, and the
formalism is equivalent to Backus-Naur Form, or BNF”. There are numerous approaches to
representing context free grammars (CFGs), e.g. combinatorial categorical grammar as proposed by
Steedman towards the end of the twentieth century (Steedman & Baldridge, 2011). Others, like
lexical-functional grammar, have been used by some Bantu linguists (Mchombo, 2003).
The third trend, of most significance to this study, is due to practical demands such as those of the
European Union. The pragmatic approach of “good enough" MT has thus been adopted.
A fourth emerging trend is likely to shape the future of MT due to the need for humans to
communicate with their smart devices. This is out of scope of the present study.
2.3 State of the Art
MT software has matured since the seminal theoretical works by Warren Weaver and Claude Shannon
in the mid-20th century (Hutchins, 1998), although the dream of a computer passing the Turing test
remains elusive still. The simplest MT strategy is direct lexical transfer where “each word of the source
language […] is directly linked to a corresponding unit in the target language…” according to
(Craciunescu, Gerding-Salas, & Stringer O’Keeffe, 2004). Our objective is not to develop Bantu MT
from scratch but to embrace and extend a viable pre-existing open source software.
Apertium was chosen over other technologies such as Python Nltk, Fsm2, Openlogos and Niutrans.
The reasons for settling on this particular technology can be summarised as follows:
• Scope: Python NLTK does not yet cover end-to-end MT (Bird, Klein, & Loper, 2009). Fsm2
focuses on two-level (both analysis and generation) morphology but not the remaining steps
in MT.
• Currency: Matxin and OpenLogos were last updated in 201011,12.
• User-friendliness: Apertium covers end-to-end MT and is up-to-date. Its user-friendliness is
enhanced by easy-to-follow online tutorials as well as an active user community who may
answer technical questions that are not covered by the user guide tutorials.
• Approach: A statistical MT (SMT) system called Niutrans, developed in China, was
encountered during the ongoing literature review (Xiao, Zhu, Zhang, & Li, 2012). However, it
11
https://sourceforge.net/projects/matxin/files, accessed April 2019.
12
https://sourceforge.net/projects/openlogos-mt/files, accessed January 2019.
6
has been found time and again that SMT for under-resourced languages would be a major
challenge. This is presumed to be even more the case for Neural MT.
Apertium was released as an open-source “shallow transfer” MT toolbox in 2005 as a culmination of
a large Spanish government-funded project involving several universities and commercial companies.
Initially supporting only a handful of Iberian peninsula language pairs, its use has since exploded,
especially around the European Union, to include 49 related language pairs, mainly European.
Notable non-European language pairs include: Indonesian-Malaysian, Maltese-Arabic, Crimean Tatar-
Turkish, Kazakh-Tatar, and Urdu-Hindi13. There is thus a reasonable expectation that closely related
Bantu language pairs can be readily added. The only African language to have been released as a stable
Apertium language pair (with Dutch, of course) is Afrikaans.
Figure 1 Apertium MT Process. Adapted from (Armentano-Oller, et al., 2005) and (Jurafsky & Martin, 2017).
From the Apertium wiki page: “Apertium is a shallow-transfer type machine translation system. Thus,
it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is
distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically
operations on groups of lexical units, rather than operations on parse trees.”14
In terms of grammar theory, however, a formal definition is yet to be encountered in recent literature
(Borsley & Börjars, 2011). Such a definition is likely to vary according to formalism (i.e. depending on
the formal grammar being used).
The toolbox consists of an engine suitable for related languages, extensive documentation which is
very useful for new users such as myself, linguistic data and compilers to create the data into a high-
speed (“tens-of-thousands of words a second”) format for use by the engine. Finite state transducers
are used for lexical processing; hidden Markov models for PoS tagging; and finite-state based chunking
for structural transfer (Armentano-Oller, et al., 2005).
13
http://wiki.apertium.org/wiki/Main_Page, accessed January 2019.
14
http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO, accessed
January 2019.
7
The engine has an 8-module “assembly-line” as shown in Figure 1. Each stage will be described in more
depth in the main study.
2.3.1 Evaluation of MT
Metrics are important for easy evaluation of different systems as well as for evaluating improvements.
We shall need them mainly for the latter.
BLEU (the bi-lingual evaluation understudy) is an automated, straightforward, low cost and language-
independent evaluation method that “correlates highly with human evaluation” (Papineni, Roukos,
Ward, & Zhu, 2002). The measure is used extensively in the MT community. It has also been reviewed
critically and thus it shall be important to bear in mind the shortfalls of such automated evaluation
metrics (Callison-Burch, Osborne, & Koehn, 2006). Alternatives such as WER (word-error rate) and
NIST (National Institute of Standards and Technology) will also be reviewed (Doddington).
3 Problem Statement
The contrast between the technological advancement of the handful of global languages and the
sparsity of research into Bantu language MT could not be starker. There has been no attempt at end-
to-end Bantu MT. Yet with a guided approach based on one language pair it is likely that a number of
Bantu language translation pairs would follow.
4 Objectives
The general objective is to demonstrate that Bantu MT is possible by making use of existing available
technologies, using the ciNyanja and ciYawo language pair.
4.1 Specific Objectives
o To use the ciNyanja and ciYawo language pair for illustrative MT purposes so that other Bantu
language pairs may follow with greater ease (bootstrap). This will set a benchmark for the
accuracy of Bantu MT.
o To use available language data to build richer corpora for ciNyanja and ciYawo for the
purposes of carrying out MT between this language pair. One way to develop richer Bantu
corpora is to enhance existing word lists with PoS tagging data. However, a Bantu specific PoS
tagset will have to be developed to take into account Bantu specific features.
This will generate new (“raw”) language data where required. This may require crafting
sentences for translation and/ or questions for answering in order to tease out the more
subtle aspects of each language. This will inform the construction of grammatical transfer
rules. Dahl’s set of questions seem to be a widely recognised starting point (Dahl, 1985). These
were used in a 2012 PhD dissertation which described and compared the tense-aspect
systems of Chichewa, Citumbuka and Cisena by a non-native speaker (Kiso, 2012). The three
tenses – past, present and future – are covered. The future tense in each of the languages has
at least two distinctions: near and distant. According to (Dicks & Dollar, 2010) ciYawo similarly
has two such demarcations of the future tense.
Sanderson suggests up to five future tenses for ciYawo (Sanderson, A Yao Grammar, 1922). It
will be interesting to analyse these in light of more recent pronouncements on tense and
aspect (Nurse, 2003).
Only twenty of Dahl’s two hundred or so questions are related to the future tense. Therefore,
more may be developed during the course of this study.
o To develop a Bantu-specific comparative list of lexical items (both words and morphemes).
5 Research Questions
The accuracy of MT is acknowledged to be short of 100% (Craciunescu, Gerding-Salas, & Stringer
O’Keeffe, 2004). In this study, it will be shown that MT is possible using a Bantu language pair. Thus,
the main question to be addressed by the research in this study is a qualitative one:
8
1. To what extent Is Bantu MT using ciNyanja and ciYawo language pair possible at present?
“Extent”, of course, will have to be defined in quantitative terms after reviewing the norms in
relevant literature.
The following are lesser questions that can be answered quantitatively:
2. How large are the language data corpora required both in terms of lexical items and grammar
rules for MT?
3. What Bantu-specific PoS tagset and morphosyntactic categories are required?.
6 Methodology
The methodological design of this study varies depending on which stage of the process of building a
fully functional MT system is at. The first step is to prepare the data in a format that the selected
software package will interpret correctly. There are two main sets of data: lexical items and transfer
rules. The lexical data are straightforward. The transfer rule data (both lexical transfer rules and
structural transfer rules) are less straightforward. This latter step has been automated as a “training”
step in many recent MT studies. Tuning, testing and evaluation stages follow.
Each stage is detailed below with justification for whether it is descriptive, analytical or comparative
in approach (or a combination thereof). The data collection procedures are also presented and
justified.
6.1 Preparation of Lexical Item data (words and morphemes)
The approach that will be used in collecting lexical items in the source language (SL) – ciNyanja or
ciYawo – and matching them with corresponding items in the target language (TL) – again, ciNyanja
or ciYawo – will be manual to start off with. The Swadesh list will be relied on at first. Once this has
been exhausted, its shortcomings shall be addressed and a more complete Bantu-specific list that
includes other morphosyntactic categories such as morphemic markers will be developed and
populated with corresponding lexical items from both languages.
The scope should embrace all pronominals, all nominal classes, the complete set of “pure” adjectives
and as much of the verbal morphosyntax as possible. At this stage, the low number of lexical items
should not be a major concern, but the completeness of lexical categories and basic grammar as
outlined in such a language guide as (Dicks & Dollar, 2010) shall be.
Such an approach is qualitative due to its reliance on the intuition15 of the researcher – a native
speaker of ciNyanja – coupled with reliance on tools such as ciYawo-English dictionaries and
grammars. An assumption is that, due to the similarity between the chosen language pair,
correspondences between lexical items will usually be found to be one-to-one. It can thus be said that
the approach is comparative but analytical as well.
At this stage, it is anticipated that the size of the list of lexical items will be at least 200 but not more
than 1000. Once this initial basic list has been collected, an additional tool – a computer script – may
be developed and used in order to cut short the tedium (as well as error-proneness) of entering lexical
items in the required electronic format, especially if the source corpus is already in electronic form
and the number of items is large enough, e.g. the CBOLD data referred to earlier. Such bulk entry
would only occur later on in the study, however, possibly overlapping with the tuning, testing and
evaluation stages when the focus in on improving the accuracy of the setup of which increasing the
sizes of the lexicons is one way. An indicative upper limit is 3,000 which is the number of words in the
recent ciYawo dictionary which may be available in electronic form (Dicks, 2018).
15
Intuition has been presented as a valid rationale in a University of Oslo PhD dissertation (Khumalo, 2007).
9
6.2 Manual Training (Preparation of Transfer Rule data)
Whereas it has been seen how languages with a more established NLP tradition tend to use the
approach of automated unsupervised learning followed by evaluation, it is intended to take a more
manual, knowledge-driven approach when codifying grammar rules as with lexical items above.
It is proposed to use the UDHR at this stage. Its 200 sentences should be manageable in a manual
process. Its 1500 or so words should mean that the number of unique words is far lower than those
in the Bantu-specific list above. If not, the list will be grown. (But this wouldn’t necessarily change the
Bantu-specific list that will have been defined in the first stage).
For comparison’s sake, the Niutrans statistical MT system used 400 sentences for tuning (Xiao, Zhu,
Zhang, & Li, 2012). Thus, our 200-sentence corpus may be on the low side. If this is indeed the case,
then other similar parallel texts such as the sustainable development goals will also be used. However,
our knowledge-driven manual approach should deliver more accurate results with less data than the
unsupervised statistical approach.
Each sentence of the two parallel texts of the UDHR – one in ciNyanja and the other in ciYawo – would
be translated using the MT software. This would first involve ascertaining that the lexicon is adequate,
before attempting the MT in both directions. Should the transfer rules not produce acceptable
translations then these would be revised until the outputs in both directions are acceptable.
A cursory examination of the contents of the transfer rule files for a number of established (European)
Apertium language pairs reveals that two or three hundred rules (in one translation direction) may be
a reasonable expected maximum.16 Some, like the aforementioned Indonesian-Malay pair, have less
than a handful of rules and therefore represent a parallel wordlist rather than a MT pair.
Such an approach would be quantitative, since a decision on the correctness of each sentence would
be made. However, the prior step of describing grammatical transfer rules, albeit in a format
acceptable by Apertium, is descriptive, hence qualitative. Thus this stage employs both a quantitative
as well as a qualitative methodology. The target at this stage will be 100% acceptability.
In statistical MT this step would represent the training step. Statistical MT systems such as Moses rely
on training data (bilingual sentence-pairs and word alignments.) and tuning data (source sentences
with one or more reference translations).
In our case, however, this is a supervised “training” step where the MT software would be used against
parallel corpora to translate sentence-by-sentence, iteration-by-iteration until it translates a sentence
pair correctly. Each iteration would attempt to resolve the source of any incorrectness – be it missing
lexical item-pairs or transfer rules – before moving to the next sentence pair until the parallel corpora
are exhausted.
Many corpus linguistics approaches have automated this (as well as the next step) into a language
data “cleaning” step.
6.3 Refinement/ Tuning
The above data preparation step should suffice for a proof-of-concept demonstration of basic MT
between ciNyanja and ciYawo (and vice versa). However, even though a restricted lexicon and
restricted set of transfer rules may suffice in a curtailed situation, for MT to be of use it must be
applicable to diverse, albeit typical, situations.
The approach in this step is identical with the previous step, only that a much larger corpus would be
used and the target will be less than 100%. Another difference is that the translation will be attempted
against a longer text (or even the entire text) on each iteration (as opposed to sentence-by-sentence
as in the previous step). The BLEU score for each iteration will be calculated and this will try to be
16
https://github.com/apertium, accessed February 2019.
10
improved in the next iteration by an intuitive selectivity on which rules to prioritise for the most
improvement.
The size of the corpus used for tuning using the Niutrans statistical MT system is described as 100,000
sentences. With over 700,000 words, and assuming 15-20 word sentences, the Bible possesses a
number of sentences similar in order of magnitude to this comparable study. Thus the fact that
ciNyanja and ciYawo Bible and Koran parallel texts may be obtainable means that these religious texts
could prove invaluable in our study. The Koran is only about 10% of the size of the Bible, however, it
may still be used for the same reasons given above in the previous stage as to a probable better
accuracy of knowledge-driven over statistical-based approaches.
6.4 Test and Evaluation
Testing or evaluating a MT setup can be done in a number of ways. The first has already been described
in the previous steps: comparison against a parallel text. The second is quite obviously to ask human
respondents to evaluate whether MT output is correct. A third way is to compare the output with that
of another MT system such as Google translate (De Pauw, Wagacha, & de Schryver, 2011). In this case,
however, the researchers point out that a previous version of the corpus is what had been used to
build the Google translate system in the first place. It must also be noted that this same study uses
the same corpus for training (90%) as it does for testing (10%).
6.5 Tools
The main tool in the first stage will be a spreadsheet, since the main activity as we make the Swadesh
list more Bantu-specific will be to enter ciNyanja and ciYawo lexical items in their corresponding slots.
The preparation of the data in a spreadsheet is itself also an act of initial analysis. Deeper analytical
insights will be recorded for eventual discussion in the main thesis. A word processor will be used.
The later stages will use the Apertium MT toolbox. Decent installation instructions are available at the
apertium wiki page17. The main tool is a personal computer (PC) – desktop or laptop – in the
researcher’s case, laptop. The Operation System (OS) software used is Debian Linux, version 9.4.
Specialised word processors for programmers, such as Notepad++ on Windows or gEdit on Linux will
be used. These can make writing computer code easier and less error-prone due to automatic
indentation, colour-coding and other features. These tools will be used not only in writing scripts but
also in preparing the lexical data in extensible markup language (XML) which most modern software
including Apertium uses as default data entry format.
The third and fourth stages will require Python NLTK and a short script developed in Python to
calculate the BLEU measures. This will also run on the Debian Linux set up described above.
6.5.1 Online Software Repository
As scripts for testing the MT set up are being developed using Python NLTK they will need to be
reposited somewhere.
A software repository is normally an online storage location where computer files (usually in text
format) may be stored. Such a repository is useful not only for backup storage of files but also for
keeping track of different versions of the files as well as facilitation of collaboration across diverse
geographical locations. Many of these services are offered for free and store the data “permanently”.
Thus, the language data and scripts may remain online long after the study has concluded.
The researcher’s GitLab repository will be used.18 In fact, the Swadesh lists for ciNyanja and ciYawo
are already uploaded thereat, together with a simple direct word translator. These are replicated in
the appendices herein.
17
http://wiki.apertium.org/wiki/Installation, accessed February 2019.
18
https://gitlab.com/zangaphee/cibantu, accessed February 2019.
11
7 Significance of the Study
Work has been done in rule-based Bantu MT, as evident in the review of the literature that is still
modest at this stage. However, these are between a Bantu language and typically English, a non-Bantu
language. Much work has also been done on the different stages of the MT end-to-end process e.g.
morphological analysis, generation and tagging. However, the original contribution to research that
this study may produce is as the first attempt at end-to-end MT for a Bantu language pair.
Indirectly, especially should a complete and accurate MT be available online for a Bantu language pair,
then this may contribute also to linguistic inclusion.
The study will have a bearing on comparative and historical Bantu linguistics. As detailed transfer rules
are being developed, this will necessarily compare at least the two Bantu langauges in question. It may
even have a bearing on Bantu grammatical universals and give insights on standardisation efforts.
8 Timeline
ACTIVITY TASK DATES COMMENT
Meetings with Meeting main Ongoing – monthly
supervisors supervisor
Meeting all Ongoing – six-monthly
supervisors
Literature review Ongoing
Progress reports Every six months
Research proposal Year One
Seminar Annually
Journal Article Submit annually
Data collection Year Two
Writeup Year Three
Produce an interim summary one third of the way through data collection, possibly after populating
the Bantu-specific list with both SL and TL items. This could assist with a stop-go decision whether to
continue to collect more data.
9 Resources
This study will involve extensive use of data that others have compiled. Therefore, ethical issues will
need to be considered before using such data, e.g. the Chichewa Koran which seems to be freely
available on the internet, but access to other data may require permission.
A fluent native ciYawo language speaker will be required to assist at the stage of preparing lexical data.
A language translator (from ciYawo to ciNyanja) will also be required to translate example sentences.
Fluent speakers of both will be required to validate the data.
12
10 References
Archangeli, D., Mielke, J., & Pulleyblank, D. (2012). From Sequence Frequencies to Conditions in Bantu
Vowel Harmony: Building a grammar from the ground up. McGill Working Papers in
Linguistics, 22(1).
Armentano-Oller, C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Bonev, B., Ortiz-Rojas, S., .
. . Sánchez-Martínez, F. (2005). An open-source shallow-transfer machine translation toolbox:
consequences of its release and availability. Proceedings of OSMaTran: Open-Source Machine
Translation, A workshop at Machine Translation Summit X (Phuket, Thailand, September 12--
16, 2005).
Backus, J. W. (1959). The syntax and semantics of the proposed international algebraic language of
the Zurich ACMGAMM Conference. In Information Processing: Proceedings of the
International Conference on Information Processing, Paris (pp. 125-132). Geneva: UNESCO.
Banda, F., Mtenje, A., Miti, L., Chanda, V., Kamwendo, G., Ngunga, A., . . . Nkolola, M. W. (2008). A
Unified Standard Orthography for South-Central African Languages (Malawi, Mozambique
and Zambia). Cape Town: The Centre for Advanced Studies of African Societies (CASAS).
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with
the Natural Language Toolkit. Sebastopol, California: O'Reilly Media.
Borsley, R. D., & Börjars, K. (2011). Non‐Transformational Syntax: Formal and Explicit Models of
Grammar. West Sussex: Wiley Blackwell.
Bosch, S., Jones, J., Pretorius, L., & Anderson, W. (2007). Computational Morphological Analysers and
Machine-Readable Lexicons for South African Bantu Languages. The International Journal of
Localisation, 6(1), 22-28.
Byamugisha, J., Keet, M., & DeRenzi, B. (2016). Bootstrapping a Runyankore CNL from an isiZulu CNL.
International Workshop on Controlled Natural Language, (pp. 25-36). Aberdeen.
Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the Role of BLEU in Machine
Translation Research. 11th Conference of the European Chapter of the Association for
Computational Linguistics, (pp. 249-256).
Catford, J. C. (1965). A Linguistic Theory of Translation. Oxford: Oxford University Press.
Center for Language Studies. (2005). The Orthography of CiYawo. Chileka: E + V Publications.
Centre for Language Studies. (2000). Mtanthauzira mawu wa chiNyanja. Blantyre: Dzuka.
Centre for Language Studies. (2013). Mgopolela maloŵe jwa ciYawo. Blantyre: Dzuka.
Chimombo, Z. C. (2010). A Common CiBantu Strategy for the Creation of Scientific Terminology.
Malilime Malawian Journal of Linguistics, 5, 87-107.
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information
Theory, 2(3), 113-124.
Craciunescu, O., Gerding-Salas, C., & Stringer O’Keeffe, S. (2004). Machine Translation and Computer-
Assisted Translation: a New Way of Translating? Translation and Computers, 8(3).
Dahl, O. (1985). Tense and Aspect Systems. Oxford: Basil Blackwell Ltd.
De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2011). Towards English-Swahili Machine Translation.
Research Workshop of the Israel Science Foundation: Machine Translation and
Morphologically-rich Languages.
de Schryver, G.-M., & Nabirye, M. (2018a). Corpus-driven Bantu Lexicography Part 1: Organic Corpus
Building for Lusoga. Lexikos, 28, 32-78.
13
de Schryver, G.-M., & Nabirye, M. (2018b). Corpus-driven Bantu Lexicography Part 2: Lemmatisation
and Rulers for Lusoga. Lexikos, 28, 79-111.
de Schryver, G.-M., & Nabirye, M. (2018c). Corpus-driven Bantu Lexicography Part 3: Mapping
Meaning onto Use in Lusoga. Lexikos, 28, 112-151.
Dicks, I. D. (2018). English - Ciyawo Learner's Dictionary. Mzuzu: Mzuni Press.
Dicks, I. D., & Dollar, S. (2010). A Practical Guide to Understanding Ciyawo. Zomba: Kachere Series.
Doddington, G. (n.d.). Automatic Evaluation of Machine Translation Quality Using N-gram Co-
Occurrence Statistics. Retrieved 1 17, 2019, from
http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf
F.A.R. (1895). Ntanu za Esopo. Cambridge: University Press Cambridge.
Hellan, L. (In preparation). Constructions formed through derivation from sign structure to sign
structure. In L. Hellan, TypeGram – a platform for cross-linguistic construction description (pp.
91-101).
Hurskainen, A. (2012). Quality Swahili machine translation. Multilingual, 23(7), 39-42.
Hurskainen, A., & Tiedemann, J. (2017). Rule-based Machine translation from English to Finnish.
Proceedings of the Conference on Machine Translation (WMT), Volume 2: Shared Task Papers
(pp. 323-329). Copenhagen: Association for Computational Linguistics.
Hutchins, W. J. (1998). Milestones in Machine Translation Part 2 – Warren Weaver’s memorandum
1949. Language Today, 6, 22-23.
Jurafsky, D., & Martin, J. H. (2017). Speech and Language Processing; An Introduction to Natural
Language Processing, Computational Linguistics and Speech Recognition (3rd ed.). New
Jersey: Prentice Hall.
Katushemererwe, F., & Hanneforth, T. (2010). Finite State Methods in Morphological Analysis of
Runyakitara Verbs. Nordic Journal of African Studies, 19(1), 1-22.
Keet, M. C., & Khumalo, L. (2016). Grammar rules for the isiZulu complex verb. Southern African
Linguistics and Applied Language Studies, 35(2), 183-200.
Khumalo, L. (2007). An Analysis of the Ndebele Passive Construction. Oslo: University of Oslo.
Khumalo, L. (2015). Advances in Developing Corpora in African Languages. Acalan Journal, 1(2), 21-30.
Kiso, A. (2012). Tense and aspect in Chichewa, Citumbuka and Cisena. Stockholm: US-AB.
Maho, J. F. (2007). The Linear Ordering of TAM/NEG Markers in the Bantu languages. SOAS Working
Papers in Linguistics, 15, 213-225.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of
English: The. Computational Linguistics, 19(2), 313-330.
Marten, L., Kula, N. C., & Thwala, N. (2007). Parameters of morpho-syntactic variation in Bantu.
Transactions of the Philological Society, 105(3), 253–338.
Mayor, A., & Tyers, F. M. (2009). Matxin: Moving towards language independence. Proceedings of the
First International Workshop on Free/Open-Source Rule-Based Machine Translation, (pp. 11-
17). Alacant.
Mchombo, S. (2003). Choppin’ up Chichewa: Theoretical Approaches to Parsing an Agglutinative
Language. Malilime Malawian Journal of Linguistics, 3, 15-34.
Miti, L. (2006). Comparative Bantu Phonology and Morphology. Cape Town: Centre for the Advanced
Studies of African Society (CASAS).
14
National Statistical Office of Malawi. (2000). 1998 Population and Housing Census Analytical Report.
Zomba: National Statistical Office.
Nurse, D. (2003). Aspect and Tense in Bantu Languages. In D. Nurse, & G. Philippson, The Bantu
languages (pp. 90-102). London: Routledge.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of
Machine Translation. Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), Philadelphia, (pp. 311-318).
Perrott, D. V. (1950). Swahili: A Complete Course for Beginners. Sevenoaks: Hodder and Stoughton.
Rose, S., Beaudoin-Lietz, C., & Nurse, D. (2002). A Glossary of Terms for Bantu Verbal Categories, with
Special Emphasis on Tense and Aspect. München: Lincom Europa.
Sanderson, G. M. (1922). A Yao Grammar. London: The Macmillan Company.
Sanderson, G. M. (1954). A Dictionary of the Yao Language. Zomba: Government Print.
Snider, K., & Roberts, J. (2006). SIL COMPARATIVE AFRICAN WORDLIST (SILCAWL). Retrieved April
2019, from https://www.eva.mpg.de/lingua/tools-at-lingboard/pdf/Snider_silewp2006-
005.pdf
Steedman, M., & Baldridge, J. (2011). Combinatory Categorial Grammar. In R. D. Borsley, & K. Börjars
(Eds.), Non‐Transformational Syntax: Formal and Explicit Models of Grammar (pp. 181-224).
West Sussex: Wiley Blackwell.
Vandeghinste, V., Martens, S., Kotze, G., Tiedemann, J., Van den Bogaert, J., De Smet, K., . . . van Noord,
G. (2013). Parse and Corpus-Based Machine Translation. In P. Spyns, & J. Odijk, Essential
Speech and Language Technology for Dutch (pp. 305-319). Berlin: Springer Berlin Heidelberg.
Whiteley, P. M., Xue, M., & Wheeler, W. C. (2018). Revising the Bantu tree. Cladistics, 1-20.
Xiao, T., Zhu, J., Zhang, H., & Li, Q. (2012). NiuTrans: an open source toolkit for phrase-based and
syntax-based machine translation. Proceedings of the ACL 2012 System Demonstrations, (pp.
19-24).
15
11 Appendix I – Code Listing for Simple Bantu Word Translator
The following is the code for a simple Bantu word translator program currently supporting ciNyanja,
ciYawo and kiSwahili using direct lexical transfer approach based on Swadesh lists.
The following is how to execute the above program on a linux command line.
zanga@tsc:~/dev/nltk$ ./Bantu-simple-trans.py -s yao -t nya -w lyuuva
lyuuva in ciNyanja is: dzuwa
zanga@tsc:~/dev/nltk$ ./Bantu-simple-trans.py -s nya -t yao -w dzuwa
dzuwa in ciYawo is: lyuuva
16
12 Appendix II – Swadesh Lists19
EN NY, NYA YAO
I ndi- nne
you (sing) iwe waawo
he a, iye veleo, mwaanjawo
we ife uwe
you (pl) inu nmwe, mmwe
they iwo vanyawo, vanyao, vaanganya
uyu, awa, uwu, iyi, ichi, izi, ili, aka, iti, p-eleepa, kweleeku, iyi, ciici aci, ciici, awu
this
apa, uku, umu wuteende-wu, a-wu, a-pa, a-lu, a-ku
uyo, awo, uku, poti, pakuti, muti, kuti, a-po, a-cila, a-co, a-dila, a-dyo, a-jila, a-jo,
that
iyo, ako a-jula, a-kala, a-ko
p-eleepa, p-eepano, papapapepa,
here pano, kuno, muno panopano, kunokuno, a-pano, a-pa, apa,
a-kuno, a-ku
p-eleepo, peepala, palapala, kulakula, a-
there apo, uko, umo, paja, kuja, muja
po, a-pala, a-kula, akula, a-ko
who amene vaani, vaa, -ani
what cicii, chichi, -api
where mmene, pamene, kumene kuweesi
when pamene, m’mene, po- nkuti, caaka ci
how -nji wudi, -ati udi
not si ngati
all -nse wose, koose
many -mbiri yejinji, -jiinji
some -ngapo
few -ena, -chepa -noondi, -nnoono, -naandi
other -ina, -ja
one chimodzi -mo, chinpepe
two ziwiri -vidi, ivili
three zitatu -tatu, itatu
four zinai -cece, ncece, mcheche
five zisanu -saano, nsaano, nsano
big -kulu wukulu, -kula
long -tali uleu, -leewu
wide -takata, -lifupi -telekula
-khathamira, -limbirapo, -khuthala, - -kweeva, -gagaatala, -kaandapala,
thick
chindikala kwimbala (fat, thick)
heavy -lemera -sito
small -ng’ono unondi, -noondi, -nnoono, -naandi
short -fupi wiipi, -jiipi
narrow -chepetsa, -siyana maganizo penupenu, -tona, -vijikana
thin -pyapyala, -onda lipuuta, -sololoka, -pwa
19
https://en.wiktionary.org/wiki/Appendix:Bantu_Swadesh_lists, accessed March
2019.
17
EN NY, NYA YAO
woman mkazi, mayi vammasyeto, vakongwe
man mwamuna va-kuunganya
human munthu wu-muundu
child mwana mwanache, mw-aanace
wife mkazi vankwao
husband munthu valume, va-alume
mother mayi amao, maama
father tate atati, n-taati, baaba, v-eese
animal nyama chijama, ci-nyama
fish nsomba somba, soomba
bird mbalame chijuni, ci-juni
dog galu nanbwa, n-bwa
louse nsabwe njipi, n-jipi
snake njoka lijoka, di-ijoka
worm nyongolotsi, litsipa (tapeworm), msundu di-nyoongolosi
tree mtengo chitela, ci-koloongwe
forest nkhalango lipuluulu, wu-kweeti, n-situ
stick ndodo ntindiso, m-puuto
fruit chipatso isogosi, ci-sogosi
seed mbewu, njere, mbuyi, mbuyu, nthanga mbeju, n-beju, n-beeca
leaf tsamba, dzani lisamba, lisaamba, di-saamba
root muzu mchiga, n-ciga
bark (tree) kowola likuungwa
flower duwa liua, di-luuva
grass udzu manyasi, ma-nyasi
rope chingwe lu-koonji
skin khungu lipende, likova
meat nyama nyama
blood magazi miasi, my-aasi
bone fupa liupa, di-iwupa
fat (n.) mafuta mauta, ma-wuta
egg dzira lindanda, di-ndaanda, di-jeele
horn nyanga, liphondo, lipenga livengwa, n-seengo
tail mchira mchila, n-cila
feather nthenga di-nyuunya
hair tsitsi umbo, lu-wuumbo, lu-nyiidi
head mutu ntwe, n-twe
ear khutu liwiwi, di-piikanilo
eye diso liso, di-iso
nose mphuno mbula, lu-pula
mouth kamwa kang'wa
tooth dzino liino, di-ino
tongue lilime lulimi, lu-dimi
18
EN NY, NYA YAO
fingernail chikhadabo lukoose, lu-kalaveesa
foot phazi ci-kaambato, lu-sajo (footprint, sole)
leg mwendo lukongolo, lu-koongolo
knee bondo lilungo, di-luungo
hand dzanja nkono
wing khupe, phiko di-papiko
belly mimba lutumbo
guts
neck khosi, chitambo lukosi, lu-kosi
back msana mngongo, n'-goongo, nsaapa, ci-nyuma
breast bere mavele, ma-veele
heart mtima ntima, n-tima
liver chiwindi, mphafa litoga, di-tooga
to drink kumwa, -mwa ku-ng'wa, -n'wa, -mwa
to eat -dya ku-lya, -dya
to bite -kuluma -luma
to suck -yamwa, -kama, -fipa, -tsopa -oonga, -kweemba
to spit -mate -suna
to vomit -sanzi -selula, -tapika
to blow -omba -puga
to breathe -puma -pumula
to laugh -seka, -bwezuka, -kakadza -seka
to see -wona -lola, -wona
to hear -mva -pikana
to know -dziwa -manyilila
to think -ganiza -ganisya
to smell -nunkha -nunjila , -nunga
to fear -opa, -opana, -opsedwa, -thupsana -oga, -oogopa
to sleep -gona -gono
to live -khala -taama
to die -fa, -mwalira -wa
to kill -pha -wulaga
to fight ndewu (n.) -menyana, -putana
to hunt -saka -lupata
to hit -tiimba, -puuta, -nyeesya
to cut -duka, -tseteka -seenga, -salula, -kata
-gala, -cecenukula, -cecenukuka, -pasula,
to split -balula
-gaangula
to stab -baya -wulutula, -tota, -koosa, -copa, -soma
to scratch -kalasa, -kalasula, -kalula, -kanda -mwaaga, -ng'wanya, -kodola
to dig -mba -sola
to swim -sambila -suuga, -n'aambila
to fly -nonthoka -guluka
to walk -yenda -eenda
19
EN NY, NYA YAO
to come -dza -yika
to lie (bed) -gona -latama, -gadama, ku-gona lugali
to sit -tambalala, -labzi -wutamila, -tama, -taama
to stand -yima -iima
to turn -khota, -lowera -piinda
to fall -gwa -gwa
to give -patsa -pa
to hold -wetsa -pakata
to squeeze -khwinyata, -vungama, -khwinyala, -finya -pana, -minya
-fikisa, -tikita, -tsangula, -tsitsitiza, -
to rub -ticita, -sugula, -kulukuuta
pekesa, -sisita
to wash -samba -suka, -saula, -likunda
to wipe -khula -syula
to pull -koka -wuta
to push -kankha, -kunkhuniza, -thunya -tuta
to throw -ponya, -taya -iita
-kadzira, -kwidzinga, -manga, -njata, -
to tie -tava
gwiriziza
to sew -soka -tota, -sona
to count -werenga -valanga, -valaanga
to say -ti, -saala, -cidiwuka, -tanjila, -salila
to sing -yimba -imba
to play -doda, -sewera, -ponya -taamba, -n'aanda, -ceesa
to float -beruka, -yandama -jajavala, -palasila
to flow -yenda -jidima, -vidima
to freeze -pimisya
to swell -fufuma -yiimba
sun dzuwa lyuuva
moon mwezi mwesi
star nyenyezi, nthondwa, nthanda ndondwa
water madzi meesi
rain mvula wula
river mtsinje lusulo
lake nyanja nyaasa, litanda
nyanja ya mchere, nyanja zakuya
sea bwaani
kwambiri
salt muthira mchere njete
stone mwala liganga
sand mchenga, timiyala, muhabva nsanga
dust fumbi luundu
earth dziko litaka
cloud mtambo, khwithi, chifwirim liunde
fog nkhungu, mikhwithi daambwe
sky thambo kwiunde
20
EN NY, NYA YAO
wind chiphwisa nyeembeelela
madzi oundana oyera amaoneka
snow pamwamba pa mapiri aatali ku mayiko
ozizira, chipale, chifunga, nkhungu
ice madzi oundana, madzi ouma, matalala
smoke utsi osi
fire moto moto
ash chipala, phulusa
to burn -wocha, -pya -ooca, -kolela, -coma
road msewu vaangula
mountain phiri litumbi
red -fiira, -psuu -ceejewu
green -biriwira -gulukulu, -awusaamba
yellow -chikasu
white -yera cheswela
black mkuda, wokuda -piliu
night usiku chilo
day tsiku lyuva
year chaka chaaka
warm -funda -kosya, -oota, -vaamba
cold -zizira sisisisi, -bepo
full -khuta, -dzadza -too, coti
new -tsopano chanpya, -amoonje
old -kalamba, -khwima, -kale -wisala, -aceekulu, -kalaambala
good -bwino -ambone
bad -ipa, -uve, -chabe, -bziwi, -bi ngalumbana, ngalimate
rotten -ola, -vunda -ola
dirty -da, -ipsa, -detsa -wuuka
-ongoka,
-salala,
straight -tawatawa, -nyoloombola, -goloka
-njo,
-sapsatira
-zungulira, -zunguza, -ndendera, -
round -aluwulu
bungulira
sharp (knife) -thwa -tema, -sijidika
blunt (knife) -phinjika -piisa, -piinya, -angatema
smooth -salala -tidila
wet chinyezi, chinyontho nyakapya
dry -uma -umula, -juumu
correct -konza -luungamika
near -fupi, -yandika -vandika, -vaandika
far -kutali -talika
right -manja n'dyo
left -siyidwa, -tsala, kumanzere nciji
21
EN NY, NYA YAO
at pa
in
with ndi
and ndi nipo, kwiisa
if nkati, nkani, pokhala nambaga
because chifukwa pakuva
name dzina liina
22
View publication stats

Research Proposal - An Approach To Machine Translation of Bantu Languages Using Cinyanja and Ciyawo

Uploaded by

Copyright:

Available Formats

You might also like

Research Proposal - An Approach To Machine Translation of Bantu Languages Using Cinyanja and Ciyawo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Proposal - An Approach To Machine Translation of Bantu Languages Using Cinyanja and Ciyawo

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Research Proposal - An Approach to Machine Translation of Bantu Languages

Research Proposal · May 2019

Zangaphee Chris Chimombo

The user has requested enhancement of the downloaded file.

(2) CiYawo: nga – n – ku – pikan – a

(3) KiSwahili: ha – wa – ku – som – a

View publication stats

You might also like