Professional Documents
Culture Documents
A Study of A Non-Resourced Language An Algerian Dialect
A Study of A Non-Resourced Language An Algerian Dialect
UBMA LORIA
Badji Mokhtar University Campus scientifique
Informatic Department BP 139, 54500 Vandoeuvre Lès
BP 12, 23000 Annaba, Algeria Nancy Cedex, France
SLTU-2012 125
2. WHY ARE WE INTERESTED IN COLLOQUIAL same idea. If we consider only the word à B@ āl-ān (Now) in
ARABIC?
MSA, we remark that its equivalent in each of the considered
dialects differs from that used in the others: úGZñËX dilw-ty in
Egyptian, Éë halla in Syrian, øñK tawā in Tunisian, ¼PX
The existence of dialects of the language is a challenge for
Natural Language Processing (NLP) in general, as it adds an-
other set of dimensions of variation from a known standard. durk in Algerian and H . X daba in Moroccan.
The problem is particularly interesting in Arabic and its di- Now let us consider Maghreb spoken languages. There are
alects. Any practical and realistic approach of the treatment clearly two native languages in Morocco and Algeria, Alge-
of Arabic should report the use of dialect, because it is ubiq- rian or Moroccan Arabic and Berber3 (respectively 40 to 50%
uitous [5]. of Berbers in Morocco, and 25 to 30% in Algeria). In Tunisia,
We see at international conferences post September 11, 2001, there are only few Berbers (1 or 2%). In addition, the number
a craze increasingly important for machine translation of stan- of monolingual Berbers in rural areas is not negligible. On
dard Arabic to Indo-European languages. These studies are the other hand, the most optimistic estimates of illiteracy is
important when it comes to translating official documents, 50% in Morocco, Algeria 26% and 23% in Tunisia [8]. MSA
however if you want to develop applications for average citi- is therefore still possessed by a small minority. So, much of
zen, it is necessary to take into account his mother tongue, it the population is monolingual in Arabic Moroccan, Algerian
means his dialect. or Tunisian or bilingual Berber/Arab Moroccan or Algerian,
The main dialectal division is between the Maghreb dialects with snippets of standard Arabic and French.
and those of the middle east, followed by that between seden-
tary dialects and bedouin ones. 3. ALGERIAN SPOKEN ARABIC
Watson writes ”Dialects of Arabic form a roughly continuous
spectrum of variation, with the dialects spoken in the eastern In Algeria, as elsewhere, spoken Arabic differs from written
and western extremes of the Arab-speaking world being mu- Arabic; Algerian Arabic has a vocabulary inspired from Ara-
tually unintelligible” [6]. Effectively, while middle easterners bic but the original words have been altered phonologically,
can generally understand one another, they often have trou- with significant Berber substrates, and many new words and
ble understanding Maghrebis2 . Although the converse is not loanwords borrowed from French, Turkish and Spanish.
true, due to the popularity of middle eastern, especially egyp- Like all Arabic dialects, Algerian Arabic has dropped the case
tian, movies and other media. In some cases, people from endings of the written language. It is not used in schools, tele-
these countries are unable to understand each other, at most vision or newspapers, which usually use standard Arabic or
few words are unknown for them [7]. In other cases, people French, but is more likely, heard in songs if not just heard in
from one of the concerned country could find the grammatical Algerian homes and on the street. Algerian Arabic is spoken
structure of the neighbor country bit understandable. Table1 daily by the vast majority of Algerians [9].
provides a simple, yet interesting, example of how spoken va- Algerian Arabic is part of the Maghreb Arabic dialect con-
rieties of Arabic differ in intelligibility. The English sentence tinuum, and fades into Moroccan Arabic and Tunisian Arabic
I am going now is given in Syrian, Egyptian, Tunisian, Alge- along the respective borders. Algerian Arabic vocabulary is
rian and Moroccan dialects and in MSA with their respective pretty much similar throughout Algeria, although the eastern-
transliteration. ers sound closer to Tunisians while the westerners speak an
Arabic closer to that of the Moroccans.
We focus, in this paper, on one of the easterners dialects of
Table 1. Variants of Arabic dialects expressing the English
Algeria: Annaba’s dialect (AD). This choice is justified by
sentence I am going now
the fact that this dialect is the one we know best and practice.
MSA à B@ I.ë@ X AK @ -anā ¯dāhibun āl-ān We present in section 4 its peculiarities.
Egyptian úGZñËX l' @P AK @ -anā rāyih. dilw-ty
Syrian Éë hðP h@P rāh. rūh. halla 4. SPECIFICITIES OF ANNABA’S DIALECT
Tunisian øñK úæÖß AK . bāš nimšy tawā To develop any application based on a language, we think that
Algerian ¼PX hðQK h@P rāh. nrūh. durk at least a basic linguistic study is necessary even if we use sta-
tistical models. In this section, we present the main features
Moroccan H. X ø XA« AK @ -anā ġādy daba of the dialect of Annaba in which we are concerned.
Annaba’s dialect (AD) is spoken in the city of Annaba lo-
cated at the east of Algeria. It is spoken by more than one
These examples reflect clearly the phonetic and lexical
million people. Like for Maghreb Arabic dialects, the most
distance between dialectal sentences expressing exactly the
3 Berber or berber languages are a family of similar or closely related lan-
2 People from Tunisia, Algeria and Morocco guages and dialects indigenous to North Africa.
SLTU-2012 126
notable features of this dialect, is the collapse of short vow- 4.1.1. The separate personal pronouns
els4 in some positions. The word H . AJ» kitāb (book) in MSA
Let’s distinguish in the following their use in singular and plu-
corresponds to H. AJ» ktāb: the short vowel @ i kasra on the ral forms.
first consonant » k- in MSA is deleted in dialectal and re- • Singular form
placed by the sukūn5 .
In AD, the consonant q is generally pronounced ¬ v. For 1. AK @ -anā, úG @ -anı̄ (I).
example ÈA¯
qāl (to say) is pronounced ÈA¯ vāl. For some
words both alternatives exist like the word ©¢¯ qt.a, which 2. masc. I K nta; fem. I K nti (You).
can be also pronounced ©¢¯ vt.a,. We give in table 2 a list of 3. masc. ñë huwa (He); fem. ùë hiya (She).
other consonants which pronunciation differs from standard
Arabic, and their respective pronunciation. • Plural form
• Singular form
4.1. Personal pronouns
bet. When anyone of these vowels is coupled with a consonant, the consonant book).
develop a unique sound (see [10] for more details).
5 A consonant can carry no vowels for the default sound of the letter. No
SLTU-2012 127
2. Õ» kum is used for ”Your”, as Õ»P@X dārkum (Your 4.4. The negative sentence
house).
The form Ó maš (Not) is in general used as a negative par-
ticle, and may be found with all person form. It can also be
3. Ñë hum is used for ”Their”, as ÑëP@X dārhum combined with the personal pronouns 7 to get negative sen-
(Their house). tences: úæÓ mašnı̄ (I am not); ½Ó mašk, ÕºÓ maškum
è h ta-marbūta mašnā (We are not); ñÓ
(You are not); AJÓ mašū (He is
not); úæÓ mašı̄ (She is not) and ÑîDÓ mašhum (They are
In the case of feminine nouns ending with
as éªÖÞ šam,ah, the suffixes are úG tı̄, ½K tk, AîE thā, ...
The form ¨AK tā, combined with personal pronouns as suf-
not). The negative sentence can also be obtained by adding
affixes AÓ mā (as a prefix) and š (as a suffix) to verbs.
fixes is used to insist that something belongs to someone. It is Table 4 illustrates some examples of negative sentences.
ad-dār tā,kum
introduced after the noun to which the possessive refers as in
ú«AK H. AJºË lktāb tā,ı̄ (My book), Õº«AK P@ YË@ Table 4. Negative sentences
(Your house). English Annaba Dialect
I do not go maš rāyah.
l' @P Ó
4.3. Interrogative form I do not remember Qº®JÓ úæÓ
mašnı̄ matfakar
You did not eat
JÊ¿AÓ māklitı̄š
We list in table 3 the most common forms of interrogative par-
ticles and pronouns used in the dialect of Annaba. We can no-
tice that in comparison to MSA no dialect word corresponds
to the original pronoun, except the third one áK ð wayn which 4.5. The plural form
is a modified form of ayna áK @ -ayna. Algerian Arabic uses broken and regular plural. Like all other
In this dialect, any sentence can be turned into a question, in Arabic dialects, suffix àð wn used for the nominative in clas-
any one of two ways. sical Arabic is no longer in use in regular plural. Suffix áK yn
used in classical Arabic for the accusative and the genitive is
used for all cases. For example the plural of áÓñÓ mūman
? @Q® K h@P rāh. taqrā? (Will you revise?).
1. It may be spoken in an interrogative tone of voice, like
(believer) is áJÓñÓ mūmnı̄n.
For feminine nouns, the plural is mostly regular (obtained by
2. An interrogative pronoun or compound derived from a postfixing HA -at): the plural of I K. bant (girl) is HA JK.
. áK ð wayn ǧāt
pronoun may be used, as ?Õ»P@X HAg bn-at. Here, the grammatical rule used to constitute plural
dārkum? (where is your house?). forms for our dialect is the same one used in MSA. For some
words the broken plural is used: like ÉK. @ñ£ .tw-abl which is
the plural of éÊK. A£ .tāblah (table). Let’s remark here that, the
word éÊK. A£ .tāblah is loaned from the French Language and
Table 3. Interrogative particles and pronouns in AD and their
equivalents in MSA. correspond to the word ”Table”.
English Annaba dial. MSA We listed in the foregoing, the main features of the dialect
Who àñº škūn áÓ man of Annaba in order to understand the particularity of this lan-
guage before doing any particular processing. In the follow-
Which AKð wanā ø @ -ayu ing we will present how we proceed for developing a parallel
Where áK ð wayn áK @ -ayna corpus in order to use it in a future work in statistical transla-
SLTU-2012 128
other Algerian dialect, there is no corpus that can be used if they exist. Then all the possible sentences obtained by re-
to develop a translation system. In this project, we start then placing each word by its synonym are produced. the number
from scratch. To build a such corpus, a first step is to establish of sentences generated depends on the number of words hav-
a standard bilingual dictionary Arabic - Arabic dialect. The ing synonyms and the number of synonyms for each word.
dictionary will contain entries as: ¨Qå @ -sri, → H . P P@ -izrib Once the sentences are obtained, they are added to the appro-
priate corpus. A similar operation is then launched but instead
which corresponds to ”hurry up”. Developing a dictionary is
replacing each word by its synonyms we replaced each fem-
the first stone which will allow us to get a corpus.
inine pronoun by a masculine pronoun, a singular by plural
To build the dictionary and consequently the corpus, we
and so on.
recorded discussions ”in live” in different environments
(medical offices, pubs, markets, ...) to ensure a large vari-
ety of the vocabulary. These recordings contained very noisy 7. THE DIALECT’S VOCABULARY
segments that have been removed. Ultimately, we selected
the equivalent of 10 hours of speech transcribable. In this section we focus on the study of dialect’s vocabulary.
This approach could be surprising, but we would like to We notice that there are three types of words:
remind that this language has never been written, then no ref-
erence to a list of words nor sentences is available somewhere • Arabized loaned words: are words belonging to foreign
in our knowledge. Afterward we performed a manual tran- vocabulary (most of them are French words), which
scription of these recordings and extracted all words which were introduced in the dialect after having been natu-
will be analyzed. Subsequently, we assigned, to each ex- ralized phonetically and/or morphologically. Examples
tracted word, the corresponding standard Arabic form which of such words are given in table 6.
best fits. This achieves a dictionary MSA-Annaba’s dialect
and a written dialect corpus. In table 5, a sample of this dic- Table 6. Examples of Arabized Foreign words
tionary is given. We built a parallel corpus dialect-standard English Annaba Dialect Origin
Arabic by translating by hand the transcribed corpus. Man-
ual transcription and translation are very costly in time. We
Nurse úÎÓQ¯ farmlı̄ French ”Infirmier”
have managed to transcribe for the moment, only 30% of Place éCK. blās.ah French ”Place”
recordings. That’s enough ø QK yizzı̄ Berber
Ship PñJ.K. babūr Turkish
Table 5. A sample of the dictionary MSA-Annaba’s dialect.
Annaba Dialect
IK Qk. ǧrı̄t
MSA
Qk. ǧaraytu • Words for which we do not find any origins like XP@ñ
IK
àAJ.Ë@ āl-bustān
s.wārad (Money), ...
Ì
àAJm. lǧnān
àAJm.Ì lǧnān é®K YmÌ '@ āl-h.adı̄qah • Arabic words: The dialect of Annaba is largely based
@P ð wrā
Z@P ð warā-a on the standard Arabic. However, the words of Arab
Êg hallas. saddada
X Y
origin have undergone some distortions. In order to de-
termine these distortions, we computed the Levenshtein
AîDÓ ½J˘ Êg hallı̄k minhā AîDÓ ½«X da,ka minhā distance [12] between the original words and the dialec-
˘ tal ones. The results showed that the deformations per-
formed on Arabic words concern principally the pro-
nunciation. In fact, in general all consonants of the
6. ENRICHING CORPORA original word are kept in the dialect but the short vowels
are modified. In such cases, the Levenshtein distance is
As noted above, a machine translation system requires a large zero because actually the vowels are not written. For
amount of data. However, in order to increase the size of our other, original Arabic words are altered by inserting,
corpora, we propose to produce new sentences from the initial omitting or by substituting consonants. In general the
corpus. Producing new sentences is done by replacing each results achieved showed that the Levenshtein distance
word in the original sentence by its different synonyms. Each is less or equal to 3. For distances greater than three,
time a word is replaced by its synonym will produce a new words can be totally different, with the exception of
sentence which is added to the initial corpus. For the devel- one or two letters of Arabic words that are saved in
opment of such tool, we must necessarily start by the devel- some dialectal words. The dialect word can also be a
SLTU-2012 129
Fig. 1. A sample of parallel corpus MSA-Annaba’s dialect.
dialect words, their equivalents in standard Arabic and be used. So, to achieve that, we recorded real discussions be-
their corresponding Levenshtein distance. tween people that we transcribed. We subsequently developed
AD-MSA dictionary that we used to translate the dialect cor-
We summarize in figure 2 the results of the analysis per- pus in standard Arabic. We also presented a study of the di-
formed on the vocabulary constituting the corpus. alect’s vocabulary which has shown that it is mainly inspired
from standard Arabic (65%). The remaining words are either
100
of French origin (19%) or of Turkish or Berber (16%).The fu-
ture work, before developing a statistical machine translation
80 Fr.O = Words of French origin system, consists in increasing the available corpus. This will
Oth.O = Words of other origins be done by adapting to Annaba’s dialect the subtitles of some
60 Ar.O = Words of Arabic origin available Algerian movies. In fact, these ones are mostly ex-
pressed with the dialect of Algiers (Capital city of Algeria).
40
9. REFERENCES
20
[1] Abdel Monem A., Shaalan K., Rafea A., and Baraka
0
Ar.O Oth.O Fr.O
H., “Generating arabic text in multilingual
speech-to-speech machine translation framework,” in
Machine Translation, Springer, 2009.
Fig. 2. Results of corpus analysis in terms of percentage.
[2] Dina Al-Kassas, Une étude contrastive de l’Arabe et du
We can therefore confirm that the vocabulary of the di- Français dans une perspective de génération
alect of Annaba is a mixture of words from Standard Arabic, multilingue, Ph.D. thesis, Université PARIS 7, DENIS
French and the other words borrowed from other languages DIDEROT, UFR Linguistique, 2005.
such as Berber and Turkish or of unknown origin.
[3] Kirchhoff K., Bilmes J., Das S., Duta N., Egan M.,
8. CONCLUSION Ji G., He F., Henderson J., Liu D., Noamany M.,
Schone P., Schwartz R., and Vergyri D., “Novel
In this paper, we presented the main features of the dialect of approaches to arabic speech recognition: Report from
Annaba through a linguistic study. In our knowledge no other the 2002 johns-hopkins workshop,” in Proceedings of
corresponding work have been done before. the International Conference on Acoustics, Speech and
As we have already specified above, this work is part of a Signal Processing, Hong Kong, 2003.
project TORJMAN which is dedicated to translating standard
Arabic to Algerian Arabic dialects. To build a statistical ma- [4] Vergyri D. and Kirchhoff K., “Automatic diacritization
chine translation system, a parallel training data is necessary. of arabic for acoustic modeling in speech recognition,”
In the case of Annaba’s dialect, there is no corpus that can in Proceedings of the COLING Workshop on
SLTU-2012 130
Arabic-script Based Languages, Geneva, Switzerland,
2004.
[5] Belgacem M., “Construction dun corpus robuste de
différents dialectes arabes,” in proc. of Actes des
VIIIémes RJC Parole, Avignon, France, 2009.
[10] Debili F., Achour H., and Souissi E., “De l’étiquetage
grammatical à la voyellation automatique de l’arabe.,”
in Correspondances, 2002.
[11] De Lacy O’Leary, “Colloquial arabic,”
http://www.archive.org/details/colloquialarabic00oleauoft,
Digitized by the Internet Archive in 2007 with funding
from Microsoft corporation.
[12] Michael Gilleland, “Levenshtein distance, in three
flavors,” http://www.merriampark.com/ld.htm, 2012.
SLTU-2012 131
A. LEVENSHTEIN DISTANCE FOR DIALECT WORDS AND THEIR EQUIVALENTS IN MSA
SLTU-2012 132