Professional Documents
Culture Documents
I08 7017 PDF
I08 7017 PDF
I08 7017 PDF
99
The 6th Workshop on Asian Languae Resources, 2008
CRULP (on the lines of PENN Treebank tagset), developed by Urdu Dictionary Board of
available at its website www.CRULP.org. A cor- Government of Pakistan. See www.crulp.org/oud
pus of 100,000 words manually tagged on this tag for details.
set has also been developed based on text from CRULP has also developed a corpus based lexi-
Jang online news service. This CRULP POS con of 50,000 words with frequency data and an-
Tagged Jang News Corpus is available through the notation specifications defined by LC-STAR II
center. project (at http://www.lc-star.org/docs/LC-
Recently another corpus of about 40,000 words STAR_D1.1_v1.3.doc). Details of the lexicon an-
annotated with Named Entity tags was also made notation scheme are given by Ijaz et al. (2007).
available for Workshop on NER for South and There are also additional tools available through
South East Asian Languages organized at IJCNLP CRULP, and documented at its website, including
2008. The annotated corpus was donated by normalization, collations, spell checking, POS tag-
CRULP and IIIT Hyderabad and is available at ging and word segmentation applications.
http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5.
Tag set contains 12 tags. Details of these tags are 5 Conclusions
discussed at the link http://ltrc.iiit.ac.in/ner-ssea-
This paper lists some core linguistic resources of
08/index.cgi?topic=3. The CRULP portion of the
Urdu, available through CRULP and other sources.
data is also available at CRULP website, and is a
However, the paper identifies licensing constraints,
subset of the CRULP POS Tagged Jang News
a challenge for open distribution, which needs to
Corpus.
be addressed.
In earlier work at CRULP, a 230 spelling errors
corpus has also been developed based on typo-
graphical errors in Newspapers and student term
References
papers. See Naseem et al. (2007) for details. Gordon, Raymond G., Jr. (ed.). (2005). Ethnologue:
A corpus of Urdu Names has also been devel- Languages of the World, Fifteenth edition. Dallas,
oped by CRULP, based on the collective telephone Tex.: SIL International. Online ver-
directories of Pakistan Telecommunications Cor- sion: http://www.ethnologue.com/.
poration Limited (PTCL) from across all major Hardie, A. (2003). Developing a tag-set for automated
cities of Pakistan. A name list has also been ex- part-of-speech tagging in Urdu. In Archer, D, Ray-
tracted from the corpus for all person names, ad- son, P, Wilson, A, and McEnery, T (eds.) Proceed-
dresses and cities of Pakistan. ings of the Corpus Linguistics 2003 conference.
UCREL Technical Papers Volume 16. Department of
4 Lexica Linguistics, Lancaster University, UK.
Lexica are as critical for development of language Ijaz, M. and Hussain, S. (2007). Corpus Based Urdu
computing as corpora. One of the most Lexicon Development. In the Proceedings of Con-
comprehensive lexica available for Urdu was ference on Language Technology ‘07, University of
Peshawar, Peshawar, Pakistan.
recently released by CRULP (available through
CRULP website). The online version, called Naseem, T. and Hussain, S. (2007). Spelling Error
Online Urdu Dictionary (OUD) contains 120,000 Trends in Urdu. In the Proceedings of Conference
entries, with 80,000 words annotated with on Language Technology ‘07, University of Pesha-
significant information. The data of OUD is XML war, Peshawar, Pakistan.
tagged, as per the annotation schema discussed by McEnery, A., Baker, J., Gaizauskas, R. & Cunningham,
Rahman (2005; pp. 15), which contains about 20 H. (2000). EMILLE: towards a corpus of South
etymological, phonetic, morphological, syntactic, Asian languages, British Computing Society Machine
semantic and other parameters of information Translation Specialist Group, London, UK.
about a word. The dictionary also gives translation Rahman, S. (2005). Lexical Content and Design Case
of 12000 words in English and work is under way Study. Presented at From Localization to Language
to enable runtime user-defined queries on the Processing, Second Regional Training of PAN Local-
available XML tags. The contents of this lexicon ization Project. Online presentation version:
are based on the 21 volume Urdu Lughat http://panl10n.net/Presentations/Cambodia/Shafiq/Le
xicalContent&Design.pdf.
100