Professional Documents
Culture Documents
134692draft POS Tag Standard PDF
134692draft POS Tag Standard PDF
Copyright@TDIL
2
CONTENTS
1. INTRODUCTION
2. SCOPE
3. TERMINOLOGY
3.1 POS Tag
3.2 XML Schema
3.3 Metadata
4. WHAT IS A POS TAG
5. REQUIREMENTS OF A POS TAG
5.1 Need of XML Schema in designing common POS format
6. POS TAG SET FOR INDIAN LANGUAGES
7. XML INTERNATIONALIZATION BEST PRACTICES
7.1 What is Internationalization Tag Set (ITS)
8. XML SCHEMA
9. METADATA ON POS
10. ONE TO ONE MAPPING LABELS IN POS SCHEMA
11. POS SCHEMA BLOCK DIAGRAM
12. DRAFT POS SCHEMA FOR INDIAN LANGUAGES USING XML
13. ONE TO ONE MAPPING LABELS FOR INDIAN LANGUAGES
14. ALGORITHM FOR SELECTION OF NODES
15. REFERENCE BASED IMPLEMENTATION
16. REFERENCE
ANNEXURES
Copyright@TDIL
3
1. INTRODUCTION
Parts of Speech tagging is one the key building blocks (noun, pronoun, verb,
demonstrative, etc) for developing Natural Language Processing applications. This POS
schema is based on W3C XML Internalization best practices, ISO 639-3 Language Codes
for Language Identification, ISO 12620:1999 as metadata definition and one to one
mapping table for all the labels used in POS Schema.
This document sets out the structural part of the XML Schema definition language and
also how to make XML POS Schema for tagging. XML Schemas including an
introduction to the nature of XML Schemas and an introduction to the XML POS Schema
abstract data model, along with other terminology used throughout this document and
also specifies the precise semantics of each component of the abstract model, the
representation of each component in XML. This document contains block diagram that
shows the flow-chart of creating XML scheme for POS tagging. It also includes the
algorithm that contains metadata as per ISO 12620:1999.
2. SCOPE
The common unified XML based POS Schema for Indian Languages based on W3C
Internationalization best practices have been formulated. The schema has been developed
to take into account the NLP requirements for Web based services in Indian Languages.
This standard specifies XML POS Schema for tagging. This portion of the XML Schema
Language discusses labels that can be used in an XML POS Schema.
3. TERMINOLOGY
3.1 POS Tag: A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads
text in some language and assigns parts of speech to each word.
3.2 XML Schema: XML Schemas express shared vocabularies and allow machines to
carry out rules made by people and to define a class of XML documents, and so the
term "instance document" is often used to describe an XML document that conforms to
a particular schema.
3.3 Metadata: Metadata describes how and when and by whom a particular set of data
was collected, and how the data is formatted.
Copyright@TDIL
4
The input to a tagging algorithm is a string of words of a natural language sentence and a
specified tag set (a finite list of Part-of-speech tags). The output is a single best POS tag
for each word.
The POS tagger can be used as a pre-processor. Text indexing and retrieval uses POS
information. POS tagger is used for making tagged corpora and Machine Translation
System. Speech processing uses POS tags to decide the pronunciation.
POS tagger would be needed to identify the tag for the words that could not be analysed
by the morphological analyser. If the Morph gives multiple tags for a word, then the
tagger could be used to resolve the ambiguity.
The need of XML for creating POS tag-set is to standardize the POS tag framework
for all Indian languages.
The main benefits of xml in using POS tag set for IL’s are:
• It Supports multilingual documents and Unicode
• XML allows developers to add extra information to a format without breaking
applications.
• XML documents can be stored without using database administrator, because they
contain meta data in the form of tags and attributes.
• The tree structure of XML documents allows documents to be compared and
aggregated efficiently element by element.
• XML documents can consist of nested elements that are distributed over multiple
remote servers
It is easier to convert data between different data types.
Copyright@TDIL
5
Copyright@TDIL
6
naTattam,
naTanam
5 Adjective JJ
6 Adverb RB Only manner
adverbs
7 Postposition PSP
8 Conjunction CC CC
8.1 Co-ordinator CCD CC__CCD
8.2 Subordinator CCS CC__CCS
8.2.1 Quotative UT CC__CCS__UT
9 Particles RP RP
9.1 Default RPD RP__RPD
9.2 Classifier CL RP__CL
9.3 Interjection INJ RP__INJ
9.4 Intensifier INTF RP__INTF
9.5 Negation NEG RP__NEG
10 Quantifiers QT QT
10.1 General QTF QT__QTF
10.2 Cardinals QTC QT__QTC
10.3 Ordinals QTO QT__QTO
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF A word written in
script other than
the script of the
original text
11.2 Symbol SYM RD__SYM For symbols such
Copyright@TDIL
7
as $, & etc
11.3 Punctuation PUNC RD__PUNC Only for
punctuations
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH
Copyright@TDIL
8
Copyright@TDIL
9
Copyright@TDIL
10
ਿਦੱ ਲੀ wAjamahila
ਤਾਜਮਿਹਲ
1.4 Nloc NST N__NST �ਤੇ ਥੱ ਲੇ ਅੱ ਗੇ uYwe WaYle
aYge piYCe
ਿਪੱ ਛੇ
2 Pronoun PR PR ਮ� ਤੂੰ ਉਹ ਇਹ mEz wUM uha
iha jo
ਜੋ
2.1 Personal PRP PR__PRP ਮ� ਤੁੰ ਉਹ mEz wuM uha
Copyright@TDIL
11
Copyright@TDIL
12
ਪਿਹਲਾ pahilA
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF A word written
in script other
than the script
of the original
text
11.2 Symbol SYM RD__SYM $, &, *, (, ) For symbols
such as $, &
etc
11.3 Punctuation PUNC RD__PUNC ., : ; Only for
punctuations
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH (ਪਾਣੀ-) ਧਾਣੀ (pANI-) XANI
(cAha-) cUha
(ਚਾਹ-) ਚੂਹ
** The annotation is to be done using the lowest level tag of the type hierarchy. Once the
lower level tag is selected, the higher level tags should be stored automatically.
Copyright@TDIL
13
3 Demonstrative DM DM
3.1 Deictic DMD DM__DMD
3.2 Relative DMR DM__DMR
3.3 Wh-word DMQ DM__DMQ
4 Verb V V
4.1 Main VM V__VM
4.1.1 Finite VF V__VM__VF
4.1.2 Non-finite VNF V__VM__VNF
4.1.3 Infinitive VINF V__VM__VINF
4.1.4 Gerund VNG V__VM__VNG
4.2 Verbal Noun Verbal noun NNV N_NNV Verbal Noun
Copyright@TDIL
14
raajaa,
puttakam
kaNNaaTi,
paTam
Copyright@TDIL
15
2 Pronoun PR PR itu,atu,avan
Copyright@TDIL
16
9 Particles RP RP maTTUm,
kuuTa
9.1 Default RPD RP__RPD maTTUm,
kuuTa
9.2 Classifier CL RP__CL Not required
10 Quantifiers QT QT koncam,
niRaiya, oru,
mutal
10.1 General QTF QT__QTF koncam,
niRaiya
10.2 Cardinals QTC QT__QTC onRu, iraNTu
Copyright@TDIL
17
mOhan
vItu
vellam,
pattam
Copyright@TDIL
18
പരസ്
രം
2.5 Wh-word PRQ PR__PRQ aaru, evan ആര,
എവ൯,
3 Demonstrative DM DM aa-, ii-, ആ, ഈ
3.1 Deictic DMD DM__DMD atu, itu അത,
ഇത,
3.2 Relative DMR DM__DMR eetu ഏത
3.3 Wh-word DMQ DM__DMQ eetu, ennane ഏത,
എങ്ങ
4 Verb V V pO, kazhi, േപാ,
Annu,ciri
കഴി
ആണ(Cop
ula), ചിരി
4.1 Main VM V__VM pO, kazhi, േപാ,
cirri,Annu(c
opula) കഴി,
ആണ,
(copula),
ചിരി
4.1.1 Finite VF V__VM__VF pOyi, േപായി,
cirikkum,
kazhikkunnu ചിരി
Akunnu(copu
ക്ക,
la)
കഴിക്
ന്,
ആകുന്
(copula)
4.1.2 Non-finite VNF V__VM__VNF pOya, േപായ,
ciricca,
kazhicca ചിരിച,
കഴിച,
4.1.3 Infinitive VINF V__VM__VINF pOkku, േപാക്,
cirikkukayAl
kazhikkee, ചിരിക്
varAn/varuv
കയാല,
An
Copyright@TDIL
19
കഴിക്,
വരാ൯/
വരുവാ
൯
എന്ന,
എന്
Copyright@TDIL
20
ലും
എങ്കില
8.1 Co-ordinator CCD CC__CCD -um ഉം
(rAmanum)
pakshe, (രാമനും)
പെക,
ഒരു,
ധാരാളം
10.1 General QTF QT__QTF kuraccu, കുറച്,
niraccu,
dharalam
നിറച്,
ധാരാളം
Copyright@TDIL
21
രണ്ട
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF
11.2 Symbol SYM RD__SYM $, &, *, (, ), $, &, *, (, ),
ruu.
രൂ
11.3 Punctuation PUNC RD__PUNC ., : ; ., : ;
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH
Copyright@TDIL
22
kena, kAra,
2.6 Indefinite PRI PR__PRI keu
Copyright@TDIL
23
tAhale
8.2. Quotative UT CC__CCS__UT ---- Not required
1
9 Particles RP RP
9.1 Default RPD RP__RPD to, ye,
9.2 Classifier CL RP__CL jana, khAnA
9.3 Interjection INJ RP__INJ Are, ei,
hAya
9.4 Intensifier INTF RP__INTF bhiShaNa,
khuba,
sA~NghAtik
a
9.5 Negation NEG RP__NEG nA, naYa,
chhA.DA
10 Quantifiers QT QT
10.1 General QTF QT__QTF kichhu,
alpa, aneka
10.2 Cardinals QTC QT__QTC eka, dui,
tina
10.3 Ordinals QTO QT__QTO prathama,
paYalA,
dvitIYa
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF A word written
in script other
than the script
of the original
text
11.2 Symbol SYM RD__SYM $, &, *, (, ) For symbols
such as $, & etc
11.3 Punctuation PUNC RD__PUNC ., : ; Only for
punctuations
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH jala Tala,
khAbAra
dAbAra
** The annotation is to be done using the lowest level tag of the type hierarchy. Once the
lower level tag is selected, the higher level tags should be stored automatically.
Copyright@TDIL
24
राजा (raajaa-
king),
पस
ु ्तक
(pustaka-
book)
Copyright@TDIL
25
जो(jo-who),
तो(to-he)
2.1 Personal PRP PR__PRP तो(to-he),
मी(mee-I),
त(ू tu-you),
ते(te-they),
तम
ु ्ह(tumhi-
you)
Copyright@TDIL
26
4 Verb V V (padalaa-fell
down),
गेला(gelaa-
went),
झोपला(jhopala
a-slept),
आहे(aahe-is),
Copyright@TDIL
27
4
4.2 Auxiliary VAUX V__VAUX आहे (is),
लागला
(started),
5 Adjective JJ सदुं र(sundara-
beautiful),
चांगला(chaang
alaa-good),
मोठा(moThaa-
big)
6 Adverb RB लवकर(lavakar
- fast ),
हळूहळू(haLuuh
aLuu-slowly)
7 Postposition PSP Not in Marathi
8 Conjunction CC CC आ�ण(aaNi-
and),
कारण(kaaraN-
because)
8.1 Co-ordinator CCD CC__CCD आ�ण(aaNi-
and),
पण(paNa-
but), परं तु
(parantu-but)
8.2 Subordinator CCS CC__CCS कारण क�
(kaaraN-
because of),
का क�(kaaraN
kii-because
of), जर-
तर(jara-tara-
if-then)
8.2. Quotative UT CC__CCS__UT असा, म्हणू
1
9 Particles RP RP तर(tara),
9.1 Default RPD RP__RPD तर(tara) (then)
9.2 Classifier CL RP__CL Not required
9.3 Interjection INJ RP__INJ अरे रे!(arere),
Copyright@TDIL
28
ओहो!(oho-
oh!)
9.4 Intensifier INTF RP__INTF खूप(khoop-
lot, very ),
बराच(baraach-
too much),
अ�तशय(atisha
ya- too much,
very)
9.5 Negation NEG RP__NEG नको(nako-
not), न(na-
Na)
10 Quantifiers QT QT थोडे(thode-
few),
जास्(jaasta-
lot),
काह�(kaahi-
few), एक(eka-
one),
प�हला(pahilaa-
first),
10.1 General QTF QT__QTF थोडे thoDe-
few),
जास्(jaasta-
lot),
काह�(kaahi-
few)
10.2 Cardinals QTC QT__QTC एक(eka-one),
दोन(dona-two)
10.3 Ordinals QTO QT__QTO प�हला(pahilaa-
first),
दसु रा(dusaraa-
second)
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF A word
written in
script other
than the
script of the
original text
Copyright@TDIL
29
‘pen’,
‘spectacles’
‘Mohan’,
‘Ravi’
1.3 Nloc NST N__NST upar, nIche,
ahIM
‘up’, ‘down’,
‘in front’
2 Pronoun PR PR
2.1 Personal PRP PR__PRP huM,tuM,te
‘me’, ‘you’,
Copyright@TDIL
30
‘he/she’
2.2 Reflexive PRF PR__PRF pote,
jAte,svayam
‘herself/him
self’
2.3 Relative PRL PR__PRL je, te, jyAM
‘who’,
‘where’
2.4 Reciprocal PRC PR__PRC aras-paras,
paraspar
‘mutually’,‘e
ach other’
2.5 Wh-word PRQ PR__PRQ koN, kyAre,
kyAM
‘who’,
‘when’,
‘where’
2.6 Indefinite koI, kaIMK,
kashuM
‘someone’,
‘something’
3 Demonstrative DM DM
3.1 Deictic DMD DM__DMD A
‘this’
3.2 Relative DMR DM__DMR je, jeNe
‘which/who’,
‘whom’
3.3 Wh-word DMQ DM__DMQ koN,shuM,ke
m
‘who’,
‘what’, ‘why’
3.4 Indefinite koI, kaIMK,
kashuM
‘someone’,
‘something’
4 Verb V V
4.1 Main VM V__VM khAshe,khAd
hu
‘will eat’,
Copyright@TDIL
31
‘ate’
4.2 Auxiliary VAUX V__VAUX chhe,hatuM,k
aryuM
‘is’, ’was’,
‘did’
5 Adjective JJ
6 Adverb RB
7 Postposition PSP
8 Conjunction CC CC
8.1 Co-ordinator CCD CC__CCD ane,ke
‘and’, ‘or’
8.2 Subordinator CCS CC__CCS tethI, evuM,
kAraNke
‘so’, ‘like
that’,
‘because’
9 Particles RP RP
9.1 Default RPD RP__RPD paNa,ja,tO
‘but’, emph,
topic
9.2 Interjection INJ RP__INJ hE !!, arrrE
!!,O !!
9.3 Intensifier INTF RP__INTF bahu,ghaNu
M
‘very’,
‘much’
9.4 Negation NEG RP__NEG nahi,na
‘no’
10 Quantifiers QT QT
10.1 General QTF QT__QTF thoduM,ghaN
uM
‘little’,
‘much’
10.2 Cardinals QTC QT__QTC eka,be traN
‘one,two,thr
ee’
10.3 Ordinals QTO QT__QTO paheluM,bIjI
‘first’(neu),
Copyright@TDIL
32
‘second’
(fem)
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF tv,
perasitemol
11.2 Symbol SYM RD__SYM $, *,&
11.3 Punctuation PUNC RD__PUNC , : ; {} ()
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH kAm-
bAm,pANi-
bANi
Copyright@TDIL
33
Copyright@TDIL
34
अश�
7 Postposition PSP खातीर, पासत, बगर,
कडेन, लागीं
8 Conjunction CC CC
8.1 Co-ordinator CCD CC__CCD आनी, वा
8.2 Subordinator CCS CC__CCS जाल्या, जर-तर,
दे खन
ू , म्हणल्य,
पण
ु न
ू
8.2. Quotative UT CC__CCS__UT अश�, क�
1
9 Particles RP RP
9.1 Default RPD RP__RPD बी, आद�, इत्या�
9.2 Classifier CL RP__CL (पांच) जाण
9.3 Interjection INJ RP__INJ आरे , चप
ू
9.4 Intensifier INTF RP__INTF उपाट, भरपूर
9.5 Negation NEG RP__NEG ना, न्य
10 Quantifiers QT QT
10.1 General QTF QT__QTF थोडे, चड, कांय, खब
ू
10.2 Cardinals QTC QT__QTC एक, दोन
10.3 Ordinals QTO QT__QTO पयल� , दस
ु र�
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF
11.2 Symbol SYM RD__SYM &, $
11.3 Punctuation PUNC RD__PUNC .,?-/
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH जोवण-�बवण
Copyright@TDIL
35
Copyright@TDIL
36
�कउछ, कोनो
4 Verb V V
4.1 Main VM V__VM चलबैत, रौपेत,
पढइत, खाइत,
सत
ु त
ै , हँसत
ै
9 Particles RP RP
9.1 Default RPD RP__RPD भ�र, यौ, हौ, रौ
Copyright@TDIL
37
तीन, चा�र
10.3 Ordinals QTO QT__QTO प�हल, दोसर,
तेसर, चा�रम
11 Residuals RD RD
11.1 Foreign word RDF RD__RDF A word
written in
script other
than the
script of the
original text
11.2 Symbol SYM RD__SYM $, , *, (, ) For symbols
such as $, &
etc
11.3 Punctuation PUNC RD__PUNC ., : ; Only for
punctuations
11.4 Unknown UNK RD__UNK
11.5 Echowords ECH RD__ECH जलखे (तलखे),
म�ट (स�ट)
** The annotation is to be done using the lowest level tag of the type hierarchy. Once the
lower level tag is selected, the higher level tags should be stored automatically.
(level 1) (level 2)
(kitaab) ﮐﺘﺎﺏ
-)ﻣﻌﺮﻓہ ﺭﺷﻤﯽ
Copyright@TDIL
38
((m‘aarefa ،(Rashmi)
(Ravi) ﺭﻭی
،(aage) ﺁﮔﮯ
(piiche) ﭘﻴﭽﻬﮯ
(jo) ﺟﻮ
Copyright@TDIL
39
Copyright@TDIL
40
following
words are
also placed:
chand,
b‘aaz,
fulaan, sab,
bahut. Can
we have a
category/
subtype
like
indefinitive
demonstrati
ve (DMI)?
،(sonaa) ﺳﻮﻧﺎ
(haMstaa) ﮨﻨﺴﺘﺎ
،(gayaa) ﮔﻴﺎ
،(sonaa) ﺳﻮﻧﺎ
(haMstaa) ﮨﻨﺴﺘﺎ
Hindi as
Hindi does
not have
enough
information
at
the word
level.
Copyright@TDIL
41
)ﻏﻴﺮﻣﺤﺪﻭ
ghair -ﺩ
mahdoo
(d
-)ﻣﺼﺪﺭ
(masdar
)ﺣﺎﺻﻞ
-ﻣﺼﺪﺭ
haasil-e-
(masdar
،(cauRaa) ﭼﻮڑﺍ
(uuMcaa) ﺍﻭﻧﭽﺎ
ﮐﻴﻮں ﮐہ
(kyoMki)
Copyright@TDIL
42
(balki) ﺑﻠﮑہ
-)ﺍﻗﺘﺒﺎﺳﯽ
iqtabaas
(ii
(bhii) ﺑﻬﯽ
،(jii) ﺟﯽ
،(ahaa) ﺍﮨﺎ
(vaah) ﻭﺍﻩ
Copyright@TDIL
43
(kasiir) ﮐﺜﻴﺮ
11 Residuals RD RD
Copyright@TDIL
44
word written in
of the
original
text.
(ڈﺍﻟﺮdollar),
(ﭘﺎﻭﻧﮉpound)
etc.
naa -)ﻧﺎﻣﻌﻠﻮﻡ
(m‘aaloom
(ﻭﺍﺋﮯ-)ﭼﺎﺋﮯ
caa‘e-) vaa‘e)
** The annotation is to be done using the lowest level tag of the type hierarchy. Once the lower
level tag is selected, the higher level tags should be stored automatically.
Copyright@TDIL
45
ITS is a technology to easily create XML which is internationalized and can be localized
effectively.
Main Attributes:
Defining mark-up for natural language labelling (xml:lang- defined for the root element
of your document, and for any element where a change of language may occur), Defining
mark-up to specify text direction (its:dir - defined for the root element of your document,
and for any element that has text content), Indicating which elements and attributes
should be translated (its:translateRule- elements to indicate which elements have non-
translatable content), Providing information related to text segmentation
(its:withinTextRule- elements to indicate which elements should be treated as either part
of their parents, or as a nested but independent run of text), Defining mark-up for unique
identifiers (xml:id- elements with translatable content can be associated with a unique
identifier), Defining mark-up for notes to localizers (its:locNote- allows content authors
to provide localization-related notes as attribute values, or to point to the location of the
relevant note text using). [For more details http://www.w3.org/TR/xml-i18n-bp/]
8. XML SCHEMA
XML Schemas express shared vocabularies and allow machines to carry out rules made
by people and to define a class of XML documents, and so the term "instance document"
is often used to describe an XML document that conforms to a particular schema. It
provides a means for defining the structure, content and semantics of XML documents.
[For more details http://www.w3.org/TR/1999/NOTE-xml-schema-req-19990215]
Copyright@TDIL
46
9. METADATA ON POS
Metadata:
Metadata describes how and when and by whom a particular set of data was collected,
and how the data is formatted. It is essential for understanding information stored in data
warehouses and has become increasingly important in XML-based Web applications.
XML Metadata:
Metadata built into the document. Every element has a tag to tell you where the data is
stored in the document. Descriptive tags give structure to the document and tell you
what the data means (sort of).
“Sort of” because it only tells the tag name, so this only has meaning to someone who
already understands what the element or attribute means.
Metadata ()
{
<?xml version="1.0"?>
<datasm-categorySelection xmlns="http://www.isocat.org/ns/dcif" dcif-version="1.0">
<globalInformation>............</globalInformation>
<languageSection>
<language>en</language>
<identifier>............ </identifier>
<version>1.0.0</version>
<registrationStatus>standard</registrationStatus> // registered as a standard //
<origin>ISO 12620:1999
<author>................</author>
<domain>............</domain>
</origin>
<creation>
<creationDate>1999-01-01</creationDate>
</creation>
<descriptionSection>
Copyright@TDIL
47
<definitionClass>
<definition xml:lang="en">.......................</definition>
<source>ISO 12620:1999</source>
</definitionClass>
</descriptionSection>
</languageSection>
In order to develop common framework of XML based POS schema in all 22 Indian
Languages, it is necessary that labels defined in POS Schema for English to have one to
one mapping for Indian Languages. The XML schema needs to have a complete tree
structure as depicted in fig. below:
Copyright@TDIL
48
The common XML Schema would select a particular Indian Language by and the Schema
then needs to be transformed into POS Schema for that particular language. The language
specific POS Schema could be enabled by making a particular branch of the tree structure
‘off’. It is schematically represented in the next heading. i.e. POS schema block diagram
Declare Metadata
Select Script
(Devanagari,
Malayalam, Bangla,
Perso-arabic-----------
-- n=12
Copyright@TDIL
49
Pos schema ()
<titleStmt>
<script>.................. </script>
<language>multilingual</language>
<type>multimodal</type>
--------------------------------------Noun Block--------------------------------------
<xs:element name="cat" POS cat=”noun” hin-cat=”सं�ा” brx-cat=”मंम
ु ा” mal-cat=”നാമം”
Copyright@TDIL
50
-------------------------------------Pronoun Block-----------------------------------
<xs:element name="cat" POS cat=”Pronoun” hin-cat=”सवर्ना” brx-cat=”मुंराइ” mal-
cat=”സരവവ്നാ” kas-cat=” ” ﭘَﺮﻧﺎ ُﻭﺕasm-cat=”সবর্না” kok-cat=”सवर्ना” guj-
cat”સવર્ના” tag=”PR”>
kok-cat=”पर
ु ूश सवर्न” guj-cat”�ુ�ુષવાચક” tag=”PRP">
Copyright@TDIL
51
----------------------------------Demonstrative Block------------------------------
<xs:element name="cat" POS cat=”Demonstrative” hin-cat=”�नष्चयवाच” brx-cat=”थाव�न
�दिन्थग” mal-cat=”നിരേദശകം” kas-cat=”ﺮﻧﺎﻭﺗۍ
ٕ َ ” ﮨﺎ َﻭﻥ ﭘasm-cat=”িনেদর ্শেবাধ” kok-
cat=”दशर्” guj-cat”દશર્ક” tag=”DM”>
-------------------------------------Verb Block---------------------------------------
<xs:element name="cat" POS cat=”Verb” hin-cat=”�क्र” brx-cat=”थाइजा” mal-cat=”്രകി”
Copyright@TDIL
52
------------------------------------Adjective Block----------------------------------
<xs:element name="cat" POS cat=”Adjective” hin-cat=”�वशे�ण” brx-cat=”थाइला�ल” mal-
cat=”നാമ വിേശഷണം” kas-cat=” ” ﺑﺎ ُﻭﺕasm-cat=”িবেশষণ” kok-cat=”�वशेशण” guj-
cat”િવશેષણ” tag=”JJ”>
---------------------------------------Adverb Block----------------------------------
<xs:element name="cat" POS cat=”Adverb” hin-cat=”�क्र �वशे�ण” brx-cat=”थाइजा�न
थाइला�ल” mal-cat=”്രകിയ വിേശഷണം” kas-cat=” ” ﻟَﮕ ٕہ ﺑٲﺵasm-cat=”ি�য়া িবেশষণ”
------------------------------------Conjunction Block-------------------------------
<xs:element name="cat" POS cat=”Conjunction” hin-cat=”योजक” brx-cat=”दाजाब महर�थ”
mal-cat=”സമുച്ച” kas-cat= ” ” ﻭﺍﮢ َﻮﻥasm-cat=”সংেযাজক” kok-cat=”जोड अव्य” guj-
cat”સંયોજકો” tag=”CC”>
Copyright@TDIL
53
------------------------------------Particles Block------------------------------------
<xs:element name="cat" POS cat=”Particles” hin-cat=”अव्य” brx-cat=”महर�थ” mal-
cat=”നിപാദം” kas-cat=” ” ﮢﻮﮢ ٕہ َﻭ ٕﻧﺘۍasm-cat=”আনুষংিগক অবয্” kok-cat=”अव्य” guj-
cat”િનપાત” tag=”RP”>
cat”સ્વયં �” tag=”RPD">
Copyright@TDIL
54
------------------------------------Quantifiers Block--------------------------------
<xs:element name="cat" POS cat=”Quantifiers” hin-cat=”संख्यावाच” brx-cat=”�बबां
�दिन्थग” mal-cat=”സംഖ�ാവാചി i” kas-cat=” ﻨﺪٛ ” ﮔﺮﻳasm-cat=”পিৰমাণবাচক” kok-
------------------------------------Residuals Block----------------------------------
<xs:element name="cat" POS cat=”Residuals” hin-cat=”अवशेष” brx-cat=”आद्” mal-
Copyright@TDIL
55
</xs:attribute>
</xs:element> </xs:schema>
}
Copyright@TDIL
56
To incorporate such facility in the xml Schema the common one to one mapping table for
the labels has been developed as presented in the Table 1, Table 2 and Table 3
Copyright@TDIL
57
Copyright@TDIL
58
Languages: Assamese, Bodo, Kashmiri (Urdu Script), Kashmiri (Hindi Script), Marathi
S.No English Hindi Assamese Bodo Kashmiri Kashmiri Marathi
(Hindi)
1 Noun सं�ा িবেশষয मुंमा ﻧﺎ ُﻭﺕ नावुत नाम
common जा�तवाचक জািতবাচক फोलेर �दिन्थग ﻋﺎﻡ आम सामान्य
नाम
Proper व्य��वाच বয্ি�বাচ मंु �दिन्थग ﺧﺎﺹ ख़ास विशेष नाम
Verbal �क्रयामू ি�য়াবাচক हाबा �दिन्थग ﺍﻭﺗٲﻭۍ ٛ
ٕ ﮐﺮ क्रावतां धातुसाधित
नाम
/ कृदं त
Nloc दे श-काल �ানবাচক थाव�न �दिन्थग मुंमा ﻧﺎﻭﺗ ٕہ ﺟﺎﻳ ِہ ﮨﺎﻭ नाव त देश
जा�य हाव कालवाचक
सापे�
नाम
2 Pronoun सवर्ना সবর্না मंरु ाइ ﭘَﺮﻧﺎ ُﻭﺕ पर नावत
ु सर्वनाम
Personal व्य��वाच বয্ি�বাচ संबुं �दिन्थग ﺷﺨﺼﻴٲﺗﯽ शिख्सयांत पुरुषवाचक
सोमोन्द बो�हमी
Relative संबंध- वाचक স��বাচক सोमोन्दो ﺭٲﺑِﺘٲﻭۍ रो�बतांव् संबंधवाची
�दिन्थग
प्र�वा स��थ �दिन्थग ک ﻟﻔﻆ क-लफ़् प्रश्नार्थक
Wh-words ��েবাধক
সবর্না
Indefinite अ�न�यवाचक
Indefinite अ�न�यवाचक NA NA NA NA NA
4 Verb �क्र ি�য়া थाइजा ٚ
ﮐﺮﺍ ُﻭﺕ क्राव क्रियापद
Auxiliary सहायक সহায়কাৰী
लेङाइ थाइजा ڈﮐﻬ ٕہ ﮐﺮﺍ ُﻭﺕ डख क्राव सहायकारी
Verb ি�য়া क्रियापद
�क्र
Main Verb मुख्य মুখয্ ি�য় गुबै थाइजा ﺭﺍے ﮐﺮﺍ ُﻭﺕ राय क्राव मुख्य
�क्र क्रियापद
Finite प�र�मत সমািপকা जाफुंजा थाइजा ﺸﺮ ﮨﺎﻭ
ٕ ِﮨ �हशर आख्यात
हाव क्रियारूप
Copyright@TDIL
59
িবেশষণ
7 Post परसगर অনুসগর सोदोब उन महर�थ ﭘﻮﺕ ﺟﺎےٚ पोत अंत्यस्थान
Position
जाय
8 Conjunction योजक সংেযাজক दाजाब महर�थ ﻭﺍﮢ َﻮﻥ राटवन उभयान्वयी
अव्यय
Co-ordinator समन्वय সম�য়ক लोगो महर ﻭﺍﮢُﺖ वाटत/ NA
वाटथ
Subordinator अधीनस् NA लेङाइ लोगो महर ﺗﺤﺘ ُﻮﻥ तहतून NA
Quotative उ��-वाचक NA मुंख’�थ َﺩﭘَﻦ ﻧِﺸﺎﻧ ٕہ दपन उद्गारवाचक
�नशान
अव्य महर�थ ﮢﻮﮢ ٕہ َﻭ ٕﻧﺘۍ टोट वनत्
আনুষংিগক
9 Particles
অবয্
अव्यय/
निपात
Default व्य�तक गोरोिन् ِڈﻓﺎﻟﭧ �डफाल् सामान्य
वग�कारक �थ �दिन्थग् َﻭﺭ ٕﮔﮩﺎ वरगहा NA
Classifier িনিদর ্�তাবাচক
সগর
दाजाबदा
Interjection �वस्मया� িব�য়েবাধক सोमोनांनाय ژﻫﮣُﺖ छटत/ विस्मयवाचक
ग्रॆ
Ordinals क्रमसू �মবাচক
फा�र �बसान ﻨﺪٚٴﻭﻧۍ ﮔﺮﻳ वेन्य क्रमवाचक
সংখয্াবাচক
শ� ग्रॆ
11 Residuals अवशेष NA आद् ﺑﺎﻗﻴٲﺗﯽ बाक़यांती शेष
Copyright@TDIL
60
Foreign �वदे शी িবেদশী শ� गुबुन हादरा�र ﻏٲﺭ ُﻣﻠﮑﯽ गोर मुल्क� विदेशी शब्द
word ﻟَﻔﻆ लफ़ुज़
शब् सोदोब
Symbol प्रत �তীক नेस�न ﻋﻼ َﻣﺖ अलामत चिन्ह
Unknown अ�ात অ�াত �म�थ�य ﺍَﺯﻭﻥ अज़ोन अज्ञात
Punctuation �वरामा�द-�च� যিত িচন थाद’�सन खािन् ِ َﻟ
ﮩﺠ َﻮﻥ लहिजवन विरामचिन्हे
Copyright@TDIL
61
Copyright@TDIL
62
�क्रया)
Auxiliary Non
Finite
(अपणू ् पालवी
र
�क्रया)
Main Verb मुख्य �क् మ�ఖయ కరయ ്രപധാ ്രകി குறை எச்சம் मुखेल �क्रया
Finite प�र�मत సమపక പൂരണ ്രകി வினைப் பெயர் �न�ीत �क्रया
Infinitive �क्रयाथर्क स త�మ�ననరథకం ്രകിയാരൂപ வினை எச்சம் सादारण रू
Gerund �क्रयावा కరయవచకం NA பெயரடை �क्रयावा नाम
Non-Finite गैर-प�र�मत అసమపక
അപൂരണ ്രകി வினையடை अ�न�ीत
�क्रया
Participle कृदं त परक नाम NA NA பின்னுருபு NA
Noun
�वशेषण �वशेशण
നാമ വിേശഷണം இணைப்புச்
5 Adjective వశషణం
சொல்
�क्र-�वशेषण �क्रया�वशे
്രകിയ இணை
6 Adverb కరయవశషణం വിേശഷണം
இணைப்புச்
சொல்
7 Post परसगर పరసరగ അനു്രപേയ சார்பு संबंद� अव्य
Position ഗം இணைப்புச்
சொல்
8 Conjunction योजक సమ�చఛయం സമുച്ച நிரப்பு जोड अव्य
இடைச்சொல்
Co-ordinator
समन्वय సమనధకరణం ഏേകാപിത இடைச்சொல் समानाधीकरण जोड
സമുച്ച अव्य
Subordinator
अधीनस् వయధకరణం ആശ്ചര� முன்னிருப்பு आश्र जोड
ചക
अव्य
സമുച്ച
Quotative उ��-वाचक అనుకరకం ഉദ്ധാരണ இனப்பிரிப்பு अवतरणअथ�- उतर
ചി സമുച്ച ஒட்டு
9 Particles अव्य అవయయం നിപാദം வியப்பிடைச் अव्य
சொல்
Default व्य�तक వయతకరమం സാമാന�ം எதிர்மறை सरभरस अव्य
Classifier वग�कारक వరగకరకం വരഗ്ഗ மிகுவிப்பான் वगर् अव्य
Interjection �वस्मया�दबोध వసమయదబ� ధకం വ�ാേക്ഷപ அளவையடை उमाळी अव्य
Negation नकारात्म నకరతమకం നിേഷദം பொது न्हयकार अव्य
Intensifier तीव् అతశయరథకం തീ്ര നിപാദം எண்ணுப் तीव्रका अव्य
பெயர்
10 Quantifiers संख्यावाच సంఖయవచకం സംഖ�ാവാചി எண்ணு संख्यादशर
முறைப் பெயர்
Copyright@TDIL
63
Copyright@TDIL
64
Display (Metadata)
Eg: {
……………………………………………..
End if
Eg: {
Copyright@TDIL
65
……………………………………………..
End if
Eg: {
Copyright@TDIL
66
……………………………………………..
End if
Eg: {
……………………………………………..
Copyright@TDIL
67
End if
Eg: {
……………………………………………..
End if
Copyright@TDIL
68
Eg: {
……………………………………………..
End if
Eg: {
Copyright@TDIL
69
cat=”�િતવાચક” tag=”NN">
cat=”વ્ય�ક્તવા” tag=”NNP">
cat=”�ુ�ુષવાચક” tag=”PRP">
cat=”પ્રિત�બ��” tag=”PRF">
……………………………………………..
End if
Copyright@TDIL
70
Hindi
1. स�प�ु रय�\N_NNP के\PSP दशर्\N_NN से\PSP �मलता\V_VM है\V_VM मो�\N_NN !\RD_PUNC
2. �हंद\ू N_NN धमर\N_NN म�\PSP तीथर\N_NN का\PSP बड़ा\JJ महत्\N_NN है \V_VM ।\RD_PUNC
3. य\ँू RP_RPD तो\RP_RPD हर\QT_QTF तीथर\N_NN बड़ा\JJ और\CC_CCD अहम\JJ है \V_VM
,\RD_PUNC ले�कन\CC_CCS सात\QT_QTC स्थान\N_NN क�\PSP बड़ी\JJ मह�ा\N_NN
और\CC_CCD मान्यत\N_NN है \V_VM ।\RD_PUNC
4. ये\DM_DMD सात�\QT_QTC धमर्स्\N_NN सात\QT_QTC नगर�\N_NN या\RP_RPD
स�प�ु रय�\N_NNP के\PSP रू\N_NN म� \PSP ग्रं\N_NN म� \PSP व�णर्\V_VM ह�\V_VAUX
।\RD_PUNC
5. ऐसा\DM_DMD कहा\V_VM गया\V_VAUX है\V_VAUX �क\CC_CCS चतम
ु ार्\N_NNP म� \PSP
इन\DM_DMD स�प�ु रय�\N_NNP का\PSP दशर्\N_NN मो�\N_NN प्रद\N_NN करने\V_VM
वाला\PSP होता\V_VM है \V_VAUX ।\RD_PUNC
Punjabi
ਹੈ\V_VAUX |\RD_PUNC
Copyright@TDIL
71
Tamil
Malayalam
Copyright@TDIL
72
Bangla
Marathi
Gujarati
માન્યત\N_NN છે .\V_VM
Copyright@TDIL
73
Konkani
Urdu
Oriya
Copyright@TDIL
74
16. REFERENCE
Copyright@TDIL
75
ANNEXURE-1
LANGUAGE TAGS
Copyright@TDIL
76
CONTRIBUTERS
1. Ms. Swaran Lata, Department of Information Technology, New Delhi
2. Prof. Girish Nath Jha, JNU, New Delhi
3. Dr. Somnath Chandra, Department of Information Technology, New Delhi
4. Dipti Misra Sharma, LTRC, IIIT-H
5. Somi Ram CDAC, NOIDA
6. Prof. Uma Maheswara Rao G, University of Hyderabad
7. Dr. Sobha L, AU-KBC, Chennai
8. Menak. S,
9. Kalika Bali, Microsoft, Bangalore
10. Prof. Pushpak Bhattacharyya, IIT-Bombay
11. Prof. Malhar Kulkarni, IIT-Bombay
12. Lata Popale, IIT-Bombay
13. Kirtida Shah, Gujarati University, Ahemadabad
14. Mona Parakh, LDCIL, Mysore
15. Jyoti Pawar, Goa University
16. Madhavi Sardesai, Goa University
17. Ramnath,
18. Aadil Kak, University of Kashmir
19. Nazima, University of Kashmir
20. Dr. Richa, LDCIL, Mysore
21. Mazhar Mehdi Hussain, JNU, New Delhi
22. Mr. Prashant Verma, W3C India, New Delhi
23. Swati Arora, W3C India, New Delhi
Copyright@TDIL