Professional Documents
Culture Documents
Automatic Annotation
Automatic Annotation
• Some of these tools are freeware and can typically be found through universities such as:
• Sanford University
• University of Illinois at Urbana-Champaign
• First, being a software program – not a human – it is never 100% accurate. Only
humans can be that accurate.
• Second, automatic tools are limited for languages and linguistic levels. For example,
there are no semantic role labelers for Arabic. There are no tools to analyze the
discourse structure like: a topic sentence, supporting sentences, and a concluding
sentence.
• Automatic annotation tools can carry out simple tasks only like assigning
grammatical classes to words. Also, both the semantic analysis and morphological
analysis are possible but when it comes to culture-specific tasks such as assigning
figures of speech and idioms, the annotation can’t be done automatically because it
requires human annotation.
Session 7 - Corpus Annotation Tools 2
Part-of-Speech Taggers 1
• The first type of automatic annotation tools we will talk about is POS taggers.
• They take raw corpora as input and give the grammatical category of each word as
output.
• Language ambiguity: in many cases a word can have two or more grammatical
classes as in I still need to book my ticket and I’m reading a new book now.
Session 7 - Corpus Annotation Tools 3
Part-of-Speech Taggers 2
• How can POS taggers overcome ambiguity?
• Through context: the linguists and programmers feed them with contextual information to help
with disambiguation. For example,
• Some of this information is 100% true but some is not. For the previous two pieces of information,
can you decide which one is always correct?
• It is probably the second one. If I wrote as ‘words after TO are verbs’, would it always be true? For
example, I went to school.
• That is why manual annotation is more accurate than automatic annotation. However, it is better to
depend on the automatic annotation with a margin of error than to do the annotation from scratch
and spend hours and hours annotating my corpus.
Session 7 - Corpus Annotation Tools 4
Part-of-Speech Taggers 3
• Since there are many POS taggers for English, how can we choose the best one
for our project?
• There are two criteria:
1- We will calculate the accuracy rate based on a small portion of our corpus which
is similar to the steps of choosing OCRs. Try more than one tagger on a sample of
my data and choose the one with the highest accuracy rate.
• In the case that there are multiple available tools that can perform your required
task, you need an objective measure to help you choose the best one.
• * 100
Session 7 - Corpus Annotation Tools 5
Part-of-Speech Taggers 3
• Since there are many POS taggers for English, how can we choose the best one
for our project?
• The tagset is the set of labels used by the part of speech tagger to mark the
grammatical class of the word.
The classification of grammatical classes differ from one tagger to another.
• There is no one standard set that all taggers use. Instead, every group of linguists
choose their own tagset.
• Before choosing a tagger, I should look at the tagset to see which grammatical
classes it marks to see if they will fit my research needs or not.
Session 7 - Corpus Annotation Tools 6
Part-of-Speech Taggers 4
• Tag sets differ in how detailed they are. For example, maybe you want to know whether reflexive pronouns
can be used in UN formal writing or not. Which tagger will you choose?
Pronoun Tags
CLAWS TreeTagger
PN1 indefinite pronoun (one) PRP personal pronoun
PNP personal pronoun (he/I) PRP$ possessive pronoun (his)
PNQ wh-pronoun (who)
PNX reflexive pronoun (herself)
• You’d better use CLAWS since it is the one that distinguishes reflective pronouns from other types of
pronouns even if it is less accurate.
• For pronouns, CLAWS uses for different tags for pronouns, while TreeTagger uses one tag for all personal
pronouns.
Session 7 - Corpus Annotation Tools 7
Shallow Parsers
• Shallow parsers also work on the syntactic level but they try to analyze phrase
structure, that is to say, they try to identify: noun, verb, adjectival, adverbial,
and prepositional phrases. It doesn’t work at the word level but the phrase level.
• On the table
• There is a POS tagger and shallow parser for Arabic called MADAMIRA
Session 7 - Corpus Annotation Tools 8
Deep Parsers
• Deep parsers also work on syntactic annotation. They analyze the
entire sentence to identify the Subject and the Predicate.
• For a computer, all these versions of the same word – الكتاب – فالكتاب – بالكتاب
والكتاب – كتاب – فكتاب – وكتابare totally different words. So, if you want to
extract the collocations of كتابfrom an Arabic corpus, you need to split off
all attachments such as و, فـ,الـ, and بـ.
• Sources
• https://framenet.icsi.berkeley.edu/fndrupal/glossary
• https://framenet.icsi.berkeley.edu/fndrupal/the_book
Session 7 - Corpus Annotation Tools 13
Frame Semantics
• Charles Fillmore’s (1982) Frame Semantics presents the notion of how
the meanings of words are understood.
• The theory basically revolves around the notion that the meaning of a
word is best understood not in isolation, but through the semantic
frame with which the word is associated.
• For example, we never open our presents until the morning. In this
example, the frame of Christmas is invoked because the interpreter
depends on the contents of the text such as presents and morning to
invoke such a frame. All interpreters will invoke the same frame as
they share cultural experience.