Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Automatic Corpus Annotation

• Raw and annotated corpora


• Raw corpus from Gutenberg project
• Annotated corpus with linguistic analysis
• Automatic annotation relies on off-the-shelf or ready made tools to do the linguistic analysis.
• Syntactic
• Semantic
• Morphological
• There are automatic software programs that do the linguistic annotation without any human
intervention. You just input your text and get the annotations as an output.

• Some of these tools are freeware and can typically be found through universities such as:
• Sanford University
• University of Illinois at Urbana-Champaign

Session 7 - Corpus Annotation Tools 1


Pros and Cons of Automatic Annotation
• Automatic annotation is fast, mostly available for free. It also saves time and effort.
However, it comes with some disadvantages.

• First, being a software program – not a human – it is never 100% accurate. Only
humans can be that accurate.
• Second, automatic tools are limited for languages and linguistic levels. For example,
there are no semantic role labelers for Arabic. There are no tools to analyze the
discourse structure like: a topic sentence, supporting sentences, and a concluding
sentence.
• Automatic annotation tools can carry out simple tasks only like assigning
grammatical classes to words. Also, both the semantic analysis and morphological
analysis are possible but when it comes to culture-specific tasks such as assigning
figures of speech and idioms, the annotation can’t be done automatically because it
requires human annotation.
Session 7 - Corpus Annotation Tools 2
Part-of-Speech Taggers 1
• The first type of automatic annotation tools we will talk about is POS taggers.

• They take raw corpora as input and give the grammatical category of each word as
output.

• Two famous English POS taggers are:


• CLAWS
• TreeTagger

• What can be a major challenge for POS taggers?

• Language ambiguity: in many cases a word can have two or more grammatical
classes as in I still need to book my ticket and I’m reading a new book now.
Session 7 - Corpus Annotation Tools 3
Part-of-Speech Taggers 2
• How can POS taggers overcome ambiguity?

• Through context: the linguists and programmers feed them with contextual information to help
with disambiguation. For example,

• Words after NEED/WANT TO are verbs.


• Words after articles {A, AN, THE} are nouns.

• Some of this information is 100% true but some is not. For the previous two pieces of information,
can you decide which one is always correct?

• It is probably the second one. If I wrote as ‘words after TO are verbs’, would it always be true? For
example, I went to school.
• That is why manual annotation is more accurate than automatic annotation. However, it is better to
depend on the automatic annotation with a margin of error than to do the annotation from scratch
and spend hours and hours annotating my corpus.
Session 7 - Corpus Annotation Tools 4
Part-of-Speech Taggers 3
• Since there are many POS taggers for English, how can we choose the best one
for our project?
• There are two criteria:
1- We will calculate the accuracy rate based on a small portion of our corpus which
is similar to the steps of choosing OCRs. Try more than one tagger on a sample of
my data and choose the one with the highest accuracy rate.
• In the case that there are multiple available tools that can perform your required
task, you need an objective measure to help you choose the best one.

• This measure can be the accuracy rate which is calculated as:

• * 100
Session 7 - Corpus Annotation Tools 5
Part-of-Speech Taggers 3
• Since there are many POS taggers for English, how can we choose the best one
for our project?

2- There is another criterion that is equally important; the tagset.

• The tagset is the set of labels used by the part of speech tagger to mark the
grammatical class of the word.
The classification of grammatical classes differ from one tagger to another.
• There is no one standard set that all taggers use. Instead, every group of linguists
choose their own tagset.
• Before choosing a tagger, I should look at the tagset to see which grammatical
classes it marks to see if they will fit my research needs or not.
Session 7 - Corpus Annotation Tools 6
Part-of-Speech Taggers 4
• Tag sets differ in how detailed they are. For example, maybe you want to know whether reflexive pronouns
can be used in UN formal writing or not. Which tagger will you choose?

Pronoun Tags
CLAWS TreeTagger
PN1 indefinite pronoun (one) PRP personal pronoun
PNP personal pronoun (he/I) PRP$ possessive pronoun (his)
PNQ wh-pronoun (who)
PNX reflexive pronoun (herself)

• You’d better use CLAWS since it is the one that distinguishes reflective pronouns from other types of
pronouns even if it is less accurate.
• For pronouns, CLAWS uses for different tags for pronouns, while TreeTagger uses one tag for all personal
pronouns.
Session 7 - Corpus Annotation Tools 7
Shallow Parsers
• Shallow parsers also work on the syntactic level but they try to analyze phrase
structure, that is to say, they try to identify: noun, verb, adjectival, adverbial,
and prepositional phrases. It doesn’t work at the word level but the phrase level.
• On the table

• Shallow parsers differ in their accuracy but not tagset.

• Two known English shallow parsers are:


• The one from the University of Illinois.
• CLiPs

• There is a POS tagger and shallow parser for Arabic called MADAMIRA
Session 7 - Corpus Annotation Tools 8
Deep Parsers
• Deep parsers also work on syntactic annotation. They analyze the
entire sentence to identify the Subject and the Predicate.

• Look at these syntactic trees.

Session 7 - Corpus Annotation Tools 9


Morphological Analyzers
• Morphological analyzers are important for morphologically-rich languages
such as Arabic.

• For a computer, all these versions of the same word – ‫الكتاب – فالكتاب – بالكتاب‬
‫ والكتاب – كتاب – فكتاب – وكتاب‬are totally different words. So, if you want to
extract the collocations of ‫ كتاب‬from an Arabic corpus, you need to split off
all attachments such as ‫ و‬,‫ فـ‬,‫الـ‬, and ‫بـ‬.

• The software program that splits off these attachments is referred to as a


morphological analyzer, stemmer, or tokenizer.

• Some freely available ones are: Farasa and Madamira


Session 7 - Corpus Annotation Tools 10
Semantic Role Labelers
• Semantic role labelers work on the semantic level to tag every word
for its function not grammatical class. One example of English
semantic labelers is here.

Session 7 - Corpus Annotation Tools 11


FrameNet ( a semantic-based annotation tool)
• Annotation
The assignment of semantic role tags to syntactic constituents.
• Frame semantics
a descriptive framework for characterizing lexical meaning in terms of semantic
frames
• Frame (semantic frame)
A schematic representation of a situation involving various participants, props and
other conceptual roles, each of which is a frame element
• Frame element (FE)
frame-specific defined semantic role that is the basic unit of a frame
• Core Frame elements
Frame elements that are essential to the meaning of a frame are called "core" FEs
(e.g Speaker in frames connected with communication); expressions of time, place
and manner are generally not core FEs.
Session 7 - Corpus Annotation Tools 12
FrameNet
• lexical unit (LU)
a pairing of a lemma and frame - i.e. a "word" taken in one of its
senses.

• Sources
• https://framenet.icsi.berkeley.edu/fndrupal/glossary
• https://framenet.icsi.berkeley.edu/fndrupal/the_book
Session 7 - Corpus Annotation Tools 13
Frame Semantics
• Charles Fillmore’s (1982) Frame Semantics presents the notion of how
the meanings of words are understood.
• The theory basically revolves around the notion that the meaning of a
word is best understood not in isolation, but through the semantic
frame with which the word is associated.
• For example, we never open our presents until the morning. In this
example, the frame of Christmas is invoked because the interpreter
depends on the contents of the text such as presents and morning to
invoke such a frame. All interpreters will invoke the same frame as
they share cultural experience.

Session 7 - Corpus Annotation Tools 14


FrameNet
• FrameNet is an online lexical database that documents the semantic and
syntactic information of each lexical unit (Baker et al., 1998).
• FrameNet is a large and rich database that provides almost 10 thousand
LUs in more than 825 semantic frames (SFs), and all of this is
exemplified by more than 135 thousand annotated sentences.
• FrameNet is a lexicographic database that describes word meanings
based on the principles of frame semantics.
• Lexical unit help
• Helper
• benefited party
• Goal
Session 7 - Corpus Annotation Tools 15

You might also like