Professional Documents
Culture Documents
Session2 3
Session2 3
Session2 3
• EVERYONE:
• Investigate the background of POS tagging with spacy:
https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-
tagging/?utm_content=cmp-true (read until “Spacy POS Tagging Example“)
• Evaluate the results of the spacy POS tagger on one of your newspaper articles. How
accurate is it? What errors do you observe?
• Choose one of the following tasks:
• LEVEL 1: Which other POS tag sets (besides the one spacy uses) are available? What are
the differences?
• LEVEL 2: Find an alternative python library for POS-Tagging and make it run on one of
your newspaper articles. Compare the results to the spacy POS Tagger.
• LEVEL 2: Calculate the distributions of POS tags within your corpus (How many NPs/VPs
etc. occur?)
Exercise 4: Lemmatization
Lemmatization: Reducing the different forms of a word to one single form, e.g., reducing “running",
“runs", or “ran" to the lemma “run".
• EVERYONE:
• Evaluate the results of the spacy Lemmatizer on one of your newspaper
articles. How accurate is it? What errors do you observe?
• EVERYONE:
• Remove the stopwords from your corpus (the file which contains all 30
newspaper articles). Run exercises 1 again. What differences do you
observe?
• Find out which languages the nltk stopword module supports. Does it
support your native language?
• Have a closer look at the English stopword list and the stopword list from
your native language. What can you say about the quality of the lists? Are
important stopwords missing?
Exercise 6: Named Entity Recognition
• Named-entity recognition (NER) seeks to detect and classify named
entities mentioned in unstructured text into pre-defined categories
such as
• Person names
• Organizations
• Locations
• Time expressions
• etc.
https://de.shaip.com/blog/named-entity-recognition-and-its-types/ 16
Exercise 6: Named Entity Recognition
• EVERYONE:
• Investigate the NER tag set that spacy uses. Which other NER tag sets are available?
What are the differences?
• Evaluate the results of the spacy NER tagger on one of your newspaper articles. How
accurate is it? What errors do you observe?
• Please submit the answers to exercise 1-3 by Sunday, 14th of May, via Email:
maria.becker@gs.uni-heidelberg.de
• Bullet points are sufficient!
• Next session we will discuss the results of Exercise 1-3, and then we will work
together on Exercise 4-6.