Session2 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Computational Knowledge Analysis –

Natural Language Processsing with Python


Session 2 and 3
08/15.05.2023
Dr. Maria Becker
Summer Term 2023
What is Natural Language Processing?

• Natural language processing (NLP) is the intersection of computer


science, linguistics and machine learning

• In NLP the goal is to make computers understand natural language (=


unstructured text) and retrieve meaningful pieces of information from
it

• NLP is a subfield of artificial intelligence, in which its depth involves the


interactions between computers and humans
Areas of NLP (I): Data Preprocessing
• Involves preparing and cleaning text data for machines to be able to
analyze it
• Preprocessing puts data in workable form and highlights features in the
text that an algorithm can work with
• There are several ways this can be done, including:
• Tokenization: text is broken down into smaller units to work with
• Stop word removal: common words (articles etc.) are removed from text so unique
words that offer the most information about the text remain
• Lemmatization and stemming: words are reduced to their root forms to process
• Part-of-speech tagging: words are marked based on the part-of speech they are,
such as nouns, verbs and adjectives
•…
Areas of NLP (II): Developing Algorithms

• Once the data has been preprocessed, an algorithm is developed to


process it
• There are many different natural language processing algorithms, two
main types that are commonly used:
• Rules-based system: uses carefully designed linguistic rules. This approach was
used early on in the development of natural language processing, and is still
used
• Statistical system/Machine learning-based system: Machine learning
algorithms use statistical methods. They learn to perform tasks based on
training data they are fed, and adjust their methods as more data is processed.
Any problems?
Stackoverflow is (almost always) the solution
• https://stackoverflow.com/
• A public platform building the collection of coding questions & answers
• Query the platform with keywords related to your problem (usually someone has
already asked the same question) or post a question/problem together with your
code
What we will learn today and next week
• Investigating the properties of a corpus (word frequencies, lexical
richness, word clouds…)
• Investigating the properties of a specific word within a corpus
(concordances, similarities…)
• POS Tagging
• Lemmatization
• Removal of stopwords
• Named Entity Recognition
Google Colab

• Google Colaboratory (Google Colab) is a cloud-based platform hosted on Google


Cloud servers
• The basic functions that we will use in our seminar are free of charge, but requires
a Google account (also free of charge) à
https://support.google.com/accounts/answer/27441?hl=en
• Google Colab provides free access to computing resources along with pre-installed
software tools for machine learning and data analysis.
• It is a Jupyter notebook environment that allows users to write, run, and share
Python code in a collaborative environment.
Getting Started…
• Step 1: Download and unzip the Folder „Exercises“ in Moodle

• Step 2: Open Google Colab à https://colab.research.google.com/ (a Google Account is


required)

• Step 3: Upload the first exercise file


• Upload à Select file (Exercise1_investigate_corpus_Python4NLP.ipynb) from your computerà Open

• Step 4: Upload your corpus


• Files à Upload à Select txt-file that contains the corpus you created
(including 30+ articles) from your computer à Open
Getting Started…

• Step 5: Run the Code!


• Run all the blocks one by one
• Change the filename („your_file.txt“) manually to the name of your corpus file
Exercises
• The exercises are a combination of programming tasks and tasks aimed at learning more about the
background of the applied methods.
• You will have time to work on the exercises in this and next session, and have to complete them as
homework
• Please submit the answers to the questions (bullet points are sufficient!) via Email:
maria.becker@gs.uni-heidelberg.de
• Exercise 1-3 by the 14th of May
• Exercise 4-6 by the 21 of May
• There are tasks for everyone, and subsequent tasks where you can choose between
• Level 1: mostly tasks where you have to do some research in the internet for learning some background
information about the applied methods – theses tasks are ideal for beginners in Python, but can also be
interesting for people who already have some experience in programming with Python.
• Level 2: mostly tasks where you have to do some more advanced programming - theses tasks are ideal for
people who already have some experience in programming with Python, but can also be chosen by motivated
beginners in Python who want to learn more about programming.
• In case you have questions or problems, please send me a private Chat Message via heiconf
Exercise 1: Investigating your corpus
• EVERYONE:
• Run Exercise 1 on the file which contains all 30 newspaper articles. What
insight do you gain on your corpus? Do they match your expectations?
Exercise 2: Investigating words in your corpus
• EVERYONE:
• Run Exercise 2 on the file which contains all 30 newspaper articles by selecting
5 words you are interested in (and that you expect to play an important role
within your data). What insight do you gain about those words? Do they match
your expectations?
Exercise 3: POS Tagging
POS Tagging: categorizing words in a text (corpus) in correspondence with a particular part of speech

• EVERYONE:
• Investigate the background of POS tagging with spacy:
https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-
tagging/?utm_content=cmp-true (read until “Spacy POS Tagging Example“)
• Evaluate the results of the spacy POS tagger on one of your newspaper articles. How
accurate is it? What errors do you observe?
• Choose one of the following tasks:
• LEVEL 1: Which other POS tag sets (besides the one spacy uses) are available? What are
the differences?
• LEVEL 2: Find an alternative python library for POS-Tagging and make it run on one of
your newspaper articles. Compare the results to the spacy POS Tagger.
• LEVEL 2: Calculate the distributions of POS tags within your corpus (How many NPs/VPs
etc. occur?)
Exercise 4: Lemmatization
Lemmatization: Reducing the different forms of a word to one single form, e.g., reducing “running",
“runs", or “ran" to the lemma “run".

• EVERYONE:
• Evaluate the results of the spacy Lemmatizer on one of your newspaper
articles. How accurate is it? What errors do you observe?

• Choose one of the following tasks:


• LEVEL 1: Find out how the spacy Lemmatizer works. What are other
approaches to reducing words to their basic forms?
• LEVEL 2: Find an alternative python library for Lemmatization and make it
run on one of your newspaper articles. Compare the results to the spacy
Lemmatizer.
Exercise 5: Removing Stopwords
Stop words are basically a set of commonly used words in a language (funxtion words such as articles,
prepositions, pronouns etc.). The reason why stop words are critical to many applications is that, if we remove
the words that are very commonly used in a given language, we can focus on the important words instead.

• EVERYONE:
• Remove the stopwords from your corpus (the file which contains all 30
newspaper articles). Run exercises 1 again. What differences do you
observe?
• Find out which languages the nltk stopword module supports. Does it
support your native language?
• Have a closer look at the English stopword list and the stopword list from
your native language. What can you say about the quality of the lists? Are
important stopwords missing?
Exercise 6: Named Entity Recognition
• Named-entity recognition (NER) seeks to detect and classify named
entities mentioned in unstructured text into pre-defined categories
such as
• Person names
• Organizations
• Locations
• Time expressions
• etc.

https://de.shaip.com/blog/named-entity-recognition-and-its-types/ 16
Exercise 6: Named Entity Recognition
• EVERYONE:
• Investigate the NER tag set that spacy uses. Which other NER tag sets are available?
What are the differences?
• Evaluate the results of the spacy NER tagger on one of your newspaper articles. How
accurate is it? What errors do you observe?

• Choose one of the following tasks:


• LEVEL 1: What are common applications of NER tagging? For which downstream tasks
can NER Tagging be helpful?
• LEVEL 2: Find an alternative python library for NER-Tagging and make it run on one of
your newspaper articles. Compare the results to the spacy NER Tagger.
• LEVEL 2: Calculate the distributions of NER tags within your corpus (How many
persons, locations etc. occur?)
Your Tasks for Next Week

• Please submit the answers to exercise 1-3 by Sunday, 14th of May, via Email:
maria.becker@gs.uni-heidelberg.de
• Bullet points are sufficient!
• Next session we will discuss the results of Exercise 1-3, and then we will work
together on Exercise 4-6.

You might also like