Stanford PoS Tagger - Tagging From Python (Linguisticsweb - Org)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2023/6/21 21:37 Stanford PoS Tagger: tagging from Python [linguisticsweb.

org]

Stanford PoS Tagger: tagging from Python


author: Sabine Bartsch, e-mail: mail@linguisticsweb.org

[tutorial status: work in progress - January 2019]

Related tutorial: Stanford PoS Tagger

While we will often be running an annotation tool in a stand-alone fashion directly from the command line, there are many
scenarios in which we would like to integrate an automatic annotation tool in a larger workflow, for example with the aim of
running pre-processing and annotation steps as well as analyses in one go. In this tutorial, we will be running the Stanford PoS
Tagger from a Python script.

The Stanford PoS Tagger is itself written in Java, so can be easily integrated in and called from Java programs. However, many
linguists will rather want to stick with Python as their preferred programming language, especially when they are using other
Python packages such as NLTK as part of their workflow. And while the Stanford PoS Tagger is not written in Python, it can
nevertheless be more or less seamlessly integrated into Python programs. In this tutorial, we will be looking at two principal ways
of driving the Stanford PoS Tagger from Python and show how this can be done with single files and with multiple files in a
directory.

Running the Stanford PoS Tagger in NLTK


NLTK integrates a version of the Stanford PoS tagger as a module that can be run without a separate local installation of the
tagger. This is the simplest way of running the Stanford PoS Tagger from Python. It has, however, a disadvantage in that users
have no choice between the models used for tagging. This is, however, a good way of getting started using the tagger. The script
below gives an example of a script using the Stanford PoS Tagger module of NLTK to tag an example sentence:

1 # running the Stanford POS Tagger from NLTK


2
3 import nltk
4 from nltk import word_tokenize
5 from nltk import StanfordTagger
6
7 text_tok = nltk.word_tokenize("Just a small snippet of text.")
8
9 # print(text_tok)
10 pos_tagged = nltk.pos_tag(text_tok)
11
12 # print the list of tuples: (word,word_class)
13 print(pos_tagged)
14
15 # for loop to extract the elements of the tuples in the pos_tagged list
16 # print the word and the pos_tag with the underscore as a delimiter
17 for word,word_class in pos_tagged:
18 print(word + "_" + word_class)

Note the for-loop in lines 17-18 that converts the tagged output (a list of tuples) into the two-column format: word_tag.

This same script can be easily modified to tag a file located in the file system:

1 # running the Stanford POS Tagger from NLTK


2 import nltk
3 from nltk import word_tokenize
www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python 1/4
2023/6/21 21:37 Stanford PoS Tagger: tagging from Python [linguisticsweb.org]

4 from nltk import StanfordTagger


5
6 # point this path to a utf-8 encoded plain text file in your own file system
7 f = "C:/Users/Public/projects/python101-2018/data/sample-text.txt"
8
9 text_raw = open(f).read()
10 text = nltk.word_tokenize(text_raw)
11 pos_tagged = nltk.pos_tag(text)
12
13 # print the list of tuples: (word,word_class)
14 # this is just a test, comment out if you do not want this output
15 print(pos_tagged)
16
17 # for loop to extract the elements of the tuples in the pos_tagged list
18 # print the word and the pos_tag with the underscore as a delimiter
19 for word,word_class in pos_tagged:
20 print(word + "_" + word_class)

Note that you need to adjust the path in line 8 above to point to a UTF-8 encoded plain text file that actually exists in your local
file system.

Driving the Stanford PoS Tagger local installation from Python / NLTK
Instead of running the Stanford PoS Tagger as an NLTK module, it can be driven through an NLTK wrapper module on the
basis of a local tagger installation. In order to make use of this scenario, you first of all have to create a local installation of the
Stanford PoS Tagger as described in the Stanford PoS Tagger tutorial under 2 Installation and requirements. In the code itself,
you have to point Python to the location of your Java installation:

java_path = "C:/Program Files/Java/jdk1.8.0_192/bin/java.exe"


os.environ["JAVAHOME"] = java_path

You also have to explicitly state the paths to the Stanford PoS Tagger .jar file and the Stanford PoS Tagger model to be used for
tagging:

jar = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/stanford-postagger.jar"
model = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/models/english-bidirect

Note that these paths vary according to your system configuration. You will need to check your own file system for the exact
locations of these files, although Java is likely to be installed somewhere in C:\Program Files\ or C:\Program Files (x86) in a
Windows system.

Running the local Stanford PoS Tagger on a sample sentence


The next example illustrates how you can run the Stanford PoS Tagger on a sample sentence:

1 # Stanford POS tagger - Python workflow for using a locally installed version of the
2 # Python version 3.7.1 | Stanford POS Tagger stand-alone version 2018-10-16
3
4 import nltk
5 from nltk import *
6 from nltk.tag.stanford import StanfordPOSTagger
7 from nltk.tokenize import word_tokenize
8
9 # enter the path to your local Java JDK, under Windows, the path should look very sim
10 java_path = "C:/Program Files/Java/jdk1.8.0_192/bin/java.exe"
11 os.environ["JAVAHOME"] = java_path
12
13 # enter the paths to the Stanford POS Tagger .jar file as well as to the model to be
14 jar = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/stanford-postagger
15 model = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/models/english-b
16
www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python 2/4
2023/6/21 21:37 Stanford PoS Tagger: tagging from Python [linguisticsweb.org]

17 pos_tagger = StanfordPOSTagger(model, jar, encoding = "utf-8")


18
19 # Tagging this one example sentence as a test:
20 # this small snippet of text lets you test whether the tagger is running before you a
21 # stored file (see line 28)
22 text = "Just a small snippet of text to test the tagger."
23
24 # Tagging a locally stored plain text file:
25 # as soon as the example in line 22 is running ok, comment out that line (#) and comm
26 # enter a path to a local file of your choice;
27 # the assumption made here is that the file is a plain text file with utf-8 encoding
28 # text = open("C:/Users/Public/projects/python101-2018/data/sample-text.txt").read()
29
30 # nltk word_tokenize() is used here to tokenize the text and assign it to a variable
31 words = nltk.word_tokenize(text)
32 # print(words)
33 # the pos_tagger is called here with the parameter 'words' so that the value of the v
34 tagged_words = pos_tagger.tag(words)
35 print(tagged_words)

Running the local Stanford PoS Tagger on a single local file


The code above can be run on a local file with very little modification. In this example, the sentence snippet in line 22 has been
commented out and the path to a local file has been commented in:

1 # Stanford POS tagger - Python workflow for using a locally installed version of the
2 # Python version 3.7.1 | Stanford POS Tagger stand-alone version 2018-10-16
3
4 import nltk
5 from nltk import *
6 from nltk.tag.stanford import StanfordPOSTagger
7 from nltk.tokenize import word_tokenize
8
9 # enter the path to your local Java JDK, under Windows, the path should look very sim
10 java_path = "C:/Program Files/Java/jdk1.8.0_192/bin/java.exe"
11 os.environ["JAVAHOME"] = java_path
12
13 # enter the paths to the Stanford POS Tagger .jar file as well as to the model to be
14 jar = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/stanford-postagger
15 model = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/models/english-b
16
17 pos_tagger = StanfordPOSTagger(model, jar, encoding = "utf-8")
18
19 # Tagging this one example sentence as a test:
20 # this small snippet of text lets you test whether the tagger is running before you a
21 # stored file (see line 28)
22 # text = "Just a small snippet of text to test the tagger."
23
24 # Tagging a locally stored plain text file:
25 # as soon as the example in line 22 is running ok, comment out that line (#) and comm
26 # enter a path to a local file of your choice;
27 # the assumption made here is that the file is a plain text file with utf-8 encoding
28 text = open("C:/Users/Public/projects/python101-2018/data/sample-text.txt").read()
29
30 # nltk word_tokenize() is used here to tokenize the text and assign it to a variable
31 words = nltk.word_tokenize(text)
32 # print(words)
33 # the pos_tagger is called here with the parameter 'words' so that the value of the v
34 tagged_words = pos_tagger.tag(words)
35 print(tagged_words)

Running the local Stanford PoS Tagger on a directory of files

www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python 3/4
2023/6/21 21:37 Stanford PoS Tagger: tagging from Python [linguisticsweb.org]

Please note down the name of the directory to which you have unpacked the Stanford PoS Tagger as well as the subdirectory in
which the tagging models are located. Also write down (or copy) the name of the directory in which the file(s) you would like to
part of speech tag is located. As we will be writing output of the two subprocesses of tokenization and tagging to files in your
file system, you have to create these output directories in your file system and again write down or copy the locations to your
clipboard for further use. In this example these directories are called:

data_path
tokenized_data_path
tagged_data_path

Once you have installed the Stanford PoS Tagger, collected and adjusted all of this information in the file below and created the
respective directories, you are set to run the following Python program:

1 # Stanford POS tagger local installation to tag a directory of plain text files
2 import nltk
3 from nltk import *
4 import os
5
6 # environment variables for the Stanford PoS Tagger
7 java_path = "C:/Program Files/Java/jdk1.8.0_192/bin/java.exe"
8 os.environ["JAVAHOME"] = java_path
9
10 from nltk.tag.stanford import StanfordPOSTagger
11 #from nltk.tokenize import word_tokenize
12
13 model = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/models/english-b
14 jar = "C:/Users/Public/utility/stanford-postagger-full-2018-10-16/stanford-postagger
15
16 pos_tagger = StanfordPOSTagger(model, jar, encoding = "utf-8")
17
18 # data path of the input corpus files as well as separate output directories for toke
19 data_path = "C:/Users/Public/projects/python101-2018/data/BPS/"
20 tokenized_data_path = "C:/Users/Public/projects/python101-2018/data/BPS/tokenized_dat
21 tagged_data_path = "C:/Users/Public/projects/python101-2018/data/BPS/tagged_data/"
22
23 # apply tokenization and pos-tagging to all the txt-files in the directory 'data' tha
24 for filename in os.listdir(data_path):
25 if filename.startswith("WC") and filename.endswith(".txt"):
26 fr = open(data_path + filename, encoding = "utf-8")
27 raw_text = fr.read()
28 tok_text = word_tokenize(raw_text)
29 fw1 = open(tokenized_data_path + "tok_" + filename, "w")
30 fw1.write(str(tok_text))
31 fw1.close()
32 fw = open(tagged_data_path + "tag_" + filename, "w")
33 fw.write(str(pos_tagger.tag(tok_text)))
34 fw.close()

www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python 4/4

You might also like