Lecture 3

Advanced Data Engineering & Analytics:
Text Processing Cont.
20 March 2024
Basic Text Processing
Case folding, stemming, stopping
Normalization (IR example)
• We often need to “normalize” terms
• E.g. in Information Retrieval indexed text and query terms should have the same form
• We want to match different language variants like color and colour, or analyze and
analyse
• Same for U.S.A. and USA and US
• We implicitly define equivalence classes of terms
• e.g., lowercasing, deleting periods in a term, etc.
• Alternative to normalization in IR: asymmetric expansion, e.g.:
• Enter: window Search: window, windows..
• Enter: windows Search: Windows, windows, window..
• Enter: Windows Search: Windows..
Case folding
• Applications like IR: reduce all letters to lower case
• Since users tend to use lower case in their search queries
• Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, machine translation, information extraction
• case of a word is helpful (Apple versus apple is important), so case folding is often not
done
Lemmatization
• Reduce inflections or variant forms to a base form
• am, are, is → be
• car, cars, car's, cars' → car
• the boy's cars are different colors → the boy car be different color
• Lemmatization: finding correct dictionary headword form
• Especially important for highly inflected and morphologically complex
languages (and useful for machine translation)
• Spanish quiero (‘I want’), quieres (‘you want’) have the same lemma as querer ‘want’
Use existing lemmatizers: NLTK, spaCy, WordNetLemmatizer, etc.
Statements such as “adults know about 30,000 words”, “you need to know at least 5,000 words to be fluent”
do not refer to inflected word forms (take/takes/taking/take/takes/took) but to lemmas or dictionary
forms (take), under an assumption that if we know a lemma, we know all its inflected forms too
spaCy
• Industrial-strength Natural Language Processing
• An open-source software library for advanced natural language
processing
https://textanalysisonline.com/spacy-word-lemmatize
Morphology
• Morphemes:
• The smallest meaningful units that make up words
• Many word forms consist of a stem plus a number of affixes (prefixes or
suffixes)
• Stems: core meaning-bearing units - central morphemes which are usually free
morphemes (free morphemes can occur by themselves as words)
• Affixes: providing additional information (usually bound morphemes; bound
morphemes have to combine with others to form words)
• Often with grammatical functions
disgracefully
<prefix><stem><suffix><suffix>
Dealing with complex morphology is
sometimes necessary
• Some languages require complex morpheme segmentation
• Turkish
• Uygarlastiramadiklarimizdanmissinizcasina
• “(behaving) as if you are among those whom we could not civilize”
• Uygar + las + tir + ama + dik + lar + imiz + dan+ mis + siniz + casina
Stemming
• Morphological parsing can be complex so cruder methods are often used
• Reduce terms to their stems (often used in IR)
• Stemming is a crude chopping of affixes
• language dependent
• e.g., automate(s), automatic, automation are all reduced to automat
for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter Stemmer: The most common English
stemmer
• Algorithmic stemmer used in IR experiments since the ‘70s
• Consists of a series of rules designed to the longest possible
suffix at each step
• Produces stems not words
• Makes a number of errors and is difficult to modify..
FYI: Porter Stemmer’s algorithm
Step 1a Step 2 (for longer stems)

sses → ss caresses → caress ational → ate relational → relate
ies → i ponies → poni izer → ize digitizer → digitize
ss → ss caress → caress ator → ate operator → operate
s → ø cats → cat
…
Step 1b Step 3 (for longer stems)
(*v*)ing → ø walking → walk
al → ø revival → reviv
sing → sing
able → ø adjustable → adjust
(*v*)ed → ø plastered → plaster
ate → ø activate → activ
…
…
Based on series of rewrite rules
Viewing morphology in a corpus:
Why only striping if there is a vowel in stem?
(*v*)ing → ø walking → walk
sing → sing
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr
1312 King 548 being
548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr
Porter Stemmer errors
Overstemming Understemming
(type I error) (type II error)
Stopping
• Function words (determiners, prepositions, etc.) have little
meaning on their own
• High occurrence frequencies
• Treated as stopwords (i.e. removed)
• reduce index space, improve response time, improve effectiveness
• Can be however important in combinations
• e.g., “to be or not to be”
Stopping
• Stopword list can be created from high-frequency
words or based on a standard list
• Lists are customized for applications, domains,
and even parts of documents
• e.g., “click” is a good stopword for anchor text
http://www.ranks.nl/stopwords
Phrase detection
Collocation/Phrase Detection
• Text processing issue – how are collocations or phrases recognized?
• Possible approaches:
• Identify syntactic phrases using a part-of-speech (POS) tagger
• Compute word association measures
• Use the most common n-grams
Common Collocation Types
• adverb + adjective: “completely satisfied”.
• adjective + noun: “excruciating pain”.
• noun + noun: “surge of anger”.
• noun + verb: “lions roar”.
• verb + noun: “drink coffee”.
• verb + expression with a preposition by its side: “burst into tears”.
• verb + adverb: “wave frantically”
• …
Term Association Measures
• Dice’s Coefficient
• Mutual Information
• Mutual Information measure favors low frequency terms
• Expected Mutual Information Measure (EMIM)
• Pearson’s Chi-squared (χ2) measure
• compares the number of co-occurrences of two words with the expected
number of co-occurrences if the two words were independent
• normalizes this comparison by the expected number
Different Term Association Measures
Association Measure based
Example
The most strongly associated words for “tropical” in a collection of TREC news
stories. Co-occurrence counts are measured at the document level.
Example
Most strongly associated words for “fish” in a collection

of TREC news stories. Co-occurrence counts are measured at the document level.
Example
Most strongly associated words for “fish” in a

collection of TREC news stories. Co-occurrence counts
are measured in windows of 5 words.
Sentence segmentation
Sentence Segmentation
• !, ? are relatively unambiguous but period “.” is quite ambiguous, e.g.:
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Common algorithm: decide (using rules, regex, or ML) whether a period is
part of the word or is a sentence-boundary marker.
• Looks at a “.” and decides EndOfSentence/NotEndOfSentence
• An abbreviation dictionary can help
• Sentence segmentation can often be done simply by rules based on
tokenization results
Determining if a Word is EndOfSentence:
an Example of a Simple Decision Tree
E.g., Stanford CoreNLP

has a similar rule-based
sentence splitter
More sophisticated decision tree features
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
• Length of word with “.”
• Probability(word with “.” occurs at end-of-sentence)
• Probability(word after “.” occurs at beginning-of-sentence)
HTML cleaning and preprocessing
Detecting Duplicates
• Duplicate and near-duplicate documents occur in many situations
• Copies, versions, plagiarism, spam, mirror sites
• About 30% of the web pages in a large crawl are exact or near duplicates of
pages in the other 70%..
• In IR duplicates consume significant resources during, e.g., crawling,
indexing, and search
• Yet they present little value to most users
Duplicate Detection
• Exact duplicate detection is relatively easy
• Checksum techniques
• A checksum is a value computed based on the content of a
document
• e.g., sum of the bytes in the document file
• Possible for files with different text to have the same checksum
• Functions such as a cyclic redundancy check (CRC), have
been developed that consider the positions of the bytes
Near-Duplicate Detection
• Exact duplicate detection is relatively easy
• Near-duplicate detection – a more challenging task
• E.g., are web pages with the same text content but different advertising or format
near-duplicates?
• A near-duplicate document is defined using a threshold value for some
similarity measure between pairs of documents
• e.g., document D1 is a near-duplicate of document D2 if more than 90% of the
words in the documents are the same
Near-Duplicate Detection
• Search:
• find near-duplicates of document D
• O(N) comparisons required
• Discovery:
• find all pairs of near-duplicate documents in the collection
• O(N2) comparisons
• For discovery, techniques used to generate compact representations are
necessary
Fingerprints
Example
Algorithm
Social Media
• Remove images and other multimedia

• Filter different languages?
• Remove HTML / XML markup tags
• Filter non-language characters (e.g. emoticons)?
• Correct grammar?…
Removing “Non-content” Noise
• Many web pages contain text, links, and pictures not directly related to the
main content of web pages
• This additional material is mostly noise that could negatively affect further
processes such as the ranking of the page in IR, or information extraction, etc.
• Techniques have been developed to detect the content blocks in a web page
• Non-content material may be either ignored or reduced in importance in the
indexing process
Noise Example
Random HTML Example
Finding Content Blocks
• Some approaches use DOM
structure and visual
(layout) features
Example Approach for Finding Content Blocks
A cumulative distribution of tags in the example web page:
Main text content of the page corresponds to the “plateau” in the middle of the distribution
Finding Content Blocks
• Represent a web page as a sequence of bits, where bn = 1 indicates that the
n-th token is a tag
• Optimization problem where we find values of i and j to maximize both the
number of tags below i and above j and the number of non-tag tokens
between i and j
• i.e., maximize
BoilerNet: Neural Sequence Labelling (LSTM)
based Approach
Leonhardt, Jurek, Avishek Anand, and Megha Khosla. "Boilerplate removal using a neural sequence labeling model." Companion Proceedings of the Web Conference 2020.
2020. https://github.com/mrjleo/boilernet
ASR, OCR, etc.
OCR, ASR
• OCR and automatic speech recognition (ASR) produce noisy text. Output is text with
errors when compared with the original printed text or speech transcript
• Thanks to redundancy in text, it is possible to improve the output to some extent
• Problems may be with short texts
Transcript:
French prosecutors are investigating former Chilean strongman Augusto
Pinochet. The French justice minister may seek his extradition from
Britain. Three French families whose relatives disappeared in Chile
have filed a Complaint charging Pinochet with crimes against humanity.
The national court in Spain has ruled crimes committed by the
Pinochet regime fall under Spanish jurisdiction.
Speech recognizer output:

french prosecutors are investigating former chilean strongman of
coastal fish today the french justice minister may seek his
extradition from britain three french families whose relatives
disappeared until i have filed a complaint charging tenants say with
crimes against humanity the national court in spain has ruled crimes
committed by the tennessee with james all under spanish jurisdiction
Example OCR output and ground truth Example ASR output and ground truth
Original page
(.png image)
Google Vision’s output ABBYY FineReader’s output

Original page
(.png image)
Google Vision’S output ABBYY FineReader’s output

Other Preprocessing Steps May Depend on
Document Genre..
• E.g., document layout extraction
• Extracting blocks
• Grouping subsequent blocks without headings
Ali, D., & Verstockt, S. (2021). Challenges in extraction and classification of news articles from historical newspapers.
In A. Maunoury (Ed.), The book of abstracts for What’s Past is Prologue : The NewsEye International Conference (pp. 8–9). Online: NewsEye.
Ali, D., & Verstockt, S. (2021). Challenges in extraction and classification of news articles from historical newspapers.
In A. Maunoury (Ed.), The book of abstracts for What’s Past is Prologue : The NewsEye International Conference (pp. 8–9). Online: NewsEye.
Byte Pair Encoding tokenization
More on tokenization
• Example:
• "Don't you love Transformers? We sure do.“
• Tokenizing by spaces:
• ["Don't", "you", "love", "Transformers?", "We", "sure", "do."]
• By spaces and punctuation:
• ["Don", "'", "t", "you", "love", "Transformers", "?", "We", "sure", "do", "."]
• By spaCy - a popular rule-based tokenizer:
• ["Do", "n't", "you", "love", "Transformers", "?", "We", "sure", "do", "."]
• Results in a large vocabulary (e.g., 267k words for Transformer XL)...
https://huggingface.co/docs/transformers/model_doc/transfo-xl
Dai, Zihang, et al. "Transformer-xl: Attentive language models beyond a fixed-length context." arXiv preprint arXiv:1901.02860 (2019).
Character tokenization?
• Transformers models often have a vocabulary size of less than 50k (when pretrained
only on a single language)
• Why not simply tokenize on characters?
• It would greatly reduce memory and time complexity
• But harder for the model to learn meaningful input representations
• E.g. learning a meaningful (context-independent) representation for the letter "t" vs. for the
word "tomorrow".
• Character tokenization in transformers is accompanied by a loss of performance
• To get the best of both worlds, transformers models usually use a hybrid solution between
word-level and character-level tokenization
Subword tokenization
• Subword tokenization algorithms rely on the principle that:
• Frequently used words should not be split into smaller subwords, but rare words
should be decomposed into meaningful subwords.
• E.g., "annoyingly" (a relatively rare word) can be decomposed into "annoying"
and "ly".
• "annoying" and "ly" as stand-alone subwords appear frequently while the meaning of
"annoyingly" is achieved by the composite meaning of "annoying" and "ly".
• Particularly, useful in agglutinative languages such as Turkish
Subword tokenization
• Subword tokenization (because tokens are often parts of words)
• Data-driven approach (not grounded in morphosyntactic theory)
• Can however include common morphemes like -est or -er
• A morpheme is the smallest meaning-bearing unit of a language (unlikeliest has morphemes un-, likely,
and -est)
• Can encode rare and unknown words as sequences of subword units
• Often such words are new compositions from different subwords
Subword tokenization approaches
• Three common algorithms:
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• WordPiece (Schuster and Nakajima, 2012)
• All have 2 parts:
• A token learner that takes a raw training corpus and induces a vocabulary (a set of tokens)
• A token segmenter that takes a raw test text and tokenizes it according to that vocabulary
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ACL 2016
Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. ACL 2018
Schuster, M. and Nakajima, K. Japanese and korean voice search. ICASSP 2012
BERTTokenizer uses WordPiece
(uncased model, the sentence gets lowercased first)

Byte Pair Encoding (BPE)
Word segmentation algorithm done by bottom up clustering
Let the base vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
• Repeat:
• choose the two symbols that are most frequently adjacent in training corpus (say ‘A’, ‘B’),
• add a new merged symbol ‘AB’ to the vocabulary
• replace every adjacent ’A’ ’B’ in corpus with ‘AB’.
• Until k merges have been done.
Bostrom, K. and Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv.
Byte Pair Encoding (BPE) (SentencePiece
variant)
• Pre-tokenization step to produce a set of unique words: most subword
algorithms are run inside white-space separated tokens
• First add a special end-of-word symbol '__' before whitespace in training
corpus
• Next, separate into letters to form base vocabulary (base vocabulary consists
of all symbols that occur in the set of unique words)
Kudo et al., 2018 , SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
https://arxiv.org/pdf/1808.06226.pdf
BPE token learner
Suppose there is a simple corpus as follows:
low low low low low lowest lowest newer newer newer newer newer
newer wider wider wider new new
Add end-of-word tokens and segment:

BPE token learner
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE
The next merges are:

BPE
• On the test data, run all merges learned from the training data:
• Greedily
• In the order we learned them
• (test frequencies don't play a role now)
• So: merge every e r to er, then merge er _ to er_, etc.
• Result:
• Test set’s "n e w e r _" would be tokenized as a full word
• Test set’s "l o w e r _" (unseen word) would be tokenized as two tokens: "low
er_"
low low low low low lowest lowest newer newer newer newer newer
newer wider wider wider new new
Unknowns
• Some new words may include symbols that were not in the base
vocabulary:
• E.g., likely to happen for very special characters like emojis.
• The vocabulary size, i.e. the base vocabulary size + the number of merges,
is a hyperparameter to choose.
• E.g., GPT 1 has a vocabulary size of 40,478, i.e., 478 base characters and has
been trained until 40,000 merges.
Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever, https://s3-
us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
WordPiece and BERT
• BERT usesa variant of the WordPiece model
• Unlike BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes
the likelihood of the training data once added to the vocabulary.
• It means finding the symbol pair, whose probability divided by the probabilities of its first symbol
followed by its second symbol is the greatest among all symbol pairs.
• E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would
have been greater than for any other symbol pairs
• (Relatively)common words are in the vocabulary:
• at, fairfax, 1910s
• Otherwords arebuilt from word pieces:
• hypatia = h ##yp ##ati ##a
https://towardsdatascience.com/how-to-build-a-wordpiece-tokenizer-for-bert-f505d97dddbb
SentencePiece
• Most tokenization algorithms have the same problem:
• It is assumed that the input text uses spaces to separate words.
• But, not all languages use spaces to separate words.
• A possible solution is to use language specific pre-tokenizers, e.g. XLM uses a specific Chinese,
Japanese, and Thai pre-tokenizer).
• To solve this problem more generally:
• SentencePiece: A simple and language-independent subword tokenizer and detokenizer for
Neural Text Processing (Kudo et al., 2018) treats the input as a raw input stream, including the
space in the set of characters and uses BPE (as actually shown in our example earlier)
Kudo et al., 2018, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
https://arxiv.org/pdf/1808.06226.pdf
XLM: https://huggingface.co/docs/transformers/model_doc/xlm
Paper 1
https://aclanthology.org/2021.acl-long.243.pdf
Paper 2
https://aclanthology.org/2021.ranlp-1.167.pdf
Paper 3
https://apps.dtic.mil/sti/pdfs/ADA586366.pdf

Lecture 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Advanced Data Engineering & Analytics:

Text Processing Cont.

Use existing lemmatizers: NLTK, spaCy, WordNetLemmatizer, etc.

for example compressed for exampl compress and

Step 1a Step 2 (for longer stems)

Most strongly associated words for “fish” in a collection

Most strongly associated words for “fish” in a

E.g., Stanford CoreNLP

• Remove images and other multimedia

Speech recognizer output:

Google Vision’s output ABBYY FineReader’s output

Google Vision’S output ABBYY FineReader’s output

(uncased model, the sentence gets lowercased first)

Add end-of-word tokens and segment:

The next merges are:

You might also like