Professional Documents
Culture Documents
EBUS622 - Week 5 - Lecture - Text Preparation
EBUS622 - Week 5 - Lecture - Text Preparation
EBUS622 - Week 5 - Lecture - Text Preparation
Dr Yuanjun Feng
University of Liverpool Management School
Email: Yuanjun.Feng2@liverpool.ac.uk
Office hour: Wednesday, 11:20-12:20, Room 125a
1
Overview
● Text Preprocessing
2
Two approaches of Text Mining
(Source)
3
Text Mining Workflow
Text
preparation Text preprocessing
Frequency of occurrence
Bag-of-words
4 (Source)
Typical Process of Text Mining
1. Transform text into structured data.
2. Apply traditional data mining techniques to the above structured data.
5 Source: Yanchang Zhao. R and Data Mining: Examples and Case Studies.
Text Preparation Steps
6
Text Preparation Steps
The lecture note of this part was created mainly by referring to Anandarajan et al. (2019) - Chapter 4.
Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical text analytics. Maximizing the Value of Text
Data.(Advances in Analytics and Data Science. Vol. 2.) Springer.
7
Text Pre-processing Steps
Tokenization
Removes unnecessary information
from the original text. Standardization
and cleansing
In text mining, far more time is spent Stop word
in preparing and pre-processing the removal
text data than in the analytics itself. Stemming
10
Hierarchy of terms and documents (Cont)
An example of a document collection in which ten people were told to
envision and describe their dog.
• The dog could be wearing dog clothes.
• Some of these people describe their own dogs, while others describe
a fictional dog.
Term
Document 1 My favorite dog is fluffy and tan.
Document Document 2 the dog is brown and cat is brown
collection / Document 3 My favorite hat is brown and coat is pink
Corpus Document 4 My dog has a hat and leash.
Document 5 He has a fluffy coat and brown coats.
Document 6 The dog is brown and fluffy & has a brown coat.
Document 7 MY dog is white with brown spots.
Document 8 The white dog has a Pink coat and the Brown dog is fluffy
Document 9 The 3 fluffy dogs AND 2 brown hats are my favorites!
Document 10 MY fluffy dog has a white coat and hat.
11
Tokenisation
Text is split into a “bag of words”.
Normally a single word is referred to as
a “token” .
Tokens serve as the features of the analysis.
Whitespace or punctuation marks are
often used to tokenise.
12
Standardization and Cleaning
Purpose: making the terms in each document comparable
o For example, we do not want “character”, “character,” and
“Character” to be considered separate items.
• Case conversion (e.g., U to u; A to a)
• Number removal (e.g., 0, 2, 123)
• Punctuation mark removal (e.g., ; ? . !)
• Special characters removal (e.g., ♥)
13
Stop Word Removal
Purpose: reducing the dimensionality of text documents
Stop words refers to the common words that do not contribute highly
to meaning or provide little information in terms of content, e.g.,:
• Articles – a, an, the
• Conjunctions – and, or
→
• Prepositions – as, by, of
• Pronouns – you, she, he, it
• Words that have high frequencies across all documents but do not
15 provide informational content are good candidates for removal.
Stemming
Stemming is the process of reducing
derived words to their stem or root form.
Purpose: reduce the number of unique tokens within the analysis set by
combining words that contain the same root into a single token
• Typically achieved by removing the prefixes and suffixes – ing,
s, er, ed, etc, and converting words to their singular form.
• For example: “sailor”, “sailing”, “sailed” → “sail”
16
Outcomes of text preprocessing
17
Text Preparation Steps
18
Term-Document Representation
Finally, to prepare the “bag of words” for analysis, we use a specific
type of matrix to represent a collection of documents in numeric format.
The word
“algorithm”
appears four
times in
Document 5
19
Matrix
• Text analysis is made possible using some concepts from matrix algebra.
• A matrix is a two-dimensional array with m rows and n columns.
• Each entry in the matrix is indexed as amn, where m represents
the row number and n indexes the column number of the entry.
m rows
5 rows
n columns 5 columns
22
Uses of Term-document matrix
Visualisation
• Heat map
• Bar plot
• Word cloud, etc
Text analysis
• Clustering
• Classification
• Topic models
• Sentiment analysis,
etc
favorite dog fluffy tan brown cat hat coat pink leash
D1 1 1 1 1 0 0 0 0 0 0
D2 0 1 0 0 2 1 0 0 0 0
D3 1 0 0 0 1 0 1 1 1 0
D4 0 1 0 0 0 0 1 0 0 1
D5 0 0 1 0 1 0 0 2 0 0
27
Text Preparation using RStudio
- Taking the example of the “dog description” documents
28
Install and load package
30
Create corpus
# Convert df to a corpus
myCorpus <- Corpus(VectorSource(df$description))
31
Display the corpus
# Display the first five lines in the corpus
inspect(myCorpus[1:5])
32
Text preprocessing by tm_map()
1. Lower case conversion
tm_map(myCorpus, tolower)
2. Punctuation removal
tm_map(myCorpus, removePunctuation)
3. Numbers removal
tm_map(myCorpus, removeNumbers)
4. Stop words removal
tm_map(myCorpus, removeWords, stopwords("english"))
5. Stemming
33 tm_map(myCorpus, stemDocument)
Lower case conversion
# First, take an original copy to avoid loosing the original corpus
originalBackup <- myCorpus
34
Punctuation removal
# Punctuation removal and display the first five lines
myCorpus <- tm_map(myCorpus, removePunctuation)
inspect(myCorpus[1:5])
35
Numbers removal
# Numbers removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeNumbers)
inspect(myCorpus[6:10])
36
Stop words removal
# Stop words removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeWords,
stopwords("english"))
inspect(myCorpus[6:10])
37
Stemming
# Stemming and display the last five lines
myCorpus <- tm_map(myCorpus, stemDocument)
inspect(myCorpus[6:10])
38
Term-document matrix creation
# Create a term-document matrix
TDM <- TermDocumentMatrix(myCorpus)
# Convert the TDM to a normal matrix
matrix <- as.matrix(TDM)
# Display the matrix
matrix
40