EBUS622 - Week 5 - Lecture - Text Preparation

Week 5 Lecture
Text mining – Text Preparation
Dr Yuanjun Feng
University of Liverpool Management School
Email: Yuanjun.Feng2@liverpool.ac.uk
Office hour: Wednesday, 11:20-12:20, Room 125a
1
Overview
● Text Mining Workflow
● Text Preprocessing
● Term Document Representation
● Text Preparation using RStudio
2
Two approaches of Text Mining
(Source)
3
Text Mining Workflow
Text
preparation Text preprocessing
Frequency of occurrence
Bag-of-words
4 (Source)
Typical Process of Text Mining
1. Transform text into structured data.
2. Apply traditional data mining techniques to the above structured data.
5 Source: Yanchang Zhao. R and Data Mining: Examples and Case Studies.
Text Preparation Steps
1. Text Preprocessing (tokenization, number removal,

punctuation removal, case conversion, stop word
removal, stemming)
2. Term Document Representation (term-document

matrix, document-term matrix)
6
1. Text Preprocessing (tokenization, number removal,

punctuation removal, case conversion, stop word
removal, stemming)
2. Term-Document Representation (term-document

The lecture note of this part was created mainly by referring to Anandarajan et al. (2019) - Chapter 4.
Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical text analytics. Maximizing the Value of Text
Data.(Advances in Analytics and Data Science. Vol. 2.) Springer.
7
Text Pre-processing Steps
 Text pre-processing takes an input of

raw text and returns cleansed tokens.
Tokenization
 Removes unnecessary information
from the original text. Standardization
and cleansing
 In text mining, far more time is spent Stop word
in preparing and pre-processing the removal
text data than in the analytics itself. Stemming
 Diligent and detailed work in pre-

processing makes the analysis
process smoother.
8
What is a Document?
● For many text-mining tasks, the “document” is obvious.
● For example, e-mails, tweets, webpages,

newspapers, customer complaints, or call records.
● We might have different types of text data.
● In longer documents, the entire document may be used, or it may

be broken up into sections, paragraphs, or sentences.
9
Hierarchy of terms and documents
• Characters are combined to form

words or terms.
• A document is typically made up

of many terms.
• The many documents make up a

document collection or corpus.
10
Hierarchy of terms and documents (Cont)
An example of a document collection in which ten people were told to
envision and describe their dog.
• The dog could be wearing dog clothes.
• Some of these people describe their own dogs, while others describe
a fictional dog.
Term
Document 1 My favorite dog is fluffy and tan.
Document Document 2 the dog is brown and cat is brown
collection / Document 3 My favorite hat is brown and coat is pink
Corpus Document 4 My dog has a hat and leash.
Document 5 He has a fluffy coat and brown coats.
Document 6 The dog is brown and fluffy & has a brown coat.
Document 7 MY dog is white with brown spots.
Document 8 The white dog has a Pink coat and the Brown dog is fluffy
Document 9 The 3 fluffy dogs AND 2 brown hats are my favorites!
Document 10 MY fluffy dog has a white coat and hat.
11
Tokenisation
 Text is split into a “bag of words”.
 Normally a single word is referred to as
a “token” .
 Tokens serve as the features of the analysis.
 Whitespace or punctuation marks are
often used to tokenise.
My favourite dog is fluffy.
[My] [favourite] [dog] [is] [fluffy] [.]
12
Standardization and Cleaning
Purpose: making the terms in each document comparable
o For example, we do not want “character”, “character,” and
“Character” to be considered separate items.
• Case conversion (e.g., U to u; A to a)
• Number removal (e.g., 0, 2, 123)
• Punctuation mark removal (e.g., ; ? . !)
• Special characters removal (e.g., ♥)
13
Stop Word Removal
Purpose: reducing the dimensionality of text documents
Stop words refers to the common words that do not contribute highly
to meaning or provide little information in terms of content, e.g.,:
• Articles – a, an, the
• Conjunctions – and, or
→
• Prepositions – as, by, of
• Pronouns – you, she, he, it
 A collection of stop words is known as a stop list or dictionary.

For example, SMART dictionary (Salton 1971) has a stop word list
containing 571 words: a, and, has, he, is, my, the, and was, etc.
14
Stop Word Removal (Cont)
An alternative to using standard stop word dictionaries is to create
a custom dictionary.
• Sometimes we also want to drop domain- or project-specific words.
Document 1 My favorite dog is fluffy and tan.
Document 2 the dog is brown and cat is brown
Document 3 My favorite hat is brown and coat is pink
Document 4 My dog has a hat and leash.
Document 5 He has a fluffy coat and brown coats.
Document 6 The dog is brown and fluffy & has a brown coat.
Document 7 MY dog is white with brown spots.
Document 8 The white dog has a Pink coat and the Brown dog is fluffy
Document 9 The 3 fluffy dogs AND 2 brown hats are my favorites!
Document 10 MY fluffy dog has a white coat and hat.
• Words that have high frequencies across all documents but do not
15 provide informational content are good candidates for removal.
Stemming
Stemming is the process of reducing
derived words to their stem or root form.
Purpose: reduce the number of unique tokens within the analysis set by
combining words that contain the same root into a single token
• Typically achieved by removing the prefixes and suffixes – ing,
s, er, ed, etc, and converting words to their singular form.
• For example: “sailor”, “sailing”, “sailed” → “sail”
16
Outcomes of text preprocessing
Raw text Preprocessed text
17
1. Text Preprocessing (tokenization, number removal;

punctuation mark removal; case conversion, stop word
removal, stemming )
2. Term Document Representation (term-document

18
Term-Document Representation
Finally, to prepare the “bag of words” for analysis, we use a specific
type of matrix to represent a collection of documents in numeric format.
The word
“algorithm”
appears four
times in
Document 5
19
Matrix
• Text analysis is made possible using some concepts from matrix algebra.
• A matrix is a two-dimensional array with m rows and n columns.
• Each entry in the matrix is indexed as amn, where m represents
the row number and n indexes the column number of the entry.
m rows
5 rows
n columns 5 columns
20 The matrix models textual information of a document collection in two dimensions.

Term-document matrix vs. Document-term matrix
• In a term-document matrix • In a document-term matrix
(TDM), the rows correspond to (DTM), the rows correspond to
terms, and the columns documents, and the columns
correspond to documents. correspond to terms.
The entries in TDM or DTM represent the frequencies of terms in

documents.
The only difference between the two is the placement of the terms and
documents, and either can be created and used in text mining analysis.
21
Term-document matrix of the “dog description” example
Document
Term
22
Uses of Term-document matrix
Visualisation
• Heat map
• Bar plot
• Word cloud, etc
Text analysis
• Clustering
• Classification
• Topic models
• Sentiment analysis,
etc
Heat map visualising the term-document matrix

23
Build a Term-document matrix
Step 1. Identify the unique terms.
Step 2. Count the frequency of each term in each document.
Five preprocessed documents:
Next, we use this example to show how to build a term-document matrix.

24
Build a Term-document matrix (Cont)
Step 1. Identify the unique terms. Step 2. Count the frequency of
each term in each document.
D1 D2 D3 D4 D5
favorite
dog
fluffy
tan
brown
cat
Ten unique terms: hat
[favorite] [dog] [fluffy] [tan] [brown] coat
[cat] [hat] [coat] [pink] [leash] pink
leash
25
Build a Term-document matrix (Cont)
Step 1. Identify the unique terms. Step 2. Count the frequency of
each term in each document.
D1 D2 D3 D4 D5
favorite 1 0 1 0 0
dog 1 1 0 1 0
fluffy 1 0 0 0 1
tan 1 0 0 0 0
brown 0 2 1 0 1
cat 0 1 0 0 0
Ten unique terms: hat 0 0 1 1 0
[favorite] [dog] [fluffy] [tan] [brown] coat 0 0 1 0 2
[cat] [hat] [coat] [pink] [leash] pink 0 0 1 0 0
leash 0 0 0 1 0
26
Build a Document-term matrix
favorite dog fluffy tan brown cat hat coat pink leash
D1 1 1 1 1 0 0 0 0 0 0
D2 0 1 0 0 2 1 0 0 0 0
D3 1 0 0 0 1 0 1 1 1 0
D4 0 1 0 0 0 0 1 0 0 1
D5 0 0 1 0 1 0 0 2 0 0
27
Text Preparation using RStudio
- Taking the example of the “dog description” documents
Step 1: Install and load packages

Step 2: Load raw text
Step 3: Create corpus
Step 4: Text preprocessing (lower case, punctuation, number,
stop word, stemming)
Step 5: Build a term-document matrix
28
Install and load package
# Install package “tm”

install.packages("tm")
install.packages("SnowballC")
Note: package “tm” is for text mining; package “SnowballC”
is for stemming.
# Load packages into R

library("tm")
library("SnowballC")
29
Load raw text
# Load "Dog_Raw.csv" into R
df <- read.csv("Dog_Raw.csv")
Note: we create a dataframe “df” to store the text data in the csv file.
# View the loaded data

View(df)
30
Create corpus
# Convert df to a corpus
myCorpus <- Corpus(VectorSource(df$description))
Note: $ operator is used to extract the values in a specific column

in a data frame
VectorSource (x) function is used to create a vector source that

interprets each element of the vector x as a document.
Corpus (x) function is used to create a corpus from a document source.
31
Display the corpus
# Display the first five lines in the corpus
inspect(myCorpus[1:5])
32
Text preprocessing by tm_map()
1. Lower case conversion
tm_map(myCorpus, tolower)
2. Punctuation removal
tm_map(myCorpus, removePunctuation)
3. Numbers removal
tm_map(myCorpus, removeNumbers)
4. Stop words removal
tm_map(myCorpus, removeWords, stopwords("english"))
5. Stemming
33 tm_map(myCorpus, stemDocument)
Lower case conversion
# First, take an original copy to avoid loosing the original corpus
originalBackup <- myCorpus
# Lower case conversion and display the first five lines

myCorpus <- tm_map(myCorpus, tolower)
34
Punctuation removal
# Punctuation removal and display the first five lines
myCorpus <- tm_map(myCorpus, removePunctuation)
35
Numbers removal
# Numbers removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeNumbers)
36
Stop words removal
# Stop words removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeWords,
stopwords("english"))
37
Stemming
# Stemming and display the last five lines
myCorpus <- tm_map(myCorpus, stemDocument)
38
Term-document matrix creation
# Create a term-document matrix
TDM <- TermDocumentMatrix(myCorpus)
# Convert the TDM to a normal matrix
matrix <- as.matrix(TDM)
# Display the matrix
matrix
Note: Alternatively, you can use

inspect(TDM) to display the TDM.
It will give you only a partial
display as an sample, depending
39
on the dimension of your TDM.
Summary
● Text Mining Workflow (extract docs, transformation, extract
feature, reduce dimension, standard data mining)
● Text Preprocessing (tokenization, number removal, punctuation

removal, case conversion, stop word removal, stemming )
● Term Document Representation (term-document matrix vs.

document-term matrix, build a term-document matrix)
● Text Preparation using RStudio (Dog_Raw.csv)
40

EBUS622 - Week 5 - Lecture - Text Preparation

Uploaded by

Copyright:

Available Formats

You might also like

EBUS622 - Week 5 - Lecture - Text Preparation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EBUS622 - Week 5 - Lecture - Text Preparation

Uploaded by

Copyright:

Available Formats

Week 5 Lecture

Text mining – Text Preparation

● Text Mining Workflow

● Term Document Representation

● Text Preparation using RStudio

1. Text Preprocessing (tokenization, number removal,

2. Term Document Representation (term-document

1. Text Preprocessing (tokenization, number removal,

2. Term-Document Representation (term-document

 Text pre-processing takes an input of

 Diligent and detailed work in pre-

● For many text-mining tasks, the “document” is obvious.

● For example, e-mails, tweets, webpages,

● We might have different types of text data.

● In longer documents, the entire document may be used, or it may

• Characters are combined to form

• A document is typically made up

• The many documents make up a

My favourite dog is fluffy.

[My] [favourite] [dog] [is] [fluffy] [.]

 A collection of stop words is known as a stop list or dictionary.

Raw text Preprocessed text

1. Text Preprocessing (tokenization, number removal;

2. Term Document Representation (term-document

20 The matrix models textual information of a document collection in two dimensions.

The entries in TDM or DTM represent the frequencies of terms in

Heat map visualising the term-document matrix

Five preprocessed documents:

Next, we use this example to show how to build a term-document matrix.

Step 1: Install and load packages

# Install package “tm”

# Load packages into R

# View the loaded data

Note: $ operator is used to extract the values in a specific column

VectorSource (x) function is used to create a vector source that

Corpus (x) function is used to create a corpus from a document source.

# Lower case conversion and display the first five lines

Note: Alternatively, you can use

● Text Preprocessing (tokenization, number removal, punctuation

● Term Document Representation (term-document matrix vs.

● Text Preparation using RStudio (Dog_Raw.csv)

You might also like