EBUS622 - Week 5 - Lecture - Text Preparation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Week 5 Lecture

Text mining – Text Preparation

Dr Yuanjun Feng
University of Liverpool Management School
Email: Yuanjun.Feng2@liverpool.ac.uk
Office hour: Wednesday, 11:20-12:20, Room 125a
1
Overview

● Text Mining Workflow

● Text Preprocessing

● Term Document Representation

● Text Preparation using RStudio

2
Two approaches of Text Mining

(Source)

3
Text Mining Workflow
Text
preparation Text preprocessing

Frequency of occurrence
Bag-of-words

4 (Source)
Typical Process of Text Mining
1. Transform text into structured data.
2. Apply traditional data mining techniques to the above structured data.

5 Source: Yanchang Zhao. R and Data Mining: Examples and Case Studies.
Text Preparation Steps

1. Text Preprocessing (tokenization, number removal,


punctuation removal, case conversion, stop word
removal, stemming)

2. Term Document Representation (term-document


matrix, document-term matrix)

6
Text Preparation Steps

1. Text Preprocessing (tokenization, number removal,


punctuation removal, case conversion, stop word
removal, stemming)

2. Term-Document Representation (term-document


matrix, document-term matrix)

The lecture note of this part was created mainly by referring to Anandarajan et al. (2019) - Chapter 4.
Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical text analytics. Maximizing the Value of Text
Data.(Advances in Analytics and Data Science. Vol. 2.) Springer.
7
Text Pre-processing Steps

 Text pre-processing takes an input of


raw text and returns cleansed tokens.

Tokenization
 Removes unnecessary information
from the original text. Standardization
and cleansing
 In text mining, far more time is spent Stop word
in preparing and pre-processing the removal
text data than in the analytics itself. Stemming

 Diligent and detailed work in pre-


processing makes the analysis
process smoother.
8
What is a Document?

● For many text-mining tasks, the “document” is obvious.

● For example, e-mails, tweets, webpages,


newspapers, customer complaints, or call records.

● We might have different types of text data.

● In longer documents, the entire document may be used, or it may


be broken up into sections, paragraphs, or sentences.
9
Hierarchy of terms and documents

• Characters are combined to form


words or terms.

• A document is typically made up


of many terms.

• The many documents make up a


document collection or corpus.

10
Hierarchy of terms and documents (Cont)
An example of a document collection in which ten people were told to
envision and describe their dog.
• The dog could be wearing dog clothes.
• Some of these people describe their own dogs, while others describe
a fictional dog.
Term
Document 1 My favorite dog is fluffy and tan.
Document Document 2 the dog is brown and cat is brown
collection / Document 3 My favorite hat is brown and coat is pink
Corpus Document 4 My dog has a hat and leash.
Document 5 He has a fluffy coat and brown coats.
Document 6 The dog is brown and fluffy & has a brown coat.
Document 7 MY dog is white with brown spots.
Document 8 The white dog has a Pink coat and the Brown dog is fluffy
Document 9 The 3 fluffy dogs AND 2 brown hats are my favorites!
Document 10 MY fluffy dog has a white coat and hat.
11
Tokenisation
 Text is split into a “bag of words”.
 Normally a single word is referred to as
a “token” .
 Tokens serve as the features of the analysis.
 Whitespace or punctuation marks are
often used to tokenise.

My favourite dog is fluffy.

[My] [favourite] [dog] [is] [fluffy] [.]

12
Standardization and Cleaning
Purpose: making the terms in each document comparable
o For example, we do not want “character”, “character,” and
“Character” to be considered separate items.
• Case conversion (e.g., U to u; A to a)
• Number removal (e.g., 0, 2, 123)
• Punctuation mark removal (e.g., ; ? . !)
• Special characters removal (e.g., ♥)

13
Stop Word Removal
Purpose: reducing the dimensionality of text documents
Stop words refers to the common words that do not contribute highly
to meaning or provide little information in terms of content, e.g.,:
• Articles – a, an, the
• Conjunctions – and, or


• Prepositions – as, by, of
• Pronouns – you, she, he, it

 A collection of stop words is known as a stop list or dictionary.


For example, SMART dictionary (Salton 1971) has a stop word list
containing 571 words: a, and, has, he, is, my, the, and was, etc.
14
Stop Word Removal (Cont)
An alternative to using standard stop word dictionaries is to create
a custom dictionary.
• Sometimes we also want to drop domain- or project-specific words.
Document 1 My favorite dog is fluffy and tan.
Document 2 the dog is brown and cat is brown
Document 3 My favorite hat is brown and coat is pink
Document 4 My dog has a hat and leash.
Document 5 He has a fluffy coat and brown coats.
Document 6 The dog is brown and fluffy & has a brown coat.
Document 7 MY dog is white with brown spots.
Document 8 The white dog has a Pink coat and the Brown dog is fluffy
Document 9 The 3 fluffy dogs AND 2 brown hats are my favorites!
Document 10 MY fluffy dog has a white coat and hat.

• Words that have high frequencies across all documents but do not
15 provide informational content are good candidates for removal.
Stemming
Stemming is the process of reducing
derived words to their stem or root form.
Purpose: reduce the number of unique tokens within the analysis set by
combining words that contain the same root into a single token
• Typically achieved by removing the prefixes and suffixes – ing,
s, er, ed, etc, and converting words to their singular form.
• For example: “sailor”, “sailing”, “sailed” → “sail”

16
Outcomes of text preprocessing

Raw text Preprocessed text

17
Text Preparation Steps

1. Text Preprocessing (tokenization, number removal;


punctuation mark removal; case conversion, stop word
removal, stemming )

2. Term Document Representation (term-document


matrix, document-term matrix)

18
Term-Document Representation
Finally, to prepare the “bag of words” for analysis, we use a specific
type of matrix to represent a collection of documents in numeric format.

The word
“algorithm”
appears four
times in
Document 5

19
Matrix
• Text analysis is made possible using some concepts from matrix algebra.
• A matrix is a two-dimensional array with m rows and n columns.
• Each entry in the matrix is indexed as amn, where m represents
the row number and n indexes the column number of the entry.

m rows
5 rows

n columns 5 columns

20 The matrix models textual information of a document collection in two dimensions.


Term-document matrix vs. Document-term matrix
• In a term-document matrix • In a document-term matrix
(TDM), the rows correspond to (DTM), the rows correspond to
terms, and the columns documents, and the columns
correspond to documents. correspond to terms.

The entries in TDM or DTM represent the frequencies of terms in


documents.
The only difference between the two is the placement of the terms and
documents, and either can be created and used in text mining analysis.
21
Term-document matrix of the “dog description” example
Document
Term

22
Uses of Term-document matrix
Visualisation
• Heat map
• Bar plot
• Word cloud, etc

Text analysis
• Clustering
• Classification
• Topic models
• Sentiment analysis,
etc

Heat map visualising the term-document matrix


23
Build a Term-document matrix
Step 1. Identify the unique terms.
Step 2. Count the frequency of each term in each document.

Five preprocessed documents:

Next, we use this example to show how to build a term-document matrix.


24
Build a Term-document matrix (Cont)
Step 1. Identify the unique terms. Step 2. Count the frequency of
each term in each document.
D1 D2 D3 D4 D5
favorite
dog
fluffy
tan
brown
cat
Ten unique terms: hat
[favorite] [dog] [fluffy] [tan] [brown] coat
[cat] [hat] [coat] [pink] [leash] pink
leash
25
Build a Term-document matrix (Cont)
Step 1. Identify the unique terms. Step 2. Count the frequency of
each term in each document.
D1 D2 D3 D4 D5
favorite 1 0 1 0 0
dog 1 1 0 1 0
fluffy 1 0 0 0 1
tan 1 0 0 0 0
brown 0 2 1 0 1
cat 0 1 0 0 0
Ten unique terms: hat 0 0 1 1 0
[favorite] [dog] [fluffy] [tan] [brown] coat 0 0 1 0 2
[cat] [hat] [coat] [pink] [leash] pink 0 0 1 0 0
leash 0 0 0 1 0
26
Build a Document-term matrix

favorite dog fluffy tan brown cat hat coat pink leash

D1 1 1 1 1 0 0 0 0 0 0
D2 0 1 0 0 2 1 0 0 0 0
D3 1 0 0 0 1 0 1 1 1 0
D4 0 1 0 0 0 0 1 0 0 1
D5 0 0 1 0 1 0 0 2 0 0

27
Text Preparation using RStudio
- Taking the example of the “dog description” documents

Step 1: Install and load packages


Step 2: Load raw text
Step 3: Create corpus
Step 4: Text preprocessing (lower case, punctuation, number,
stop word, stemming)
Step 5: Build a term-document matrix

28
Install and load package

# Install package “tm”


install.packages("tm")
install.packages("SnowballC")
Note: package “tm” is for text mining; package “SnowballC”
is for stemming.

# Load packages into R


library("tm")
library("SnowballC")
29
Load raw text
# Load "Dog_Raw.csv" into R
df <- read.csv("Dog_Raw.csv")
Note: we create a dataframe “df” to store the text data in the csv file.

# View the loaded data


View(df)

30
Create corpus
# Convert df to a corpus
myCorpus <- Corpus(VectorSource(df$description))

Note: $ operator is used to extract the values in a specific column


in a data frame

VectorSource (x) function is used to create a vector source that


interprets each element of the vector x as a document.

Corpus (x) function is used to create a corpus from a document source.

31
Display the corpus
# Display the first five lines in the corpus
inspect(myCorpus[1:5])

32
Text preprocessing by tm_map()
1. Lower case conversion
tm_map(myCorpus, tolower)
2. Punctuation removal
tm_map(myCorpus, removePunctuation)
3. Numbers removal
tm_map(myCorpus, removeNumbers)
4. Stop words removal
tm_map(myCorpus, removeWords, stopwords("english"))
5. Stemming
33 tm_map(myCorpus, stemDocument)
Lower case conversion
# First, take an original copy to avoid loosing the original corpus
originalBackup <- myCorpus

# Lower case conversion and display the first five lines


myCorpus <- tm_map(myCorpus, tolower)
inspect(myCorpus[1:5])

34
Punctuation removal
# Punctuation removal and display the first five lines
myCorpus <- tm_map(myCorpus, removePunctuation)
inspect(myCorpus[1:5])

35
Numbers removal
# Numbers removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeNumbers)
inspect(myCorpus[6:10])

36
Stop words removal
# Stop words removal and display the last five lines
myCorpus <- tm_map(myCorpus, removeWords,
stopwords("english"))

inspect(myCorpus[6:10])

37
Stemming
# Stemming and display the last five lines
myCorpus <- tm_map(myCorpus, stemDocument)

inspect(myCorpus[6:10])

38
Term-document matrix creation
# Create a term-document matrix
TDM <- TermDocumentMatrix(myCorpus)
# Convert the TDM to a normal matrix
matrix <- as.matrix(TDM)
# Display the matrix
matrix

Note: Alternatively, you can use


inspect(TDM) to display the TDM.
It will give you only a partial
display as an sample, depending
39
on the dimension of your TDM.
Summary
● Text Mining Workflow (extract docs, transformation, extract
feature, reduce dimension, standard data mining)

● Text Preprocessing (tokenization, number removal, punctuation


removal, case conversion, stop word removal, stemming )

● Term Document Representation (term-document matrix vs.


document-term matrix, build a term-document matrix)

● Text Preparation using RStudio (Dog_Raw.csv)

40

You might also like