Introduction To Text Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Introduction to Text mining

Lotfi NAJDI
2 / 54
Introduction
Text is still some of the most valuable data out there for those who know how
to use it.

In this lab we will try to take on some of the most important tasks in working
with text.

Basic text analysis and pattern matching

Tokenization for extracting relevant insights from text data

Building machine learning models with text data

3 / 54
Business problem
You're a consultant for DelFalco's Italian Restaurant, and The owner asked

you to identify whether there are any foods on their menu that diners find
disappointing.

The business owner suggested you use diner reviews from the Yelp
website to determine which dishes people liked and disliked.

4 / 54
Business problem
review_id text stars
i used to work food service and my manager at the time
recommended i try defalco's. he knows food well so i was
excited to try one of his favorites spots.
this place is really,
really good. lot of authentic italian choices and they even
have a grocery section with tons of legit italian goodies. i had
a chicken parmigiana sandwich that was to die for. anytime
109 my ex-manager comes back to town (he left for vegas and i 4
think he misses defalco's more than anything else in the
valley), he is sure to stop by and grab his favorite grub.
parking is a bit tricky during busy hours and the wait times
for food can get a bit long, so i recommend calling your
order ahead of time (unless you want to take a look around
while you wait, first-timers).

5 / 54
Business problem
The owner also gave you this list of menu items and common alternate
spellings.

menu_items
cheese steak
cheesesteak
steak and cheese
italian combo
tiramisu
cannoli
chicken salad
chicken spinach salad
meatball
pizza

6 / 54
Your turn 1
Before you get to analysis, run the code to load the data from the provided

excel file using read_excel function.

Explore the data by using glimpse and by displaying first rows of

DelFalco_reviews .

01:00
7 / 54
Your turn 1
DelFalco_reviews <- read_excel("data/yelp_reviews.xlsx")

DelFalco_reviews %>% select( text, stars)

glimpse(DelFalco_reviews)
# Rows: 1,321
# Columns: 4
# $ review_id <dbl> 109, 1013, 1204, 1251, 1354, 1504, 1739,~
# $ stars <dbl> 4, 4, 5, 1, 2, 5, 5, 4, 5, 3, 3, 5, 3, 4~
# $ date <dttm> 2013-01-27, 2015-04-15, 2011-03-20, 201~
# $ text <chr> "i used to work food service and my mana~

8 / 54
Any idea to solve the problem?
Given the data from Yelp and the list of menu items, do you have any ideas for

how you could find which menu items have disappointed diners?

We could identify items mentinned by easch review , and then calculate


the average rating for those items.

We can tell which foods are mentioned in reviews with low scores, so the

restaurant can fix the recipe or remove those foods from the menu.

Pattern Matching is common text minnig task is matching text elements

within chunks of text or whole documents.

9 / 54
Pattern Matching
Items in one review
As a first step, we will write code to extract the foods mentioned in a single
review :

text
i felt like i was eating in a storage room. soup was good...sandwich nothing
special.bread was like pizza dough.....doughy in the middle and done on the
outside....service was very good....might go back for take out and try the pizza.
if you ever had a pat's philly cheesesteak in philly this place can't compare.

10 / 54
Pattern Matching
Items in one review
Using str_detect function from stringr package, we will check if single_review
contains for examlpe pizza 🍕.

single_review <- DelFalco_reviews %>%


slice(10) %>%
select(text)
single_review %>% str_detect("pizza")
# [1] TRUE

11 / 54
Pattern Matching
Items in one review
Then we will check if single_review contain more than one item using the
same syntax.

# [1] "cheese steak" "cheesesteak" "steak and cheese"


# [4] "italian combo" "tiramisu" "cannoli"
# [7] "chicken salad" "chicken spinach salad" "meatball"
# [10] "pizza"

single_review %>% str_detect( menu_items)

# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

12 / 54
Pattern Matching
Items in one review
In order to identify the items mentionned in each review , we could str_extract

instead of str_detect to Extract matching patterns (menu_items ) from a the


review.

single_review %>% str_extract( menu_items)

# [1] NA "cheesesteak" NA NA NA
# [6] NA NA NA NA "pizza"

13 / 54
Your turn 2
Create a data set single_review containing just the review at position 10

Using str_detect function from stringr package, check if single_review


contain pizza 🍕.

Then try str_detect with the menu_items vector . Could you explain the
result ?

Use str_extract instead of str_detect to Extract matching patterns


(menu_items ) from a the review

03:00
14 / 54
Your turn 2
str_detect
single_review <- DelFalco_reviews %>%
slice(10) %>%
select(text)
single_review %>% str_detect("pizza")

single_review %>% str_detect( menu_items)

str_extract
single_review %>%
str_extract( menu_items)

15 / 54
Pattern Matching
Scale pattern matching to items mentioned in all
reviews
Now let's consider the whole dataset and collect ratings for each menu item.
Each review has a rating ( stars) .

For each review we will extract the list matching itmes.

To accompish this task we will use The map functions .

16 / 54
Pattern Matching
Scale pattern matching to items mentioned in all
reviews
The map functions transform their input by applying a function to each
element and returning a vector the same length as the input.

numbers1 <- c(10, 5, -6,18)


numbers1
# [1] 10 5 -6 18

numbers2 <- map(numbers1 , ~ abs(.x) + 2)


numbers2
# [[1]]
# [1] 12
#
# [[2]]
# [1] 7
#
# [[3]]
# [1] 8 17 / 54
Pattern Matching
Scale pattern matching to items mentioned in all
reviews

18 / 54
Your turn 3
Complete the following chunk in order to extract the list matching itmes
for each review.

Check the type of the item column

item_ratings <- DelFalco_reviews %>%


mutate(item = map( text , ~ str_extract(.x, menu_items)) ) %>%
select(item, stars, everything())

19 / 54
Your turn 3
Item column is represents a list-column

Use the unnest function from tidyr package to make each element of
the list on its own row.

Drop rows with NA then select just item and stars columns.

item_ratings <- DelFalco_reviews %>%


mutate(item = map( text , ~ str_extract(.x, menu_items)) ) %>%
unnest(item) %>%
drop_na() %>%
select(item , stars)

20 / 54
Analysis
Now we are ready for further analysis

mean_ratings <- item_ratings %>%


group_by(item) %>%
summarise(average_rating = mean(stars) , number_reviews = n() )
mean_ratings
# # A tibble: 43 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 artichoke salad 5 5
# 2 calzone 4.48 98
# 3 calzones 4.54 35
# 4 cannoli 4.5 94
# 5 cheese steak 4.44 84
# 6 cheesesteak 4.45 106
# 7 chicken cutlet 3.55 11
# 8 chicken parm 4.26 88
# 9 chicken parmesan 4.26 19
# 10 chicken parmigiana 4.47 17
# # ... with 33 more rows

21 / 54
Analysis
The 10 best rated items

mean_ratings %>%
arrange(-average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 artichoke salad 5 5
# 2 corned beef 5 2
# 3 fettuccini alfredo 5 6
# 4 turkey breast 5 1
# 5 steak and cheese 4.89 9
# 6 reuben 4.75 4
# 7 prosciutto 4.68 50
# 8 purista 4.67 63
# 9 chicken salad 4.6 5
# 10 chicken pesto 4.56 27

22 / 54
Analysis
The 10 worst rated items

mean_ratings %>%
arrange(average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 chicken cutlet 3.55 11
# 2 spaghetti 3.89 36
# 3 italian beef 3.92 25
# 4 macaroni 4 5
# 5 tuna salad 4 5
# 6 turkey sandwich 4 6
# 7 italian combo 4.05 22
# 8 garlic bread 4.13 39
# 9 roast beef 4.14 7
# 10 eggplant 4.16 69

23 / 54
Your turn 4
Calculate the mean ratings and the number of reviews for each menu
item.

Display the 10 best rated items.

Display the 10 worst rated items.

mean_ratings <- item_ratings %>%


group_by(item) %>%
summarise(average_rating = mean(stars) , number_reviews = n() )

24 / 54
Your turn 4
mean_ratings %>%
arrange(average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 chicken cutlet 3.55 11
# 2 spaghetti 3.89 36
# 3 italian beef 3.92 25
# 4 macaroni 4 5
# 5 tuna salad 4 5
# 6 turkey sandwich 4 6
# 7 italian combo 4.05 22
# 8 garlic bread 4.13 39
# 9 roast beef 4.14 7
# 10 eggplant 4.16 69

25 / 54
Your turn 4
mean_ratings %>%
arrange(-average_rating) %>% slice(1:10)
# # A tibble: 10 x 3
# item average_rating number_reviews
# <chr> <dbl> <int>
# 1 artichoke salad 5 5
# 2 corned beef 5 2
# 3 fettuccini alfredo 5 6
# 4 turkey breast 5 1
# 5 steak and cheese 4.89 9
# 6 reuben 4.75 4
# 7 prosciutto 4.68 50
# 8 purista 4.67 63
# 9 chicken salad 4.6 5
# 10 chicken pesto 4.56 27

26 / 54
Matching patterns with regular
expressions
Regular expressions are a concise and flexible tool for describing patterns
in strings.

Regexps allow us to describe patterns in strings.

They take a little while to get your head around, but once you understand
them, you’ll find them extremely
useful.

27 / 54
Matching patterns with regular
expressions
Basic matches
The simplest patterns match exact strings:

fruit <- c("apple", "banana", "pear", "pineapple")


str_view(fruit, "an")

apple
banana
pear
pineapple

28 / 54
Matching patterns with regular
expressions
The next step up in complexity is ., which matches any character (except a
newline):

fruit <- c("apple", "banana", "pear", "pineapple")


str_view(fruit, ".a.")

apple
banana
pear
pineapple

29 / 54
Matching patterns with regular
expressions
Anchors
By default, regular expressions will match any part of a string. It’s often useful
to anchor the regular expression so that it matches from the start or end of the
string. You can use:

^ to match the start of the string.

$ to match the end of the string.

fruit <- c("apple", "banana", "pear", "pineapple")


str_view(fruit, "^a")

apple
banana
pear
pineapple
30 / 54
Matching patterns with regular
expressions
Anchors

fruit <- c("apple", "banana", "pear", "pineapple")


str_view(fruit, "a$")

apple
banana
pear
pineapple

31 / 54
Matching patterns with regular
expressions
Character classes and alternatives
There are a number of special patterns that match more than one character.

There are four other useful tools:

\d : matches any digit.

\s : matches any whitespace (e.g. space, tab, newline).

[abc] : matches a, b, or c.

[^abc]: matches anything except a, b, or c.

32 / 54
Matching patterns with regular
expressions
Character classes and alternatives

str_view_all(fruit, "[pe]")

ap ple
banana
p ear
pineap ple

33 / 54
Matching patterns with regular
expressions
Character classes and alternatives

strings <- c( "apple", "219 733 8965", "329-293-8753",


"Work: 579-499-7527; Home: 543.355.3679" )

str_view_all(strings, "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})")

apple
219 733 8965
329-293-8753
Work: 579-499-7527 ; Home: 543.355.3679

34 / 54
Stringr
Stringr package for Pattern matching functions and regular expressions .

stringr 1.4.0.9000

Introduction to stringr
Source: vignettes/stringr.Rmd

There are four main families of functions in stringr:

1. Character manipulation: these functions allow you to manipulate individual


characters within the strings in character vectors.

2. Whitespace tools to add, remove, and manipulate whitespace.

35 / 54
Tokenization
A token is a meaningful unit of text, such as a word, that we are interested
in using for analysis, and tokenization is the process of splitting text into
tokens. Text Mining with R: A Tidy Approach

In order to illustrate this method we will consider a case study about TED
talks dataset created by Katherine M. Kinnaird and John Laudun for their
paper “TED Talks as Data”.

36 / 54
Tokenization
speaker text
Al Gore Thank you so much, Chris. And it's truly a great honor to have the
opportunity to come to this stage twice; I'm extremely grateful. I
have been blown away by this conference, and I want to thank all
of you for the many nice comments about what I had to say the
other night. And I say that sincerely, partly because (Mock sob) I
need that. (Laughter) Put yourselves in my position. (Laughter) I
flew on Air Force Two for eight years. (Laughter) Now I have to take
off my shoes or boots to get on an airplane! (Laughter) (Applause)
I'll tell you one quick story to illustrate what that's been like for me.
(Laughter) It's a true story — every bit of this is true. Soon after
Tipper and I left the — (Mock sob) White House — (Laughter) we
were driving from our home in Nashville to a little farm we have 50
miles east of Nashville. Driving ourselves. (Laughter) I know it
sounds like a little thing to you, but — (Laughter) I looked in the
rear-view mirror and all of a sudden it just hit me. There was no
motorcade back there. (Laughter) You've heard of phantom limb
pain? (Laughter) This was a rented Ford Taurus. (Laughter) It was
dinnertime, and we started looking for a place to eat. We were on
I-40. We got to Exit 238, Lebanon, Tennessee. We got off the exit,
we found a Shoney's restaurant. Low-cost family restaurant chain, 37 / 54
Your turn 5
1. Strat by loading the two packages below (first tidyverse, and then tidytext)
by replacing the _ with the package names.

2. Complete the chunk to read TED transcripts from the file ted_talks.rds .

3. Use glimpse to display variables in ted_talks dataframe :

02:00
38 / 54
Your turn 5
library(tidyverse)
library(tidytext)
ted_talks <- read_rds("data/ted_talks.rds")

ted_talks %>% glimpse()


# Rows: 992
# Columns: 3
# $ talk_id <dbl> 1, 7, 53, 66, 92, 96, 49, 86, 71, 94, 54, ~
# $ text <chr> "Thank you so much, Chris. And it's truly ~
# $ speaker <chr> "Al Gore", "David Pogue", "Majora Carter",~

As you can see there is tree variables :

talk_id: the identifier from the TED website for this particular talk
text: the text of this TED talk
speaker: the main or first listed speaker (some TED talks have more than
one speaker)

39 / 54
Tokenization
Let's start with one talk .

We can use tidytext’s unnest_tokens() function

to break the text into individual tokens and transform it to a tidy data
structure..

tidy_talks <- ted_talks %>% # We piped in the original, non-tidy data


slice(1) %>%
unnest_tokens(word, text) # argument to tokenize _into_ and argumen

talk_id speaker word


1 Al Gore thank
1 Al Gore you
1 Al Gore so
1 Al Gore much
1 Al Gore chris
40 / 54
Tokenization
The unnest_tokens() function takes three arguments:

the input dataframe that contains your text (often you will use the pipe
%>% to send this argument to unnest_tokens())

the output column that you want to unnest to, and

the input column that you want to unnest from.

41 / 54
Your turn 6
Try The unnest_tokens() function in order to tokenize the text column. You
might start with the first row before scalling to the the whole dataset

--

tidy_talks <- ted_talks %>% # We piped in the original, non-tidy data


slice(1) %>%
unnest_tokens(word, text)

02:00
42 / 54
Most common TED talk words
Now that our data in a tidy format, a whole world of analysis opportunity
has opened up for us.

We can start by computing term frequencies in just one line. What are the
most common words in these TED talks?

Most_common_words <- tidy_talks %>%


count(word, sort = TRUE)

word n
the 95
to 75
and 71
of 62
a 59
i 50
in 40
43 / 54
Stop words
Words like "the", "and", and "to" that aren't very interesting for a text analysis are
called stop words. Often the best choice is to remove them. The tidytext
package provides access to stop word lexicons, with a default list and then
other options and other languages.

get_stopwords()
# # A tibble: 175 x 2
# word lexicon
# <chr> <chr>
# 1 i snowball
# 2 me snowball
# 3 my snowball
# 4 myself snowball
# 5 we snowball
# 6 our snowball
# 7 ours snowball
# 8 ourselves snowball
# 9 you snowball
# 10 your snowball
# # ... with 165 more rows

44 / 54
Stop words
When text data is in a tidy format, stop words can be removed using an
anti_join().

This type of join will "filter" or remove items that are in the right-hand side,
keeping those in the left-hand side.

These are now more interesting words and are starting to show the focus
of TED talks.

Most_common_words <- tidy_talks %>%


anti_join(get_stopwords()) %>%
count(word, sort = TRUE)

word n
laughter 22
going 10
can 9
carbon 9
much 9
45 / 54
Visualize Most common TED talk
words
We can fluently pipe from the code we just wrote straight to ggplot2
functions.

One of the significant benefits of using tidy data principles is consistency


with widely-used tools that are broadly supported.

46 / 54
Your turn 7
Use count() to find the most common words.

tidy_talks %>%
count(word, sort = TRUE)
# # A tibble: 730 x 2
# word n
# <chr> <int>
# 1 the 95
# 2 to 75
# 3 and 71
# 4 of 62
# 5 a 59
# 6 i 50
# 7 in 40
# 8 you 39
# 9 it 37
# 10 that 33
# # ... with 720 more rows

01:00
47 / 54
Your turn 8
To exclude stop words, you might use anti_join() within a e call to
get_stopwords() function from Tidytext package.

tidy_talks %>%
_____join(_________) %>%
count(word, sort = TRUE)

01:00
48 / 54
Your turn 9
Complete the folowing chunk in orderr to :

to take the word "laughter" out of most common TED talk.

Visualize top 15 words by puting n on the x-axis and word on the y-axis.

49 / 54
Your turn 9
tidy_talks %>%
filter( word != "laughter" ) %>%
# remove stop words
anti_join(get_stopwords()) %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 12) %>%
mutate(word = reorder(word, n)) %>%
# put `n` on the x-axis and `word` on the y-axis
ggplot(aes(n, word)) +
geom_col() +
theme_xaringan(background_color = "#FFFFFF")

50 / 54
Compare TED talk vocabularies

51 / 54
52 / 54
Text Mining with R

Text Mining with R:


A Tidy Approach

On

Welcome to Text Mining with R We

View
This is the website for Text Mining with R! Visit
the GitHub repository for this site, find the Edit

book at O’Reilly, or buy it on Amazon.

This work by Julia Silge and David Robinson is


licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 3.0
United States License.

53 / 54
Text Mining with R

Text mining with 1. Introduction


tidy data principles
Julia Silge

1. Introduction

2. Thank you for coming to my


Text data sets are diverse and ubiquitous, and tidy data
TED talk principles provide an approach to make text mining easier,
more effective, and consistent with tools already in wide use. In
this tutorial, you will develop your text mining skills using the
3. Shakespeare gets sentimental
tidytext (https://juliasilge.github.io/tidytext/) package in R, along
with other tidyverse (https://www.tidyverse.org/) tools. You will
4. Newspaper headlines apply these skills in several case studies, which will allow you to:

54 / 54

You might also like