Professional Documents
Culture Documents
RDataMining Slides Twitter Analysis
RDataMining Slides Twitter Analysis
Yanchang Zhao
http://www.RDataMining.com
7 October 2016
1
Chapter 10: Text Mining, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Topic Modelling
Sentiment Analysis
Followers and Retweeting Analysis
Follower Analysis
Retweeting Analysis
R Packages
References and Online Resources
2 / 40
Twitter
3 / 40
RDataMining Twitter Account
I Techniques
I Text mining
I Topic modelling
I Sentiment analysis
I Social network analysis
I Tools
I Twitter API
I R and its packages:
I twitteR
I tm
I topicmodels
I sentiment140
I igraph
5 / 40
Process
6 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Topic Modelling
Sentiment Analysis
Followers and Retweeting Analysis
Follower Analysis
Retweeting Analysis
R Packages
References and Online Resources
7 / 40
Retrieve Tweets
8 / 40
(n.tweet <- length(tweets))
## [1] 448
# print tweet #190 and make text fit for slide width
writeLines(strwrap(tweets.df$text[190], 60))
10 / 40
2
Stemming and Stem Completion
myCorpus <- tm_map(myCorpus, stemDocument) # stem words
writeLines(strwrap(myCorpus[[190]]$content, 60))
## r refer card data mine now provid link packag cran packag
## mapreduc hadoop ad
## 9 104
## Docs
## Terms 21 22 23 24 25 26 27 28 29 30
## data 0 1 0 0 1 0 0 0 0 1
## mining 0 0 0 0 1 0 0 0 0 1
## r 1 1 1 1 0 1 0 1 1 1
13 / 40
Top Frequent Terms
14 / 40
library(ggplot2)
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip() +
theme(axis.text=element_text(size=7))
university
tutorial
text
talk
slide
science
research
rdatamining
r
position
Terms
package
network
mining
learn
introduction
group
example
data
course
canberra
big
australia
analytics
analysing
0 50 100 150 200
15 / 40
Wordcloud
m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency
word.freq <- sort(rowSums(m), decreasing = T)
# colors
pal <- brewer.pal(9, "BuGn")[-(1:4)]
16 / 40
random website
paper snowfall
sunday
source decision wwwrdataminingcom
track friday search developed management
mode interacting california version facebook
tree edited center guidance forecastingadvanced summit
titled document page check healthcare
member novfunction
postdoc technological official
conference
format sas apache thanks
iselect card classification singapore
learn april
june
credit
knowledge due deadline america
mapreduce postdoctoral
may
university
jan march
load excel process poll start list
link
francisco computational reference project
result area modeling find retrieval
step seminar forest
analytics
risk
engine
melbourne book
intern
make
create twittercan
us machine spatial
case
participation
regression
tweet
fast kdnuggets package video notes
answers
web
week
data
useful dataset
today
updated
please
tool
dr
group
provided cfp time sydney
tuesday
australia simple
short file rule thursday
visualisations
r
job call software high
survey seattle
pm
v example
online outlier workshop dmapps
big analysing
vs
will
hadoop
course
ieee graph give dynamic
mid
amazon tricks rstudio little code cran
informal
introduction join
scientist
pls technique august systemmexico
ausdm skills
pdf
datacamp
australasian
## r
## code 0.27
## example 0.21
## series 0.21
## markdown 0.20
## user 0.20
## data
## mining 0.48
## big 0.44
## analytics 0.31
## science 0.29
## poll 0.24
18 / 40
Network of Terms
library(graph)
library(Rgraphviz)
plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T)
talk
tutorial slide
big research
19 / 40
Topic Modelling
dtm <- as.DocumentTermMatrix(tdm)
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 7) # first 7 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))
## Topic 1
## "r, data, mining, slide, position, series, application"
## Topic 2
## "r, mining, data, big, position, available, text"
## Topic 3
## "data, science, group, poll, kdnuggets, package, software"
## Topic 4
## "r, data, talk, slide, mining, analysing, dataset"
## Topic 5
## "r, mining, package, book, example, slide, analysing"
## Topic 6
## "big, r, mining, network, analysing, statistical, tutorial"
## Topic 7
## "data, r, slide, analytics, research, analysing, workshop"
## Topic 8
20 / 40
## "data, mining, research, canberra, big, event, text"
Topic Modelling
topics <- topics(lda) # 1st topic identified for every document (tweet)
topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)
ggplot(topics, aes(date, fill = term[topic])) +
geom_density(position = "stack")
0.006
0.004
term[topic]
big, r, mining, network, analysing, statistical, tutorial
data, mining, research, canberra, big, event, text
data, r, slide, analytics, research, analysing, workshop
density
0.000
# sentiment analysis
library(sentiment)
sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)
##
## neutral positive
## 428 20
# sentiment plot
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
plot(result, type = "l")
22 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Topic Modelling
Sentiment Analysis
Followers and Retweeting Analysis
Follower Analysis
Retweeting Analysis
R Packages
References and Online Resources
23 / 40
Retrieve User Info and Followers
user <- getUser("RDataMining")
user$toDataFrame()
friends <- user$getFriends() # who this user follows
followers <- user$getFollowers() # this user's followers
followers2 <- followers[[1]]$getFollowers() # a follower's followers
## [,1] ...
## description "R and Data Mining. Group on LinkedIn: ...
## statusesCount "583" ...
## followersCount "2376" ...
## favoritesCount "6" ...
## friendsCount "72" ...
## url "http://t.co/LwL50uRmPd" ...
## name "Yanchang Zhao" ...
## created "2011-04-04 09:15:43" ...
## protected "FALSE" ...
## verified "FALSE" ...
## screenName "RDataMining" ...
## location "Australia" ...
## lang "en" ...
## id "276895537" ... 24 / 40
Follower Map3
@RDataMining Followers (#: 2376)
3
Based on Jeff Leeks twitterMap function at
http://biostat.jhsph.edu/~jleek/code/twitterMap.R
25 / 40
Active Influential Followers
M Kautzar Ichramsyah
Prof. Diego Kuonen
10.0 20.0
Christopher D. Long
.................................
Murari Bhartia
#AI PR Girl
Robert Penner
5.0
Roby Zac S.
#Tweets per day
Michal Illich
Sharon Machlis
StatsBlogs David Smith Marcel Molina
Mitch Sanders Data Science London
2.0
Statistics Blog
Yichuan Wang Learn R
0.5
biao
Rob J Hyndman RDataMining
Data Mining
Antonio Piccolboni
0.2
Duccio Schiavon
LearnDataAnalysis
5 10 20 50 100
#followers / #friends
26 / 40
Top Retweeted Tweets
# plot them
dates <- strptime(tweets.df$created, format="%Y-%m-%d")
plot(x=dates, y=tweets.df$retweetCount, type="l", col="grey",
xlab="Date", ylab="Times retweeted")
colors <- rainbow(10)[1:length(selected)]
points(dates[selected], tweets.df$retweetCount[selected],
pch=19, col=colors)
text(dates[selected], tweets.df$retweetCount[selected],
tweets.df$text[selected], col=colors, cex=.9)
27 / 40
Top Retweeted Tweets
Handling and Processing Strings in R an ebook in PDF format, 105 pages. http://t.co/UXnetU7k87
15
A Twitter dataset for text mining: @RDataMining Tweets extracted on 3 February 2016. Download it at https://t.co/lQp94IvfPf
Times retweeted
Free online course on Computing for Data Analysis (with R), to start on 24 Sept 2012 https://t.co/Y617n30y Slides in 8 PDF files on Getting Data from the Web with R http://t.co/epT4Jv07WD
10
Lecture videos of natural language processing course at Stanford University: 18 videos, with each of over 1 hr length http://t.co/VKKdA9Tykm
The R Reference Card for Data Mining now provides links to packages on CRAN. Packages for MapReduce and Hadoop added. http://t.co/RrFypol8kw
5
0
Date
28 / 40
Tracking Message Propagation
tweets[[1]]
retweeters(tweets[[1]]$id)
retweets(tweets[[1]]$id)
## [[1]]
## [1] "bobaiKato: RT @RDataMining: A Twitter dataset for te...
##
## [[2]]
## [1] "VipulMathur: RT @RDataMining: A Twitter dataset for ...
##
## [[3]]
## [1] "tau_phoenix: RT @RDataMining: A Twitter dataset for ...
31 / 40
R Packages
32 / 40
4
Twitter Data Extraction Package twitteR
I userTimeline, homeTimeline, mentions,
retweetsOfMe: retrive various timelines
I getUser, lookupUsers: get information of Twitter user(s)
I getFollowers, getFollowerIDs: retrieve followers (or
their IDs)
I getFriends, getFriendIDs: return a list of Twitter users
(or user IDs) that a user follows
I retweets, retweeters: return retweets or users who
retweeted a tweet
I searchTwitter: issue a search of Twitter
I getCurRateLimitInfo: retrieve current rate limit
information
I twListToDF: convert into data.frame
4
https://cran.r-project.org/package=twitteR
33 / 40
5
Text Mining Package tm
I removeNumbers, removePunctuation, removeWords,
removeSparseTerms, stripWhitespace: remove numbers,
punctuations, words or extra whitespaces
I removeSparseTerms: remove sparse terms from a
term-document matrix
I stopwords: various kinds of stopwords
I stemDocument, stemCompletion: stem words and
complete stems
I TermDocumentMatrix, DocumentTermMatrix: build a
term-document matrix or a document-term matrix
I termFreq: generate a term frequency vector
I findFreqTerms, findAssocs: find frequent terms or
associations of terms
I weightBin, weightTf, weightTfIdf, weightSMART,
WeightFunction: various ways to weight a term-document
matrix
5
https://cran.r-project.org/package=tm
34 / 40
Topic Modelling and Sentiment Analysis Packages
topicmodels & sentiment140
Package topicmodels 6
6
https://cran.r-project.org/package=topicmodels
7
https://github.com/okugami79/sentiment140
35 / 40
Social Network Analysis and Visualization Package
igraph 8
I degree, betweenness, closeness, transitivity:
various centrality scores
I neighborhood: neighborhood of graph vertices
I cliques, largest.cliques, maximal.cliques,
clique.number: find cliques, ie. complete subgraphs
I clusters, no.clusters: maximal connected components
of a graph and the number of them
I fastgreedy.community, spinglass.community:
community detection
I cohesive.blocks: calculate cohesive blocks
I induced.subgraph: create a subgraph of a graph (igraph)
I read.graph, write.graph: read and writ graphs from and
to files of various formats
8
https://cran.r-project.org/package=igraph
36 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Topic Modelling
Sentiment Analysis
Followers and Retweeting Analysis
Follower Analysis
Retweeting Analysis
R Packages
References and Online Resources
37 / 40
References
38 / 40
Online Resources
I Chapter 10 Text Mining & Chapter 11 Social Network
Analysis, in book R and Data Mining: Examples and Case
Studies
http://www.rdatamining.com/docs/RDataMining.pdf
I RDataMining Reference Card
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I Online documents, books and tutorials
http://www.rdatamining.com/resources/onlinedocs
I Free online courses
http://www.rdatamining.com/resources/courses
I RDataMining Group on LinkedIn (22,000+ members)
http://group.rdatamining.com
I Twitter (2,700+ followers)
@RDataMining
39 / 40
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
40 / 40