DAC Biz&Mkt LS Sept '16 - Session 1 - Handout

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

A Special Interests Student Club with SIM GE

Business & Marketing Lab Series


Text Data Mining & Sentiment Analysis
Session 1: Text Data Mining

1. Introduction to Text Data Mining & Sentiment Analysis

Text Data Mining


Text Data Mining, or simply text mining, is the act of extracting textual
(unstructured) data from a source and after which we look forward to extracting
meaningful insights from it. A generic way to describe text mining is that it can turn
qualitative objects into quantitative ones which can then be incorporated in other
types of analyses. In the modern context, using text mining to solve business
problems is described as text analytics.

Sentiment Analysis
According to Wikipedia, sentiment analysis (a.k.a opinion mining) refers to the use
of natural language processing, text analysis and computational linguistics to
identify and extract subjective information from source materials. One of the most
common applications of sentiment analysis is to track attitudes and feelings from a
text source or the web, especially for tracking products, services, brands or even
people. We then determine whether they are viewed positively or negatively by an
audience.

2. Unstructured Data

Before we move on to the techniques used for text mining and sentiment analysis,
we shall learn about unstructured data and the importance of it.

Structured VS Unstructured Data


Structured data refers to information found in databases with high degrees of
organisation. They are readily searchable by simple, straightforward search
algorithms or operations.
Unstructured data on the other hand is usually associated with the lack of
architecture and often includes email messages, document files, text from social
media sites and even audio and image data.
1
Why is unstructured data important?
An article titled Unstructured Data and the 80 percent rule by analytics strategy
consultant Seth Grimes stated that it is a truism that 80 percent of business-
relevant information originates in unstructured form 1. This is supported by the
International Data Corporations (IDC) report that object-based storage is gaining
momentum; worldwide revenue for file-based and object-based storage will reach
$38 billion by 2017, a huge jump from the market's estimated $23-billion-plus
revenue in 20132. This means that the use for techniques to process and extract
information from unstructured data is becoming more relevant with each passing
year.

3. Facebook for Developers

When we do text mining, we have a myriad of resources to choose from: email, PDF
files, blogs and even social media sites. For this session, we will be using Facebook
as the source for our text data. Before we jump straight into our R programming, we
first must create an application through Facebooks Developer site to use its
Application Programming Interface (API). An API is essentially a way for
programmers to communicate with a certain application and in this case, you will be
using the API to get data from a Facebook page. Your API is the middleman between
the R environment and Facebook.

Creating a Facebook application and getting access to its API

To create a Facebook application, first log in into your Facebook account. Go to the
following link: https://developers.facebook.com/docs/apps/register. Scroll down to
(2. Developer Account) and click on the button Create Developer Account to
Register as a Facebook Developer.

1 https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/

2 http://www.crn.com/news/storage/240159690/idc-file-based-object-based-storage-growth-
far-exceeds-overall-storage-growth.htm

2
After registering, click on the Create new Facebook App button and click on the
selection Website as a platform.

Give a name for the app and click on Create New Facebook App ID.

3
In the next window choose a category for your app (I picked Apps for Pages, but
any one would work) and then Create App ID. Complete the verification step
where you have to select pictures as prompted. At the page, you will end up at, click
on the Skip Quick Start button to get directly to the settings of your app.

Welcome to your very first own Facebook app! Now, switch to R Studio and install
the required libraries/packages to call upon and utilise the Facebook web API.

> install.packages("devtools")
> library(devtools)

4
> install_github("Rfacebook", "pablobarbera", subdir="Rfacebook")
> library(Rfacebook)
> install.packages("httr")
> library(httr)

Package descriptions:

devtools: The aim of this package is to make package development easier


by providing R functions that simplify common tasks but for this session we
use it to make the function install_github() available for us
Rfacebook: Provides us with an interface to the Facebook API
httr: Provides us tools for working with URLs and HTTP

After installing the packages, we need to connect our R session with the Facebook
app that we have just created, and authenticate it for data mining. In other words, R
will be connected to your app and when it is conducting text mining, it is able to
access whatever information that is available on a page/profile. If a certain
information is made accessible to your app, R is able to access it as well.

Copy your App ID and App Secret as arguments to the parameters needed in
the fbOAuth() function.

> fb_oauth <- fbOAuth(app_id <- "yourappidhere",


+ app_secret <- "yourappsecrethere")

You will get a prompt as follows:

Copy the URL http://localhost:1410/ and go to the settings of your Facebook app
(Settings is accessible through the tab on the left side). Then, choose on + Add
Platform and choose Website.

5
Copy the URL for the field as shown below and save the changes.

Go back to R Studio and hit the Enter key. A browser window would then open and
you have to allow the app to access your Facebook account.

6
If everything worked, the browser should show the message:

And the console in R Studio would show:

Authentication complete.
Authentication successful.

You can then save your fb_oauth object

> save(fb_oauth, file="fb_oauth")

and use it in the future using the following command:

> load("fb_oauth")

Text Data Mining and Analysing Facebook with R

Now that we have connected everything and gain access to Facebook, we test some
of the functions. We will start with getting our own profile information. Before we
can access such information to derive insights, we have to obtain an Access
token. How we get this access token is to first head to Tools & Support.

7
Select Graph API Explorer.

Click Get Token, Get User Access Token, check all boxes, Get Access
Token and finally click OK.

8
Copy the newly generated Access token.

Create a new variable with it for a temporary token for access.

> temp_token <- "pasteyourtokenhere"

Note: The difference between using the access token generated by the app (app
access token) and the temp access token (user access token) is that the latter
communicates with the API by accessing through your own Facebook profile, instead
of the app. Some user data that would normally be visible to an app that's making a

9
request with a user access token isn't always visible with an app access token. If
you're reading user data and using it in your app, you should use a user access
token instead of an app access token. In addition, using user access tokens is more
secured in a way that you are required to refresh it every few hours.

4. Mining Data from Facebook

Following the steps above, we can now move on to mining data from Facebook since
we have established a connection to its API. However, before we get into that, we
should briefly look into what an API is.

What is an API?
Application Programming Interface, or API, is essentially a way for programmers or
developers to communicate with a certain application. So in this case where you
would like to mine data from Facebook, you have to call its API. To do that you have
to communicate with it using specific languages, like a programming language in
our case it is R. The API serves as the middleman between the
programmer/developer and an application. This middleman gets the requests and
should it be a valid one, it returns the requested data.

Do note that from here on, we will be using User Access Token to access the API.

Get information of your own Facebook profile

We will start off with the simple task of mining details of your own Facebook profile.

> my_fb_profile <- getUsers("me", token = temp_token)

After executing the above, you will obtain a data frame with several variables
containing information regarding your Facebook profile. Lets try to access one of
these variables.

> my_fb_profile$name
[1] "Ryzal Kamis"

You can also get a list of pages that you have liked.

> my_fb_likes <- getLikes("me", token = temp_token)


> View(my_fb_likes)

From the data frame that you have created above, choose a Facebook page to mine
data from. Copy its unique ID from the variable id and copy it to the command to
be executed below. I will use Analytics Vidhya (a great resource for topics
regarding data analytics) Facebook page as an example here.

> analytics.vidhya_fb.posts <- getPage(452065408218678,


+ token = temp_token,
+ n = 999999999)

10
For the parameter n, an argument of a huge number is stated because by default,
the function would only get 25 posts. Now, we should utilise the xlsx package to
write the data frame we have created above to an Excel file.

> library(xlsx)
> write.xlsx(analytics.vidhya_fb.posts,
+ "Analytics Vidhya Facebook Posts.xlsx",
+ showNA = TRUE)

Now, you can view contents shared by the page through the Excel file you have just
created. Now we will search for Facebook pages that contains certain key words you
will be specifying.

> fb.pages_data <- searchPages("data",


+ token = temp_token,
+ n = 999999999)
200 pages 400 491
> View(fb.pages_data)

To identify the page that has the most number of likes or talking_about_count

> fb.pages_data$name[which.max(fb.pages_data$likes)]
[1] "TE Data"
> fb.pages_data$name[which.max(fb.pages_data$talking_about_count)]
[1] "MY DATA Online Store"

We can do the same for Facebook groups as well. Use the function searchGroup() to
create a data frame listing down Facebook groups with a certain key word.

> fb.groups_data <- searchGroup("data",


+ token = temp_token)

To get the list of posts shared by this group, you can use the function getGroup().
However, do note that this function is only applicable to groups with an OPEN
privacy. We will be using the local group DataScience SG as an example here.

> data.science.sg_fb.posts <- getGroup(157938781081987,


+ token = temp_token,
+ since = '2001/08/30',
+ until = '2016/08/30',
+ n = 999999)

Now, to export the data frame to an Excel file.

> write.xlsx(data.science.sg_fb.posts,
+ "DataScience SG Facebook Group Posts.xlsx",
+ showNA = TRUE)

For cheap thrills, let us now try updating our Facebook status through R.

> updateStatus("I am updating my Facebook status using R programming language. Can


you feel the excitement?",
+ token = temp_token)
11
Success! Link to status update:
http://www.facebook.com/10152141763938264_10154064513808264

Awesome stuff isnt it? There are many uses to the functions that you have been
introduced with. The vast amount of information you can obtain with such
convenience speaks volume of Rs potential. This is just the tip of the iceberg.

5. Data Mining from SIM Confessions Page

For the next session of the Lab Series we will be learning how to construct Word
Clouds. We will use data from the SIM Confessions page. So first, we have to mine
data from the page.

Identify the id of the Facebook page.

> fb.pages_SIMCon <- searchPages("SIM Confessions",


+ token = temp_token)
> View(fb.pages_SIMCon)
> SIMCon.page.id <- "500110393384651"

Mine all the posts created by the page since the date of its creation and export it to
an Excel file in .xlsx format.

> SIMCon.posts.all <- getGroup(group_id = SIMCon.page.id,


+ token = temp_token,
+ n = 9999999999999999,
+ since = '2013/02/01',
+ until = '2016/08/30')
> write.xlsx(SIMCon.posts.all,
+ "SIM Confessions Facebook Page Posts.xlsx")

We can use this Excel file as a data set for us to rely on for the next session.
Otherwise, the function save.image() works in retaining the data frame you have
created in the environment.

12
END OF SESSION

13

You might also like