Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

University of the Philippines

Open University

Fundamentals of
Predictive Analytics
A Business Analytics Course

Dr. Eugene Rex Jalao


Dr. Melinda Lumanta
Ms. Louise Antonette Villanueva
Asst. Prof. Reinald Adrian Pugoy
Asst. Prof. Mar Cruz
Course Writers
University of the Philippines
OPEN UNIVERSITY

COMMISSION ON HIGHER EDUCATION

University of the Philippines


OPEN UNIVERSITY

Fundamentals of Predictive Analytics 1


UNIVERSITY OF THE PHILIPPINES OPEN UNIVERSITY

Fundamentals of Predictive Analytics


A Business Analytics Course

Welcome, dear students! This course will help you traverse the world of predictive
analytics. In predictive analytics (also sometimes called data mining), useful patterns can
be extracted from data available which can, in turn, be utilized to predict the future.
Moreover, predictive analytics draws ideas from various fields such as machine learning,
pattern recognition, statistics, and database systems.

Before taking this course, you should have already completed the Fundamentals of
Descriptive Analytics course. And, one more thing! This very document is your course
guide. Kindly read this carefully before embarking on your journey to learn the
Fundamentals of Predictive Analytics.

COURSE OBJECTIVES

At the end of the course, you should be able to:


1. Discuss key concepts, frameworks, and algorithms in predictive analytics.
2. Identify appropriate predictive analytics/data mining methodologies for different
business problems.
3. Utilize appropriate predictive analytics/data mining methodologies to build and
validate models.
4. Evaluate the performance of the predictive model.

COURSE OUTLINE

MODULE 1: Introduction to Predictive Analytics


1. Definition of Predictive Analytics and Data Mining
2. Predictive Analytics Framework

MODULE2: Data Preprocessing


1. Definition of Data Preprocessing
2. Tasks of Data Preprocessing

MODULE 3: Supervised Learning


1. Classification Methodologies

Fundamentals of Predictive Analytics 2


2. Regression Methodologies
3. Evaluation and Deployment

MODULE 4: Unsupervised Learning


1. Association Rule Mining
2. Sequential Pattern Mining
3. Clustering
4. Text Mining
5. Social Media Sentiment Analysis

COURSE MATERIALS

The course learning package consists of the items listed below. These will be made
available for access and download.
1. Course Guide
2. Study Guides
3. Video Lectures/Resources
4. Other Digital References

STUDY SCHEDULE

Week Topic Activities


No.

1-2 Course Overview; • Read the Course Guide.


Module 1: Introduction to • Introduce yourself via the Self-
Predictive Analytics Introduction forum.
• Go through the Module 1 Study Guide.
• Watch the following video resources:
o Introduction to Predictive Analytics (ERL
Jalao)
o Supervised Learning vs Unsupervised
Learning (RADL Pugoy)
o Tools of Data Mining (ERL Jalao)
• Participate in Discussion Forum 1.

2-3 Module 2: Data • Go through the Module 2 Study Guide.


Preprocessing • Watch the following video resources:
o Data Preprocessing (ERL Jalao)

Fundamentals of Predictive Analytics 3


o Data Integration (ERL Jalao)
o Data Transformation (ERL Jalao)
o Data Encoding (ERL Jalao)
o Data Cleaning (ERL Jalao)
o Data Reduction and Manipulation (ERL
Jalao)
• Participate in Discussion Forum 2.

4-7 Module 3: Supervised • Go through the Module 3 Study Guide


Learning (Classification) (pp. 1-3).
• Watch the following video resources:
o Classification (ERL Jalao)
o Naive Bayes (ERL Jalao)
o Decision Trees (ERL Jalao)
o Nearest Neighbours (ERL Jalao)
o Artificial Neural Networks (ERL Jalao)
o Support Vector Machines (ERL Jalao)
o Ensembles (ERL Jalao)
o Random Forests (ERL Jalao)
o Model Evaluation (ERL Jalao)
• Submit Assignment 1.

8-9 Module 3: Supervised • Go through the Module 3 Study Guide


Learning (Regression) (pp. 3-5).
• Watch the following video resources:
o Regression (ERL Jalao)
o Regression Model Evaluation (ERL Jalao)
o Indicator Variables (ERL Jalao)
o Multicollinearity (ERL Jalao)
o Logistic Regression (ERL Jalao)
• Submit Assignment 2.

10-12 Module 4: Unsupervised • Go through the Module 4 Study Guide.


Learning • Watch the following video resources:
o Association Rule Mining (ERL Jalao)
o Sequential Pattern Mining (ERL Jalao)
o K-Means Clustering (ERL Jalao)
o Hierarchical Clustering (ERL. Jalao)
o Text Mining (ERL Jalao)
o Social Media Sentiment Analysis (ERL
Jalao)
• Submit Assignment 3.

Fundamentals of Predictive Analytics 4


13-14 Review and Integration Participate in Discussion Forum 3.

15 Final Exam Accomplish the Final Exam.

COURSE REQUIREMENTS

To earn a digital certificate of accomplishment, you need to satisfactorily complete the


following:

1. Three (3) discussion forums (20%)


Discussion forums or DFs are venues for you to share your thoughts and
learnings and show comprehension of the subject matter at hand. Guide
questions shall be posted for each DF. Each of you is expected to provide
insightful answers, with properly cited references.

2. Three (3) assignments (60%)


The goal of the assignments is for you to actually apply learned methodologies
and principles to the problem or issue at hand. An assignment may be a position
paper, modelling exercise, or programming assignment.

3. One (1) final exam (20%)


The final exam shall serve as your summative assessment, i.e, to evaluate
everything that you have learned all throughout this course.

Fundamentals of Predictive Analytics 5


MODULE 1: INTRODUCTION TO PREDICTIVE ANALYTICS

Introduction

This is the first module in the course. As such, it gives an overview of what the students
will be learning in this course as a whole, i.e., predictive analytics. A brief overview on
the principles pertaining to predictive analytics are defined and discussed.

Learning Objectives

After working on this module, you should be able to:


1. Define what predictive analytics is.
2. Discuss the fundamental ideas, concepts, and techniques associated with
predictive analytics.
3. Describe the Predictive Analytics Framework.

1.1. What is predictive analytics?

Predictive analytics (also sometimes called data mining) is the non-trivial extraction of
implicit, previously unknown, and potentially useful information from the data. In other
words, useful patterns are extracted from the data, and we hope that these patterns will
be repeated in the future. Another definition of data mining is that it is the exploration
and analysis of a large quantity of data to discover meaningful patterns by automatic or
semi-automatic means. Data mining is all about explaining the past to predict the future.
Predictive analytics draws ideas from various fields such as machine learning, pattern
recognition, statistics, and database systems.

Fundamentals of Predictive Analytics 6


Learning Resources
• Video on “Introduction to Predictive Analytics” by Dr. Eugene Rex Jalao.
• Video on “Predictive Analytics” by Mr. Dominic Ligot.

Discussion Forum 1
Discuss the potential applications of predictive analytics/data mining in your field of work.

1.2. Supervised Learning vs Unsupervised Learning

Before proceeding to the latter parts of the course, it is important to differentiate


supervised learning and unsupervised learning. Supervised learning refers to the task of
inferring a function from supervised (or labeled) training data. Each example in the
training data is a pair, consisting of the input object and its corresponding label. On the
other hand, unsupervised learning pertains to the problem of trying find hidden
structures in unlabeled data. Since the examples are unlabeled, it can be more
challenging to find and evaluate a potential solution.

Learning Resources
Video on “Supervised Learning vs Unsupervised Learning” by Asst. Prof.
Reinald Adrian Pugoy

Study Questions

1. How does the DBMS perform the functionalities listed in this module?
2. How do the different components of a database system relate to one another?

Fundamentals of Predictive Analytics 7


1.3. Tools of Data Mining

Data mining tools are software usually downloaded or bought from third-party providers.
An example of this tool is R, an open-source software that is considered to be the most
utilized tool for data mining and predictive analytics. Other tools include SPSS, Rapid
Miner, SAS, Excel, and Python, with Python coming closer to R nowadays. Furthermore,
it is important to take note that no tool solves all predictive analytics problems. In other
words, solutions for such problems cannot be implemented by just using a single
software. For this reason, a majority of companies worldwide use both free/open-source
software and commercial software.

Learning Resources
Video on “Tools of Data Mining” by Dr. Eugene Rex Jalao.

1.4. The Predictive Analytics Framework

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a non-
proprietary framework that allows us to implement predictive analytics solutions. It is a
necessary standard process that ensures that data mining is reliable and can be easily
repeated by people with little to no data mining background. It also demonstrates the
maturity of data mining and reduces dependency on experts. Furthermore, the CRISP-
DM Model also serves as an aid to project planning and management, and it is also
considered a “comfort factors” for new adopters.

The CRISP-DM Model involves several phases. These are:

1. Business Understanding - The focus is on understanding the project objectives


and requirements from a business perspective, then converting this knowledge
into a data mining problem definition and a preliminary plan designed to achieve
the objectives.

Fundamentals of Predictive Analytics 8


2. Data Understanding - It starts with an initial data collection and familiarization.
Data quality problems are identified to discover first insights into the data and to
detect interesting subsets.
3. Data Preparation - This covers all pre-processing activities to construct the final
dataset from the raw data. This process then allows the data to fit into the model
that will be used to solve a particular problem.
4. Modeling - This phase defines the modeling techniques, applications, and
parameters to calibrate specific models to solve the problem.
5. Evaluation - It identifies and verifies whether the output of the model generated
has solved the problem. Otherwise, if the model fails to solve, there is a need to
go back to the previous phases as the model has to be redefined.
6. Deployment - After validating the model, it is then deployed so that the end-users
of the organization will be able to finally utilized and reap the benefits of the data
mining solution.

Figure 1. Generic tasks and output of the CRISP-DM Model.

Learning Resources
Video on “CRISP-DM (Cross-Industry Standard Process for Data Mining)” by Dr.
Eugene Rex Jalao.

Fundamentals of Predictive Analytics 9


Assignment 0/Discussion Forum 2

Apply the CRISP-DM on the case study to be given by your professor/faculty-in-charge.


Share your experience by posting it in our discussion forum."

Study Question
Why is important for managers to know how to entity relationship diagrams are
designed?

Fundamentals of Predictive Analytics 10


MODULE 2: DATA PRE-PROCESSING

Introduction

This module intends to:

1. Introduce basic concepts in data pre-processing; and


2. Introduce methods of data pre-processing.

2.1. Basic Concepts in Data Pre-processing

Data preprocessing is an important step in data analytics. It aims at assessing and


improving the quality of data for secondary statistical analysis. With this, the data is
better understood and the data analysis is performed more accurately and efficiently.

Learning Objectives

After studying the basic concepts in data preprocessing, you should be able to:

1. Explain what data preprocessing is and why it is important in data analytics; and
2. Describe different forms of data preprocessing.

A. What is Data Pre-Processing?

Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can
lead to errors in parameter estimation and incorrect analysis leading users to draw false
conclusions. Quality decisions must be based in quality data; hence, unclean data may
cause incorrect or even misleading statistical results and predictive analysis. Data
preprocessing is a data mining technique that involves transforming raw or source data
into an understandable format for further processing.

Fundamentals of Predictive Analytics 11


B. Tasks for Data Pre-processing

Several distinct steps are involved in preprocessing data. Here are the general steps
taken to pre-process data:

• Data cleaning
o This step deals with missing data, noise, outliers, and duplicate or
incorrect records while minimizing introduction of bias into the database.
o Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data integration
o Extracted raw data can come from heterogeneous sources or be in
separate datasets. This step reorganizes the various raw datasets into a
single dataset that contain all the information required for the desired
statistical analyses.
o Involves integration of multiple databases, data cubes, or files.
o Data with different representations are put together and conflicts within
the data are resolved.
• Data transformation

Fundamentals of Predictive Analytics 12


o This step translates and/or scales variables stored in a variety of formats
or units in the raw data into formats or units that are more useful for the
statistical methods that the researcher wants to use.
o Data is normalized, aggregated and generalized.
• Data reduction
o After the dataset has been integrated and transformed, this step removes
redundant records and variables, as well as reorganizes the data in an
efficient and “tidy” manner for analysis.
o Pertains to obtaining reduced representation in volume but produces the
same or similar analytical results.
o This step aims to present a reduced representation of the data in a data
warehouse.

Pre-processing is sometimes iterative and may involve repeating this series of steps until
the data are satisfactorily organized for the purpose of statistical analysis. During
preprocessing, one needs to take care not to accidentally introduce bias by modifying
the dataset in ways that will impact the outcome of statistical analyses. Similarly, we
must avoid reaching statistically significant results through “trial and error” analyses on
differently pre-processed versions of a dataset.

Learning Resources
Dr. Eugene Rex Jalao’s video on Data Preprocessing

Activity 2-1
Watch:
Dr. Eugene Rex Jalao’s video on Data Preprocessing.

Methods of Data Pre-processing


Data preprocessing consists of series of steps to transform data extracted from different
data sources into a “clean” data prior to statistical analysis. Data pre-processing includes
data cleaning, data integration, data transformation, and data reduction.

Fundamentals of Predictive Analytics 13


Learning Objectives
After studying the methods of data processing, you should be able to:

1. Understand the different methods of data preprocessing; and


2. Differentiate the different techniques of data preprocessing.

A. Data Integration

Data integration is the process of combining data derived from various data sources
(such as databases, flat files, etc.) into a consistent dataset. In data integration, data
from the different sources, as well as the metadata - the data about this data - from
different sources are integrated to come up with a single data store. There are a number
of issues to consider during data integration related mostly to possible different
standards among data sources. These issues could be entity identification problem, data
value conflicts, and redundant data. Careful integration of the data from multiple sources
may help reduce or avoid redundancies and inconsistencies and improve data mining
speed and quality of sources.

Four Types of Data Integration Methodologies

1. Inner Join - creates a new result table by combining column values of two tables
(A and B) based upon the join-predicate.
2. Left Join - returns all the values from an inner join plus all values in the left table
that do not match to the right table, including rows with NULL (empty) values in
the link column.
3. Right Join - returns all the values from the right table and matched values from
the left table (NULL in the case of no matching join predicate).
4. Outer Join - the union of all the left join and right join values.

Fundamentals of Predictive Analytics 14


Learning Resources
Dr. Eugene Rex Jalao’s video on Data Integration

Activity 2-1

Watch:
Dr. Jalao’s video on Data Integration.

Fundamentals of Predictive Analytics 15


Study Questions

1. How are the alternative data warehousing architectures different from the usual
architecture?
2. Discuss the advantages and disadvantages of the different alternative data
warehousing architectures.

B. Data Transformation
Data transformation is a process of transforming data from one format to another. It aims
to transform the data values into a format, scale or unit that is more suitable for analysis.
Data transformation is an important step in data preprocessing and a prerequisite for
doing predictive analytic solutions.

Here are a few common possible options for data transformation:

1) Normalization - a way to scale specific variable to fall within a small specific range
a) min-max normalization - transforming values to a new scale such that all
attributes fall between a standardized format.

b.) Z-score standardization - transforming a numerical variable to a standard


normal distribution

Fundamentals of Predictive Analytics 16


2.) Encoding and Binning
a) Binning - the process of transforming numerical variables into categorical
counterparts.
i) Equal-width (distance) partitioning - divides the range into N intervals of equal
size, thus forming a uniform grid.

ii.) Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing
approximately the same number of samples.

Fundamentals of Predictive Analytics 17


b.) Encoding - the process of transforming categorical values to binary or
numerical counterparts, e.g. treat male or female for gender to 1 or 0.
Data encoding is needed because some data mining methodologies,
such as Linear Regression, require all data to be numerical.
i.) Binary Encoding (Unsupervised)
• Transformation of categorical variables by taking the
values 0 or 1 to indicate the absence or presence of each
category.
• If the categorical variable has k categories, we would need
to create k binary variables.

ii.) Class-based Encoding (Supervised)


• Discrete Class

Fundamentals of Predictive Analytics 18


Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable
with its corresponding probability of the class variable.

Continuous Class
Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding average of the class variable.

Fundamentals of Predictive Analytics 19


• Learning Resources
• Dr. Eugene Rex Jalao’s video on Data Transformation
• Dr. Eugene Rex Jalao’s video on Data Encoding

C. Data Cleaning

All data sources potentially include errors and missing values – data cleaning addresses
these anomalies. Data cleaning is the process of altering data in a given storage
resource to make sure that it is accurate and correct. Data cleaning routines attempts to
fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data, as well as resolve redundancy caused by data integration.

Fundamentals of Predictive Analytics 20


Data Cleaning Tasks:
a) Fill in missing values
Solutions for handling missing data:
i. Ignore the tuple
ii. Fill in the missing value manually
iii. Data Imputation
- Use a global constant to fill in the missing value
- Use the attribute mean to fill in the missing value
- Use the attribute mean for all samples belonging to the
same class

b) Cleaning noisy data


Solutions for cleaning noisy data:
i. Binning - transforming numerical values into categorical
components
ii. Clustering - grouping data into corresponding cluster and use the
cluster average to represent a value
iii. Regression - utilizing a simple regression line to estimate a very
erratic data set
iv. Combined computer and human inspection - detecting suspicious
values and checking it by human interventions

c) Identifying outliers
Solutions for identifying outliers:
i. Box plot

Activity 2-2
Watch:
Dr. Jalao’s video on Data Cleaning.

D. Data Reduction and Manipulation

Data reduction is a process of obtaining a reduced representation of the data set that is
much smaller in volume but yet produce the same (or almost the same) analytical
results. The need for data reduction emerged from the fact that some database/data
warehouse may store terabytes of data, and complex data analysis/mining may take a
very long time to run on the complete data set.

Fundamentals of Predictive Analytics 21


Data Reduction Strategies:
a. Sampling - utilizing a smaller representative or sample from the big data set or
population that will generalize the entire population.
i.Types of Sampling
1. Simple Random Sampling - there is an equal probability of
selecting any particular item.
2. Sampling without replacement - as each item is selected, it is
removed from the population
3. Sampling with replacement - objects are not removed from the
population as they are selected for the sample
4. Stratified sampling - split the data into several partitions, then
draw random samples from each partition.

b. Feature Subset Selection - reduces the dimensionality of data by eliminating


redundant and irrelevant features.
i.Feature Subset Selection Techniques
1. Brute-force approach - try all possible feature subsets as input to
data mining algorithm
2. Embedded approaches - feature selection occurs naturally as part
of the data mining algorithm
3. Filter approaches - features are selected before data mining
algorithm is run
4. Wrapper approaches - use the data mining algorithm as a black
box to find the best subset or attributes

c. Feature Creation - creating new attributes that can capture the important
information in a data set much more efficiently than the original attributes.
i.Feature Creation Methodologies
1. Feature Extraction
2. Mapping Data to New Space
3. Feature Construction

Learning Resources
Dr. Eugene Rex Jalao’s video on Data Reduction and Manipulation.

Fundamentals of Predictive Analytics 22


Activity 2-3

Watch:
Dr. Jalao’s video on Data Reduction and Manipulation
Other References Used for Module 2:

A Comprehensive Approach Towards Data Preprocessing Techniques & Association


Rules Jasdeep Singh Malik, Prachi Goyal, 3 Mr.Akhilesh K Sharma 3 Assistant
Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India. Available at URL
https://bvicam.ac.in/news/INDIACom%202010%20Proceedings/papers/Group3/INDIACo
m10_279_Paper%20(2).pdf

Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw
University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-
preprocess.pdf

Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis
of Electronic Health Records. Springer, Cham. Available at URL
https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2

Fundamentals of Predictive Analytics 23


MODULE 3: SUPERVISED LEARNING

Introduction

In Module 1, you already have encountered the definition of supervised learning. This
time, Module 3 discusses supervised learning in greater detail. Specifically, supervised
learning methodologies can be categorized into two: classification, the prediction of a
class or category from several predictor variables; and regression, the prediction of
numerical value from one or more predictors. This module also tackles how regression
and classification may be evaluated.

Learning Objectives

After working on this module, you should be able to:


• Differentiate classification from regression.
• Identify and discuss appropriate supervised learning methodologies for various
scenarios and business problems.
• Build and validate prediction models by utilizing supervised learning
methodologies.
• Evaluate the performance of the prediction model.

3.1. Classification

Given a collection of records, let us say that we have multiple predictor variables (x , x ,
1 2

x ) and one categorical response (y). Here, we intend to find a model for predicting the
p

class variable from multiple predictor variables. This is the essence of classification; a
categorical response is predicted from multiple predictor variables. In classification,
historical data are used to build a model and the goal is to predict previously unseen
records. There are several classification algorithms and some of these are listed below:

ZeroR
The simplest classification methodology which relies on the target and ignores all
predictors.

OneR
Simple yet accurate classification algorithm that generates one rule for each predictor in
the data.

Fundamentals of Predictive Analytics 24


Naïve-Bayes
A frequency-based classifier that uses a probabilistic framework and conditional
probability to classify data.

Decision Tree
It builds classification models in the form of a tree structure that represents rules that can
be easily understood.

Nearest Neighbours
An intuitive method that classified unlabeled data based on their similarity with sample
examples in the training set. It utilizes distance as a similarity measure in making
predictions.

Artificial Neural Network (ANN)


A network of perceptrons or nodes that mimic a biological network of neurons in the
brain.

Support Vector Machine (SVM)


It performs classification by finding a plane that maximizes the margin between the two
classes.

Ensemble
This predicts the class of previously unseen records by aggregating predictions made by
multiple classifiers.

Random Forests
A relatively modern algorithm that is essentially an ensemble of decision trees.

Learning Resources
• Video on “Classification” by Dr. Eugene Rex Jalao.
• Video on “Naive Bayes” by Dr. Eugene Rex Jalao
• Video on “Decision Trees” by Dr. Eugene Rex Jalao
• Video on “Nearest Neighbours” by Dr. Eugene Rex Jalao
• Video on “Artificial Neural Networks” by Dr. Eugene Rex Jalao
• Video on “Support Vector Machines” by Dr. Eugene Rex Jalao
• Video on “Ensembles” by Dr. Eugene Rex Jalao
• Video on “Random Forests” by Dr. Eugene Rex Jalao

Fundamentals of Predictive Analytics 25


3.1.1. Classification Model Evaluation

We will not know how well the model performs unless model evaluation comes into the
picture. Model evaluation is a methodology used to find the model that represents the
data and how well the chosen model will work in the future. Listed below are questions
necessary to be answered in model evaluation:

1. How do we evaluate the performance of a classification model? Appropriate


metrics have to be defined.
2. How do we obtain reliable estimates of these errors? The methods for
performance evaluation, and the concepts of overfitting and underfitting have to
be considered.
3. How do we compare the relative performance among competing models?

Thus, if there are multiple prediction models or algorithms, how are these models
compared? Which one among them will be chosen to deploy in business?

Learning Resources
Video on “Model Evaluation” by Dr. Eugene Rex Jalao.

Assignment 1
Your faculty-in-charge will give you your assignment on Classification Methodologies.

3.2. Regression

Regression is a data mining task of predicting the target’s value, i.e., the numerical
variable (y), by building a model based on one or more predictors, which can be
numerical and categorical variables.

There are many applications of regression such as data description, parameter


estimation, prediction/estimation, and control. In this regard, regression analysis,
arguably the most widely used (and misused) statistical technique, is utilized. Multiple
linear regression (MLR) is a method used to model the linear relationship between a
target variable and more than one predictor variables.

Fundamentals of Predictive Analytics 26


Learning Resources
Video on “Regression” by Dr. Eugene Rex Jalao.

3.2.1. Regression Model Evaluation

In regression, we predict actual values that are numerical in nature. Considering this,
how do we know whether the predictions are accurate or whether the regression model
is valid? Listed below are questions to be considered in evaluating a regression model.

1. Is at least one of the predictors useful in the predicting the response? If this is not
the case, we cannot predict Y in the first place because none of the predictors
are useful.
2. How well does the model fit the data? Is it a good fit?
3. Given a set of predictor values, what is the prediction’s response value?
4. Are there any outliers that might influence the coefficients?
5. Do all of the predictors help explain Y or is only a subset of the predictors useful?

Learning Resources
Video on “Regression Model Evaluation” by Dr. Eugene Rex Jalao.

3.2.2. Indicator Variables

Qualitative variables do not have a scale measurement. In this regard, indicator


variables, also known as dummy variables, are used to model qualitative variables in
regression. Indicator variables assigns levels to qualitative variables/categories so that
regression analysis can be performed on them.

Learning Resources
Video on “Indicator Variables” by Dr. Eugene Rex Jalao.

Fundamentals of Predictive Analytics 27


3.2.3. Multicollinearity

Multicollinearity is the inflation of coefficient estimates due to interdependent regression.


This implies that if the regressors (or the predictor variables) are dependent with each
other (i.e. correlation among x , x , x …), then the problem of multicollinearity exists in the
1 2 3

final regression model. The said problem does not exist if all regressors are completely
independent with each other; however, this is a rare occurrence in regression analysis.
Usually, there is interdependence among predictor variables to a certain extent.

The effect of strong multicollinearity is that it can result in large variances and
covariances for the least square estimates of the coefficients. Large variances implies
unstable predictions, and coefficient estimates would be very sensitive to minor changes
in the regression model. Thus, the question now is how is multicollinearity detected? We
want procedures to correct identify the presence of multicollinearity and to provide
insights as to which regressors are causing the problem.

Learning Resources
Video on “Multicollinearity” by Dr. Eugene Rex Jalao.

3.2.4. Logistic Regression

Logistic regression predicts the probability of an outcome that can only have two values.
As such, it can be considered a classification algorithm. Yes, it predicts probability but
we can consider this example: anything below 0.5 is a class/category, and anything
above 0.5 is another class/category. Furthermore, the prediction uses one or several
predictors, which can be numerical and categorical in nature.

Learning Resources
Video on “Logistic Regression” by Dr. Eugene Rex Jalao.

Assignment 2
Your faculty-in-charge will give you your assignment on Regression Methodologies.

Fundamentals of Predictive Analytics 28


MODULE 4: UNSUPERVISED LEARNING

Introduction

Module 4 talks about unsupervised learning where we find hidden patterns within the
data. There is no response or class variable like in classification or regression.
Moreover, in unsupervised learning, there is no guarantee that there are meaningful
patterns.

Learning Objectives

After working on this module, you should be able to:


1. Identify and discuss appropriate unsupervised learning methodologies for various
scenarios and business problems.
2. Build and validate prediction models by utilizing unsupervised learning
methodologies.
3. Evaluate the performance of the prediction model.

4.1. Association Rule Mining

Association rule mining is a rule-based method for discovering relationships between


variables. It is intended to identify strong rules using certain measures of
interestingness. Specifically, given a set of transactions, we find rules that will predict the
occurrence of an item based on the occurrences of other items in a particular
transaction.

Figure 1. Example of Association Rules

Fundamentals of Predictive Analytics 29


Learning Resources
Video on “Association Rule Mining” by Dr. Eugene Rex Jalao

4.2. Sequential Pattern Mining

Sequential pattern mining is concerned with finding statistically relevant patterns within
the time series data where values are delivered in a sequence. A sequence is an
ordered list of elements or transactions, whereas an element contains a collection of
events or items. Each element is attributed to a specific time or location of a particular
transaction. Sequential pattern mining is performed by growing subsequences or
patterns one at a time.

Figure 2. Example of sequence data.

Learning Resources
Video on “Sequential Pattern Mining” by Dr. Eugene Rex Jalao

4.3. Clustering

Clustering is the task that of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each
other than to those belonging in other clusters.

4.3.1. K-Means Clustering


K-means clustering is a basic partitional clustering approach or algorithm that
classifies or groups objects based on attributes into k groups (clusters). The number
of clusters (k) must be specified first. Each cluster is associated with a center point
called the centroid, and each point in the data set is assigned to the cluster with the
closest centroid.

Fundamentals of Predictive Analytics 30


Figure 3. Overview of K-Means.

4.3.2. Hierarchical Clustering

Hierarchical clustering produces a set of nested clusters organized as a hierarchical


tree, which is usually visualized as a dendogram. A dendrogram is a tree-like diagram
that records the sequences of merges or splits in the data set. One of the advantages of
the said clustering method is that there is no need to assume any particular number of
clusters, unlike K-means. Any desired number of clusters can be obtained by cutting the
dendrogram at an appropriate level.

Figure 4. Sample dendrogram.

Fundamentals of Predictive Analytics 31


Learning Resources
Video on “Hierarchical Clustering” by Dr. Eugene Rex Jalao

4.4. Text Mining

Text mining, also known as text data mining or knowledge discovery in textual
databases, is a semi-automated process of extracting knowledge from unstructured data
sources. Benefits of text mining are obvious in text-rich data environments such as in
law, academic research, medicine, biology, technology, finance, and marketing. It can
also be utilized in electronic communication records; examples of which include spam
filtering, email prioritization and categorization, and automatic response generation.
Basically, text mining consists of these steps:

1. Establish the corpus.


2. Create term-document matrix.
3. Extract knowledge from term-document matrix.

Learning Resources
Video on “Text Mining” by Dr. Eugene Rex Jalao

4.5. Social Media Sentiment Analysis

Social media sentiment analysis takes two main types of textual information: facts and
opinions. We take note that factual statements can imply opinions too. Most current text
information processing methods (e.g. text mining) work with factual information. In
essence, these textual data can be extended into sentiment analysis or opinion mining,
defined as the computational study of opinions, sentiments, and emotions expressed in
text.

Opinions are important because whenever a decision has to be made, we want to hear
the opinions of others. In the past, we ask opinions from friends, family, focus groups,
and consultants. Now, due to the advent of the Internet, opinions are of global scale.

Fundamentals of Predictive Analytics 32


Learning Resources
Video on “Social Media Sentiment Analysis” by Dr. Eugene Rex Jalao

Assignment 3
Your faculty-in-charge will give you your assignment on Unsupervised Learning
Methodologies.

Fundamentals of Predictive Analytics 33


FINAL EXAM

IDENTIFICATION / FILL-IN THE BLANKS

# Answer Item

1 Predictive Analytics Exploration and analysis of a large quantity of data to


/ Data Mining discover meaningful patterns by automatic or semi-automatic
means.

2 Classification Historical data are used to build a model and the goal is to
predict previously unseen records.

3 ZeroR The simplest classification methodology which relies on the


target and ignores all predictors.

4 Naïve-Bayes A frequency-based classifier that uses a probabilistic


framework and conditional probability to classify data.

5 Nearest An intuitive method that classified unlabeled data based on


Neighbours their similarity with sample examples in the training set.

6 Regression A data mining task of predicting the target’s value by building


a model based on one or more predictors, which can be
numerical and categorical variables.

7 Indicator Variables This is used to model qualitative variables in regression;


assigns levels to such qualitative variables so that regression
analysis can be performed on them.

8 Logistic Regression It is considered a classification algorithm predicts the


probability of an outcome that can only have two values.

9 Association Rule A rule-based method that is intended to identify strong rules


Mining using certain measures of interestingness.

Fundamentals of Predictive Analytics 34


10 K-Means Clustering A basic partitional approach or algorithm that classifies or
groups objects based on attributes into groups

11 Dendogram A tree-like diagram that records the sequences of merges or


splits in the data set.

12 Text Mining A semi-automated process of extracting knowledge from


unstructured data sources.

SOMETHING TO PONDER / ESSAY


In no more than 4 sentences, explain your answers to the following items:
1. Justify the necessity of utilizing the CRISP-DM Framework in predictive analytics.
2. Differentiate a decision tree from a random forest.
3. Differentiate supervised learning from unsupervised learning.
4. What is the difference of classification, regression, and clustering?
5. What is the significance of data preprocessing?

Fundamentals of Predictive Analytics 35

You might also like