Professional Documents
Culture Documents
Fundamentals of Predictive Analytics A Business Analytics Course
Fundamentals of Predictive Analytics A Business Analytics Course
Open University
Fundamentals of
Predictive Analytics
A Business Analytics Course
Welcome, dear students! This course will help you traverse the world of predictive
analytics. In predictive analytics (also sometimes called data mining), useful patterns can
be extracted from data available which can, in turn, be utilized to predict the future.
Moreover, predictive analytics draws ideas from various fields such as machine learning,
pattern recognition, statistics, and database systems.
Before taking this course, you should have already completed the Fundamentals of
Descriptive Analytics course. And, one more thing! This very document is your course
guide. Kindly read this carefully before embarking on your journey to learn the
Fundamentals of Predictive Analytics.
COURSE OBJECTIVES
COURSE OUTLINE
COURSE MATERIALS
The course learning package consists of the items listed below. These will be made
available for access and download.
1. Course Guide
2. Study Guides
3. Video Lectures/Resources
4. Other Digital References
STUDY SCHEDULE
COURSE REQUIREMENTS
Introduction
This is the first module in the course. As such, it gives an overview of what the students
will be learning in this course as a whole, i.e., predictive analytics. A brief overview on
the principles pertaining to predictive analytics are defined and discussed.
Learning Objectives
Predictive analytics (also sometimes called data mining) is the non-trivial extraction of
implicit, previously unknown, and potentially useful information from the data. In other
words, useful patterns are extracted from the data, and we hope that these patterns will
be repeated in the future. Another definition of data mining is that it is the exploration
and analysis of a large quantity of data to discover meaningful patterns by automatic or
semi-automatic means. Data mining is all about explaining the past to predict the future.
Predictive analytics draws ideas from various fields such as machine learning, pattern
recognition, statistics, and database systems.
Discussion Forum 1
Discuss the potential applications of predictive analytics/data mining in your field of work.
Learning Resources
Video on “Supervised Learning vs Unsupervised Learning” by Asst. Prof.
Reinald Adrian Pugoy
Study Questions
1. How does the DBMS perform the functionalities listed in this module?
2. How do the different components of a database system relate to one another?
Data mining tools are software usually downloaded or bought from third-party providers.
An example of this tool is R, an open-source software that is considered to be the most
utilized tool for data mining and predictive analytics. Other tools include SPSS, Rapid
Miner, SAS, Excel, and Python, with Python coming closer to R nowadays. Furthermore,
it is important to take note that no tool solves all predictive analytics problems. In other
words, solutions for such problems cannot be implemented by just using a single
software. For this reason, a majority of companies worldwide use both free/open-source
software and commercial software.
Learning Resources
Video on “Tools of Data Mining” by Dr. Eugene Rex Jalao.
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a non-
proprietary framework that allows us to implement predictive analytics solutions. It is a
necessary standard process that ensures that data mining is reliable and can be easily
repeated by people with little to no data mining background. It also demonstrates the
maturity of data mining and reduces dependency on experts. Furthermore, the CRISP-
DM Model also serves as an aid to project planning and management, and it is also
considered a “comfort factors” for new adopters.
Learning Resources
Video on “CRISP-DM (Cross-Industry Standard Process for Data Mining)” by Dr.
Eugene Rex Jalao.
Study Question
Why is important for managers to know how to entity relationship diagrams are
designed?
Introduction
Learning Objectives
After studying the basic concepts in data preprocessing, you should be able to:
1. Explain what data preprocessing is and why it is important in data analytics; and
2. Describe different forms of data preprocessing.
Data in the real world tend to be incomplete, noisy, and inconsistent. “Dirty” data can
lead to errors in parameter estimation and incorrect analysis leading users to draw false
conclusions. Quality decisions must be based in quality data; hence, unclean data may
cause incorrect or even misleading statistical results and predictive analysis. Data
preprocessing is a data mining technique that involves transforming raw or source data
into an understandable format for further processing.
Several distinct steps are involved in preprocessing data. Here are the general steps
taken to pre-process data:
• Data cleaning
o This step deals with missing data, noise, outliers, and duplicate or
incorrect records while minimizing introduction of bias into the database.
o Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data integration
o Extracted raw data can come from heterogeneous sources or be in
separate datasets. This step reorganizes the various raw datasets into a
single dataset that contain all the information required for the desired
statistical analyses.
o Involves integration of multiple databases, data cubes, or files.
o Data with different representations are put together and conflicts within
the data are resolved.
• Data transformation
Pre-processing is sometimes iterative and may involve repeating this series of steps until
the data are satisfactorily organized for the purpose of statistical analysis. During
preprocessing, one needs to take care not to accidentally introduce bias by modifying
the dataset in ways that will impact the outcome of statistical analyses. Similarly, we
must avoid reaching statistically significant results through “trial and error” analyses on
differently pre-processed versions of a dataset.
Learning Resources
Dr. Eugene Rex Jalao’s video on Data Preprocessing
Activity 2-1
Watch:
Dr. Eugene Rex Jalao’s video on Data Preprocessing.
A. Data Integration
Data integration is the process of combining data derived from various data sources
(such as databases, flat files, etc.) into a consistent dataset. In data integration, data
from the different sources, as well as the metadata - the data about this data - from
different sources are integrated to come up with a single data store. There are a number
of issues to consider during data integration related mostly to possible different
standards among data sources. These issues could be entity identification problem, data
value conflicts, and redundant data. Careful integration of the data from multiple sources
may help reduce or avoid redundancies and inconsistencies and improve data mining
speed and quality of sources.
1. Inner Join - creates a new result table by combining column values of two tables
(A and B) based upon the join-predicate.
2. Left Join - returns all the values from an inner join plus all values in the left table
that do not match to the right table, including rows with NULL (empty) values in
the link column.
3. Right Join - returns all the values from the right table and matched values from
the left table (NULL in the case of no matching join predicate).
4. Outer Join - the union of all the left join and right join values.
Activity 2-1
Watch:
Dr. Jalao’s video on Data Integration.
1. How are the alternative data warehousing architectures different from the usual
architecture?
2. Discuss the advantages and disadvantages of the different alternative data
warehousing architectures.
B. Data Transformation
Data transformation is a process of transforming data from one format to another. It aims
to transform the data values into a format, scale or unit that is more suitable for analysis.
Data transformation is an important step in data preprocessing and a prerequisite for
doing predictive analytic solutions.
1) Normalization - a way to scale specific variable to fall within a small specific range
a) min-max normalization - transforming values to a new scale such that all
attributes fall between a standardized format.
Continuous Class
Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding average of the class variable.
C. Data Cleaning
All data sources potentially include errors and missing values – data cleaning addresses
these anomalies. Data cleaning is the process of altering data in a given storage
resource to make sure that it is accurate and correct. Data cleaning routines attempts to
fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data, as well as resolve redundancy caused by data integration.
c) Identifying outliers
Solutions for identifying outliers:
i. Box plot
Activity 2-2
Watch:
Dr. Jalao’s video on Data Cleaning.
Data reduction is a process of obtaining a reduced representation of the data set that is
much smaller in volume but yet produce the same (or almost the same) analytical
results. The need for data reduction emerged from the fact that some database/data
warehouse may store terabytes of data, and complex data analysis/mining may take a
very long time to run on the complete data set.
c. Feature Creation - creating new attributes that can capture the important
information in a data set much more efficiently than the original attributes.
i.Feature Creation Methodologies
1. Feature Extraction
2. Mapping Data to New Space
3. Feature Construction
Learning Resources
Dr. Eugene Rex Jalao’s video on Data Reduction and Manipulation.
Watch:
Dr. Jalao’s video on Data Reduction and Manipulation
Other References Used for Module 2:
Son NH (2006) Data mining course—data cleaning and data preprocessing. Warsaw
University. Available at URL http://www.mimuw.edu.pl/~son/datamining/DM/4-
preprocess.pdf
Malley B., Ramazzotti D., Wu J.T. (2016) Data Pre-processing. In: Secondary Analysis
of Electronic Health Records. Springer, Cham. Available at URL
https://link.springer.com/chapter/10.1007%2F978-3-319-43742-2_12#Sec2
Introduction
In Module 1, you already have encountered the definition of supervised learning. This
time, Module 3 discusses supervised learning in greater detail. Specifically, supervised
learning methodologies can be categorized into two: classification, the prediction of a
class or category from several predictor variables; and regression, the prediction of
numerical value from one or more predictors. This module also tackles how regression
and classification may be evaluated.
Learning Objectives
3.1. Classification
Given a collection of records, let us say that we have multiple predictor variables (x , x ,
1 2
x ) and one categorical response (y). Here, we intend to find a model for predicting the
p
class variable from multiple predictor variables. This is the essence of classification; a
categorical response is predicted from multiple predictor variables. In classification,
historical data are used to build a model and the goal is to predict previously unseen
records. There are several classification algorithms and some of these are listed below:
ZeroR
The simplest classification methodology which relies on the target and ignores all
predictors.
OneR
Simple yet accurate classification algorithm that generates one rule for each predictor in
the data.
Decision Tree
It builds classification models in the form of a tree structure that represents rules that can
be easily understood.
Nearest Neighbours
An intuitive method that classified unlabeled data based on their similarity with sample
examples in the training set. It utilizes distance as a similarity measure in making
predictions.
Ensemble
This predicts the class of previously unseen records by aggregating predictions made by
multiple classifiers.
Random Forests
A relatively modern algorithm that is essentially an ensemble of decision trees.
Learning Resources
• Video on “Classification” by Dr. Eugene Rex Jalao.
• Video on “Naive Bayes” by Dr. Eugene Rex Jalao
• Video on “Decision Trees” by Dr. Eugene Rex Jalao
• Video on “Nearest Neighbours” by Dr. Eugene Rex Jalao
• Video on “Artificial Neural Networks” by Dr. Eugene Rex Jalao
• Video on “Support Vector Machines” by Dr. Eugene Rex Jalao
• Video on “Ensembles” by Dr. Eugene Rex Jalao
• Video on “Random Forests” by Dr. Eugene Rex Jalao
We will not know how well the model performs unless model evaluation comes into the
picture. Model evaluation is a methodology used to find the model that represents the
data and how well the chosen model will work in the future. Listed below are questions
necessary to be answered in model evaluation:
Thus, if there are multiple prediction models or algorithms, how are these models
compared? Which one among them will be chosen to deploy in business?
Learning Resources
Video on “Model Evaluation” by Dr. Eugene Rex Jalao.
Assignment 1
Your faculty-in-charge will give you your assignment on Classification Methodologies.
3.2. Regression
Regression is a data mining task of predicting the target’s value, i.e., the numerical
variable (y), by building a model based on one or more predictors, which can be
numerical and categorical variables.
In regression, we predict actual values that are numerical in nature. Considering this,
how do we know whether the predictions are accurate or whether the regression model
is valid? Listed below are questions to be considered in evaluating a regression model.
1. Is at least one of the predictors useful in the predicting the response? If this is not
the case, we cannot predict Y in the first place because none of the predictors
are useful.
2. How well does the model fit the data? Is it a good fit?
3. Given a set of predictor values, what is the prediction’s response value?
4. Are there any outliers that might influence the coefficients?
5. Do all of the predictors help explain Y or is only a subset of the predictors useful?
Learning Resources
Video on “Regression Model Evaluation” by Dr. Eugene Rex Jalao.
Learning Resources
Video on “Indicator Variables” by Dr. Eugene Rex Jalao.
final regression model. The said problem does not exist if all regressors are completely
independent with each other; however, this is a rare occurrence in regression analysis.
Usually, there is interdependence among predictor variables to a certain extent.
The effect of strong multicollinearity is that it can result in large variances and
covariances for the least square estimates of the coefficients. Large variances implies
unstable predictions, and coefficient estimates would be very sensitive to minor changes
in the regression model. Thus, the question now is how is multicollinearity detected? We
want procedures to correct identify the presence of multicollinearity and to provide
insights as to which regressors are causing the problem.
Learning Resources
Video on “Multicollinearity” by Dr. Eugene Rex Jalao.
Logistic regression predicts the probability of an outcome that can only have two values.
As such, it can be considered a classification algorithm. Yes, it predicts probability but
we can consider this example: anything below 0.5 is a class/category, and anything
above 0.5 is another class/category. Furthermore, the prediction uses one or several
predictors, which can be numerical and categorical in nature.
Learning Resources
Video on “Logistic Regression” by Dr. Eugene Rex Jalao.
Assignment 2
Your faculty-in-charge will give you your assignment on Regression Methodologies.
Introduction
Module 4 talks about unsupervised learning where we find hidden patterns within the
data. There is no response or class variable like in classification or regression.
Moreover, in unsupervised learning, there is no guarantee that there are meaningful
patterns.
Learning Objectives
Sequential pattern mining is concerned with finding statistically relevant patterns within
the time series data where values are delivered in a sequence. A sequence is an
ordered list of elements or transactions, whereas an element contains a collection of
events or items. Each element is attributed to a specific time or location of a particular
transaction. Sequential pattern mining is performed by growing subsequences or
patterns one at a time.
Learning Resources
Video on “Sequential Pattern Mining” by Dr. Eugene Rex Jalao
4.3. Clustering
Clustering is the task that of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each
other than to those belonging in other clusters.
Text mining, also known as text data mining or knowledge discovery in textual
databases, is a semi-automated process of extracting knowledge from unstructured data
sources. Benefits of text mining are obvious in text-rich data environments such as in
law, academic research, medicine, biology, technology, finance, and marketing. It can
also be utilized in electronic communication records; examples of which include spam
filtering, email prioritization and categorization, and automatic response generation.
Basically, text mining consists of these steps:
Learning Resources
Video on “Text Mining” by Dr. Eugene Rex Jalao
Social media sentiment analysis takes two main types of textual information: facts and
opinions. We take note that factual statements can imply opinions too. Most current text
information processing methods (e.g. text mining) work with factual information. In
essence, these textual data can be extended into sentiment analysis or opinion mining,
defined as the computational study of opinions, sentiments, and emotions expressed in
text.
Opinions are important because whenever a decision has to be made, we want to hear
the opinions of others. In the past, we ask opinions from friends, family, focus groups,
and consultants. Now, due to the advent of the Internet, opinions are of global scale.
Assignment 3
Your faculty-in-charge will give you your assignment on Unsupervised Learning
Methodologies.
# Answer Item
2 Classification Historical data are used to build a model and the goal is to
predict previously unseen records.