Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

UNIT – 4: PREDICTIVE

MODELING AND
ANALYSIS

1|P ag e
PREDICTIVE MODELING
Predictive modeling is the process of analysing current outcomes
and known information to predict future outcomes. In predictive
analytics, predictive modeling algorithms are used to procure
possible future outcomes.
With data science at its peak, predictive modeling has emerged as
a helpful data mining technique that has enabled organizations
and corporations to extract predictive outcomes based on
whatever data is known currently.
In the process of predictive modeling, data is recorded, a
statistical model or an algorithm is applied, and future outcomes
are predicted.
Even when this concept has been in practice for more than half a
century, it has just recently gained the significance that it
deserved from the very first day on. While the early years went
into investigating the efficacy of this data science technique,
recent times have led industries and organizations to implement
this successfully.

2|P ag e
TECHNIQUES OF PREDICTIVE MODELING
While predictive modeling is defined as a predictive analytics tool
to extract future outcomes with the help of past data, it can also
be considered as a mathematical procedure used to calculate
future possibilities.
Also known as predictive analytics, predictive modeling is of
different types of predictive models.
1. Classification Model
Among all the predictive modeling techniques in machine
learning, the classification model is one of the widely used
techniques. In classification predictive modeling, an input is
classified into a specific category where it is treated as a label and
its class is predicted.
2. Forecast Model
One of the most popular and accurate predictive models, the
forecast model is used to forecast/predict metric values based on
past data. With the help of historical data, the forecast model
computes data points consisting of numerical values and allots
them values based on historical data.
3. Outliers Model
The outliers model revolves around detecting the ‘outliers’ in a
dataset or anomalies. Simply put, the outliers model is one of the
types of predictive models that helps to detect anomalies in a data
set and even predict related information of a particular data set.
Especially in the field of finance, the outliers model helps
predictive modeling to detect whether a transaction is a fraud or
safe.

3|P ag e
For example, when identifying fraudulent transactions, the model
can assess not only amount, but also location, time, purchase
history and the nature of a purchase.

APPLICATIONS OF PREDICTIVE MODELING

1. Sales
Sales are one of the most essential aspects of a business that
keeps it running. Based on the kind of sales a company has
achieved in the past, predictive analysis techniques and tools can
very well establish the future for the company in terms of sales
and profits.
Furthermore, it can also detect the anomalies wherein the sales
department is lagging which leads to an enhanced performance of
the company in the determined areas or demographic circles.
2. Marketing
Another application of predictive analytics is marketing. As
marketing is the act of promoting a particular service or a
commodity to a group of target customers, it involves predicting
the reaction of customers and forecasting the customer
requirements based on the data collected by customer feedback.
3. Social Media
Social media is the hub of unstructured, heterogeneous, and vast
data.
A platform where millions of people interact and use the internet
on a day-to-day basis, social media requires predictive modeling
for forecasting customer feedback and determining the kind of
response a product or a response on the platform will get.

4|P ag e
That said, the importance of social media and it is one of the
widely used applications of predictive modeling that helps
various platforms to detect customer activity and compute future
outcomes accordingly.
4. Risk Assessment
A major application of predictive modeling is risk assessment.
Risk assessment is often practiced in financial institutions and
fraud detection cases where one might want to assess the kind of
risk that s/he is subject to.

Based on the data analysis of the past records, predictive analytics


tools can help an individual, company, or organization to conduct
a risk assessment and determine the depth of risk or profit that
the future beholds.

LOGIC DRIVEN MODELING


It leverages statistics to predict outcomes. Most often the event
one wants to predict is in the future, but predictive modeling can
be applied to any type of unknown event, regardless of when it
occurred. For example, predictive models are often used to detect
crimes and identify suspects, after the crime has taken place.
In many cases the model is chosen on the basis of detection
theory to try to guess the probability of an outcome given a set
amount of input data, for example given an email determining
how likely that it is spam.
Depending on definitional boundaries, predictive modeling is
synonymous with, or largely overlapping with, the field of
machine learning, as it is more commonly referred to in academic
or research and development contexts. When deployed

5|P ag e
commercially, predictive modeling is often referred to as
predictive analytics.
 Nomograms are useful graphical representation of a
predictive model. As in spreadsheet software, their use
depends on the methodology chosen. The advantage of
nomograms is the immediacy of computing predictions
without the aid of a computer.
 Point estimates tables are one of the simplest form to
represent a predictive tool. Here combination of
characteristics of interests can either be represented via a
table or a graph and the associated prediction read off the y-
axis or the table itself.
 Tree based methods (e.g. CART, survival trees) provide one
of the most graphically intuitive ways to present predictions.
However, their usage is limited to those methods that use
this type of modeling approach which can have several
drawbacks. Trees can also be employed to represent
decision rules graphically.
 Score charts are graphical tabular are graphical tools to
represent either predictions or decision rules.
 A statistical model embodies a set of assumptions concerning
the generation of the observed data, and similar data from a
larger population. A model represents, often in considerably
idealized form, the data-generating process. The model
assumptions describe a set of probability distributions, some
of which are assumed to adequately approximate the
distribution from which a particular data set is sampled.
 A logic-driven is based on experience, knowledge and logical
relationships of variable and constants connected to the
desired performance outcome. To help conceptualize the
relationships inherent in a system, diagramming methods
are useful.
6|P ag e
 Cause and effect diagram enables a user to hypothesize
relationships between potential causes and of an outcome.
 Influence diagram are another tool to conceptualize
relationships with business performance relationships.

Thus, the economic value of a customer is


 V = value of a loyal customer
 R = revenue per purchase
 F = purchase frequency (number visits per year)
 M = gross profit margin
 D = defection rate (proportion customers not returning each
year)

STRATEGIES FOR BUILDING PREDICTIVE MODELS


1. Scope and define the predictive analytics model you want to
build. In this step you want to determine what business processes
should be analysed and what the desired business outcomes are,
such as the adoption of a product by a certain segment of
customers.
2. Explore and profile your data. Predictive analytics is data-
intensive. In this step you need to determine the needed data,
where it’s stored, whether it’s readily accessible, and its current
state.
3. Gather, cleanse and integrate the data. Once you know where
the necessary data is located, you may need to clean the data. You

7|P ag e
will want to build your model from a consistent and
comprehensive set of information that is ready to be analysed.
4. Build the predictive model. Establish the hypothesis and then
build the test model. Your goal is to include, and rule out, different
variables and factors and then test the model using historical data
to see if the results produced by the model prove the hypothesis.
5. Incorporate analytics into business processes. To make the
model valuable, you need to integrate it into the business process
so it can be used to help achieve the outcome.
6. Monitor the model and measure the business results. We live
and market in a dynamic environment, where buying, competitive
and other factors change. You will need to monitor the model and
measure how effective it is at continuing to produce the desired
outcome. It may be necessary to make adjustments and fine tune
the model as conditions evolve.

DATA-DRIVEN MODELING
Data modeling is the process of creating a visual representation of
either a whole information system or parts of it to communicate
connections between data points and structures. The goal is to
illustrate the types of data used and stored within the system, the
relationships among these data types, the ways the data can be
grouped and organized and its formats and attributes.
Data models are built around business needs. Rules and
requirements are defined upfront through feedback from
business stakeholders so they can be incorporated into the design
of a new system or adapted in the iteration of an existing one.
Ideally, data models are living documents that evolve along with
changing business needs. They play an important role in
supporting business processes and planning IT architecture and

8|P ag e
strategy. Data models can be shared with vendors, partners,
and/or industry peers.

TYPES OF DATA MODEL


 Conceptual data models. They are also referred to as domain
models and offer a big-picture view of what the system will
contain, how it will be organized, and which business rules
are involved. Conceptual models are usually created as part
of the process of gathering initial project requirements.
Typically, they include entity classes (defining the types of
things that are important for the business to represent in the
data model), their characteristics and constraints, the
relationships between them and relevant security and data
integrity requirements. Any notation is typically simple.

 Logical data models. They are less abstract and provide


greater detail about the concepts and relationships in the
domain under consideration. One of several formal data
modeling notation systems is followed. These indicate data
attributes, such as data types and their corresponding

9|P ag e
lengths, and show the relationships among entities. Logical
data models don’t specify any technical system
requirements. This stage is frequently omitted in agile
or DevOps practices. Logical data models can be useful in
highly procedural implementation environments, or for
projects that are data-oriented by nature, such as data
warehouse design or reporting system development.

 Physical data models. They provide a schema for how the


data will be physically stored within a database. As such,
they’re the least abstract of all. They offer a finalized design
that can be implemented as a relational database, including
associative tables that illustrate the relationships among
entities as well as the primary keys and foreign keys that
will be used to maintain those relationships. Physical data
models can include database management system (DBMS)-
specific properties, including performance tuning.

DATA MODELING PROCESS


1. Identify the entities. The process of data modeling begins
with the identification of the things, events or concepts that
are represented in the data set that is to be modelled. Each
entity should be cohesive and logically discrete from all
others.
2. Identify key properties of each entity. Each entity type can be
differentiated from all others because it has one or more
unique properties, called attributes. For instance, an entity
called “customer” might possess such attributes as a first
name, last name, telephone number and salutation, while an
entity called “address” might include a street name and
number, a city, state, country and zip code.

10 | P a g e
3. Identify relationships among entities. The earliest draft of a
data model will specify the nature of the relationships each
entity has with the others. In the above example, each
customer “lives at” an address. If that model were expanded
to include an entity called “orders,” each order would be
shipped to and billed to an address as well. These
relationships are usually documented via unified modeling
language (UML).
4. Map attributes to entities completely. This will ensure the
model reflects how the business will use the data. Several
formal data modeling patterns are in widespread use.
Object-oriented developers often apply analysis patterns or
design patterns, while stakeholders from other business
domains may turn to other patterns.
5. Assign keys as needed, and decide on a degree of
normalization that balances the need to reduce redundancy
with performance requirements. Normalization is a
technique for organizing data models (and the databases
they represent) in which numerical identifiers, called keys,
are assigned to groups of data to represent relationships
between them without repeating the data. For instance, if
customers are each assigned a key, that key can be linked to
both their address and their order history without having to
repeat this information in the table of customer names.
Normalization tends to reduce the amount of storage space a
database will require, but it can at cost to query
performance.
6. Finalize and validate the data model. Data modeling is an
iterative process that should be repeated and refined as
business needs change.

TYPES OF DATA MODELING

11 | P a g e
Data modeling has evolved alongside database management
systems, with model types increasing in complexity as businesses'
data storage needs have grown. Here are several model types:
 Hierarchical data models represent one-to-many
relationships in a treelike format. In this type of model, each
record has a single root or parent which maps to one or
more child tables. This model was implemented in the IBM
Information Management System (IMS), which was
introduced in 1966 and rapidly found widespread use,
especially in banking. Though this approach is less efficient
than more recently developed database models, it’s still used
in Extensible Markup Language (XML) systems and
geographic information systems (GISs).
 Relational data models were initially proposed by IBM
researcher E.F. Codd in 1970. They are still implemented
today in the many different relational databases commonly
used in enterprise computing. Relational data modeling
doesn’t require a detailed understanding of the physical
properties of the data storage being used. In it, data
segments are explicitly joined through the use of tables,
reducing database complexity.
Relational databases frequently employ structured query
language (SQL) for data management. These databases work well
for maintaining data integrity and minimizing redundancy.
They’re often used in point-of-sale systems, as well as for other
types of transaction processing.
 Entity-relationship (ER) data models use formal diagrams to
represent the relationships between entities in a database.
Several ER modeling tools are used by data architects to
create visual maps that convey database design objectives.
 Object-oriented data models gained traction as object-
oriented programming and it became popular in the mid-

12 | P a g e
1990s. The “objects” involved are abstractions of real-world
entities. Objects are grouped in class hierarchies, and have
associated features. Object-oriented databases can
incorporate tables, but can also support more complex data
relationships. This approach is employed in multimedia and
hypertext databases as well as other use cases.
 Dimensional data models were developed by Ralph Kimball,
and they were designed to optimize data retrieval speeds for
analytic purposes in a data warehouse. While relational and
ER models emphasize efficient storage, dimensional models
increase redundancy in order to make it easier to locate
information for reporting and retrieval. This modeling is
typically used across OLAP systems.

BENEFITS OF DATA MODELING


Data modeling makes it easier for developers, data architects,
business analysts, and other stakeholders to view and understand
relationships among the data in a database or data warehouse. In
addition, it can:
 Reduce errors in software and database development.
 Increase consistency in documentation and system design
across the enterprise.
 Improve application and database performance.
 Ease data mapping throughout the organization.
 Improve communication between developers and business
intelligence teams.
 Ease and speed the process of database design at the
conceptual, logical and physical levels.

13 | P a g e
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines
work as the supervisor that teaches the machines to predict the
output correctly. It applies the same concept as a student learns in
the supervision of the teacher.
Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam filtering,
etc.

STEPS INVOLVED IN SUPERVISED LEARNING

14 | P a g e
o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset,
and validation dataset.
o Determine the input features of the training dataset, which
should have enough knowledge so that the model can
accurately predict the output.
o Determine the suitable algorithm for the model, such as
support vector machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we
need validation sets as the control parameters, which are the
subset of training datasets.
o Evaluate the accuracy of the model by providing the test set.
If the model predicts the correct output, which means our
model is accurate.

TYPES OF SUPERVISED LEARNING ALGORITHMS

15 | P a g e
1. Regression
Regression algorithms are used if there is a relationship between
the input variable and the output variable. Below are some types-

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is


categorical, which means there are two classes such as Yes-No,
Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict


the output on the basis of prior experiences.
o In supervised learning, we can have an exact idea about the
classes of objects.
o Supervised learning model helps us to solve various real-
world problems such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

16 | P a g e
o Supervised learning models are not suitable for handling the
complex tasks.
o Supervised learning cannot predict the correct output if the
test data is different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the
classes of object.

SIMPLE REGRESSION

Basically, a simple regression analysis is a statistical tool that is


used in the quantification of the relationship between a single
independent variable and a single dependent variable based on
observations that have been carried out in the past.

Basically, the simple linear regression model can be expressed in


the same value as the simple regression formula.

y = β0 + β1X+ ε.

17 | P a g e
ASSUMPTIONS OF SIMPLE REGRESSION

 Homogeneity of variance: this can also be referred to as


homoscedasticity. The core of this assumption states that
there is no significant change in the size of the error in our
prediction across the values of the independent variable.
 Independence of observations: here, statistically valid
sampling methods were used to collect the observations in
the dataset, and there exists no unknown relationships
among observations.
 Normality: this simply assumes that the data follows a normal
distribution.

LIMITS OF SIMPLE REGRESSION


Even the best data does not give perfection. Typically, simple
linear regression analysis is widely used in research to mark the
relationship that exists between variables. However, since
correlation does not interpret as causation, the relationship
between 2 variables does not mean that one causes the other to
occur. In fact, a line in a simple linear regression that describes
the data points well may not bring about a cause-and-effect
relationship.

MULTIPLE REGRESSION

Multiple regression, also known as multiple linear regression


(MLR), is a statistical technique that uses two or more explanatory
variables to predict the outcome of a response variable. In other
words, it can explain the relationship between multiple
independent variables against one dependent variable. These
independent variables serve as predictor variables, while the
single dependent variable serves as the criterion variable.

18 | P a g e
ASSUMPTIONS OF MULTIPLE REGRESSION

 Assumption of linearity is important.


 Multiple regression models should be linear in nature.
 In multiple regression, the assumption of homoscedasticity is
required.
 Between the independent variables, there is a low degree of
correlation.
 The variance of the independent variable is constant at all
levels.
 In multiple regression, the assumption of normality is
required. It means that variables in multiple regression must
have a normal distribution.
 In multiple regression, the model should be specified in a
methodical manner. It suggests that the model should contain
just important variables and be accurate.

ADVANTAGES OF MULTIPLE REGRESSION

 Multiple regression analysis helps us to better study the


various predictor variables at hand.
19 | P a g e
 It increases reliability by avoiding dependency on just one
variable and have more than one independent variable to
support the event.
 Multiple regression analysis permits you to study more
formulated hypotheses that are possible.

LOGISTIC REGRESSION

Logistic regression estimates the probability of an event occurring,


such as voted or didn’t vote, based on a given dataset of
independent variables. Since the outcome is a probability, the
dependent variable is bounded between 0 and 1. In logistic
regression, a logit transformation is applied on the odds—that is,
the probability of success divided by the probability of failure.

TYPES OF LOGISTIC REGRESSION

There are three types of logistic regression models, which are


defined based on categorical response.

 Binary logistic regression: In this approach, the response or


dependent variable is dichotomous in nature—i.e. it has only
two possible outcomes (e.g. 0 or 1). Some popular examples
of its use include predicting if an e-mail is spam or not spam
or if a tumor is malignant or not malignant. Within logistic
regression, this is the most commonly used approach, and
more generally, it is one of the most common classifiers for
binary classification.
 Multinomial logistic regression: In this type of logistic
regression model, the dependent variable has three or more
possible outcomes; however, these values have no specified
order. For example, movie studios want to predict what
genre of film a moviegoer is likely to see to market films more
effectively. A multinomial logistic regression model can help
the studio to determine the strength of influence a person's
age, gender, and dating status may have on the type of film
that they prefer. The studio can then orient an advertising

20 | P a g e
campaign of a specific movie toward a group of people likely
to go see it.
 Ordinal logistic regression: This type of logistic regression
model is leveraged when the response variable has three or
more possible outcome, but in this case, these values do have
a defined order. Examples of ordinal responses include
grading scales from A to F or rating scales from 1 to 5.

21 | P a g e

You might also like