Professional Documents
Culture Documents
Unit - 4
Unit - 4
MODELING AND
ANALYSIS
1|P ag e
PREDICTIVE MODELING
Predictive modeling is the process of analysing current outcomes
and known information to predict future outcomes. In predictive
analytics, predictive modeling algorithms are used to procure
possible future outcomes.
With data science at its peak, predictive modeling has emerged as
a helpful data mining technique that has enabled organizations
and corporations to extract predictive outcomes based on
whatever data is known currently.
In the process of predictive modeling, data is recorded, a
statistical model or an algorithm is applied, and future outcomes
are predicted.
Even when this concept has been in practice for more than half a
century, it has just recently gained the significance that it
deserved from the very first day on. While the early years went
into investigating the efficacy of this data science technique,
recent times have led industries and organizations to implement
this successfully.
2|P ag e
TECHNIQUES OF PREDICTIVE MODELING
While predictive modeling is defined as a predictive analytics tool
to extract future outcomes with the help of past data, it can also
be considered as a mathematical procedure used to calculate
future possibilities.
Also known as predictive analytics, predictive modeling is of
different types of predictive models.
1. Classification Model
Among all the predictive modeling techniques in machine
learning, the classification model is one of the widely used
techniques. In classification predictive modeling, an input is
classified into a specific category where it is treated as a label and
its class is predicted.
2. Forecast Model
One of the most popular and accurate predictive models, the
forecast model is used to forecast/predict metric values based on
past data. With the help of historical data, the forecast model
computes data points consisting of numerical values and allots
them values based on historical data.
3. Outliers Model
The outliers model revolves around detecting the ‘outliers’ in a
dataset or anomalies. Simply put, the outliers model is one of the
types of predictive models that helps to detect anomalies in a data
set and even predict related information of a particular data set.
Especially in the field of finance, the outliers model helps
predictive modeling to detect whether a transaction is a fraud or
safe.
3|P ag e
For example, when identifying fraudulent transactions, the model
can assess not only amount, but also location, time, purchase
history and the nature of a purchase.
1. Sales
Sales are one of the most essential aspects of a business that
keeps it running. Based on the kind of sales a company has
achieved in the past, predictive analysis techniques and tools can
very well establish the future for the company in terms of sales
and profits.
Furthermore, it can also detect the anomalies wherein the sales
department is lagging which leads to an enhanced performance of
the company in the determined areas or demographic circles.
2. Marketing
Another application of predictive analytics is marketing. As
marketing is the act of promoting a particular service or a
commodity to a group of target customers, it involves predicting
the reaction of customers and forecasting the customer
requirements based on the data collected by customer feedback.
3. Social Media
Social media is the hub of unstructured, heterogeneous, and vast
data.
A platform where millions of people interact and use the internet
on a day-to-day basis, social media requires predictive modeling
for forecasting customer feedback and determining the kind of
response a product or a response on the platform will get.
4|P ag e
That said, the importance of social media and it is one of the
widely used applications of predictive modeling that helps
various platforms to detect customer activity and compute future
outcomes accordingly.
4. Risk Assessment
A major application of predictive modeling is risk assessment.
Risk assessment is often practiced in financial institutions and
fraud detection cases where one might want to assess the kind of
risk that s/he is subject to.
5|P ag e
commercially, predictive modeling is often referred to as
predictive analytics.
Nomograms are useful graphical representation of a
predictive model. As in spreadsheet software, their use
depends on the methodology chosen. The advantage of
nomograms is the immediacy of computing predictions
without the aid of a computer.
Point estimates tables are one of the simplest form to
represent a predictive tool. Here combination of
characteristics of interests can either be represented via a
table or a graph and the associated prediction read off the y-
axis or the table itself.
Tree based methods (e.g. CART, survival trees) provide one
of the most graphically intuitive ways to present predictions.
However, their usage is limited to those methods that use
this type of modeling approach which can have several
drawbacks. Trees can also be employed to represent
decision rules graphically.
Score charts are graphical tabular are graphical tools to
represent either predictions or decision rules.
A statistical model embodies a set of assumptions concerning
the generation of the observed data, and similar data from a
larger population. A model represents, often in considerably
idealized form, the data-generating process. The model
assumptions describe a set of probability distributions, some
of which are assumed to adequately approximate the
distribution from which a particular data set is sampled.
A logic-driven is based on experience, knowledge and logical
relationships of variable and constants connected to the
desired performance outcome. To help conceptualize the
relationships inherent in a system, diagramming methods
are useful.
6|P ag e
Cause and effect diagram enables a user to hypothesize
relationships between potential causes and of an outcome.
Influence diagram are another tool to conceptualize
relationships with business performance relationships.
7|P ag e
will want to build your model from a consistent and
comprehensive set of information that is ready to be analysed.
4. Build the predictive model. Establish the hypothesis and then
build the test model. Your goal is to include, and rule out, different
variables and factors and then test the model using historical data
to see if the results produced by the model prove the hypothesis.
5. Incorporate analytics into business processes. To make the
model valuable, you need to integrate it into the business process
so it can be used to help achieve the outcome.
6. Monitor the model and measure the business results. We live
and market in a dynamic environment, where buying, competitive
and other factors change. You will need to monitor the model and
measure how effective it is at continuing to produce the desired
outcome. It may be necessary to make adjustments and fine tune
the model as conditions evolve.
DATA-DRIVEN MODELING
Data modeling is the process of creating a visual representation of
either a whole information system or parts of it to communicate
connections between data points and structures. The goal is to
illustrate the types of data used and stored within the system, the
relationships among these data types, the ways the data can be
grouped and organized and its formats and attributes.
Data models are built around business needs. Rules and
requirements are defined upfront through feedback from
business stakeholders so they can be incorporated into the design
of a new system or adapted in the iteration of an existing one.
Ideally, data models are living documents that evolve along with
changing business needs. They play an important role in
supporting business processes and planning IT architecture and
8|P ag e
strategy. Data models can be shared with vendors, partners,
and/or industry peers.
9|P ag e
lengths, and show the relationships among entities. Logical
data models don’t specify any technical system
requirements. This stage is frequently omitted in agile
or DevOps practices. Logical data models can be useful in
highly procedural implementation environments, or for
projects that are data-oriented by nature, such as data
warehouse design or reporting system development.
10 | P a g e
3. Identify relationships among entities. The earliest draft of a
data model will specify the nature of the relationships each
entity has with the others. In the above example, each
customer “lives at” an address. If that model were expanded
to include an entity called “orders,” each order would be
shipped to and billed to an address as well. These
relationships are usually documented via unified modeling
language (UML).
4. Map attributes to entities completely. This will ensure the
model reflects how the business will use the data. Several
formal data modeling patterns are in widespread use.
Object-oriented developers often apply analysis patterns or
design patterns, while stakeholders from other business
domains may turn to other patterns.
5. Assign keys as needed, and decide on a degree of
normalization that balances the need to reduce redundancy
with performance requirements. Normalization is a
technique for organizing data models (and the databases
they represent) in which numerical identifiers, called keys,
are assigned to groups of data to represent relationships
between them without repeating the data. For instance, if
customers are each assigned a key, that key can be linked to
both their address and their order history without having to
repeat this information in the table of customer names.
Normalization tends to reduce the amount of storage space a
database will require, but it can at cost to query
performance.
6. Finalize and validate the data model. Data modeling is an
iterative process that should be repeated and refined as
business needs change.
11 | P a g e
Data modeling has evolved alongside database management
systems, with model types increasing in complexity as businesses'
data storage needs have grown. Here are several model types:
Hierarchical data models represent one-to-many
relationships in a treelike format. In this type of model, each
record has a single root or parent which maps to one or
more child tables. This model was implemented in the IBM
Information Management System (IMS), which was
introduced in 1966 and rapidly found widespread use,
especially in banking. Though this approach is less efficient
than more recently developed database models, it’s still used
in Extensible Markup Language (XML) systems and
geographic information systems (GISs).
Relational data models were initially proposed by IBM
researcher E.F. Codd in 1970. They are still implemented
today in the many different relational databases commonly
used in enterprise computing. Relational data modeling
doesn’t require a detailed understanding of the physical
properties of the data storage being used. In it, data
segments are explicitly joined through the use of tables,
reducing database complexity.
Relational databases frequently employ structured query
language (SQL) for data management. These databases work well
for maintaining data integrity and minimizing redundancy.
They’re often used in point-of-sale systems, as well as for other
types of transaction processing.
Entity-relationship (ER) data models use formal diagrams to
represent the relationships between entities in a database.
Several ER modeling tools are used by data architects to
create visual maps that convey database design objectives.
Object-oriented data models gained traction as object-
oriented programming and it became popular in the mid-
12 | P a g e
1990s. The “objects” involved are abstractions of real-world
entities. Objects are grouped in class hierarchies, and have
associated features. Object-oriented databases can
incorporate tables, but can also support more complex data
relationships. This approach is employed in multimedia and
hypertext databases as well as other use cases.
Dimensional data models were developed by Ralph Kimball,
and they were designed to optimize data retrieval speeds for
analytic purposes in a data warehouse. While relational and
ER models emphasize efficient storage, dimensional models
increase redundancy in order to make it easier to locate
information for reporting and retrieval. This modeling is
typically used across OLAP systems.
13 | P a g e
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines
work as the supervisor that teaches the machines to predict the
output correctly. It applies the same concept as a student learns in
the supervision of the teacher.
Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam filtering,
etc.
14 | P a g e
o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset,
and validation dataset.
o Determine the input features of the training dataset, which
should have enough knowledge so that the model can
accurately predict the output.
o Determine the suitable algorithm for the model, such as
support vector machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we
need validation sets as the control parameters, which are the
subset of training datasets.
o Evaluate the accuracy of the model by providing the test set.
If the model predicts the correct output, which means our
model is accurate.
15 | P a g e
1. Regression
Regression algorithms are used if there is a relationship between
the input variable and the output variable. Below are some types-
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
16 | P a g e
o Supervised learning models are not suitable for handling the
complex tasks.
o Supervised learning cannot predict the correct output if the
test data is different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the
classes of object.
SIMPLE REGRESSION
y = β0 + β1X+ ε.
17 | P a g e
ASSUMPTIONS OF SIMPLE REGRESSION
MULTIPLE REGRESSION
18 | P a g e
ASSUMPTIONS OF MULTIPLE REGRESSION
LOGISTIC REGRESSION
20 | P a g e
campaign of a specific movie toward a group of people likely
to go see it.
Ordinal logistic regression: This type of logistic regression
model is leveraged when the response variable has three or
more possible outcome, but in this case, these values do have
a defined order. Examples of ordinal responses include
grading scales from A to F or rating scales from 1 to 5.
21 | P a g e