13 14 BDT DataMining&MachineLearning PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Big Data Technology

13/14 Data Mining & Machine Learning

Cyrus Lentin
Data Mining

Data Mining Is The Computing Process Of Discovering Patterns In Large Data Sets Involving Methods
At The Intersection Of Machine Learning, Statistics, And Database Systems
It Is An Essential Process Where Intelligent Methods Are Applied To Extract Data Patterns
The Overall Goal Of The Data Mining Process Is To Extract Information From A Data Set And
Transform It Into An Understandable Structure For Further Use
Aside From The Raw Analysis Step, It Involves
Database And Data Management Aspects
Data Pre-processing
Model And Inference Considerations
Interestingness Metrics
Complexity Considerations
Post-processing Of Discovered Structures
Visualization, And Online Updating
Data Mining Is The Analysis Step Of The "Knowledge Discovery In Databases" Process, Or KDD

Big Data Technology - Cyrus Lentin 1


Knowledge Discovery In Database

Knowledge Discovery In Databases (KDD) Process Is Commonly Defined With The Stages:
Selection
Pre-processing
Transformation
Data Mining
Interpretation/Evaluation
Cross Industry Standard Process For Data Mining (CRISP-DM) Defines KDD In Six Phases:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
A Simplified Process Defines KDD As
Pre-processing
Data Mining
Results Validation
Big Data Technology - Cyrus Lentin 2
Pre Processing

Before Data Mining Algorithms Can Be Used, A Target Data Set Must Be Assembled
As Data Mining Can Only Uncover Patterns Actually Present In The Data
The Target Data Set Must Be Large Enough To Contain These Patterns
The Target Data Should Also Remain Concise Enough To Be Mined Within An Acceptable Time Limit
A Common Source For Data Is A Data Mart Or Data Warehouse
Pre-processing Is Essential To Analyze The Multivariate Data Sets Before Data Mininga
The Target Set Is Then Cleaned. Data Cleaning Removes The Observations Containing Noise And
Those With Missing Data

Big Data Technology - Cyrus Lentin 3


Data Mining

Data mining involves six common classes of tasks:


Anomaly Detection ( Outlier / Change Detection / Deviation Detection)
The Identification Of Unusual Data Records, That Might Be Interesting Or Data Errors That Require
Further Investigation
Regression
Attempts To Find A Function Which Models The Data With The Least Error That Is, For Estimating The
Relationships Among Data Or Datasets
Classification
Is The Task Of Generalizing Known Structure To Apply To New Data. For Example, An E-mail Program
Might Attempt To Classify An E-mail As "Legitimate" Or As "Spam"
Clustering
Is The Task Of Discovering Groups And Structures In The Data That Are In Some Way Or Another
"Similar", Without Using Known Structures In The Data
Association Rule Learning (Dependency Modelling)
Searches For Relationships Between Variables. This Is A Mathematical Modeling Technique Showing
That If You Buy A Certain Group Of Items, You Are Likely To Buy Another Group Of Items
Summarization
Providing A More Compact Representation Of The Data Set, Including Visualization And Report
Generation

Big Data Technology - Cyrus Lentin 4


Result Validation

Data mining can unintentionally be misused, and can then produce results which appear to be
significant; but which do not actually predict future behaviour and cannot be reproduced on a new
sample of data and bear little use
Often this results from investigating too many hypotheses and not performing proper statistical
hypothesis testing
A simple version of this problem in machine learning is known as overfitting, but the same problem
can arise at different phases of the process and thus a train/test split - when applicable at all - may
not be sufficient to prevent this from happening

Big Data Technology - Cyrus Lentin 5


Machine Learning
Machine learning is a subfield of computer science that evolved from the study of pattern recognition
and computational learning theory in artificial intelligence
Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn
without being explicitly programmed"
Machine learning explores the study and construction of algorithms that can learn from and make
predictions on data
Such algorithms operate by building a model from example inputs in order to make data-driven
predictions or decisions, rather than following strictly static program instructions
Machine learning is employed in a range of computing tasks where designing and programming
explicit algorithms is unfeasible
A Machine Learning Computer Program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E

Big Data Technology Cyrus Lentin 6


Machine Learning
Type Of Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning

Big Data Technology Cyrus Lentin 7


Supervised Learning / Unsupervised Learning
You are a small, you see different types of
animals, your father tells you that this particular
animal is a dog; after him giving you tips few
times, you see a new type of dog that you never
saw before - you identify it as a dog and not as a
cat or a monkey or a potato

Big Data Technology Cyrus Lentin 8


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc
cat or a monkey or a potato

Big Data Technology Cyrus Lentin 9


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc.
cat or a monkey or a potato
There is a teacher and you learn through
samples & examples, such that when a new
sample comes your way, you may still be able to
identify it

Big Data Technology Cyrus Lentin 10


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc.
animal is a dog. After him giving you tips few However from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc.
cat or a monkey or a potato.
There is a teacher and you learn through You have lots of information but you did not
samples & examples, such that when a new know what to do with it initially; there is no
sample comes your way, you may still be able to teacher to guide you and you have to find a way
identify it. out on your own; then, based on *some*
criteria you start churning out that information
into groups that makes sense to you

Big Data Technology Cyrus Lentin 11


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc
cat or a monkey or a potato
There is a teacher and you learn through You have lots of information but you did not
samples & examples, such that when a new know what to do with it initially; there is no
sample comes your way, you may still be able to teacher to guide you and you have to find a way
identify it out on your own; then, based on *some*
criteria you start churning out that information
into groups that makes sense to you

Supervised learning problems are categorized


into "regression" and "classification" problems

Big Data Technology Cyrus Lentin 12


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc
cat or a monkey or a potato
There is a teacher and you learn through You have lots of information but you did not
samples & examples, such that when a new know what to do with it initially; there is no
sample comes your way, you may still be able to teacher to guide you and you have to find a way
identify it out on your own; then, based on *some*
criteria you start churning out that information
into groups that makes sense to you

Supervised learning problems are categorized Unsupervised learning is generally done


into "regression" and "classification" problems by clustering the data based on relationships
among the variables in the data

Big Data Technology Cyrus Lentin 13


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc
cat or a monkey or a potato
There is a teacher and you learn through You have lots of information but you did not
samples & examples, such that when a new know what to do with it initially; there is no
sample comes your way, you may still be able to teacher to guide you and you have to find a way
identify it out on your own; then, based on *some*
criteria you start churning out that information
into groups that makes sense to you

Supervised learning problems are categorized Unsupervised learning is generally done


into "regression" and "classification" problems by clustering the data based on relationships
among the variables in the data

With supervised learning, there is always


feedback loop to correct you

Big Data Technology Cyrus Lentin 14


Supervised Learning / Unsupervised Learning
You are a small, you see different types of You go to a new country, you did know much
animals, your father tells you that this particular about it - their food, culture, language etc;
animal is a dog; after him giving you tips few however from day 1, you start making sense
times, you see a new type of dog that you never there, learning to eat new cuisines including
saw before - you identify it as a dog and not as a what not to eat, find a way to that beach etc
cat or a monkey or a potato
There is a teacher and you learn through You have lots of information but you did not
samples & examples, such that when a new know what to do with it initially; there is no
sample comes your way, you may still be able to teacher to guide you and you have to find a way
identify it out on your own; then, based on *some*
criteria you start churning out that information
into groups that makes sense to you

Supervised learning problems are categorized Unsupervised learning is generally done


into "regression" and "classification" problems by clustering the data based on relationships
among the variables in the data

With supervised learning, there is always With unsupervised learning there is no feedback
feedback loop to correct you based on the prediction results, ie, there is no
teacher to correct you

Big Data Technology Cyrus Lentin 15


Supervised Machine Learning Workflow

Big Data Technology Cyrus Lentin 16


Supervised Machine Learning Cross Validation

Big Data Technology Cyrus Lentin 17


Supervised Machine Learning Accuracy

Big Data Technology Cyrus Lentin 18


Machine Learning

Big Data Technology Cyrus Lentin 19


Regression
Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one independent variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent Variable
The variable we wish to predict or explain
Independent Variable
The variable used to predict or explain the dependent variable
It is used to understand a phenomena, make predictions, and/or test hypotheses
It is one of the most commonly used tools for business analysis
It is easy to use and applies to many situations

Big Data Technology Cyrus Lentin 20


Linear Regression Simple
Single explanatory variable
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in Y are assumed to be directly related to changes in X

IMPORTANT
Independent Variables must continuous numeric or categoric numeric values
Dependent Variables are also continuous numeric values

Big Data Technology Cyrus Lentin 21


Linear Regression Multiple
Includes any number of explanatory variables
Two or more independent variables, Xs
Relationship between Xs and Y is described by a linear function
Changes in Y are assumed to be directly related to changes in Xs

IMPORTANT
Independent Variables must continuous numeric or categoric numeric values
Dependent Variables are also continuous numeric values

Big Data Technology Cyrus Lentin 22


Linear Regression Assumptions
Correlation
There exists a correlation between the dependent and independent variables
No Internal Correlation
The independent variables do not have any linear relationships between each other
Linearity
There exists a linear relationship between the dependent and independent variables
Normality of Errors
The error term is normally distributed
The expected value of the error term, conditional on the independent variables is zero
Homoscedasticity
The error terms are homoscedastic, ie the variance of the error terms is constant for all the
observations
The expected value of the product of error terms is always zero, which implies that the error terms
are uncorrelated with each other

Big Data Technology Cyrus Lentin 23


Linear Regression Applications
Human Resource - Salary Estimate
Predicting or estimating salary of a person based on set of attributes such as years of experience,
level of education, industry of work, previous job salary etc
Human Resources - Churn
Considering high level of employee churn, multiple regression based model to estimate months of
stickiness (or job with a new employer) at the time of recruitment based on candidate attributes
Human Resources - Resource Demand
Forecasting or Demand Estimation for each of the technology skills; levels of bench in most of the
big IT services provider is important level to get project & deliver but also add to the cost; an
accurate estimation of demand by skills could be important measures to manage requirements at
right cost
Real Estate - House - Price Prediction
Predicting House Prices considering house, locality and builder characteristics
Real Estate - House Demand Forecast
Developing a forecasting model to find volume of houses on sales in a month given economic
factors, seasonality and other dimensions
Retailer - Sales Volume & Return On Investment
Finding out drivers of retail product sales as a function of spend across media channels, economic
factors and competitor actions

Big Data Technology Cyrus Lentin 24


Linear Regression Applications
Banking/Financial Services - Customer Value Estimation
Considering customer level attributes, estimating short value of the customers
Banking/Financial Services - Spend Value at a Customer
Spend on Credit Card is a strong indicator of customer engagement on the card and whether the card is
a front of wallet card; Predicting Spend value of card holder could help the product and marketing
teams in engaging the customers with an appropriate treatment strategy
Banking/Financial Services - Balance In Flow into Transaction or Saving Account
Predicting amount of balance expected to be deposited into customers transaction and saving account
using customer level characteristics
Banking/Financial Services - Drivers of New Account Volume
Building Marketing or Media Mix Model to find economic, advertisement spend (across media or
channels), competitor and offer related variables impacting new account open volume in a week
Insurance/Financial Services - Claim Amount Estimation
Insurance providers charge premium based on estimated claim amount for the target group of the
customers; Claim could be against Motor, Home or Pet Policy; also, the estimated claim amount could
be used for operational cash reserve calculations
Banking/Financial Services - Revenue Regression Model
Predicting revenue of customers and identifying parameters which are linked to increased revenue of
the customers; this helps business bankers in realigning the priority and focus
Big Data Technology Cyrus Lentin 25
Logistic Regression
Logistic regression is a form of regression where the dependent variable (outcome variable) is binary
and the independent variables (influencing factors) can be either continuous or categorical
It is a prediction done with categorical variable as output

Logistic regression can be binomial (binary) or multinomial


In the binary logistic regression, the outcome can have only two possible types of values eg Yes or
No, True or False
Multinomial logistic refers to cases where the outcome can have three or more possible types of
values eg Good, Average, Bad

IMPORTANT
Independent Variables are all continuous numeric or categorical numeric values
Dependent Variables are always categorical numeric values

Big Data Technology Cyrus Lentin 26


Logistic Regression Applications
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Personal Loan Applicants
Estimating probability and score of personal loan applicant who has higher chances of defaulting if
Personal Loan is given
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Credit Card Applicants
Estimating probability and score of Credit Card Applicants who have higher likelihood of Defaulting
on Credit Card Payments
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Home Loan Applicants
Estimating probability and score of Home Loan or Mortgage Applicants who have higher likelihood of
Defaulting on their Home Loan or Mortgage facilities
Banking/Financial Services - Application Fraud Model
Based on Application attributes, predicting whether an applicant is fraudulent; this helps in
minimizing risk of issuing product to fraudulent or suspects applicants
Banking/Financial Services - Transaction Fraud Model
Predicting whether a transaction initiated is a fraudulent or genuine using transaction attributes
which as transaction time, merchant, and location of transaction
Insurance/Financial Services - Claims Fraud
Claim is one of the major expense items for an insurance underwriter and fraudulent claims can
have severe adverse consequences; Predictive Model can be built to find chances of a claim being
fraudulent and approve the genuine claims

Big Data Technology Cyrus Lentin 27


Logistic Regression Applications
Banking/Financial Services - Customer and Marketing Response Model
Banks contact their customers on a regular basis and a typical response rate of the marketing
communications is around 1%. Aim of Response Model is to find the customers who have higher
chances of getting a customer response. If customers are contacted for Credit Card Balance Transfer
offer, who are more likely to respond positively?
Banking/Financial Services - Customer and Marketing Cross Sell Model
Each bank aims to acquire customers and sell other products post the acquisition. Cross Sell Model
helps in finding out customers who would be interested in a particular product given customer
transactions, interactions and behaviors.
Banking/Financial Services - Customer and Marketing Account Attrition Model
Building a predicting model which finds likelihood of an account getting close in a defined
performance period. E.g. Which of the customers have higher likelihood of closing Credit Card
account, or Paying off Personal Loan/Mortgage before maturity date.
Banking/Financial Services - Customer Churn or Customer Attrition Model
Finding out attributes of customers which help in identifying customers who have higher chances of
attrition. This helps in building relevant customer retention strategies.
Government/Taxation - Earning Manipulation Identification Model (Fraud Identification Model)
Based on balance sheet and income statement data, predict which firm has higher chances of
manipulating its earnings.
Big Data Technology Cyrus Lentin 28
Time Series
A time series is a collection of observations made sequentially through time
The interval between observations can be any time interval (hours within days, days, weeks, months,
years, etc)
Time series can occur in a wide range of fields from economics to sociology, meteorology to
financial investment, etc

Some examples of time series are:


Monthly closings of the stock exchange index
Malaria incidence or deaths over calendar years
Daily maximum temperatures
Hourly records of babies born at a maternity hospital

Big Data Technology Cyrus Lentin 29


Types Of Time Series
Continuous
Observations made continually in time give rise to a Continuous Time Series
Thermometer readings at a Met station (continuously measured)
Measurement of whether air pollution reached increasing levels of unacceptability at an industrial
site (air pollution levels are continuous)
Discrete
More often, observations are taken only at specific points in time, giving rise to a Discrete Time
Series
annual number of road accidents (discrete)
maximum daily temperature (continuous)
whether or not there was daily rain (binary)

Big Data Technology Cyrus Lentin 30


Objectives Of Time Series
Description (often with monitoring data)
Merely to describe the patterns over time
Explanation
Can the pattern observed over time be explained in terms of other factors or causes? Helps in
understanding the behavior of the series
Prediction (forecasting)
Can past records help us to predict what will happen in the future?
Improving the past system / behavior
If factors affecting the behavior of a variable over time can be identified, action may be taken to
improve the system, eg action over increasing levels of air pollution

Big Data Technology Cyrus Lentin 31


Factors Influencing Time Series
Seasonality
Seasonal Variation or Seasonal Fluctuations
Cyclic Changes
Cyclical Variation or Cyclic Fluctuations
Trend
Secular Trend or Long Term Variation
Irregular Variation
Irregular Fluctuations

Big Data Technology Cyrus Lentin 32


Classification
A classification model attempts to draw some conclusion from observed values
Given one or more inputs a classification model will try to predict the value of one or more
outcomes
Outcomes are labels that can be applied to a dataset
For example,
-- filtering emails as spam or not spam
-- validating transaction data as fraudulent, or authorized

Types of Classification
Decision Trees
Random Forests
Nave Bayes

Big Data Technology Cyrus Lentin 33


Decision Trees
A decision tree is a mechanical way to make a decision by dividing the inputs into smaller decisions
Like other models, it involves mathematics but its not very complicated mathematics

Big Data Technology Cyrus Lentin 34


Random Forest
Random Forests grows many classification trees; each of these trees is called ensemble
To classify a new object, a random subset of the input data is given to each of the trees in the forest
Each tree gives a classification, and we say the tree "votes" for that class
The forest chooses the classification having the most votes (over all the trees in the forest)

Ensembles are a divide-and-conquer approach used to improve performance. The main principle
behind ensemble methods is that a group of weak learners can come together to form a strong
learner. Each classifier, individually, is a weak learner, while all the classifiers taken together are a
strong learner.

Big Data Technology Cyrus Lentin 35


Random Forest Features
It is unexcelled in accuracy among current algorithms
It runs efficiently on large data bases
It can handle thousands of input variables without variable deletion
It gives estimates of what variables are important in the classification
It generates an internal unbiased estimate of the generalization error as the forest building
progresses
It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing
It has methods for balancing error in class population unbalanced data sets
Prototypes are computed that give information about the relation between the variables and the
classification (K Nearest Neighbor typically used in Recommendation Engines)
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or
(by scaling) give interesting views of the data
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering,
data views and outlier detection

Big Data Technology Cyrus Lentin 36


Nave Bayes
Naive Bayes is a simple technique for constructing classifiers models that assign class labels to
problem instances, where the class labels are drawn from some finite set
It is not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of
the value of any other feature, given the class variable
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in
diameter
A naive Bayes classifier considers each of these features to contribute independently to the
probability that this fruit is an apple, regardless of any possible correlations between the color,
roundness, and diameter features
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting
Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have
worked quite well in many complex real-world situations
An advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification
SPAM Filtering is one of the most popular application for Nave Bayes

Big Data Technology Cyrus Lentin 37


Classification - Applications
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Personal Loan Applicants
Estimating probability and score of personal loan applicant who has higher chances of defaulting if
Personal Loan is given
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Credit Card Applicants
Estimating probability and score of Credit Card Applicants who have higher likelihood of Defaulting
on Credit Card Payments
Banking/Financial Services - Credit Risk Predicting Credit Defaulters from Home Loan Applicants
Estimating probability and score of Home Loan or Mortgage Applicants who have higher likelihood of
Defaulting on their Home Loan or Mortgage facilities
Banking/Financial Services - Application Fraud Model
Based on Application attributes, predicting whether an applicant is fraudulent. This helps in
minimizing risk of issuing product to fraudulent or suspects applicants
Banking/Financial Services - Transaction Fraud Model
Predicting whether a transaction initiated is a fraudulent or genuine using transaction attributes
which as transaction time, merchant, and location of transaction
Insurance/Financial Services - Claims Fraud
Claim is one of the major expense items for an insurance underwriter and fraudulent claims can
have severe adverse consequences. Predictive Model can be built to find chances of a claim being
fraudulent and approve the genuine claims.

Big Data Technology Cyrus Lentin 38


Classification - Applications
Banking/Financial Services - Customer and Marketing Response Model
Banks contact their customers on a regular basis and a typical response rate of the marketing
communications is around 1%. Aim of Response Model is to find the customers who have higher
chances of getting a customer response. If customers are contacted for Credit Card Balance Transfer
offer, who are more likely to respond positively?
Banking/Financial Services - Customer and Marketing Cross Sell Model
Each bank aims to acquire customers and sell other products post the acquisition. Cross Sell Model
helps in finding out customers who would be interested in a particular product given customer
transactions, interactions and behaviors.
Banking/Financial Services - Customer and Marketing Account Attrition Model
Building a predicting model which finds likelihood of an account getting close in a defined
performance period. E.g. Which of the customers have higher likelihood of closing Credit Card
account, or Paying off Personal Loan/Mortgage before maturity date.
Banking/Financial Services - Customer Churn or Customer Attrition Model
Finding out attributes of customers which help in identifying customers who have higher chances of
attrition. This helps in building relevant customer retention strategies.
Government/Taxation - Earning Manipulation Identification Model (Fraud Identification Model)
Based on balance sheet and income statement data, predict which firm has higher chances of
manipulating its earnings.
Big Data Technology Cyrus Lentin 39
What To Use When?

Algorithm Independent Data Type Dependent Data Type Precision Criteria


Linear Regression Continuous Numeric / Continuous Numeric RMSE
Categorical Numeric
Logistic Regression Continuous Numeric / Binary Numeric / Confusion Matrix
Categorical Numeric Multinomial Numeric
Decision Tree Continuous / Categorical Categorical Confusion Matrix
Numeric / Alphanumeric Numeric / Alphanumeric
Random Forest Continuous / Categorical Categorical Confusion Matrix
Numeric / Alphanumeric Numeric / Alphanumeric
Nave Bayes Continuous / Categorical Categorical Confusion Matrix
Numeric / Alphanumeric Numeric / Alphanumeric
Classification introduced because Logistic Regression can not handle non-numeric data & large
number of features
Decision Trees rarely used in real world except when there are few Features
Nave Bayes used when we have many (but not a large number of) Labels with AlphaNumeric data
Random Forest used when we have large number of Labels with AlphaNumeric data

Big Data Technology Cyrus Lentin 40


Clustering
Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes,
called clusters
This helps users understand the natural grouping or structure in a data set
Clustering is unsupervised classification ie no predefined classes
Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for
other algorithms

Big Data Technology Cyrus Lentin 41


Clustering Finding Pattern In Data
We can find patterns in data by isolating groups of data-points which are similar to each other in a
well defined sense

Big Data Technology Cyrus Lentin 42


Clustering Finding Pattern In Data
We can find patterns in data by isolating groups of data-points which are similar to each other in a
well defined sense

Big Data Technology Cyrus Lentin 43


Clustering Finding Pattern In Data
We can find patterns in data by isolating groups of data-points which are similar to each other in a
well defined sense

Big Data Technology Cyrus Lentin 44


Clustering
A good clustering method will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The quality of a clustering result also depends on both the similarity measure used by the method
and its implementation
The quality of a clustering method is also measured by its ability to discover some or all of the
hidden patterns
However, objective evaluation is problematic: usually done by human / expert inspection

Big Data Technology Cyrus Lentin 45


Clustering - Applications
Cross Industry Marketing
Help marketers discover distinct groups in their customer bases, and then use this knowledge to
develop targeted marketing programs
Government Land / Revenue
Identification of areas of similar land use in an earth observation database
Government City Planning / Urban Development
Identifying groups of houses according to their house type, value, and geographical location.
Insurance Claims Monitoring
Identifying groups of motor / medical insurance policy holders with a high average claim cost

Big Data Technology Cyrus Lentin 46


Association Analysis Introduction
Association Analysis is a mathematical modeling technique based upon the theory that if you buy a
certain group of items, you are likely to buy another group of items
It is used to analyze the customer purchasing behavior and helps in increasing the sales and maintain
inventory by focusing on the point of sale transaction data
Given a dataset, the Apriori Algorithm trains and identifies product baskets and product association
rules

Big Data Technology Cyrus Lentin 47


Association Analysis Apriori Algortihm

Apriori Algorithm Is A Classical Algorithm In Data Mining


It Is Used For Mining Frequent Item-sets And Relevant Association Rules
It Is Devised To Operate On A Database Containing A Lot Of Transactions.

Big Data Technology - Cyrus Lentin 48


Association Analysis Metrics

Support
The Support Of An Itemset X, Supp(x) Is The Proportion Of Transaction In The Database In Which The
Item X Appears. It Signifies The Popularity Of An Itemset
Confidence
Signifies The Likelihood Of Item Y Being Purchased When Item X Is Purchased
Lift
This signifies the likelihood of the itemset Y being purchased when item X is purchased while taking
into account the popularity of Y
Conviction
This is interpreted as the ratio of the expected frequency that X occurs without Y that is to say, the
frequency that the rule makes an incorrect prediction

Big Data Technology - Cyrus Lentin 49


Association Analysis Pros & Cons

Pros Of The Apriori Algorithm


It Is An Easy-to-implement And Easy-to-understand Algorithm.
It Can Be Used On Large Itemsets.
Cons Of The Apriori Algorithm
Sometimes, It May Need To Find A Large Number Of Candidate Rules Which Can Be Computationally
Expensive.
Calculating Support Is Also Expensive Because It Has To Go Through The Entire Database.

Big Data Technology - Cyrus Lentin 50


Association Analysis Applications

Market Basket Analysis


It Is Very Important For Effective Market Basket Analysis And It Helps The Customers In Purchasing
Their Items With More Ease Which Increases The Sales Of The Markets
Adverse Drug Reactions
It Has Also Been Used In The Field Of Healthcare For The Detection Of Adverse Drug Reactions. It
Produces Association Rules That Indicates What All Combinations Of Medications And Patient
Characteristics Lead To Adverse Drug Reactions

Big Data Technology - Cyrus Lentin 51


Thank you!
Contact:
Cyrus Lentin
cyrus@lentins.co.in
+91-98200-94236

Big Data Technology - Cyrus Lentin 52

You might also like