Professional Documents
Culture Documents
Business Analytics
Business Analytics
Index
Date Lecture Topics
23-11-2020 L1 Introduction
24-11-2020 L2 Clustering - K means clustering
25-11-2020 L3 K means clustering using Tanagra
27-11-2020 L4 Exercises in clustering
28-11-2020 L5 Association rule mining
30-11-2020 L6 Exercise - Association rule mining
01-12-2020 L7 Classification using logistic regression, EVP,
Evaluating ML model
02-12-2020 L8
05-12-2020 L9
18-12-2020 L10 Predictive analytics
19-12-2020 L11 Logistic regression - How to select predictors
Video Lecture_1
22-12-2020 L12 Neural Networks
23-12-2020 L13
24-12-2020 L14
• Real world phenomenon is COVID spread and model takes in input and
attempts to define the real world phenomenon
▪
□ Here to find the relationship between 2 variables and
explore if they are related do CORRELATION
□
▪
▪ What is the contribution of pages viewed on amount spent?
○
○ Predict rather than explain. Focus on accuracy of predictions
•
• Environment is unstable/uncertain
• Multiple criteria to evaluate, optimize operational cost
• Not diagnostic not trying to explain anything
•
• Looking at past data to see what is selling well
• After this can move onto to diagnostics and try to find out why this
happened (location, trends)
• Classification predictive db
• They used only backend data. If they had factored in external variables
will become prescriptive
• Form highly cohesive clusters (points within cluster are close, between
clusters be highly separate)
•
• How?
○
Calculate the
distance
•
• If we include another variable age, distance changes because of
different scales it will change the distance
• So ALWAYS better to normalize your data before calculating distance
• Normalize using:
○ Z transform:
○ Values negative are less than the mean and positive are above the
mean
• K means algorithm
• Iterative algorithm
• Picks random k number of points (here k=2)
• Starts calculating distance from centroids randomly
○ Now it sees that a is closer to the circle than the triangle to moves
it to the other cluster
• Now recalculates the centroid
• So iteratively it does:
○ Calculate dist
○ Calculate centroid
○ Shift points to the centroid with least distance and recalculate
• Stops when it finds that all the points are closest to its assigned
centroid
• Certain software allows which points to start the K Mean with
• Outliers to eliminate can use K medoids instead on K means . Or
just remove outliers before doing k means
Exercise
• Apply vlook up to get a point (because when you use solver it needs a
formula so that it can dynamically change)
• Assign each point to the cluster with from which it has minimal
distance from the centroid
• Do until you can minimize the distance (sum of min here in column I)
• Cluster 1
○ Elbow plot
○ Silhouette width you calculate for different k values and use the K
value that gives you highest average silhouette width
Follow-up
• Read [3. Chateau Wiery (A)] up to page 4
• 6 variables available
• Need to classify the types of shoppers by clustering
• Difference between factor analysis and cluster analysis?
○ Factor analysis is a dimension reduction technique. Combining
variables based on shared variance. Grouping variables.
▪
▪ Can check shared covariance is high by looking at Eigen value.
Then don’t use multiple variable instead group to get only 1
single variable
• Usually you will normalize the data (here all values are on the same
scale)
• Normalize the data
• All marked yes are inputs that have been selected for k means
• Check the elbow plot to get k value
• Elbow plot is not available in this tool
• Here calculate wss manually (run Tanagra with k=1, k=2, k=3 etc and
copy the values)
• silhouette plot
○ Size gives number of points in the cluster and wss represents the
compactness of the cluster (distance from the centroid)
○ Cluster 3 is most compact
• Ex: here you can see cluster 3 has negative value for shopping is fun
○
• Ex: Cluster 2 is minimally interred in shopping (variable 5) instead
they focus on price
• Paste in excel
Exercise
• Survey results
○ But take it with a pinch of salt, sample size small, and sample has
more females
• Q2 Product categories
○ Normalize data
•
Recommendation
•
•
○ Asian customer generally not too trusting of online shopping
• Scree plot
○ Filter cluster 2
•
○ Restricted basis time and geography (here only US data)
○ These are concerns to be vary
•
• For k=3 better discrimination
•
• Exercise 2
• B2B situation, trying to find out who are the best vendors or suppliers
• 4 cluster solution
• Cluster 2
• Ex:
○ Amazon: frequently bought together
▪ This debunked
▪ This is a real example
□ 4 items, in pairs
□ Bread, cheese, Poptart, Beer
□ Find how many items sets are possible?
□
□
□ Metrics to asses goodness of a rule
Support
Worksheet
•
• (OR)
•
• This is called confidence, the second metric to check goodness of rule
H/W: Find confidence when order of rules are reversed (A purchased after B)
• Confidence is the only metric in which order of rule matters (not
symmetric)
• Confidence gives indication of direction in which the rule occurs
Lift
• Takes care of popularity of the consequent
•
• <1 is also interesting as it shows
• Calculate lift
•
• And juice is purchased after cheese is purchased
• Order is preserved in the lift
• Ex: Coke and Pepsi (if planning a promotion don’t place them
• If first column used countif formula (no need to check the sequence)
○ Export to tanagra
○ Increasing itemset to 4
Exercise
○ If we increase itemset to 3
• Better to have limited rules in retail situation. Also there is little data
so better to have higher support (only 100- rows)
• MSNBC website
• Page visit, visitor data for one day only
• Then adjust
○ Diagnostic model:
○ Predictive model: focus is on prediction accuracy
▪
▪ We can no longer use significance model
▪ When you have large db (have entire population for your study). Significance (p
value) will always be high. So cannot use it for generalization
▪ Need to use cross-validation (out of sample predictions), predict out of sample
data also
□ Remove the target variable and ask the model to predict. Compare with the
original value (which we know) and assess the accuracy of the model
□ Ex: if your test data have only younger age group it can work very well with
training data but in test data if there are older age it will perform poorly
□ Ensure there is heterogeneity and representation is present in training data
□ Fix this issue with proper sampling when splitting the data
□ How to split between test and training
• Classification matrix
•
• Class of interest (people who respond to the class so yes)
• But there are cases where No could be the class of interest like credit card fraud
○
○ Check TP and TN
○
• Suppose
○
○ But it will never detect a TP
○ This is accuracy paradox you cannot rely on accuracy of a model to assess 'goodness' of a
model
Exercise
• Test data
○ Confusion matrix
Accuracy
▪
Sensitivity
Specificity
▪
▪ Cost benefit analysis.
▪ Identify
▪ Look at monetary value, which returns best value. Vary and check which helps us
identify better.
Exercise
•
• Customers felt it was an invasion of privacy so they lost out on sales
• So Walmart assigned a cost if they classify inaccurately
• Logistic regression
• Identifying the variables ( feature engineering) takes time should be done carefully
•
• Ex: You are using income to predict purchase
• Purchase is either 0 or 1
• Increasing value of income tendency to purchase increases
• Values between 0 and 1 split using the sigmoid curve (helps connect x and y
• Connect x and y
•
• Check with salary
•
• So good candidate for a predictor
• Select data>
• Null model is with only the intercepts none of the predictors (sum of square errors)
• Z test
• Salary is signficant
• Survey based study to understand what type of products customers are willing to buy
domestic goods
•
○ Identify key variables that firm should focus on
○ Use regression
○
○ Low income people have more propensity to buy
○ Target highly educated, have children and is ethnocentric (personality trait cannot
segment but when you create ads use messaging)
• Admit yes - 1 no - 0
• Scores are continuous
• Undergraduate institute rank (discreate categorical variable) So create dummy variable
• Over a period of time when you refine the model you will want to collapse ( combine the
variables)
• Chapter 7 in textbook
• Templates for proposal is present in the appendix of the book
• Nature of model chosen depends on the context. Better always to build a simple model
• How to check accuracy?
○
• So far done descriptive analytics (cluster, association mining, and checking how to assess
goodness of classifier - accuracy, specificity,)
• Expected value framework
• Logistic regression (diagnostic model)
Predictive analytics
• Goal: predictive accuracy
• Achieved with cross validation
○ Spilt into training and test data
• Occam's Razor : The best possible model should be parsimonious, don’t build complex models
(using all variables for prediction etc).
• There is a tradeoff BIAS VS VARIANCE
• Expertise is required consult domain experts, talk to stakeholders and pick predictors
○ Build a model for cross-selling, consumer already bought 3 products in a products in the
product family. If all ready bought all 3 want to upsell to them upgrade to a higher
priced product. If
• Remove insignificant predictors and don’t complex transformation of predictors
(interactions). It will start overfitting.
• Build models incrementally
○ There are 2 methods (feature engineering):
▪
□ Add each variable to a model and check a criteria
□ This varies for each model ex: for regression check the R^2
For logistic regression it is AIC
□ Backward: Add all the variables and algorithm will eliminate each variable
and check
□
• Exercises that will be covered
•
• This is the confusion matrix
• Look at VIF
○ https://www.youtube.com/watch?v=0SBIXgPVex8
○ In case of skewness this may not be sufficient we need to address that later
• To get probability
•
Churn example
• Predict whether customer will churn (leave)
• Very common in banking and telecom. Look to give promotional offers to them to help with
retention.
○ Have data for 80% of the data, rest 20 use as test data
• STEP1: remove variable not need for prediction
○ Data> filter
○ So run only the mins not the # calls and not the charges
○ Anyone who has a voice mail plan will have voice mail messages (highly correlated)
○ Remove these variables and run model again with these variables
• Confusion matrix
○ Check for overfitting: Check if perforce very well with training but not well with test
• Find the best cutoff
○ Cutoff .10
○
○ SENSITIVITY is important here
•
• In this example also there is reasons to bin
○ Take customer service call
○ It is highly significant
○ After binning
○ Still significant but
○ Check AIC (it should have gone DOWN after you change your predictors)
○ Recommendation:
▪ Do binning use cutoff 0.15
▪ Giving them promotional benefits which will have costs, so depends on the
marketing budget
▪ But in health analytics wont be happy, cost is high with not there you need at least
90%+ sensitivity
□ Same case for fraud or loan defaulters
Example
• Small number of records for the class of interest (5622 non-manipulators and 39 manipulators)
• Highly unbalanced data set
• Handle it with
• Come up with 220 records 181 non-manipulators and 39 manipulators for the sample
• Look at VIF
•
• If limited levels its fine
• But if you have many values ex: region, location etc. These will create issues if have many
levels.
•
• Case
• Demographic adta
▪ 1 is default 0 is otherwise
▪ We are not going to look at Defaulter type for this
○ Authors have split data set into build and validate
Link to video
○ Also delete these 2 columns because for the time being we are not going to deal with
categorical variables with multiple levels
○ And delete last column which identifies that it is the training dataset
○ Delete defaulter type (as it is a multiclass level, i.e: not binary classification)
If they receive salary on 31st they pay EMI on the same day
▪
• Next look at odds value (ie. Exp(coeff) tells us the strength so higher the odds stronger the
predictors)
• Continuing example
• But we should not have too many levels in the variable so set a smaller number of bins
• Not always necessary to have equal bins but it is easier to analyse if that is the case
• Binned to get 3 categories for age and 3 for month
• Check for the entire original data (both training and test)
• How to deal with categorical variables that take multiple values use zero sum
• You have
• Checking across branch and region you might get 0 cell count for some combination it
cannot estimate and it wont run
• So we need to collapse (reduce the number of levels)
• Or
•
•
• Now check with the test data
•
Now
•
• You need to also scrutinize the expected value for this decision
• So we choose 0.2 as cutoff despite the fact that 0.5 gives better sensitivity
• So always look at associated costs as well
Neural Networks
• Ada Lovelace
• She worked with Babbage on analytical engine
• It is only a machine that takes input and gives output
• ML we don’t explicitly tell the machine what to do the machine looks at the data and
comes up with the rules for us. Machine can learn from examples
○ He batted with a stump hitting a ball against a water tank (curved surface)
○ He learnt through repetition until he perfected it
○
○ Similar to a regression equation
• W0 is the bias (or intercept)
○ If everything is 0 you don’t want to loose information
• We want the ANN to learn to do this it has an activation function (transfer function)
• Z is fed into G(Z) function
○ Only when it meets to a certain threshold it will predict Y or feed into the next
neuron
○ Neural networks are not restricted by these function (no precondition) unlike
the functions above.
○
▪ Say you have 2 variables plotted in 2 D space to separate black and res
dots
▪ It can also flip the dimensions and draw or it can draw curved lines etc
▪ The algorithm is searching for a solution
○ Error is fedback
○ This is called back propagation
○ Adjust the weights and learn correct value of y
○ Check slope repeatedly if it finds an increasing value it will bring back down so
comes to middle (hill climbing)
Optimization problem
• The NN is trying to find the minima
• If it finds it has a negative gradient (slope) it knows its going down towards minimal loss
• Set learning rate and see how many steps it takes before it reaches the optimal point
• Too large it will overstep the minima (here in 1 step itself it overstepped)
• If overshoots the ANN realizes it has overshot and readjusts the weights
• Inputs to ANN
• Used mostly in deep learning not NN (the signal will drop overtime in deep learning)
• This activation function is also a hyperparameter
• By default use Logit
• NN topology
• NN has 3 layers
○ Input layer
▪ Takes in the input through nodes 9here 2 nodes because 2 x variables/predictors)
○ Hidden layer:
▪ 1 layer with 3 nodes
▪ This is designed by the analyst
▪ There are weights associated with these connection
▪ Output is fed to the output layer
▪ Why 3 nodes?
□ For classification
□
□ If see 100% accuracy good chance it is overfitting (especially likely with NN)
□
Rule of thumb
□
□ So 3 nodes
□ There are algorithms to recommend number of nodes
○ Bias
▪ Will look at forward pass and backward pass( what happens inside the NN)
▪ Initially NN will assgin random value for the weights
▪ All the weights need to arrive at the optimal minima, not as simple
▪ Same for the bias, assign random values
□ It returns probability values for classification and we need to set the cutoff
▪ Backward propagation
□ Calculate the error at the hidden node and the activation node
□ Starts correcting the weight using delta value (direction gradient of function)
• Example
• Set parameters
• Will iterate through the entire dataset and try to find the minimal error
• If it doesn’t seem to converge stop the process and tweak the parameters and try angain
• Example
Source of candidate
• Location 7 levels
• In logistic regression this would be a problem but not in NN
• Select columns and open in Tanagra
• Set parameters
• Execute
• Use min-max
Source
•
○ This is our back propagation NN
• Execute
• Use genetic algorithm to determine the features ( ideal number of nodes and layers)
• We have 12 variables
○ Need 10
○ But we have only 6000 records that could lead to overfitting instead be conservative but
still increase the number of nodes
• Some of these have many levels so it will be difficult for regression to caputure non linearity
• Import to Tanagra
•
• 70/30 split
○
○ To find the number of nodes use thumb rule
○ Learning rate start with 0.1 and work upwards if the algorithm doesn’t converge
increase
○ The rest of the rows gives you how the error rate changes if you drop a variable
○ 89%
○ If it had degraded means model is not consistent
•
• So far we have used it only for classification