Business Analytics

Business Analytics
12 November 2020 18:01
Index
Date Lecture Topics
23-11-2020 L1 Introduction
24-11-2020 L2 Clustering - K means clustering
25-11-2020 L3 K means clustering using Tanagra
27-11-2020 L4 Exercises in clustering
28-11-2020 L5 Association rule mining
30-11-2020 L6 Exercise - Association rule mining
01-12-2020 L7 Classification using logistic regression, EVP,
Evaluating ML model
02-12-2020 L8
05-12-2020 L9
18-12-2020 L10 Predictive analytics
19-12-2020 L11 Logistic regression - How to select predictors
Video Lecture_1
22-12-2020 L12 Neural Networks
23-12-2020 L13
24-12-2020 L14
Excel Add-ins: Link
Business Analytics Page 1

L1
23 November 2020 13:02
Analytics is about DATA DRIVEN INSIGHTS:

• Data
• Patterns
• Decision making
•
• Source of data:
• Companies are investing in structures to store and analyse their

internal data
• External data can come from external vendors (info about trends,
competition), buying behavior and social media etc
• All this is to come up with DATA DRIVEN INSIGHTS FOR DECISION
MAKING TO GAIN COMPETITVE ADVANTAGE
• It is no longer the highest payed decision makers insights that drive

strategy, it is now more objective
• Growth in data is the key driver for the growth in analytics
• Evolution of analytics
• Renewed interest in AI in 90's

• 80's had ERP growth of relational db (Oracle) used to generate static
reports
• 90's there were db warehouse store more information than db to
provide dashboards and scorecards. Have KPI and track
improvements. Start of Business Intelligence
• 00's SaaS started , use data mining to gain insights
• BI, BA, DS are all umbrella terms there is no 1 size fits all definition

• BI, BA, DS are all umbrella terms there is no 1 size fits all definition
here.
• Techniques keep changing and evolving
• Real world phenomenon is COVID spread and model takes in input and
attempts to define the real world phenomenon
• Use either mathematical, quantitative or statistical techniques to

explain the phenomenon
• Types of models
• No one model is better than the other, depends on context. Always

start with descriptive or work through rest.
○ Use data that already exists with the firm. RETROSPECTIVE

ANALYSIS
○

○
○ Data from a ecommerce site selling chocolates

○ From historical data try to understand what has happened.
○ Try to ask the right questions:
▪
□ Can see they seem to shop more on Fri and Monday

□ Visualization is an important part of analytics. This
comes under the category of DESCRIPTIVE analytics
▪
□ Take avg time spent
□ Using measure of central tendency here to gain insights

Similarly
□ can get avg
spend
▪
□ Here to find the relationship between 2 variables and
explore if they are related do CORRELATION
□
▪
▪ What is the contribution of pages viewed on amount spent?

▪ What is the contribution of pages viewed on amount spent?
▪ Here can build a regression model
▪ Explaining why something happened

○
○ Hypothesis testing
▪ How to check if explanation is correct? Look at the
significance of the variable
○
○ Predict rather than explain. Focus on accuracy of predictions

○
○ Example: Model to determine credit default for individual.

○ So is this predictive?
▪ This is the new criteria, not p value

▪ Build a predictive model with only part of the data ( a sample)
▪ Not concerned with population focusing on prediction
▪ Focus on minimizing error

▪ Predictive accuracy is the benchmark
▪ Try to improve the adjusted R square value (find all related
factors)
○
Focus of data mining is to get patterns from data

○ Focus of data mining is to get patterns from data
○ ML to get
○ Common terms associated with both streams
○ There is usually a target variable (y variable)

▪ Example: determine credit default risk
▪ Here default is a categorical variable (yes/no), classify the

customer
▪ These are classification algorithms
▪ It is supervised task
▪ They are part of predictive

▪ Asses with help of out of sample
▪ Association: Grouping no classification

▪ Clustering: Each grouped set is a cluster
▪ Unsupervised techniques there is no target variable
○ IT & strategies: where to deploy the model

• Course focuses on prescriptive modelling

•
○ basis varying scenarios
○ Example: Google maps prescribes options for route to destination,

we need to pick
○ Optimization and simulation are the techniques used
○ We apply constraints and try to analyse (Solver in Excel)

•
•
• Environment is unstable/uncertain
• Multiple criteria to evaluate, optimize operational cost
• Not diagnostic not trying to explain anything
•
• Looking at past data to see what is selling well
• After this can move onto to diagnostics and try to find out why this
happened (location, trends)
• Classification predictive db
• They used only backend data. If they had factored in external variables
will become prescriptive

•
• Process to build a model
• Industry standard to build a model

• Business understanding, talk to all stakeholders are establish objective
• Data understanding: data comes from disparate sources so need to
collate. integrate the data, check discrepancies.
• GST fraud: create business that don’t exists and route transactions
there. Do multiple
• Data preparation: Missing values, outliers, prepare for regression
(there should be no multicollinearity so do mean centering, factor
analysis etc)

○
• Evaluation and deployment
• Tomorrow - Unsupervised learning, cluster analysis

L2
24 November 2020 10:32
• Form highly cohesive clusters (points within cluster are close, between
clusters be highly separate)
•
• How?
○
○ Euclidean distance used for continuous variables
○ If not continuous use different matrix

•
• Highly subjective, researcher needs to pick
○ Hierarchical: requires more memory

○ Partitioning: ex- K means algorithm to understand if there are
groups in the data given
• 2 variables 6 people, find if there are clusters. (Plot a scatter plot)
• E and c are close together

•
Calculate the
distance
•
• If we include another variable age, distance changes because of
different scales it will change the distance
• So ALWAYS better to normalize your data before calculating distance
• Normalize using:
○ Z transform:

○
○ Do for all 3 variables
○ Distance between b & f, before and after standardization
○ Values negative are less than the mean and positive are above the
mean
• K means algorithm
• Iterative algorithm
• Picks random k number of points (here k=2)
• Starts calculating distance from centroids randomly
• Randomly picks points close by and puts them together

•
• Have a new centroid and calculates that value
○ Now it sees that a is closer to the circle than the triangle to moves
it to the other cluster
• Now recalculates the centroid

•
• So iteratively it does:
○ Calculate dist
○ Calculate centroid
○ Shift points to the centroid with least distance and recalculate
• Stops when it finds that all the points are closest to its assigned
centroid
• Certain software allows which points to start the K Mean with
• Outliers to eliminate can use K medoids instead on K means . Or
just remove outliers before doing k means
Exercise

•
• Create a table for lookup
• Apply vlook up to get a point (because when you use solver it needs a
formula so that it can dynamically change)
• We get points a and f for first iteration

•
• Now a and b are cluster centroids

• Calculate distance to each of the center
• Get each points distance to 2 cluster centroids
• Assign each point to the cluster with from which it has minimal
distance from the centroid

•
• Do until you can minimize the distance (sum of min here in column I)
• Want to reduce the sum of the distances (within sum of squares) by

changing the cluster centers

•
• Why 6 only 6 data points, also >= 1
• Also constraint to make it int

•
• Set evolutionary solver (it mimics k means closely)

• After completion sum minimized and centers are changed
• Cluster 1
○ Centroid will be average of these (take the mean of z value unless

all are from same scale that case it doesn’t matter)
• How do we determine how many clusters to form?
○ Elbow plot

○
○ For different k values get the within sum of squares values
○ Plot a graph between weighted sum of squares vs k value
○ Where it bends that is the ideal number of clusters

○ You do not get much marginal gain by increasing the number of
clusters
• Silhouette width

○
○ Calculate I dist to other member in same clusters and to

neighboring clusters
○ Check if it is highly cohesive
○ If positive dist to neighboring cluster is greater

○ If negative then closer to neighboring cluster (undesirable, not
highly cohesive)
○ Silhouette width you calculate for different k values and use the K
value that gives you highest average silhouette width

○
○ In this example it is highest for k=2
Follow-up
• Read [3. Chateau Wiery (A)] up to page 4

L3
25 November 2020 12:59
• 6 variables available
• Need to classify the types of shoppers by clustering
• Difference between factor analysis and cluster analysis?
○ Factor analysis is a dimension reduction technique. Combining
variables based on shared variance. Grouping variables.
▪
▪ Can check shared covariance is high by looking at Eigen value.
Then don’t use multiple variable instead group to get only 1
single variable
○ Cluster: grouping objects basis distance between them.

Segregating the data
▪ Use Euclidean distance and similarity matrix

•

•
• Click this icon to choose variables for analysis
• Select the attributes for clustering
○ Selecting all except case no

○ To adjust your settings

○
• Usually you will normalize the data (here all values are on the same
scale)
• Normalize the data
○ Drag and drop standardize to the location

○
○ Options for normalizing
○ 2nd option here is mean centering

○ Last option here is z transform
• Source attribute converted to new attributes with the prefix std_

• Now select new variables for clustering

•
• All marked yes are inputs that have been selected for k means
• Check the elbow plot to get k value
• Elbow plot is not available in this tool
• Here calculate wss manually (run Tanagra with k=1, k=2, k=3 etc and
copy the values)

•
• silhouette plot
○ Look for high value of silhouette width

○ Max value possible is 1
○ Here can see 3 is a good value
○ In real world you might not always get the same result with both
methods, your discretion to pick, but it variations between the 2
will not be too disparate
• Click on clustering
• Drag and drop k means

•
• Specify number of clusters as 3
○ Already normalized the data
○ Relates to centroid calculation. Mc Queen is better

• Execute > View
• Output

○
○ Size gives number of points in the cluster and wss represents the
compactness of the cluster (distance from the centroid)
○ Cluster 3 is most compact
○ Within sum of square is the sum of these values (homogeneity of

clusters)
○ Total sum of squares= Total variance in the data
▪ Comes from Grand mean = Avg distance of all the points to
the centroids (here 3)
▪ Gives total variance in the data

○
○ Between sum of squares: distance between various cluster we
want that to be large to ensure clusters are distinct
R square =
•
• Now analyzes the cluster

• Now analyzes the cluster
• Remember the unit is standard deviation from mean

• See how they have loaded on the various variables
• Now label the various clusters basis this
• We looked at positive values and characterized the cluster. Can also

rely on negative values to help discriminate the clusters.
• Ex: here you can see cluster 3 has negative value for shopping is fun
○
• Ex: Cluster 2 is minimally interred in shopping (variable 5) instead
they focus on price
• So for cluster 1, focus on the experience, have book signings with

authors, ambience, new food outlets.
• Cluster 2 , talk to them about the discounts and promotional offers
• We can ignore the 3rd group when sending out promotional material
• Visualization

•
• Paste in excel
• Change from scientific notation to number

•
• Can see case 1 belongs to cluster 1

• Ignore cluster 3
• Create a label which becomes a target for future analysis
Exercise
• Survey results
• Retained only mean value of variables

• Kmeans requires CONTINUOUS VALUES

•
• Look at mean values

• Minimal SD so not much variance
• Q1 difference based on gender

○
○ But take it with a pinch of salt, sample size small, and sample has
more females
• Q2 Product categories
○ Spend less but buy more in fashion

○ Skincare and electronics are expensive but number of people
buying is low
• Payment options
○ Asian countries prefer COD as they are distrustful of online

shopping
•
○ Check the elbow
○ Can use k=3

○ Use Tanagra

○
○ Normalize data
○ Select only standardize variables

○
○ Now perform clustering

○ Clustering > K means
○ Interpret the cluster
○ Not much difference between C1 & C3

○ V1 is perceived risk which is low for C2
○ Last variable is repurchase intent
○ Only 1 variable is significantly discriminatory here, show a 2

cluster solution rather than a 3 cluster solution as shown from the
elbow plot

elbow plot
• Try with k=2
○ If perceived risk is low then rest is positive (shop more)

○
• Cluster analysis is highly subjective (look at data and verify it is split
properly and clusters are sufficiently distinct)
•
Recommendation
•
○ Students might not represent the total population closely

○
•
○ Asian customer generally not too trusting of online shopping

L4
27 November 2020 10:30
• Data about popular songs taken from Spotify API
• Developer wants to create music that emulates popular songs

•
• Popularity is the variable we will use for clustering (scale 100)

• Also include value highlighted in red
○ These are continuous and not categorical
• Explicit is binary don’t include

• Key there 12 keys in music so categorical don’t include

•
• Scree plot
○ Use k=2 or k=3
○ Cluster 2 is most popular followed by 1 and 3

•
○ These factors loaded high on cluster 2

•

○
○ Saves result to specific location
○ Saves as text file, drag and drop into excel
○ Filter cluster 2

○
○ Can similarly use to analyse other categorical variables

○ For explicit lyrics

○
•
○ Restricted basis time and geography (here only US data)
○ These are concerns to be vary
○ Incrementally add attributes test (A/B testing) and only then

release
• Attributes loading is low vales as there is very little variance
•
• For k=3 better discrimination
•
• Exercise 2
• B2B situation, trying to find out who are the best vendors or suppliers
• Rank suppliers basis:

○
• Help develop a ranking scheme
• 4 cluster solution
• We want minimal price
• But those with bad service we don’t want to work with
○ Cluster 3 also performed badly on all other parameters

• Cluster 4
• Cluster 2

○
○ Low price positive on all other parameters

• Cluster 1 all positive but expensive

•
• First cluster is infrastructure service 1 time expensive

• Cluster 2
• Over a period of time preference for tempo changes, so need to do

sampling
•
• In suppliers they don’t typically change over time so no need for
sampling

L5
28 November 2020 10:36
• Generally managers are interested in product affinity

○ Complements: as demand of A goes up B goes up (ex: sugar, tea)
○ Supplement: as demand of A goes up B goes down (ex: coffee,
tea)
○
• Understand how cooccurrences occur
•
• Ex:
○ Amazon: frequently bought together
○ Looks at transaction db and suggests frequently cooccurring

products
○ OTT platform recommend basis your watch history
• Association rule mining

○
•
○
• Used for
▪ Mainly used to analysis factors to check if they occur

together
□ Gene analysis
□ Comorbidity for COVID
• Ex:
▪ This debunked
▪ This is a real example

▪
▪ Walmart tried to determine what people would buy if a

hurricane is predicted to hit (inventory management
problem - what to stock)
▪ Didn't use mining to find trivial association, these we
can set basis our intuition itself (obviously people will
buy things like water, toilet paper etc)
▪ Used instead mining to find hidden patterns (not
obvious through intuition), found that pop tarts & beer
are top selling pre-hurricane purchase
• This is done using transaction db
▪ Each row has transaction id and items bought

• Now association mining will try to determine an association
rule
• Ex:
▪ This is a If-Then rule

▪ Left hand side is the antecedent and right side is the
consequent

▪
▪ You can also reverse them

• How many such rules can we make?
▪ Dependents on the item sets we want
▪ Item set: combination of items we are interested in
(made before we proceed with analysis)
▪ Now we need to verify if these associations are

interesting and meaningful
▪ Example:
□ 4 items, in pairs
□ Bread, cheese, Poptart, Beer
□ Find how many items sets are possible?
□
□ 6 possible combinations (number of pairs (a,b) not

same as (b,a))
□ Now must measure the
□
□ Metrics to asses goodness of a rule
 Support
Probability of a product being bought

◊
◊ But we are interested in prob of 2

products being bought together i.e: joint
probability
◊

◊
Worksheet
• Item sets are given (6 pairs)

• Calculate support for antecedent, consequent and then both together
• Support for antecedent (pattern matching to see how many

transactions have the product)
• Same done for consequent

• Same done for consequent
• Support for items occurring together (is cooccurrence happening

together)
Use countifs() to see transactions where both the items are

• present
(order in formula doesn’t matter)
• This gives you the support for each rule
Issues with support:

• Don’t know the order of purchase (doesn’t capture that information)
• So we need to go with condition probability (is B bought after A
is purchased)
•
• (OR)
•
• This is called confidence, the second metric to check goodness of rule
• So why did we even look at support if confidence captures both?

• If you have millions of transaction, generate item sets that have
minimal support (saves significantly on computation)
• This means that for above problem, only if 75% of transactions

have this combination proceed with finding the confidence

• Now, calculate confidence for each item
• First find joint probability
Here doing pattern matching and here order matters

•
• Confidence is (joint prob/ prob of antecedent)

•
H/W: Find confidence when order of rules are reversed (A purchased after B)
• Confidence is the only metric in which order of rule matters (not
symmetric)
• Confidence gives indication of direction in which the rule occurs
Issues with confidence

• If consequent is present in all rules (very popular), i.e: when
probability of B is high
• It influences the confidence (it increases), so it is inflating the value of

confidence
• So also need to also look at lift
Lift
• Takes care of popularity of the consequent
• So lift should be >1 to be interesting

• If =1 uninteresting rule
•
• <1 is also interesting as it shows

•
• Calculate lift
• Good rule has high values for all 3 metrics

• Only 1 rule here is significant
•
• And juice is purchased after cheese is purchased
• Order is preserved in the lift
• And at least in half of the transactions they are purchased

(support)
• Lift <1 are not purchased together
• Ex: Coke and Pepsi (if planning a promotion don’t place them

• Ex: Coke and Pepsi (if planning a promotion don’t place them
together). Supplements to each other.
• Look for high confidence and support

• Don’t decide on a cutoff value for these apriori , look at
combination of all 3, also depending on how many samples
you have, nature of data
•

L6
30 November 2020 10:28
• Joint probability is symmetric
• If first column used countif formula (no need to check the sequence)
• The sequence is handled by the conditional probability formula

•
• Confidence changes with sequence

• Ex:
○ Bread and milk
○ Check milk and bread only within transactions that have

antecedent present
○ If basket contains milk there is 50% chance finding bread

○ So confidence is asymmetric where as support is symmetric
• Lift
○ Symmetric

○
○ Lift> 1 or < s interesting

○ Lift = 1 they are independent events
• So check for
• Generate a binary incident matrix (this is how software processes)
○ Export to tanagra
○ Except pattern id select all columns

○
○ Apriori is the name of the association rule algorithm

○ Set the property values
○ Only theses 2 rules match the rules
▪ Rules with high values for all 3 given most priority

▪ Check lift first then support followed by confidence
○ Click view to see all the rules
○ Increasing itemset to 4

▪
Exercise
○ 1 means customer has chosen that feature

○ There are 2000 records

○
○ If we increase itemset to 3
○ Now get more rules (15 rules)
○ But what if we want to stick to 2 itemsets but want more rules?

▪ Aprior does not calculate confidence and lift first
▪ It only considers items that have a minimum support values
▪ So decrease the support to get more rules
▪ No hard and fast rules for defining these values ideally we

want it to be high
▪
▪
• Cosmetic
•
• How to pitch? What items should sales clerks cross sell?

• Start with high values for support and high values of confidence (0.1,
0.5)

•
• See a lot of 30% support change the rule
• Better to have limited rules in retail situation. Also there is little data
so better to have higher support (only 100- rows)
• 2 rules are flipped

• POS display place mascara & eyeshadow together and same with lip-
gloss and foundation. Same for cross selling
• MSNBC website
• Page visit, visitor data for one day only

•
• Click stream analysis path stream followed

•
• If analyze path and mine patterns to make better predictions in the
future
• Make recommendation on web site design and ad pattern
• If visiting in sequence use information to place
• Pick more records don’t compromise on COFIDENCE OR LIFT
• Then adjust
• It is data collected only for single day

• So can we rely on these rules?
○ No
○ For Walmart example they used time based element of db to

decide the pre hurricane purchase

○
○ Don’t discard the pattern just explore over longer period

L7
01 December 2020 13:04
• In some situations you have categorical items

• Identify the customers who will respond to ad. Predictors are age and gender.
• Outcome that we are trying to predict is BINARY CATEGORICAL variable (Will customer
respond? Yes/No)
• Use classification model for categorical methods (Neural networks, decision trees, SVM)
○ We are working only with binary classification here
○ There are cases where can have multi level classification

• You use estimation tasks (regression)
• How to assess regression models?
○ How to show model built on sample is good? Look at the SIGNIFICANCE (is the model
GENERLIZABLE?). Results of sample holds good for the population also.
○ Adjusted R square: how will the model explains the y variable or phenomenon we are
interested. Not the generalizability.
•
○ Diagnostic model:
○ Predictive model: focus is on prediction accuracy
▪
▪ We can no longer use significance model
▪ When you have large db (have entire population for your study). Significance (p
value) will always be high. So cannot use it for generalization
▪ Need to use cross-validation (out of sample predictions), predict out of sample
data also
▪ Most common is two fold cross validation

□ Split the data, 1 part is training dataset, other is test dataset
□ Build the model with training data
□ Check the accuracy using test data

□
□ Model is simply an equation

□ Some function of the predictors (age and gender)
□ Remove the target variable and ask the model to predict. Compare with the
original value (which we know) and assess the accuracy of the model
□ Ex: if your test data have only younger age group it can work very well with
training data but in test data if there are older age it will perform poorly
□ Ensure there is heterogeneity and representation is present in training data
□ Fix this issue with proper sampling when splitting the data

□ How to split between test and training
 There is no ideal split

 However you split model should work consistently.
• Classification matrix

• Classification matrix
•
• Class of interest (people who respond to the class so yes)
• But there are cases where No could be the class of interest like credit card fraud
○ TN - true negative and TP- true positive

○ FP- false positive and FN- False negative
• Example:
○ Output of software
○
○ Check TP and TN
○
• Suppose
○
○ But it will never detect a TP
○ This is accuracy paradox you cannot rely on accuracy of a model to assess 'goodness' of a
model

○
○ Here also 90% accuracy but this is a poor model

○ Instead look at sensitivity and specificity
▪ Although accuracy is 90%, sensitivity is 0% so cannot predict the class of interest

accurately
• Class of interest is low when data is highly skewed. In real life you will have highly skewed
data. Example: credit fraud, GST fraud, response to add
○ This can cause issues that need to be addressed (beyond scope of the discussion)
• Sometimes
Exercise
• Test data
○ Actual value is 1 yes and 0 No

○ The output will give you probability of data belonging to the class of interest (Yes)
○
○ Default cutoff is 0.5

○
○ 12 records, first 2 are misclassified
○ Confusion matrix
Accuracy
▪
Sensitivity
Specificity
▪ Increase sensitivity (catch more number of positive records)

▪ So vary the cutoff value, lower to 0.2, so1 record now correctly classified
▪ Accuracy goes down, specificity does down but sensitivity 100%

▪ Hit and trial. Cutoff is very important for creating a model.
▪ Depending on cost involved to you vary (medicine and fraud you cannot change
much)
○ Another way to decide cutoff value
▪
▪ Cost benefit analysis.
▪ Identify
▪ Identify for each category

▪
▪ Opportunity cost treated as $0, usually do not consider them
▪ Get monetary value of each decision
▪ Look at monetary value, which returns best value. Vary and check which helps us
identify better.
Exercise

•
• Walmart trying to predict pregnancy's in household based on purchase

• Used 3 different models
• Class of interest is pregentant household
• Split the dataset and trained the model
○
• Set cutoff of 0.78

• Tried other cutoff values

•
• Target did same
•
• Customers felt it was an invasion of privacy so they lost out on sales
• So Walmart assigned a cost if they classify inaccurately

L8
02 December 2020 10:31
• From payoff matrix we can see ideal cutoff is 0.1
• Logistic regression
• Identifying the variables ( feature engineering) takes time should be done carefully

•
• When is odds =1, 50% chance of happening

•
• Greater than 0,5 becomes > 1
•
• Ex: You are using income to predict purchase

•
• Purchase is either 0 or 1
• Increasing value of income tendency to purchase increases
• Fitting a linear equation, but has drawbacks
○ Y variable should be either 0 or 1 but have negative value according to

○
• Values between 0 and 1 split using the sigmoid curve (helps connect x and y
• Lines will never touch 1 and 0 extend to infinity

• So use logit function instead of predicting y

•
• However it can touch

• Add log of odds
• Connect x and y
• Trial and error to find best fit

•
Exercise logistic regression problem
• run logistic regression and assess the fit indices

• college football team trying to understand if the win is related to home or away and the salary
of the coach
• try to see if the variables are significant
• step one examine the predictors
• hey from pivot table
• Away game are close
• Same for home
•
• Check with salary

• Check with salary
• Bin into groups
•
• So good candidate for a predictor

•
• Select data>
• Home is string so coded it

• Home is string so coded it
•
• Coeff table both logit and exponential
• If value above 5 there is multicollinearity, can look to reduce factors

• If value above 5 there is multicollinearity, can look to reduce factors
• Lower deviance model s good
• Null model is with only the intercepts none of the predictors (sum of square errors)
• If p < 0.05 then significant, this is the first check
• Pseudo r square is good if close to 1
• 83% explained by variables
• Z test
• Salary is signficant
• Assess the signifi

•
• Exponent is just log odds converted into probability
• -ve Value of log get less than 1, else >1

• This is the effect size (if increase 1 unit salary odds increases by 1.042)

L9
05 December 2020 10:31
• Survey based study to understand what type of products customers are willing to buy
domestic goods
•
○ Identify key variables that firm should focus on
• Willingness to buy is our dependent variable

○ Compare Income (categorical variable ) with y variable
○ Insert a pivot table

○
○ Check across the groups there are

○
○ Low income group seems more willing to buy

○ High income does not seem to prefer domestic goods
○ For age
○ Simple things to start with to help pick our predictors

○
○ Use regression
○ Select variables donot include code and willingness to pay (y variable)

○
○ Assess to check f model is better than null model

○ 33% explanation
▪ Conduct a pilot survey
▪ Or choose better/more variables and check if those separate classes well
○ Check significance
○ Gender is not significant rest is

○ Age log odds negative as age increase willingness decrease
○ Same as pattern seen in explanatory analysis
○ If has child more willingness to buy ( no child is the base here as it is 0)
○
○ Low income people have more propensity to buy

○
○ Highest odds good predictor

○ But edu is categorical variable not continuous
○ Higher odds more important the predictor
○ Target highly educated, have children and is ethnocentric (personality trait cannot
segment but when you create ads use messaging)
• Admission to check if there is diversity and inclusivity

• Admission to check if there is diversity and inclusivity
• Admit yes - 1 no - 0
• Scores are continuous
• Undergraduate institute rank (discreate categorical variable) So create dummy variable
• VIF: less than 5 is acceptable

• Rank2 and rank3 are 2

•
• Over a period of time when you refine the model you will want to collapse ( combine the
variables)
• Significant and 19.8% explanation (improve exploitability in future)

•
• Score gre is 1 so 50% probability
• Chapter 7 in textbook
• Templates for proposal is present in the appendix of the book
• Parsimony is always better

• Goal is predictive accuracy not explanation
• Nature of model chosen depends on the context. Better always to build a simple model
• How to check accuracy?
○
○ You end up overfitting

○
○ It memories the data

L10
18 December 2020 08:20
• So far done descriptive analytics (cluster, association mining, and checking how to assess
goodness of classifier - accuracy, specificity,)
• Expected value framework
• Logistic regression (diagnostic model)
Predictive analytics
• Goal: predictive accuracy
• Achieved with cross validation
○ Spilt into training and test data
• Occam's Razor : The best possible model should be parsimonious, don’t build complex models
(using all variables for prediction etc).
• There is a tradeoff BIAS VS VARIANCE
• Complex: Overfitted. Performance in test data will be poor.

•
• Expertise is required consult domain experts, talk to stakeholders and pick predictors
○ Build a model for cross-selling, consumer already bought 3 products in a products in the
product family. If all ready bought all 3 want to upsell to them upgrade to a higher
priced product. If
• Remove insignificant predictors and don’t complex transformation of predictors
(interactions). It will start overfitting.
• Build models incrementally
○ There are 2 methods (feature engineering):
▪
□ Add each variable to a model and check a criteria
□ This varies for each model ex: for regression check the R^2
 For logistic regression it is AIC
□ Backward: Add all the variables and algorithm will eliminate each variable
and check
□
• Exercises that will be covered

If more interested in verifying Positive cases then look at sensitivity
Negative more interested then look at specificity

False positive in marketing where we are concerned when they will not respond
• Do Select Data > Create name
• Perform Linear regression

• Variables explain if these students should be admitted
• Only 319 records have value
• These represent the training data

• The one we don’t is the test data
• 80-20 split (training test)

• 80-20 split (training test)
• Software have options to split for you (will show once reach neural network will show how to
do)
• Building model only with training data

• Select predictors
•
• This is the confusion matrix
• Out of sample validation is that the

•
• Look at VIF
○ https://www.youtube.com/watch?v=0SBIXgPVex8
• Remove insignificant predictors

• Can remove gre (look at the p value, it should be < 0.05)
• Remove this predictor
• Process is incremental so go back remove predictors and run again

•
• Flow the flow

• Look at significance, multicollinearity, goodness, kai square, coefficient
• Now validate the model
• First is confusion matrix of training dataset set

• Second is the out of sample classification, Look at this to assess the goodness of model
•
• Percent correct is ACCURACY
• True positivity SENSITIVITY
•
• Ture negative rate is SPECIFICITY
•
• Default cutoff for prediction
• We want HIGH SENSITIVITY want to find who will be admitted (trading off accuracy for
sensitivity)
• So change the cutoff value
• Arrows will allow this
• Now sensitivity is increasing ( detecting positive values)

•
• Can even lower until satisfied

• Recommendation
○ In case of skewness this may not be sufficient we need to address that later
• To check the predicted value for each data record
• This is from regression equation
• Equation gives the log odds
• Then take antilog (EXP)

•
• To get probability
•
• Which is the same as the output of Regressit
• Less than 0.5 predicted as 0 more than 0.5 is 1
Churn example
• Predict whether customer will churn (leave)
• Very common in banking and telecom. Look to give promotional offers to them to help with
retention.
○ Length: tenure in months
○ Int plan/Voice mails: discrete binary variables, membership to that plan
Charges for different times of days is different

○ Charges for different times of days is different
○ Customer service calls they made: continuous variable
○ Have data for 80% of the data, rest 20 use as test data
• STEP1: remove variable not need for prediction
• STEP2: recoding the binary values and y value
○ Regressit understands only numbers
○ Select columns Ctrl + H to replace (yes with 1 and no with 0)
○ Data> filter
○ Uncheck the blanks for Churn_training column

○
○ Create a new column after you filter (Churn_traning2)
○ Set False to 0 and True to 1 and fill this column

○ Now for column (Churn?)
○ Run logistic regression with out of sample validation

○ Don't select churn or churn_trng as a predictor

○
○ Look at the VIF you can see large values
○ Ideally VIF SHOULD BE <5

○ This is why it is always good to do a exploratory analysis before you run a model
○ Otherwise the time spent that is used for billing so highly correlated to the charges. If
you consulted a domain expert in telecom they would have told
○ So run only the mins not the # calls and not the charges
○ Anyone who has a voice mail plan will have voice mail messages (highly correlated)
○ Remove these variables and run model again with these variables

○
• STEP3: Check which predictors are significant
○ Here account length is insignificant

○ Remove it and rebuild the model
• HW: Change cutoff and see if can improve the prediction and analyse the confusion matrix

L11
19 December 2020 09:54
• Model building is always incremental, so always build model interactively

• Interpret diagnostic part
○ Look at effect sizes and direction
○ Only vmail plan is negative

▪ But it is a binary categorical variable so compare it with the base
▪
▪ Base: when compared with someone who does not have a plan, if they have a
vmail plan they have less chance of churning
○
▪ Int plan more chance of churning if they have int plan
○ Percentage change
• Confusion matrix
○ Check for overfitting: Check if perforce very well with training but not well with test
• Find the best cutoff

○
○ Cutoff .10
○
○ SENSITIVITY is important here
•
• In this example also there is reasons to bin
○ Take customer service call
○ It is highly significant
○ Close to 1 (50% chance of happening)

○ If numeric easy to see
○ But with binary dependent variable it is difficult to visualize if there related in a non
linear manner
○ Track only for unit change
○ Even if effect is significant
○ Left is false and right is true

○ See the difference across the 2 groups
▪ See in right side not making a lot of calls
▪ But non churners make calls

○
○ Histogram, frequency patterns help you bin the variable

○ Check if relationship has been suppressed through binning
○ Log regression odds may not capture

○ Or use neural network that captures the nonlinearity by logistic regression will suppress
▪ Non linear relationship between x and y will not be captured ( pattern in the split)
▪ P value is high but size of the odds is not much so assuming suppressed and
explore through binning
○ Add new variable and rebuild the model
○ After binning
○ Still significant but
○ Check AIC (it should have gone DOWN after you change your predictors)

○
○ Adjust cutoff value
○ Recommendation:
▪ Do binning use cutoff 0.15
▪ Giving them promotional benefits which will have costs, so depends on the
marketing budget
▪ But in health analytics wont be happy, cost is high with not there you need at least
90%+ sensitivity
□ Same case for fraud or loan defaulters
Example
• Trying to find out if firms are manipulating their earnings

• Case helps you to appreciate the importance of sampling
• SMEs submit financial statements to the private bank for loan purpose and if they cook the
book can lead to NPA
• Uses Benish model which returns a M score based on 8 financial ratios
• Statistical model based on logistic regression

• M is manipulation
• But this model was built in '99 with US data
• Also because it is old firms now know about it and they can cook up these ratios to create a
favourable M score
• Analyst thus using ML to
• Used SEBI indexes and data from Prowess for sampling
• Small number of records for the class of interest (5622 non-manipulators and 39 manipulators)
• Highly unbalanced data set
• Handle it with

• Handle it with
○ Over sampling: Create more number of positive records (i.e. manipulators)

○ Under sampling: class which is over represented is removed (i.e. non-manipulators)
• Here they used under sampling
• Come up with 220 records 181 non-manipulators and 39 manipulators for the sample
• Look at VIF
○ Remove predictors that are not significant
○ And rerun the model

○
○ Now all are significant
○ Look at confusion matrix (no test data here just training)
○ But can we same model for actual predictions?

▪
▪ Do CROSS VALIDATION
○ So if you know you have been given a sample to work with always ask for more samples
to check by getting more samples
○ Take a different set of non-manipulators and check again
○ M score is same as cutoff (here 0.1)
○ Any firm that scores more than 0.1 it needs to be scrutinized
▪ How to check?
□ Model is returning the probability so when doing logistics regression save
the predictions and run
□ ODDS RATIO https://psychscenehub.com/psychpedia/odds-ratio-2/

•
• Variable with many levels
•
• If limited levels its fine
• But if you have many values ex: region, location etc. These will create issues if have many
levels.
•
• Case
• Auto finance company

• Tried to find out who defaults on an auto finance loan
• Demographic adta
• 2 categorical variables that take multiple levels

• 2 categorical variables that take multiple levels
○ Branch within region

○ Y variable
▪ 1 is default 0 is otherwise
▪ We are not going to look at Defaulter type for this
○ Authors have split data set into build and validate
○ Pick only the build dataset a

○ Delete the first 3 variables
○ Remove highlighted columns

○ region and branch (will see how to handle later)

Video Lecture_1
20 December 2020 18:03
Link to video
• Select only the training records
○ Ctrl+A : to select only filtered records

○ Ctrl +C and paste in new sheet
• Delete first 3 columns (not useful)
○ Also delete these 2 columns because for the time being we are not going to deal with
categorical variables with multiple levels
○ And delete last column which identifies that it is the training dataset
○ Delete defaulter type (as it is a multiclass level, i.e: not binary classification)
○ Defaulter flag is the y variable

•
• Run logistic regression
○ Check the number of records are represented correctly
○ No out of sample as this is the training dataset

• From output
○ Start with VIF
▪ It should be under 5 for all variables
○ Next check significance
▪ Should be below 0.05

▪ Number of dependents is not significant so remove and run regression again
○ Now examine the coefficient ( both magnitude and direction)

▪ FRICODE: binary categorical variable which indicates if a person has a fridge or not
(baseline is 0, no fridge)
▪ FULLPDC: If they have payed postdated checks in full
▪ MTHINCTH: Monthly income in thousands

▪
SALDATFR: Salary date fraction
If they receive salary on 31st they pay EMI on the same day
▪
• Next look at odds value (ie. Exp(coeff) tells us the strength so higher the odds stronger the
predictors)
○ Tenure is important here

○ Age and monthly income
○ Odds are close to 1 (ie: 50% odds) and highly significant

○ Also bin these 2 continuous variables (discretize the variables) to see if any nonlinear
effects are suppresed

L12
21 December 2020 15:01
• No, Model should be generalizable so don’t sample the test data
• Continuing example
• Bin the variable

• Plot a histogram

•
• But we should not have too many levels in the variable so set a smaller number of bins
• Not always necessary to have equal bins but it is easier to analyse if that is the case
• Binned to get 3 categories for age and 3 for month
• Check for the entire original data (both training and test)

•
• If similar it helps us to see if sampling was done correctly

• Now assign dummy variable (2 dummy variables because there are 3 levels)
• Similarly for month variable as well
• Now run regression again with the coded variables
• For age base (base) is 40+
• Coefficient is positive that means when compared to the base

• So older consumers are stable (40+) for younger need to verify
• For month
• One is significant and one is insignificant

•
• Because sign is positive

• When insignificant means that when compared to the base they are not very different
• 5.2 and less scrutinize
• So nonlinear covariates you can examine by binning
• How to deal with categorical variables that take multiple values use zero sum
• You have
• Checking across branch and region you might get 0 cell count for some combination it
cannot estimate and it wont run
• So we need to collapse (reduce the number of levels)

•
• Or
• Build separate regression models for different regions
• When compared to comibatore these has lower propensity to default, AP is

insignificant, kerala is more
•
•
• Now check with the test data
• How to check whether it has improved?

• Check the AIC
Original
AIC
•
Now
• It has come down so model has improved

• Look at final confusion matrix
• We are interested in high sensitivity
•
• You need to also scrutinize the expected value for this decision
• So we choose 0.2 as cutoff despite the fact that 0.5 gives better sensitivity
• So always look at associated costs as well
Neural Networks

Neural Networks
• Ada Lovelace
• She worked with Babbage on analytical engine
• It is only a machine that takes input and gives output
• She was quoted by

○ If the computer is only a machine that takes input and give output what if
instead they could think for themselves
○ Turing test
▪ What if machines could learn from data themselves
▪ A is supposed to respond like a person and C should not be able to detect.

○ Eugene Goostman is a chatbot that is supposed to have passed this test
• ML we don’t explicitly tell the machine what to do the machine looks at the data and
comes up with the rules for us. Machine can learn from examples

•
• ANN: artificial neural network

○ It learns from the output
○ It check the output against the training data
○ If it has not achieved the desired output it will adjust itself and keep learning
until it achieves the desired output
○ It mimics how a human being learns (modelled after a neuron)
○ He batted with a stump hitting a ball against a water tank (curved surface)
○ He learnt through repetition until he perfected it
• X1, x2, x3 are inputs to the neuron

• Output is Y
• Each input is associated with each input9w1,w2 and w3)
• Z is a summing function
○
○ Similar to a regression equation
• W0 is the bias (or intercept)
○ If everything is 0 you don’t want to loose information
• We want the ANN to learn to do this it has an activation function (transfer function)
• Z is fed into G(Z) function
○ Only when it meets to a certain threshold it will predict Y or feed into the next
neuron
○ Neural networks are not restricted by these function (no precondition) unlike
the functions above.
○

○
○ If you don’t have an activation function here it becomes a regression model
○ Can handle higher order problems

○ In reality you will have multiple neuron (perceptron) will be connected to each
other
○ For others there is a clear set of criteria, when should ANN stop? How does it
find the solution?
▪ Say you have 2 variables plotted in 2 D space to separate black and res
dots
▪ It can also flip the dimensions and draw or it can draw curved lines etc
▪ The algorithm is searching for a solution
▪ No close form conditions

▪ Analyst role is to nudge the algorithm so it searches effectively
Google
machin
e
learnin
g crash
course
• Brute force method: Trial and error
• Differentiation is used to find slope of a curve ( change in one variable )
○ Requires you to use a function that is differentiable

•
○ Used as a approximator function

•
○ Weight is like the beta is regression

○ As weight increases error increases same at low values
○ For each of the inputs ANN will search for a weight that gives least loss
○ Error is fedback
○ This is called back propagation
○ Adjust the weights and learn correct value of y
○ Check slope repeatedly if it finds an increasing value it will bring back down so
comes to middle (hill climbing)
○ It will try to find Global minima of a function

○ You can also have a local minima
○ We need to set hyperparameters to nudge it to that global minima

L13
23 December 2020 17:04
Optimization problem
• The NN is trying to find the minima
• One of the simplest Functions used for is hill climbing
• If it finds it has a negative gradient (slope) it knows its going down towards minimal loss

•
• Analyst needs to work on hyperparameters
• Most Kaggle competition the wining algorithms use HG boost or hyperlearning
• Specify the learning rate ( scalar used to change the weight)

• If you give a small value might take a long time but too large and it might overstep
• https://developers.google.com/machine-learning/crash-course/fitter/graph
• Set learning rate and see how many steps it takes before it reaches the optimal point

•
• Too large it will overstep the minima (here in 1 step itself it overstepped)

•
• If overshoots the ANN realizes it has overshot and readjusts the weights
• Inputs to ANN
• Can also use z transformation if you are worried about outliers
• They introduce non-linearity

• They introduce non-linearity
• It is not a single activation that is happening it will be connected to another neuron
• Used mostly in deep learning not NN (the signal will drop overtime in deep learning)
• This activation function is also a hyperparameter
• By default use Logit
• NN topology

•
• Creating network topology

• Consumer acceptance of cheese (1 they like, 0 don’t like)
• Fat score and salt score are normalized, acceptance is your Y variable
• NN has 3 layers
○ Input layer
▪ Takes in the input through nodes 9here 2 nodes because 2 x variables/predictors)
○ Hidden layer:
▪ 1 layer with 3 nodes
▪ This is designed by the analyst
▪ There are weights associated with these connection
▪ Output is fed to the output layer
▪ Why 3 nodes?
□ For classification
□ More than 1-2 it becomes a deep learning network

□ If you increase the layers the algorithm will memorizes the data very well
(overfitting)

(overfitting)
□
□
□ If see 100% accuracy good chance it is overfitting (especially likely with NN)
□
Rule of thumb
□
□ So 3 nodes
□ There are algorithms to recommend number of nodes
○ Bias
▪ Will look at forward pass and backward pass( what happens inside the NN)
▪ Initially NN will assgin random value for the weights

▪
▪ All the weights need to arrive at the optimal minima, not as simple
▪ Same for the bias, assign random values
▪ Inputs to hidden nodes
□ What is the output of these nodes

□ It goes through the activation function/squashing function
□ The activation function is a sigmoid function
□
□ Now that it has found output it will check for error
□ It returns probability values for classification and we need to set the cutoff
▪ Backward propagation
□ Calculate the error at the hidden node and the activation node
□ For all the hidden nodes

□ For the 5th node it combines the error and passes it backward
□ Starts correcting the weight using delta value (direction gradient of function)

□
□ Here analyst only sets the learning rate

□ When should it stop ( it will not know if it reached a global or a local minima
 Either set an acceptable error rate to stop

 Or limit the number of epochs

○
• Example

•
• Set parameters

• Set parameters
• Will iterate through the entire dataset and try to find the minimal error
• If it doesn’t seem to converge stop the process and tweak the parameters and try angain
• Initially set it high and then reduce

•
• Example
Source of candidate

Source of candidate
• Location 7 levels
• In logistic regression this would be a problem but not in NN
• Select columns and open in Tanagra
• Drag and drop
• Set parameters
• Set an 80/20 split

• Seed state determines the randomness of sampling

• Seed state determines the randomness of sampling
• Execute
• After sampling select the variables
• And pick only the continuous variables for standardization
• Use min-max

•
• Execute you will get 2 new variables
• To set dummy coding
• Select the 5 categorical variables
• Don’t check this box!

• Execute it will create dummy values

• Execute it will create dummy values
Source
• For source referral is base

• For location it has taken 6 as base value
• Select all the standardized and coded variables
• In Target tab pick Joined
•
○ This is our back propagation NN

○
○ This is th number of hidden nodes in hidden layer
▪ This is when you want to use out of sample test

▪ None because we have already taken care of normalization
○ Stopping
▪ Max iterations in the software is 1000 so set that

• Execute
• Best it could do was error rate 0.1348

• Confusion matrix (for the training)
• 70.7% TP rate
• Specificity is 1
• Got all negative values right
• You provided more number of negative records so it has overfit the data
• So always look at data for classifier performance for training data

• So always look at data for classifier performance for training data
• So disadvantage of NN it is a blackbox you cannot make sense of the weights
• So use sensitivity analysis to evaluate instated
○ After 100 stagnated

○ Even if you increase the number of iterations you wont see improvement
• Out of sample validation

•
• Execute

•
• This is the confusion matrix for the test data

• Finetune the hyperparameters
• Use genetic algorithm to determine the features ( ideal number of nodes and layers)
• We have 12 variables

• We have 12 variables
• Number of nodes in hidden layer (rule of thumb)
○ Need 10
○ But we have only 6000 records that could lead to overfitting instead be conservative but
still increase the number of nodes
○ Gone down now (not overfitting)

L14
24 December 2020 17:02
• Y variable it takes as string

• So replace 0 with no fraud and 1 with fraud
• Claim type is categorical with 5 levels

• Claim amt is a continuous variable

•
• Some of these have many levels so it will be difficult for regression to caputure non linearity
• Import to Tanagra
•
• 70/30 split
• Normalize the continuous variables
• Categorical create dummy coded values

• Its picked it up as a number (continuous) so it will treat it as ordinal which is wrong
○ First convert them into categorical variable

○
○ It gives you new variables

○ Now dummy code

○
○ Build the model

○
○
○ To find the number of nodes use thumb rule
○ 24 input variables and 2 output levels

○
○ Learning rate start with 0.1 and work upwards if the algorithm doesn’t converge
increase
○ 0 so didn’t converge overstepped the minima, finetune your hyperparameters
○ 95% sensitivity here which is good

○ Current error is 0.26

○
○ The rest of the rows gives you how the error rate changes if you drop a variable
○ Assess the model

○

○
○ For fraud would prefer a higher cutoff
○ Try with 60% split
○ 89%
○ If it had degraded means model is not consistent
•
• So far we have used it only for classification
• 1 descriptive and 1 predictive analysis (logit and NN)

○ Is it linear and non-linear
○ How you will use this learning and how you will apply in real-world
• Exam:

○
○ Answer the following:

▪ How would an analyst improve a model
▪ How would decision maker take action on what results are given

Business Analytics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Analytics

Uploaded by

Copyright:

Available Formats

Business Analytics

12 November 2020 18:01

Excel Add-ins: Link

Business Analytics Page 1

Analytics is about DATA DRIVEN INSIGHTS:

• Companies are investing in structures to store and analyse their

• It is no longer the highest payed decision makers insights that drive

• Renewed interest in AI in 90's

Business Analytics Page 2

• Use either mathematical, quantitative or statistical techniques to

• No one model is better than the other, depends on context. Always

○ Use data that already exists with the firm. RETROSPECTIVE

Business Analytics Page 3

○ Data from a ecommerce site selling chocolates

□ Can see they seem to shop more on Fri and Monday

□ Using measure of central tendency here to gain insights

Business Analytics Page 4

▪ Explaining why something happened

Business Analytics Page 5

○ Example: Model to determine credit default for individual.

▪ This is the new criteria, not p value

▪ Not concerned with population focusing on prediction

▪ Focus on minimizing error

Business Analytics Page 6

○ There is usually a target variable (y variable)

▪ Here default is a categorical variable (yes/no), classify the

▪ They are part of predictive

▪ Association: Grouping no classification

○ IT & strategies: where to deploy the model

Business Analytics Page 7

○ basis varying scenarios

○ Example: Google maps prescribes options for route to destination,

○ Optimization and simulation are the techniques used

○ We apply constraints and try to analyse (Solver in Excel)

Business Analytics Page 8

Business Analytics Page 9

• Process to build a model

• Industry standard to build a model

Business Analytics Page 10

• Evaluation and deployment

• Tomorrow - Unsupervised learning, cluster analysis

Business Analytics Page 11

○ Euclidean distance used for continuous variables

○ If not continuous use different matrix

Business Analytics Page 12

• Highly subjective, researcher needs to pick

○ Hierarchical: requires more memory

• 2 variables 6 people, find if there are clusters. (Plot a scatter plot)

• E and c are close together

Business Analytics Page 13

Business Analytics Page 14

○ Do for all 3 variables

○ Distance between b & f, before and after standardization

• Randomly picks points close by and puts them together

Business Analytics Page 15

• Have a new centroid and calculates that value

Business Analytics Page 16

Business Analytics Page 17

• Create a table for lookup

• We get points a and f for first iteration

Business Analytics Page 18

• Now a and b are cluster centroids

• Get each points distance to 2 cluster centroids