Professional Documents
Culture Documents
W02-Measurment Scales and Data Understanding
W02-Measurment Scales and Data Understanding
Customer Analytics
Data Understanding
BUS445 D200 Fall 2023
Customer Analytics
Roadmap for Today
Measurement Scales
• In-class Quiz
• Analyzing horse racing performances
NOMINAL Y Y Y Y LESS
INFORMATION
MAY BE ORDINAL N Y Y Y
MEASURED
AS…
INTERVAL N N Y Y
RATIO N N N Y MORE
INFORMATION
0.067
0.0665
0.066
0.0645
0.064
16 furlong races
0.0635
0.063
0.0625
0.062
Battleship BlackGold Bushranger
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding
Data
Preparation
Deployment
Modeling
Evaluation
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding
Deployment
Supervised
Learning Unsupervised
Modeling
W11: Continuous vs. Binary Learning
Segmentation W10: Cluster
W3: Multiple Linear W4: Logistic
W6: Non- Regression Regression Analysis +
linear Effect PCA
W7: Tree
W5: Model Evaluation W9: MVP
models
Assessment
W8: Neural
Network
Two-way Contingency Tables
Blue-eyed Non-blue-eyed
Midterm: 80 or above 5 5 10
Midterm: 79 or below 7 33 40
12 38 50
Blue-eyed Non-blue-eyed
Midterm: 80 or above 41.6% 13.2%
Midterm: 79 or below 58.3% 86.8%
99.9% 100%
Beyond Eyeballing: Extending to more levels
Jack & Jill Data Set
Family size and categorical spending information on 557 families
• The spending categories have 188, 183, and 186 families each.
• The family size categories have 187, 270 and 100 families each.
• 9 possible categories
Family Size
1 2 3+ TOTAL
???
Spending Medium
Category
183
High 186
TOTAL
187 270 100 557
Expected counts if no relation
The proportion for households of a specific family size and a certain
spending level
= the proportion of households having the specific family size
X the proportion of households having the certain spending level
Family Size
1 2 3+ Marginal
Proportion
Child Children Children
Low .113 .163 .061 .337
Spending Medium .111 .160 .059 .329
Category
High .112 .162 .060 .334
Marginal .336 .485 .180 1.00
Proportion
Expected Counts if no relation
Family Size
1 2 3+ TOTAL
Child Children Children
Low 63 91 34 188
Spending
Category Medium 62 89 33 183
High 62 90 33 186
Chi-square Test χ 2 =
r c (oij − eij ) 2
Statistics
∑ ∑
i =1 j =1 eij
Chi-square measures how much our data differ from what we’d expect
(given the hypothesis of independence)
High 62 90 33 186
Family Size
1 Child 2 Children 3+ % of
Children TOTAL
Low 52.9% 28.9% 11%
(null) (34%) (34%) (34%) 34%
Spending
Category Medium 33.2% 33.0% 32%
(null) (33%) (33%) (33%) 33%
High 13.9% 38.1% 57%
(null) (33%) (33%) (33%) 33%
TOTAL 100% 100% 100% 100%
W2 Tutorial Exercises
Data Understanding Tutorial
Chapter 3 + W02 Tutorial Exercise
SPEED Category
15.3375 MEDIUM • and summarize counts of
14.75 FAST identical values as frequency
15.275 MEDIUM distributions:
15.4625 MEDIUM
• 1 FAST, 3 MEDIUM, 2 SLOW
15.5625 SLOW
15.56875 SLOW
Data Understanding Tutorial
• Visualizing Data Variables (Section 3.5)
– When should we do equal-width vs. equal-
frequency binning?
• Creating cross-tabs (Section 3.6)
– What is a chi-square test?
W02 Tutorial Exercise