W02-Measurment Scales and Data Understanding

BUS445
Customer Analytics
Data Understanding
BUS445 D200 Fall 2023
Customer Analytics
Roadmap for Today
Measurement Scales
• In-class Quiz
• Analyzing horse racing performances
Data Understanding Tutorial

• Case: Jack & Jill
Intrinsic Scale of an Attribute
• How fast a horse goes is an intrinsic attribute

of a horse. Intuitively, think “speed”.
– What is the measurement scale of this attribute?
• We can measure this attribute at lower levels
– Position: scale?
– Time behind winner: scale?
• What is the disadvantage of measuring at
lower levels?
• Any advantages?
HIERARCHY
INTRINSIC SCALE OF THE ATTRIBUTE
NOMINAL ORDINAL INTERVAL RATIO
NOMINAL Y Y Y Y LESS
INFORMATION
MAY BE ORDINAL N Y Y Y
MEASURED
AS…
INTERVAL N N Y Y
RATIO N N N Y MORE
INFORMATION
IF YOU DO MEASURE IN THE RED AREA

REMEMBER THAT YOU ARE MAKING ASSUMPTIONS
Summarizing One Variable
(“univariate” statistics)
• Legit to summarize COUNTS of identical values of
a categorical (nominal, ordinal) variable:
– Black Gold has 3 1st places, 2 2nd places, and 1 3rd
place
– This is a frequency distribution but not a single number
– Battleship’s mode?
• We have the option to combine categories to

create new simpler distributions (may be necessary
for cross-tabs to give valid results)
e.g. relabel 2nd and 3rd place as “not first”
– Battleship has 3 first places, and 3 not first places
Summarizing One Variable
(“univariate” statistics)
• Continuous (ratio, interval) variables can be
summarized using their actual values, rather
than just counting the identical values.
– e.g., Black Gold’s average time in 6 races is 164.5
seconds per race
– Compute new more useful variables, e.g. To get
something closer to our intuitive notion of speed we
can divide two continuous variables, such as time
by furlongs, and average over races to get 15.35
seconds per furlong.
Visualization to explore data, e.g.
Average speed (furlongs/second) by race distance
0.067
0.0665
0.066
0.0655 8 furlong races

0.065
0.0645
0.064
16 furlong races
0.0635
0.063
0.0625
0.062
Battleship BlackGold Bushranger
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding
Data
Preparation
Deployment
Modeling
Evaluation
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding
W2: Measurement Scales
Data W8: Missing

Preparation Value
Deployment
Supervised
Learning Unsupervised
Modeling
W11: Continuous vs. Binary Learning
Segmentation W10: Cluster
W3: Multiple Linear W4: Logistic
W6: Non- Regression Regression Analysis +
linear Effect PCA
W7: Tree
W5: Model Evaluation W9: MVP
models
Assessment
W8: Neural
Network
Two-way Contingency Tables
• A.K.A.: Cross-Tabulation or Cross-Tab

• Purpose: to analyze the relation between two
categorical variables
Continuous variables (interval or ratio
measures) will need to be converted to
categories (“binning”)
A Two-way Contingency Table
A student suspects that blue-
eyed students got higher
marks in the midterm.
The class has 50 students in total, with 12

having blue eyes and 38 not.
10 students got 80 or above in the midterm.
A half of the 10 students are blue-eyed!

A Two-way Contingency Table
A student suspects that blue-
eyed students got higher
marks in the midterm.
Blue-eyed Non-blue-eyed
Midterm: 80 or above 5 5 10
Midterm: 79 or below 7 33 40
12 38 50
Blue-eyed Non-blue-eyed
Midterm: 80 or above 41.6% 13.2%
Midterm: 79 or below 58.3% 86.8%
99.9% 100%
Beyond Eyeballing: Extending to more levels
Jack & Jill Data Set
Family size and categorical spending information on 557 families
• The spending categories have 188, 183, and 186 families each.
• The family size categories have 187, 270 and 100 families each.
• 9 possible categories
Family Size
1 2 3+ TOTAL
Child Children Children
If there is no relation, 188

Low
???
Spending Medium
Category
183
High 186
TOTAL
187 270 100 557
Expected counts if no relation
The proportion for households of a specific family size and a certain
spending level
= the proportion of households having the specific family size
X the proportion of households having the certain spending level
Family Size
1 2 3+ Marginal
Proportion
Low .113 .163 .061 .337
Spending Medium .111 .160 .059 .329
Category
High .112 .162 .060 .334
Marginal .336 .485 .180 1.00
Proportion
Expected Counts if no relation
Family Size
1 2 3+ TOTAL
Low 63 91 34 188
Spending
Category Medium 62 89 33 183
High 62 90 33 186
TOTAL 187 270 100 557

Actual Counts
Family Size
1 2 3+ TOTAL
Low 99 78 11 188
Spending
High 26 103 57 186
TOTAL 187 270 100 557

Comparing the Expected Counts
to a Actual Counts
Chi-square Test χ 2 =
r c (oij − eij ) 2
Statistics
∑ ∑
i =1 j =1 eij
Chi-square measures how much our data differ from what we’d expect
(given the hypothesis of independence)
The “p-value” is 6 x 10 -16

EXTRPOLATE THIS SAMPLE to EVERYONE?
If there was no relation between family size and
spending in the whole country, what is the chance that
we accidentally picked a sample of 557 that is so far
away from the “no-relation” distribution ?
The “p-value” of the “chi-square” statistic told us

already:
The probability is 6 x 10 -16
A wildly unlikely coincidence!
chi-square test may not work all the time
• Look at the expected count (under the null

hypotheses of no relation between the variables)
• If the expected count in any cell is less than 5, the
chi-square test becomes inaccurate.
• The more cells that have expected counts of less
than 5, the less accurate the test
• Most stats software will do the checking for you.

If not, easy to ask for the expected count
Expected Counts if no relation
All greater than 5
Chi-square is valid! Family Size
1 2 3+ TOTAL
Low 63 91 34 188
Spending
High 62 90 33 186
TOTAL 187 270 100 557

Of course, don’t forget the STORY!
Family Size
1 Child 2 Children 3+ % of
Children TOTAL
Low 52.9% 28.9% 11%
(null) (34%) (34%) (34%) 34%
Spending
Category Medium 33.2% 33.0% 32%
(null) (33%) (33%) (33%) 33%
High 13.9% 38.1% 57%
(null) (33%) (33%) (33%) 33%
TOTAL 100% 100% 100% 100%
W2 Tutorial Exercises
Chapter 3 + W02 Tutorial Exercise
• Reading Data (Section 3.3)

– What are packages and functions in R?
– What is FACTOR?
– How do .csv, .r and .Rdata differ from one
another?
• Summarizing Data Variables (Section 3.4)
– Why would we care about “small” levels in a
category?
– What is correlation?
• Visualizing Data Variables (Section 3.5)
– When should we do equal-width vs. equal-
frequency binning?
Binning continuous variables
• E.g. converting % grades to letter grades
• E.g., We could put the ratio variable “seconds/furlong”
into ordinal categories
– E.g., put Black Gold seconds-per-furlong variable into three
categories:
• FAST less than 15 seconds/furlong,
• MEDIUM  between 15 and 15.5 seconds/furlong,
• SLOW more than 15.5 seconds/furlong:
SPEED Category
15.3375 MEDIUM • and summarize counts of
14.75 FAST identical values as frequency
15.275 MEDIUM distributions:
15.4625 MEDIUM
• 1 FAST, 3 MEDIUM, 2 SLOW
15.5625 SLOW
15.56875 SLOW
• Visualizing Data Variables (Section 3.5)
– When should we do equal-width vs. equal-
frequency binning?
• Creating cross-tabs (Section 3.6)
– What is a chi-square test?
W02 Tutorial Exercise
• Answer all the questions in “W2 Tutorial

Exercise.docx” (posted on Canvas)
• For the last question, 3-4 sentences for each
table will be sufficient. Describe the pattern,
note whether the p-values suggest that the
patterns are meaningful, and whether the
results make intuitive sense or are surprising.

W02-Measurment Scales and Data Understanding

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W02-Measurment Scales and Data Understanding

Uploaded by

Copyright:

Available Formats

BUS445

Data Understanding Tutorial

• How fast a horse goes is an intrinsic attribute

IF YOU DO MEASURE IN THE RED AREA

• We have the option to combine categories to

0.0655 8 furlong races

W2: Measurement Scales

Data W8: Missing

• A.K.A.: Cross-Tabulation or Cross-Tab

The class has 50 students in total, with 12

A half of the 10 students are blue-eyed!

Child Children Children

If there is no relation, 188

TOTAL 187 270 100 557

High 26 103 57 186

TOTAL 187 270 100 557

The “p-value” is 6 x 10 -16

The “p-value” of the “chi-square” statistic told us

• Look at the expected count (under the null

• Most stats software will do the checking for you.

TOTAL 187 270 100 557

• Reading Data (Section 3.3)

• Answer all the questions in “W2 Tutorial

You might also like