Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

BUS445

Customer Analytics
Data Understanding
BUS445 D200 Fall 2023
Customer Analytics
Roadmap for Today

Measurement Scales
• In-class Quiz
• Analyzing horse racing performances

Data Understanding Tutorial


• Case: Jack & Jill
Intrinsic Scale of an Attribute

• How fast a horse goes is an intrinsic attribute


of a horse. Intuitively, think “speed”.
– What is the measurement scale of this attribute?
• We can measure this attribute at lower levels
– Position: scale?
– Time behind winner: scale?
• What is the disadvantage of measuring at
lower levels?
• Any advantages?
HIERARCHY
INTRINSIC SCALE OF THE ATTRIBUTE
NOMINAL ORDINAL INTERVAL RATIO

NOMINAL Y Y Y Y LESS
INFORMATION

MAY BE ORDINAL N Y Y Y
MEASURED
AS…
INTERVAL N N Y Y

RATIO N N N Y MORE
INFORMATION

IF YOU DO MEASURE IN THE RED AREA


REMEMBER THAT YOU ARE MAKING ASSUMPTIONS
Summarizing One Variable
(“univariate” statistics)
• Legit to summarize COUNTS of identical values of
a categorical (nominal, ordinal) variable:
– Black Gold has 3 1st places, 2 2nd places, and 1 3rd
place
– This is a frequency distribution but not a single number
– Battleship’s mode?

• We have the option to combine categories to


create new simpler distributions (may be necessary
for cross-tabs to give valid results)
e.g. relabel 2nd and 3rd place as “not first”
– Battleship has 3 first places, and 3 not first places
Summarizing One Variable
(“univariate” statistics)
• Continuous (ratio, interval) variables can be
summarized using their actual values, rather
than just counting the identical values.
– e.g., Black Gold’s average time in 6 races is 164.5
seconds per race
– Compute new more useful variables, e.g. To get
something closer to our intuitive notion of speed we
can divide two continuous variables, such as time
by furlongs, and average over races to get 15.35
seconds per furlong.
Visualization to explore data, e.g.
Average speed (furlongs/second) by race distance

0.067

0.0665

0.066

0.0655 8 furlong races


0.065

0.0645

0.064
16 furlong races
0.0635

0.063

0.0625

0.062
Battleship BlackGold Bushranger
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding

Data
Preparation

Deployment

Modeling

Evaluation
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Business Data
Understanding Understanding

W2: Measurement Scales

Data W8: Missing


Preparation Value

Deployment
Supervised
Learning Unsupervised
Modeling
W11: Continuous vs. Binary Learning
Segmentation W10: Cluster
W3: Multiple Linear W4: Logistic
W6: Non- Regression Regression Analysis +
linear Effect PCA
W7: Tree
W5: Model Evaluation W9: MVP
models
Assessment
W8: Neural
Network
Two-way Contingency Tables

• A.K.A.: Cross-Tabulation or Cross-Tab


• Purpose: to analyze the relation between two
categorical variables
Continuous variables (interval or ratio
measures) will need to be converted to
categories (“binning”)
A Two-way Contingency Table
A student suspects that blue-
eyed students got higher
marks in the midterm.

The class has 50 students in total, with 12


having blue eyes and 38 not.
10 students got 80 or above in the midterm.

A half of the 10 students are blue-eyed!


A Two-way Contingency Table
A student suspects that blue-
eyed students got higher
marks in the midterm.

Blue-eyed Non-blue-eyed
Midterm: 80 or above 5 5 10
Midterm: 79 or below 7 33 40
12 38 50

Blue-eyed Non-blue-eyed
Midterm: 80 or above 41.6% 13.2%
Midterm: 79 or below 58.3% 86.8%
99.9% 100%
Beyond Eyeballing: Extending to more levels
Jack & Jill Data Set
Family size and categorical spending information on 557 families
• The spending categories have 188, 183, and 186 families each.
• The family size categories have 187, 270 and 100 families each.
• 9 possible categories
Family Size
1 2 3+ TOTAL

Child Children Children

If there is no relation, 188


Low

???
Spending Medium
Category
183
High 186
TOTAL
187 270 100 557
Expected counts if no relation
The proportion for households of a specific family size and a certain
spending level
= the proportion of households having the specific family size
X the proportion of households having the certain spending level

Family Size
1 2 3+ Marginal
Proportion
Child Children Children
Low .113 .163 .061 .337
Spending Medium .111 .160 .059 .329
Category
High .112 .162 .060 .334
Marginal .336 .485 .180 1.00
Proportion
Expected Counts if no relation
Family Size
1 2 3+ TOTAL
Child Children Children
Low 63 91 34 188
Spending
Category Medium 62 89 33 183

High 62 90 33 186

TOTAL 187 270 100 557


Actual Counts
Family Size
1 2 3+ TOTAL
Child Children Children
Low 99 78 11 188
Spending
Category Medium 62 89 32 183

High 26 103 57 186

TOTAL 187 270 100 557


Comparing the Expected Counts
to a Actual Counts

Chi-square Test χ 2 =
r c (oij − eij ) 2
Statistics
∑ ∑
i =1 j =1 eij
Chi-square measures how much our data differ from what we’d expect
(given the hypothesis of independence)

The “p-value” is 6 x 10 -16


EXTRPOLATE THIS SAMPLE to EVERYONE?
If there was no relation between family size and
spending in the whole country, what is the chance that
we accidentally picked a sample of 557 that is so far
away from the “no-relation” distribution ?

The “p-value” of the “chi-square” statistic told us


already:
The probability is 6 x 10 -16
A wildly unlikely coincidence!
chi-square test may not work all the time

• Look at the expected count (under the null


hypotheses of no relation between the variables)
• If the expected count in any cell is less than 5, the
chi-square test becomes inaccurate.
• The more cells that have expected counts of less
than 5, the less accurate the test

• Most stats software will do the checking for you.


If not, easy to ask for the expected count
Expected Counts if no relation
All greater than 5
Chi-square is valid! Family Size
1 2 3+ TOTAL
Child Children Children
Low 63 91 34 188
Spending
Category Medium 62 89 33 183

High 62 90 33 186

TOTAL 187 270 100 557


Of course, don’t forget the STORY!

Family Size
1 Child 2 Children 3+ % of
Children TOTAL
Low 52.9% 28.9% 11%
(null) (34%) (34%) (34%) 34%
Spending
Category Medium 33.2% 33.0% 32%
(null) (33%) (33%) (33%) 33%
High 13.9% 38.1% 57%
(null) (33%) (33%) (33%) 33%
TOTAL 100% 100% 100% 100%
W2 Tutorial Exercises
Data Understanding Tutorial
Chapter 3 + W02 Tutorial Exercise

• Reading Data (Section 3.3)


– What are packages and functions in R?
– What is FACTOR?
– How do .csv, .r and .Rdata differ from one
another?
• Summarizing Data Variables (Section 3.4)
– Why would we care about “small” levels in a
category?
– What is correlation?
Data Understanding Tutorial
• Visualizing Data Variables (Section 3.5)
– When should we do equal-width vs. equal-
frequency binning?
Binning continuous variables
• E.g. converting % grades to letter grades
• E.g., We could put the ratio variable “seconds/furlong”
into ordinal categories
– E.g., put Black Gold seconds-per-furlong variable into three
categories:
• FAST less than 15 seconds/furlong,
• MEDIUM  between 15 and 15.5 seconds/furlong,
• SLOW more than 15.5 seconds/furlong:

SPEED Category
15.3375 MEDIUM • and summarize counts of
14.75 FAST identical values as frequency
15.275 MEDIUM distributions:
15.4625 MEDIUM
• 1 FAST, 3 MEDIUM, 2 SLOW
15.5625 SLOW
15.56875 SLOW
Data Understanding Tutorial
• Visualizing Data Variables (Section 3.5)
– When should we do equal-width vs. equal-
frequency binning?
• Creating cross-tabs (Section 3.6)
– What is a chi-square test?
W02 Tutorial Exercise

• Answer all the questions in “W2 Tutorial


Exercise.docx” (posted on Canvas)
• For the last question, 3-4 sentences for each
table will be sufficient. Describe the pattern,
note whether the p-values suggest that the
patterns are meaningful, and whether the
results make intuitive sense or are surprising.

You might also like