CSE 385 - Data Mining and Business Intelligence - Lecture 04

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

DATA MINING AND

BUSINESS INTELLIGENCE -
LECTURE 04
Dr. Mahmoud Mounir
mahmoud.mounir@cis.asu.edu.eg
LEVELS OF MEASUREMENTS
COLLECTING

ORGANIZING
DESCRIPTIVE
SUMMARIZNING
STATISTICS

PRESENTING

GENERALIZTION FROM SAMPLES TO


POPULATION

DETERMINING RELATIONS AMONG


INFERENTIAL VARIABLES

MAKING PREDICTIONS

DATA MINING - LECTURE 4 2


EXPLORING DATA
DATA MINING - LECTURE 4 3
EXPLORING DATA
❑Cases, Variables and Levels of Measurements.

DATA MINING - LECTURE 4 4


EXPLORING DATA
❑Cases, Variables and Levels of Measurements.

DATA MINING - LECTURE 4 5


EXPLORING DATA
❑Cases, Variables and Levels of Measurements.

DATA MINING - LECTURE 4 6


EXPLORING DATA

DATA MINING - LECTURE 4 7


EXPLORING DATA

DATA MINING - LECTURE 4 8


EXPLORING DATA

DATA MINING - LECTURE 4 9


EXPLORING DATA

DATA MINING - LECTURE 4 10


LEVELS OF MEASUREMENTS
DATA MINING - LECTURE 4 11
LEVELS OF MEASUREMENTS
Ranking
Ordinal Flight Classes
Grades
Categorical Color
(Qualitative) Nominal Nationality
Marital Status
VARIABLES

Ratio

Interval

No. of Goals
Discrete No. of Rooms
No. of Children
Height
QUANTITIVE Continuous Weight
Temperature

DATA MINING - LECTURE 4 12


DATA MINING - LECTURE 4 13
MEASURES OF CENTRAL TENDENCY AND
DISPERSION
❑Besides summarizing data by means of tables and/or graphs, it can
also be useful to describe the center of a distribution. We can do
that by means of so-called measures of central tendency: the
mode, median and mean.

❑Yet to adequately describe a distribution we need more


information. We also need information about the variability or
dispersion of the data. We need, in other words, measures of
dispersion. Well-known measures of dispersion are the range, the
interquartile range, the variance and the standard deviation. A
graph that nicely presents the variability of a distribution is the
box plot.

DATA MINING - LECTURE 4 14


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 15


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 16


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 17


MEASURES OF CENTRAL TENDENCY
❑ MODE

DATA MINING - LECTURE 4 18


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 19


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 20


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 21


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 22


MEASURES OF CENTRAL TENDENCY
❑ MEAN (UNGROUPED DATA)

DATA MINING - LECTURE 4 23


MEASURES OF CENTRAL TENDENCY
❑ MEAN (GROUPED DATA)

x =  xf
n
x = class midpoint

DATA MINING - LECTURE 4 24


MEASURES OF CENTRAL TENDENCY
❑ MEAN (GROUPED DATA)

Age Frrquency (f) Midpoint (x) f*x


30-34 4 32 128
35-39 5 37 185 ∑f = n = 20
40-44 2 42 84 ∑f*x = 820
Mean = 820/20 = 41
45-49 9 47 423
Total 20 820

DATA MINING - LECTURE 4 25


MEASURES OF CENTRAL TENDENCY
❑ MEAN:
Balance point of the data

DATA MINING - LECTURE 4 26


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 27


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 28


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 29


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 30


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 31


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 32


MEASURES OF CENTRAL TENDENCY

DATA MINING - LECTURE 4 33


MEASURES OF DISPERSION OR VARIABILITY

DATA MINING - LECTURE 4 34


MEASURES OF DISPERSION OR
VARIABILITY

DATA MINING - LECTURE 4 35


RANGE, INTERQUARTILE RANGE AND
BOX PLOT

DATA MINING - LECTURE 4 36


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ RANGE (R):

DATA MINING - LECTURE 4 37


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ INTERQUARTILE RANGE (IQR):

DATA MINING - LECTURE 4 38


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ INTERQUARTILE RANGE (IQR):

DATA MINING - LECTURE 4 39


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ INTERQUARTILE RANGE (IQR):

DATA MINING - LECTURE 4 40


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ BOX PLOT

DATA MINING - LECTURE 4 41


RANGE, INTERQUARTILE RANGE AND
BOX PLOT
❑ BOX PLOT

DATA MINING - LECTURE 4 42


RANGE, INTERQUARTILE RANGE AND
BOX PLOT

DATA MINING - LECTURE 4 43


RANGE, INTERQUARTILE RANGE AND
BOX PLOT

𝑻𝒉𝒆 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝒊𝒔 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒍𝒚 𝒔𝒌𝒆𝒘𝒆𝒅

DATA MINING - LECTURE 4 44


VARIANCE AND STANDARD DEVIATION

DATA MINING - LECTURE 4 45


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (UNGROUPED DATA)

DATA MINING - LECTURE 4 46


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (UNGROUPED DATA)
➢Mean is the point
of balance, so we
have positive and
negative deviations
from the mean.
➢The sum of
deviation sum to
zero. That’s why we
don’t use the
original deviations,
but the squared
deviations.

DATA MINING - LECTURE 4 47


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (UNGROUPED DATA)

DATA MINING - LECTURE 4 48


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (UNGROUPED DATA)

σ𝒙 𝟐
σ 𝒙𝟐
− 𝒏
𝒔=
𝒏−𝟏

DATA MINING - LECTURE 4 49


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (GROUPED DATA)

σ 𝑥 − 𝑥lj 2𝑓
𝑆 =
𝑛−1

x = class midpoint

DATA MINING - LECTURE 4 50


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (GROUPED DATA)

σ 𝒙𝒇 𝟐
𝟐
σ𝒙 𝒇 −
𝒔= 𝒏
𝒏−𝟏

x = class midpoint

DATA MINING - LECTURE 4 51


VARIANCE AND STANDARD DEVIATION
❑ VARIANCE (GROUPED DATA)
Age Frrquency (f) Midpoint (x) X-Mean (X-Mean) 2 (X-Mean)2 f

30-34 4 32 -9 81 324
35-39 5 37 -4 16 80
40-44 2 42 1 1 2
45-49 9 47 6 36 324
Total 20 730

∑f = n = 20 𝟕𝟑𝟎
Mean = 820/20 = 41 𝑺=
𝟐𝟎 − 𝟏
∑(X-Mean)2 f = 730 = 𝟑𝟖 . 𝟒𝟐 ≈ 𝟔. 𝟐𝟎

DATA MINING - LECTURE 4 52


Z-SCORE
❑Sometimes researchers want to know if a specific
observation is common or exceptional.
❑To answer that question, they express a score in terms of
the number of standard deviations it is removed from
the mean.
❑This number is what we call a z-score.
❑If we recode original scores into z-scores, we say that we
standardize a variable.

DATA MINING - LECTURE 4 53


Z-SCORE

DATA MINING - LECTURE 4 54


Z-SCORE

DATA MINING - LECTURE 4 55


Z-SCORE

DATA MINING - LECTURE 4 56


Z-SCORE
❑ EMPIRICAL RULE
NORMAL DISTRIBUION (BELL SHAPED)

DATA MINING - LECTURE 4 57


Z-SCORE
❑ EMPIRICAL RULE “APPROXIMATION”
NORMAL DISTRIBUION (BELL SHAPED)
▪ Approximately 68%
of the data lie within one standard deviation of the mean, that is, in the
interval with endpoints 𝑥ҧ ±s for samples and with endpoints μ±σ for
populations.
▪ Approximately 95%
of the data lie within two standard deviations of the mean, that is, in
the interval with endpoints 𝑥ҧ ±2s for samples and with endpoints μ±2σ for
populations.
▪ Approximately 99.7%
of the data lies within three standard deviations of the mean, that is, in
the interval with endpoints 𝑥ҧ ±3s for samples and with endpoints μ±3σ for
populations.

DATA MINING - LECTURE 4 58


Z-SCORE
❑ EMPIRICAL RULE
NORMAL DISTRIBUION (BELL SHAPED)

DATA MINING - LECTURE 4 59


Z-SCORE

DATA MINING - LECTURE 4 60


EXERCISE (1)
▪ What does the distribution
of the variable look like?
▪ What is the center of the
distribution?
▪ Study the variability of the
distribution.
▪ Construct a box plot.
▪ What is the z-score of school
#3?
DATA MINING - LECTURE 4 61
EXERCISE (2)
• A relative frequency histogram for the data

DATA MINING - LECTURE 4 62


EXERCISE (4)

DATA MINING - LECTURE 4 63


EXERCISE (6)

DATA MINING - LECTURE 4 64


EXERCISE (6)

DATA MINING - LECTURE 4 65


EXERCISE (7)
(a)(5 Marks) The 70 highest dams in the world
have an average height of 206 meters with a
standard deviation of 35 meters. The Hoover and
Grand Coulee dams have heights of 221 and 168
meters, respectively. The Russian dams, the
Nurek and Charvak, have heights with z-scores
of +2.69 and –1.13, respectively. List the dams in
order of ascending size.

DATA MINING - LECTURE 4 66


EXERCISE (8)
(2 Marks) Here are some summary statistics for the numbers of acres of
soybeans ‫ فول الصويا‬and peanuts ‫ الفول السوداني‬harvested per county in
Alabama in 2009, for counties that planted those crops.
In one southern county, there were 9 thousand acres of soybeans
harvested and 3 thousand acres of peanuts harvested. Relative to its
crop, which plant had a better harvest?

DATA MINING - LECTURE 4 67

You might also like