Professional Documents
Culture Documents
Data Science Tut
Data Science Tut
3
Why are we studying this
Accounting
• Public accounting firms use statistical sampling procedures
when conducting audits for their clients.
Economics
• Economists use statistical information in making forecasts
about the future of the economy or some aspect of it.
Finance
• Financial advisors use price-earnings ratios and dividend
yields to guide their investment advice.
5
Applications in Business and Economics
Marketing
• Electronic point-of-sale scanners at retail checkout counters
are used to collect data for a variety of marketing research
applications.
Production
• A variety of statistical quality control charts are used to
monitor the output of a production process.
Information Systems
• A variety of statistical information helps administrators
assess the performance of computer networks.
6
“Statistics never lie, they simply tell a different
story depending on math of the tellers”
Data and Data Sets
• Data are the facts and figures collected,
analyzed, and summarized for presentation
and interpretation.
10
Elements, Variables, and
Observations
• Elements are the entities on which data are
collected.
• A variable is a characteristic of interest for the
elements.
• The set of measurements obtained for a particular
element is called an observation.
• A data set with n elements contains n observations.
• The total number of data values in a complete data
set is the number of elements multiplied by the
number of variables.
11
Data, Data Sets, Elements, Variables, and Observations
Variables
Data Set
12
Basic Concepts
Population
• All entities of interest in a study
Sample
• Subset of the population
• Often randomly chosen
• Can follow different schemes
• Preferably representative of the population
Types of Data
• Numerical – arithmetic can be performed(age,
salary)
• Categorical – takes values in categories(gender, state)
• Nominal – Names or Labels
• Ordinal – Ordered data based on factors such as size
or quality
• Dummy variable- 0-1 coded for categories
• Discrete and Continuous Data
• Time Series and Cross Sectional Data
Scales of Measurement
• Scales of measurement include
– Nominal
– Ordinal
– Interval
– Ratio
Example
Students of a university are classified by the school in which
they are enrolled using a nonnumeric label such as Business,
Humanities, Education, and so on.
Alternatively, a numeric code could be used for the school
variable (e.g. 1 denotes Business, 2 denotes Humanities, 3
denotes Education, and so on).
18
Scales of Measurement
Ordinal scale
• The data have the properties of nominal data and the order
or rank of the data is meaningful.
• A nonnumeric label or numeric code may be used.
Example
Students of a university are classified by their class standing
using a nonnumeric label such as Freshman, Sophomore,
Junior, or Senior.
Alternatively, a numeric code could be used for the class
standing variable (e.g. 1 denotes Freshman, 2 denotes
Sophomore, and so on).
19
Scales of Measurement
Interval scale
• The data have the properties of ordinal data, and the
interval between observations is expressed in terms
of a fixed unit of measure.
• Interval data are always numeric.
Example
Melissa has an SAT score of 1985, while Kevin has an
SAT score of 1880. Melissa scored 105 points more
than Kevin.
20
Scales of Measurement
Ratio scale
– Data have all the properties of interval data and the ratio of two values is
meaningful.
– Ratio data are always numerical.
– Zero value is included in the scale.
Example:
Price of a book at a retail store is $ 200, while the price of the same book sold
online is $100. The ratio property shows that retail stores charge twice the online
price.
21
Categorical and Quantitative Data
• Data can be further classified as being
categorical or quantitative.
• The statistical analysis that is appropriate
depends on whether the data for the variable
are categorical or quantitative.
• In general, there are more alternatives for
statistical analysis when the data are
quantitative.
22
Categorical Data
• Labels or names are used to identify an
attribute of each element
• Often referred to as qualitative data
• Use either the nominal or ordinal scale of
measurement
• Can be either numeric or nonnumeric
• Appropriate statistical analyses are rather
limited
23
Quantitative Data
• Quantitative data indicate how many or how
much.
• Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful
for quantitative data.
24
Scales of Measurement
Data
Categorical Quantitative
Non-
Numeric Numeric
numeric
25
Cross-Sectional Data
Cross-sectional data are collected at the same or
approximately the same point in time.
Example
Data detailing the number of building permits
issued in November 2013 in each of the counties
of Ohio.
26
Time Series Data
Time series data are collected over several time periods.
Example
Data detailing the number of building permits issued in Lucas
County, Ohio in each of the last 36 months.
27
Time Series Data
Graph of Time Series Data
28
Household Data
Types of Data
Data Sources
Existing Sources
• Internal company records – almost any department
• Business database services – Dow Jones & Co.
• Government agencies - U.S. Department of Labor
• Industry associations – Travel Industry Association of
America
• Special-interest organizations – Graduate
Management Admission Council (GMAT)
• Internet – more and more firms
31
Data Sources
Data Available From Internal Company Records
Record Some of the Data Available
Employee records Name, address, social security number
Production records Part number, quantity produced, direct labor cost, material cost
Inventory records Part number, quantity in stock, reorder level, economic order
quantity
Sales records Product number, sales volume, sales volume by region
Credit records Customer name, credit limit, accounts receivable balance
Customer profile Age, gender, income, household size
32
Data Sources
Data Available From Selected Government Agencies
Government Agency Web address Some of the Data Available
Census Bureau www.census.gov Population data, number of households, household income
Federal Reserve Board www.federalreserve.gov Data on money supply, exchange rates, discount rates
Office of Mgmt. & Budget www.whitehouse.gov/omb Data on revenue, expenditures, debt of federal government
Department of Commerce www.doc.gov Data on business activity, value of shipments, profit by industry
Bureau of Labor Statistics www.bls.gov Customer spending, unemployment rate, hourly earnings, safety
record
33
Data Sources
Statistical Studies – Observational
• In observational (nonexperimental) studies no
attempt is made to control or influence the variables
of interest.
• Example - Survey
• Studies of smokers and nonsmokers are
observational studies because researchers do not
determine or control who will smoke and who will
not smoke.
34
Data Sources
Statistical Studies – Experimental
• In experimental studies the variable of interest is first
identified. Then one or more other variables are
identified and controlled so that data can be
obtained about how they influence the variable of
interest.
• The largest experimental study ever conducted is
believed to be the 1954 Public Health Service
experiment for the Salk polio vaccine. Nearly two
million U.S. children (grades 1- 3) were selected.
35
Data Acquisition Considerations
Time Requirement
• Searching for information can be time consuming.
• Information may no longer be useful by the time it is
available.
Cost of Acquisition
• Organizations often charge for information even when
it is not their primary business activity.
Data Errors
• Using any data that happen to be available or were
acquired with little care can lead to misleading
information.
36
Descriptive Statistics
• Most of the statistical information in newspapers,
magazines, company reports, and other publications
consists of data that are summarized and presented in a
form that is easy to understand.
• Such summaries of data, which may be tabular, graphical,
or numerical, are referred to as descriptive statistics.
Example
The manager of Hudson Auto would like to have a better
understanding of the cost of parts used in the engine tune-ups
performed in her shop. She examines 50 customer invoices
for tune-ups. The costs of parts, rounded to the nearest
dollar, are listed on the next slide.
37
Example: Hudson Auto Repair
Sample of Parts Cost ($) for 50 Tune-ups
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
38
Tabular Summary: Frequency and
Percent Frequency
39
Graphical Summary: Histogram
Example: Hudson Auto
Histogram: Tune-up Parts Cost
18
16
14
12
Frequency
10
0
50-59 60-69 70-79 80-89 90-99 100-109
Parts Cost ($)
40
Numerical Descriptive Statistics
• The most common numerical descriptive
statistic is the mean (or average).
• The mean demonstrates a measure of the
central tendency, or central location of the
data for a variable.
• Hudson’s mean cost of parts, based on the 50
tune-ups studied is $79 (found by summing up
the 50 cost values and then dividing by 50).
41
Statistical Inference
Population: The set of all elements of interest in a
particular study.
Sample: A subset of the population.
Statistical inference: The process of using data
obtained from a sample to make estimates and test
hypotheses about the characteristics of a population.
Census: Collecting data for the entire population.
Sample survey: Collecting data for a sample.
42
Process of Statistical Inference
43
Analytics
47
Frequency Distribution
48
Frequency Distribution
Example: Marada Inn
• Guests staying at Marada Inn were asked to rate the quality of their
accommodations as being excellent, above average, average, below average, or
poor.
• The ratings provided by a sample of 20 guests are:
49
Frequency Distribution
• Example: Marada Inn
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
50
Relative Frequency Distribution
51
Histogram
• A histogram is a graphical representation of
the of data in which the data is grouped in
Bins.
By Summarizing Information,
Descriptive Stats Speed Up and
Simplify comprehension of a
Group’s characteristics!!
The “Population” in statistics includes all
members of a defined group that we are
Sample Vs Population studying or collecting information on for
data driven decisions.
A Parameter is a characteristic of a
population
Pros:
1. It is simple and rigidly defined.
• The mean is the arithmetic average of
2. It is not based on the position in
the scores.
the series.
• The calculation of the mean considers
3. It is easy to understand the
both the number of scores and their
arithmetic average even if some of
value.
the details of the data are lacking.
• The formula for the mean of the
Cons:
variable X is:
1. Mean can be badly affected by ∑ 2!
outliers. • 𝑥̅ = 0
2. It gives misleading conclusions.
Sample Mean 𝑥̅
∑ 𝑥3
𝑥̅ =
𝑛
where: Sxi = sum of the values of the n observations
n = number of observations in the sample
Population Mean µ
∑ 𝑋3
𝜇=
𝑁
62
Sample Mean 𝑥̅
• Example: Apartment Rents
Seventy efficiency apartments were randomly sampled in a college
town. The monthly rents for these apartments are listed below.
545 715 530 690 535 700 560 700 540 715
540 540 540 625 525 545 675 545 550 550
565 550 625 550 550 560 535 560 565 580
550 570 590 572 575 575 600 580 670 565
700 585 680 570 590 600 649 600 600 580
670 615 550 545 625 635 575 650 580 610
610 675 590 535 700 535 545 535 530 540
63
Sample Mean 𝑥̅
∑8! :;,=>?
𝑥̅ = = = 590.80
9 @A
64
Median
Pros: • The Median is the middle point in an ordered
distribution at which an equal number of
Median can be calculated in all
distributions.
scores lie on each side of it.
Cons:
• In Skewed data, median is a better measure
than mean.
It is not based on all the values.
Median
• For an odd number of observations:
26 18 27 12 14 27 19 7 observations
12 14 18 19 26 27 27 in ascending order
66
Median
• For an even number of observations:
26 18 27 12 14 27 30 19 8 observations
12 14 18 19 26 27 27 30 in ascending order
67
Median
• Example: Apartment Rents
Averaging the 35th and 36th data values:
Median = (575 + 575)/2 = 575
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
68
Mode 1. Mode gives the most likely
experience rather than the “typical”
q Mode is the highest point or “central” experience.
of the frequencies
distribution curve. 2. In symmetric distributions, the
q It is not affected by extreme mean, median, and mode are the
values but actual value of an same.
important part of the series. 3. In skewed data, the mean and
median lie further toward the skew
than the mode.
Mode
• Example: Apartment Rents
550 occurred most frequently (7 times)
Mode = 550
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
70
MEASURES OF DISPERSION
• RANGE
• INTER-QUARTILE RANGE
Range For example, consider the
following set:
The Range of a set 40, 30, 43, 48, 26, 50, 55, 40, 34, 42,
of data is the 47, and 50
difference between
the largest and the
smallest number in To find the range you would take
the set. the largest number, 55, and
subtract the smallest number, 26.
ü The Interquartile
range (IQR) is a
measure of variability,
based on dividing a
data set into quartiles.
ü Quartiles divide a
rank-ordered data set
into four equal parts,
denoted by Q1, Q2,
Q3 and Q4,
respectively.
Variance
q A measure of how
data points differ WHY
from the mean. SQUARED
DEVIATION
?
q The average of the
å(x - X )
squared deviations 2
å(x - X )
2
s =
2
n -1
Standard Deviation
q A measure of
å(x - µ)
2
variation of s=
scores about the N
mean.
Why Standard deviation?
Why not Variance!?
å(x - X )
2
s=
n -1
Example
The monthly salaries of all
employees in a premier hotel
in Bangalore has been
reduced to half of their
existing salaries. The previous
salary of a sample of size 10 is
given as (in 1000 Rs units) :
{22, 38, 16, 46, 52, 30, 18, 16,
20, 12}
What would be the new mean
and standard deviation of the
salaries, and how would they
compare to the old salaries?
Example
The monthly salaries of all
executives in an analytics
company in Bangalore has
been reduced to by 40,000
rupees. The previous salary of
a sample of size 10 is given as
(in 1000 Rs units) :
{220, 150, 120, 160, 145, 152,
188, 176, 120, 124}
What would be the new mean
and standard deviation of the
salaries, and how would they
compare to the old salaries?
Percentile
• Percentile is the value below which a
given percentage of observations in a
population fall.
• For example, the 80th percentile is the value
below which(including it) 80% of the
observations may be found, and above
which(including it) 20% of the observations
may be found.
Percentile
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
80
Crosstabulation
• The left and top margin labels define the classes for the two variables.
81
Crosstabulation
82
Crosstabulation
• Only three homes in the sample are an A-Frame style and priced
at $250,000 or more.
83
Crosstabulation
• Example: Finger Lakes Homes
84
Crosstabulation: Row or Column
Percentages
85
Crosstabulation: Row Percentages
86
Crosstabulation: Column Percentages
• Example: Finger Lakes Homes
87
Crosstabulation: Simpson’s Paradox
88
Side-by-Side Bar Chart
89
Side-by-Side Bar Chart
12
10
8
6
4
2
Home Style
Colonial Log Split-Level A-Frame
90
Stacked Bar Chart
• A stacked bar chart is another way to display and compare two variables on the
same display.
• If percentage frequencies are displayed, all bars will be of the same height (or
length), extending to the 100% mark.
91
Stacked Bar Chart
40 Finger Lake Homes
36
32
28
Frequency
< $250,000
24 > $250,000
20
16
12
8
4
Home Style
Colonial Log Split A-Frame
92
Stacked Bar Chart
Finger Lake Homes
100
90
Percentage Frequency
80
70
60 < $250,000
> $250,000
50
40
30
20
10
Home Style
Colonial Log Split A-Frame
93
Data Visualization: Best Practices in Creating Effective Graphical
Displays
94
Creating Effective Graphical Displays
95
Choosing the Type of Graphical Display
96
Box Plot
Box Plot
• Box Plot (aka Box and Whisker plot) is a graphical
representation of numerical data that can be used to
understand the variability of the data and the existence of
outliers.
• The whisker of the box plot extends till Q1 – 1.5 IQR and Q3 +
1.5 IQR, observations beyond these two limits are potential
outliers.
3rd Quartile
+ 1.5 IQD
3rd
Quartile
Median
1st
Quartile
Distribution Shape: Skewness
• An important measure of the shape of a distribution is called skewness.
! '! #'̅ *
Skewness = ∑
(!#$)(!#&) )
105
Distribution Shape: Skewness
.35
Skewness = 0
.30
Relative Frequency
.25
.20
.15
.10
.05
0
106
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
107
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
108
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
109
Distribution Shape: Skewness
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
110
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
111
Z-scores
• The z-score is often called the standardized
value.
• It denotes the number of standard deviations
a data value xi is from the mean
"! #"̅
𝑧! =
%
• Excel’s STANDARDIZE function can be used to
compute the z-score.
Z-scores
• An observation’s z-score is a measure of the
relative location of the observation in a data
set.
• A data value less than the sample mean will
have a z-score less than zero.
• A data value greater than the sample mean
will have a z-score greater than zero.
• A data value equal to the sample mean will
have a z-score of zero.
z-Scores
• Example: Apartment Rents
• z-Score of Smallest Value (525)
'! #'̅ ,&,#,-..0.
𝑧+ = )
= ,1.21
= -1.20
114
Chebyshev’s Theorem
• At least (1 - 1/z2) of the items in any data set
will be within z standard deviations of the
mean, where z is any value greater than 1.
117
Empirical Rule
• When the data are believed to approximate a bell-shaped
distribution:
• The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
118
Empirical Rule
119
Empirical Rule
99.72%
95.44%
68.26%
x
µ – 3s µ – 1s µ µ + 1s µ + 3s
µ – 2s µ + 2s
120
Outliers
• A value, or observation lying outside of norm
• May not be outlier based on variables
• Combination of variables may be extreme
• Sometimes eliminated for analysis
• Not a good idea always
• Good to run analysis both with and without
them
Detecting Outliers
• An outlier is an unusually small or unusually large value in a data set.
• A data value with a z-score less than -3 or greater than +3 might be considered
an outlier.
• It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the data set
• a correctly recorded data value that belongs in the data set
122
Empirical Rule
• Example: Apartment Rents
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there are no outliers in this data
set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
123
Outlier
Outlier