Data Science Tut

Session 1: Data representation
Rishideep Roy, IIM Bangalore

What is Statistics?
• The term statistics can refer to numerical facts
such as averages, medians, percentages, and
maximums that help us understand a variety
of business and economic situations.
• Statistics can also refer to the art and science
of collecting, analyzing, presenting, and
interpreting data.
3
Why are we studying this
• Making a more informed decision, based on data.
• Can involve any discipline.
• Empirical evidence essential to make decision

based on past.
Applications in Business and Economics
Accounting
• Public accounting firms use statistical sampling procedures
when conducting audits for their clients.
Economics
• Economists use statistical information in making forecasts
about the future of the economy or some aspect of it.
Finance
• Financial advisors use price-earnings ratios and dividend
yields to guide their investment advice.
5
Applications in Business and Economics
Marketing
• Electronic point-of-sale scanners at retail checkout counters
are used to collect data for a variety of marketing research
applications.
Production
• A variety of statistical quality control charts are used to
monitor the output of a production process.
Information Systems
• A variety of statistical information helps administrators
assess the performance of computer networks.
6
“Statistics never lie, they simply tell a different
story depending on math of the tellers”
Data and Data Sets
• Data are the facts and figures collected,
analyzed, and summarized for presentation
and interpretation.
• All the data collected in a particular study are

referred to as the data set for the study.
10
Elements, Variables, and
Observations
• Elements are the entities on which data are
collected.
• A variable is a characteristic of interest for the
elements.
• The set of measurements obtained for a particular
element is called an observation.
• A data set with n elements contains n observations.
• The total number of data values in a complete data
set is the number of elements multiplied by the
number of variables.
11
Data, Data Sets, Elements, Variables, and Observations
Variables
Company Stock Annual Sales Earnings per share

Exchange ($M) ($)
Dataram NQ 73.10 0.86
Observati
EnergySouth N 74.00 1.67
Element on
Keystone N 365.70 0.86
Names
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13
Data Set
12
Basic Concepts
Population
• All entities of interest in a study
Sample
• Subset of the population
• Often randomly chosen
• Can follow different schemes
• Preferably representative of the population
Types of Data
• Numerical – arithmetic can be performed(age,
salary)
• Categorical – takes values in categories(gender, state)
• Nominal – Names or Labels
• Ordinal – Ordered data based on factors such as size
or quality
• Dummy variable- 0-1 coded for categories
• Discrete and Continuous Data
• Time Series and Cross Sectional Data
Scales of Measurement
• Scales of measurement include
– Nominal
– Ordinal
– Interval
– Ratio
• The scale determines the amount of

information contained in the data.
• The scale indicates the data summarization
and statistical analyses that are most
appropriate.
17
Nominal scale
• Data are labels or names used to identify an attribute of
the element.
• A nonnumeric label or numeric code may be used.
Example
Students of a university are classified by the school in which
they are enrolled using a nonnumeric label such as Business,
Humanities, Education, and so on.
Alternatively, a numeric code could be used for the school
variable (e.g. 1 denotes Business, 2 denotes Humanities, 3
denotes Education, and so on).
18
Ordinal scale
• The data have the properties of nominal data and the order
or rank of the data is meaningful.
• A nonnumeric label or numeric code may be used.
Example
Students of a university are classified by their class standing
using a nonnumeric label such as Freshman, Sophomore,
Junior, or Senior.
Alternatively, a numeric code could be used for the class
standing variable (e.g. 1 denotes Freshman, 2 denotes
Sophomore, and so on).
19
Interval scale
• The data have the properties of ordinal data, and the
interval between observations is expressed in terms
of a fixed unit of measure.
• Interval data are always numeric.
Example
Melissa has an SAT score of 1985, while Kevin has an
SAT score of 1880. Melissa scored 105 points more
than Kevin.
20
Ratio scale
– Data have all the properties of interval data and the ratio of two values is
meaningful.
– Ratio data are always numerical.
– Zero value is included in the scale.
Example:
Price of a book at a retail store is $ 200, while the price of the same book sold
online is $100. The ratio property shows that retail stores charge twice the online
price.
21
Categorical and Quantitative Data
• Data can be further classified as being
categorical or quantitative.
• The statistical analysis that is appropriate
depends on whether the data for the variable
are categorical or quantitative.
• In general, there are more alternatives for
statistical analysis when the data are
quantitative.
22
Categorical Data
• Labels or names are used to identify an
attribute of each element
• Often referred to as qualitative data
• Use either the nominal or ordinal scale of
measurement
• Can be either numeric or nonnumeric
• Appropriate statistical analyses are rather
limited
23
Quantitative Data
• Quantitative data indicate how many or how
much.
• Quantitative data are always numeric.
• Ordinary arithmetic operations are meaningful
for quantitative data.
24
Data
Categorical Quantitative
Non-
Numeric Numeric
numeric
Nominal Ordinal Nominal Ordinal Interval Ratio
25
Cross-Sectional Data
Cross-sectional data are collected at the same or
approximately the same point in time.
Example
Data detailing the number of building permits
issued in November 2013 in each of the counties
of Ohio.
26
Time Series Data
Time series data are collected over several time periods.
Example
Data detailing the number of building permits issued in Lucas
County, Ohio in each of the last 36 months.
Graphs of time series data help analysts understand

• what happened in the past
• identify any trends over time, and
• project future levels for the time series
27
Time Series Data
Graph of Time Series Data
28
Household Data
Types of Data
Data Sources
Existing Sources
• Internal company records – almost any department
• Business database services – Dow Jones & Co.
• Government agencies - U.S. Department of Labor
• Industry associations – Travel Industry Association of
America
• Special-interest organizations – Graduate
Management Admission Council (GMAT)
• Internet – more and more firms
31
Data Sources
Data Available From Internal Company Records
Record Some of the Data Available
Employee records Name, address, social security number
Production records Part number, quantity produced, direct labor cost, material cost
Inventory records Part number, quantity in stock, reorder level, economic order
quantity
Sales records Product number, sales volume, sales volume by region
Credit records Customer name, credit limit, accounts receivable balance
Customer profile Age, gender, income, household size
32
Data Sources
Data Available From Selected Government Agencies
Government Agency Web address Some of the Data Available
Census Bureau www.census.gov Population data, number of households, household income
Federal Reserve Board www.federalreserve.gov Data on money supply, exchange rates, discount rates
Office of Mgmt. & Budget www.whitehouse.gov/omb Data on revenue, expenditures, debt of federal government
Department of Commerce www.doc.gov Data on business activity, value of shipments, profit by industry
Bureau of Labor Statistics www.bls.gov Customer spending, unemployment rate, hourly earnings, safety
record
33
Data Sources
Statistical Studies – Observational
• In observational (nonexperimental) studies no
attempt is made to control or influence the variables
of interest.
• Example - Survey
• Studies of smokers and nonsmokers are
observational studies because researchers do not
determine or control who will smoke and who will
not smoke.
34
Data Sources
Statistical Studies – Experimental
• In experimental studies the variable of interest is first
identified. Then one or more other variables are
identified and controlled so that data can be
obtained about how they influence the variable of
interest.
• The largest experimental study ever conducted is
believed to be the 1954 Public Health Service
experiment for the Salk polio vaccine. Nearly two
million U.S. children (grades 1- 3) were selected.
35
Data Acquisition Considerations
Time Requirement
• Searching for information can be time consuming.
• Information may no longer be useful by the time it is
available.
Cost of Acquisition
• Organizations often charge for information even when
it is not their primary business activity.
Data Errors
• Using any data that happen to be available or were
acquired with little care can lead to misleading
information.
36
Descriptive Statistics
• Most of the statistical information in newspapers,
magazines, company reports, and other publications
consists of data that are summarized and presented in a
form that is easy to understand.
• Such summaries of data, which may be tabular, graphical,
or numerical, are referred to as descriptive statistics.
Example
The manager of Hudson Auto would like to have a better
understanding of the cost of parts used in the engine tune-ups
performed in her shop. She examines 50 customer invoices
for tune-ups. The costs of parts, rounded to the nearest
dollar, are listed on the next slide.
37
Example: Hudson Auto Repair
Sample of Parts Cost ($) for 50 Tune-ups
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
38
Tabular Summary: Frequency and
Percent Frequency
Parts Cost ($) Frequency Percent Frequency

50-59 2 4%
60-69 13 26%
70-79 16 32%
80-89 7 14%
90-99 7 14%
100-109 5 10%
TOTAL 50 100%
39
Graphical Summary: Histogram
Example: Hudson Auto
Histogram: Tune-up Parts Cost
18
16
14
12
Frequency
10
0
50-59 60-69 70-79 80-89 90-99 100-109
Parts Cost ($)
40
Numerical Descriptive Statistics
• The most common numerical descriptive
statistic is the mean (or average).
• The mean demonstrates a measure of the
central tendency, or central location of the
data for a variable.
• Hudson’s mean cost of parts, based on the 50
tune-ups studied is $79 (found by summing up
the 50 cost values and then dividing by 50).
41
Statistical Inference
Population: The set of all elements of interest in a
particular study.
Sample: A subset of the population.
Statistical inference: The process of using data
obtained from a sample to make estimates and test
hypotheses about the characteristics of a population.
Census: Collecting data for the entire population.
Sample survey: Collecting data for a sample.
42
Process of Statistical Inference
Example: Hudson Auto
Step 1 Step 2 Step 3 Step 4

• Population • A sample of • The sample • The sample
consists of all 50 engine data provides average is
tune ups. tune-ups is a sample used to
Average cost examined. average parts estimate the
of parts is cost of $79 population
unknown. per tune-up. average.
43
Analytics
Analytics is the scientific process of transforming data into

insight for making better decisions .
Techniques:
Descriptive analytics: This describes what has happened in the
past.
Predictive analytics: Use models constructed from past data to

predict the future or to assess the impact of one variable on
another.
Prescriptive analytics: The set of analytical techniques that yield

a best course of action. 44
DATA SUMMARIZATION
• Histograms
• Pie chart
• Bar Charts
• Scatterplot
• Box Plot
Tabular and Graphical Displays
Data
Categorical Data Quantitative Data
Tabular Graphical Tabular Graphical

Displays Displays Displays Displays
• Frequency • Bar Chart • Frequency Dist. • Dot Plot

Distribution • Pie Chart • Rel. Freq. Dist. • Histogram
• Rel. Freq. Dist. • Side-by-Side • % Freq. Dist. • Stem-and-
• Percent Freq. Bar Chart • Cum. Freq. Dist. Leaf Display
Distribution • Stacked • Cum. Rel. Freq. Dist. • Scatter
• Crosstabulation Bar Chart • Cum. % Freq. Dist. Diagram
• Crosstabulation
47
Frequency Distribution
• A frequency distribution is a tabular summary of

data showing the number (frequency) of
observations in each of several non-overlapping
categories or classes.
• The objective is to provide insights about the data

that cannot be quickly obtained by looking only at
the original data.
48
Example: Marada Inn
• Guests staying at Marada Inn were asked to rate the quality of their
accommodations as being excellent, above average, average, below average, or
poor.
• The ratings provided by a sample of 20 guests are:
Below Average Average Above Average

Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average
49
• Example: Marada Inn
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
50
Relative Frequency Distribution
• The relative frequency of a class is the fraction or

proportion of the total number of data items belonging
to the class.
!"#$%#&'( )* +,# '-.//
Relative frequency of a class =
0
• A relative frequency distribution is a tabular summary

of a set of data showing the relative frequency for each
class.
51
Histogram
• A histogram is a graphical representation of
the of data in which the data is grouped in
Bins.
• Histogram is usually used for continuous data.

Histogram
Data for First Income.
Pie chart
• For categorical variables
• Data for Satisfaction Likert Scale.
Bar Chart
Scatterplot
• For numerical variables
• Data for Utilities(first hundred values).
Descriptive Stats
• Descriptive Statistics Measures:

are used to Report on 1. Measures of Central
Populations and Tendency
Samples
2. Measures of spread
By Summarizing Information,
Descriptive Stats Speed Up and
Simplify comprehension of a
Group’s characteristics!!
The “Population” in statistics includes all
members of a defined group that we are
Sample Vs Population studying or collecting information on for
data driven decisions.
A Parameter is a characteristic of a
population
A Sample is a smaller group of members of

a population selected to represent the
population.
A Statistic is a characteristic of a
sample
* A random sample is one in which every

member of a population has an equal
chance of being selected.
Descriptive Stats – Summarizing Data
v Central Tendency (or Groups’ “Middle
Values”)
1. Mean
2. Median
3. Mode
v Variation (or Summary of Differences

Within Groups)
1. Range
2. Interquartile Range
3. Variance
4. Standard Deviation
Central Tendency
Simpson and Kafka defined it as “ A measure
of central tendency is a typical value around
• Measures of central which other figures congregate”
tendency are statistical
measures which describe
the position of a
distribution.
• Computable values on a Waugh has expressed “An

distribution that discuss average stand for the whole
the behavior of the center group of which it forms a part
of a distribution. yet represents the whole!!”
Mean
Pros:
1. It is simple and rigidly defined.
• The mean is the arithmetic average of
2. It is not based on the position in
the scores.
the series.
• The calculation of the mean considers
3. It is easy to understand the
both the number of scores and their
arithmetic average even if some of
value.
the details of the data are lacking.
• The formula for the mean of the
Cons:
variable X is:
1. Mean can be badly affected by ∑ 2!
outliers. • 𝑥̅ = 0
2. It gives misleading conclusions.
Sample Mean 𝑥̅
∑ 𝑥3
𝑥̅ =
𝑛
where: Sxi = sum of the values of the n observations
n = number of observations in the sample
Population Mean µ
∑ 𝑋3
𝜇=
𝑁
where: ∑ 𝑋" = sum of the values of the N observations

N = number of observations in the population
62
Sample Mean 𝑥̅
• Example: Apartment Rents
Seventy efficiency apartments were randomly sampled in a college
town. The monthly rents for these apartments are listed below.
545 715 530 690 535 700 560 700 540 715
540 540 540 625 525 545 675 545 550 550
565 550 625 550 550 560 535 560 565 580
550 570 590 572 575 575 600 580 670 565
700 585 680 570 590 600 649 600 600 580
670 615 550 545 625 635 575 650 580 610
610 675 590 535 700 535 545 535 530 540
63
Sample Mean 𝑥̅
∑8! :;,=>?
𝑥̅ = = = 590.80
9 @A
64
Median
Pros: • The Median is the middle point in an ordered
distribution at which an equal number of
Median can be calculated in all
distributions.
scores lie on each side of it.
The Median is relatively • It is also known as the 50th percentile (P50),

unaffected by Outliers. or 2nd quartile (Q2)
Cons:
• In Skewed data, median is a better measure
than mean.
It is not based on all the values.
Median
• For an odd number of observations:
26 18 27 12 14 27 19 7 observations
12 14 18 19 26 27 27 in ascending order
The median is the middle value. Median = 19
66
Median
• For an even number of observations:
26 18 27 12 14 27 30 19 8 observations
12 14 18 19 26 27 27 30 in ascending order
The median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5
67
Median
Averaging the 35th and 36th data values:
Median = (575 + 575)/2 = 575
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
Note: Data is in ascending order.
68
Mode 1. Mode gives the most likely
experience rather than the “typical”
q Mode is the highest point or “central” experience.
of the frequencies
distribution curve. 2. In symmetric distributions, the
q It is not affected by extreme mean, median, and mode are the
values but actual value of an same.
important part of the series. 3. In skewed data, the mean and
median lie further toward the skew
than the mode.
Mode
550 occurred most frequently (7 times)
Mode = 550
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
Note: Data is in ascending order.
70
MEASURES OF DISPERSION
• RANGE
• VARIANCE AND STANDARD DEVIATION
• PERCENTILES AND QUARTILES
• INTER-QUARTILE RANGE
Range For example, consider the
following set:
The Range of a set 40, 30, 43, 48, 26, 50, 55, 40, 34, 42,
of data is the 47, and 50
difference between
the largest and the
smallest number in To find the range you would take
the set. the largest number, 55, and
subtract the smallest number, 26.
The Range is 29!

Interquartile Range
ü The Interquartile
range (IQR) is a
measure of variability,
based on dividing a
data set into quartiles.
ü Quartiles divide a
rank-ordered data set
into four equal parts,
denoted by Q1, Q2,
Q3 and Q4,
respectively.
Variance
q A measure of how
data points differ WHY
from the mean. SQUARED
DEVIATION
?
q The average of the
å(x - X )
squared deviations 2
about the mean is s =

2
called the variance. N
å(x - X )
2
s =
2
n -1
Standard Deviation
q A measure of
å(x - µ)
2
variation of s=
scores about the N
mean.
Why Standard deviation?
Why not Variance!?
å(x - X )
2
s=
n -1
Example
The monthly salaries of all
employees in a premier hotel
in Bangalore has been
reduced to half of their
existing salaries. The previous
salary of a sample of size 10 is
given as (in 1000 Rs units) :
{22, 38, 16, 46, 52, 30, 18, 16,
20, 12}
What would be the new mean
and standard deviation of the
salaries, and how would they
compare to the old salaries?
Example
The monthly salaries of all
executives in an analytics
company in Bangalore has
been reduced to by 40,000
rupees. The previous salary of
a sample of size 10 is given as
(in 1000 Rs units) :
{220, 150, 120, 160, 145, 152,
188, 176, 120, 124}
What would be the new mean
and standard deviation of the
salaries, and how would they
compare to the old salaries?
Percentile
• Percentile is the value below which a
given percentage of observations in a
population fall.
• For example, the 80th percentile is the value
below which(including it) 80% of the
observations may be found, and above
which(including it) 20% of the observations
may be found.
Percentile
• 50th percentile is median

• 25th percentile is first quartile.
• Arrange the data in ascending order.

• Compute Lp, the location of the pth percentile
Lp = (p/100)(n + 1)
80th Percentile
Lp = (p/100)(n + 1) = (80/100)(70 + 1) = 56.8
(the 56th value plus .8 times the
difference between the 57th and 56th values)
80th Percentile = 635 + .8(649 – 635) = 646.2
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
80
Crosstabulation
• A crosstabulation is a tabular summary of data for two variables.
• Crosstabulation can be used when:

• one variable is categorical and the other is quantitative,
• both variables are categorical, or
• both variables are quantitative.
• The left and top margin labels define the classes for the two variables.
81
Crosstabulation
• Example: Finger Lakes Homes

The number of Finger Lakes homes sold for each style and price for
the past two years is shown below.
Price Home Style

Range Colonial Log Split A-Frame Total
< $250,000 18 6 19 12 55
> $250,000 12 14 16 3 45
Total 30 20 35 15 100
82
Crosstabulation
Insights Gained from Preceding Crosstabulation
• The greatest number of homes (19) in the sample are a split-level

style and priced at less than $250,000.
• Only three homes in the sample are an A-Frame style and priced
at $250,000 or more.
83
Crosstabulation
Price Home Style

< $250,000 18 6 19 12 55
> $250,000 12 14 16 3 45
Total 30 20 35 15 100
84
Crosstabulation: Row or Column
Percentages
• Converting the entries in the table into row

percentages or column percentages can
provide additional insight about the
relationship between the two variables.
85
Crosstabulation: Row Percentages
Price Home Style

< $250,000 32.73 10.91 34.55 21.82 100
> $250,000 26.67 31.11 35.56 6.67 100
(Colonial and > $250K)/(All > $250K) x 100 = (12/45) x 100
86
Crosstabulation: Column Percentages
Price Home Style

Range Colonial Log Split A-Frame
< $250,000 60.00 30.00 54.29 80.00
> $250,000 40.00 70.00 45.71 20.00
Total 100 100 100 100
(Colonial and > $250K)/(All Colonial) x 100 = (12/30) x 100
87
Crosstabulation: Simpson’s Paradox
• Data in two or more crosstabulations are often aggregated to produce a

summary crosstabulation.
• We must be careful in drawing conclusions about the relationship between the

two variables in the aggregated crosstabulation.
• In some cases the conclusions based upon an aggregated crosstabulation can

be completely reversed if we look at the unaggregated data. The reversal of
conclusions based on aggregate and unaggregated data is called Simpson’s
paradox.
88
Side-by-Side Bar Chart
• A side-by-side bar chart is a graphical display for depicting

multiple bar charts on the same display.
• Each cluster of bars represents one value of the first

variable.
• Each bar within a cluster represents one value of the

second variable.
89
Side-by-Side Bar Chart
Finger Lake Homes

20
18
16 < $250,000
> $250,000
14
Frequency
12
10
8
6
4
2
Home Style
Colonial Log Split-Level A-Frame
90
Stacked Bar Chart
• A stacked bar chart is another way to display and compare two variables on the
same display.
• It is a bar chart in which each bar is broken into rectangular segments of a

different color.
• If percentage frequencies are displayed, all bars will be of the same height (or
length), extending to the 100% mark.
91
Stacked Bar Chart
40 Finger Lake Homes
36
32
28
Frequency
< $250,000
24 > $250,000
20
16
12
8
4
Home Style
Colonial Log Split A-Frame
92
Stacked Bar Chart
Finger Lake Homes
100
90
Percentage Frequency
80
70
60 < $250,000
> $250,000
50
40
30
20
10
Home Style
Colonial Log Split A-Frame
93
Data Visualization: Best Practices in Creating Effective Graphical
Displays
• Data visualization describes the use of graphical displays to summarize and

present information about a data set.
• The goal is to communicate as effectively and clearly as possible the key

information about the data.
94
Creating Effective Graphical Displays
• Creating effective graphical displays is as much art as it is science.

• Here are some guidelines . . .
• Give the display a clear and concise title.
• Keep the display simple.
• Clearly label each axis and provide the units of measure.
• If colors are used, make sure they are distinct.
• If multiple colors or lines are used, provide a legend.
95
Choosing the Type of Graphical Display
• Displays used to show the distribution of data:

Bar Chart to show the frequency distribution or relative frequency
distribution for categorical data
Pie Chart to show the relative frequency or percent frequency for
categorical data
Dot Plot to show the distribution for quantitative data over the entire range
of the data
Histogram to show the frequency distribution for quantitative data over a
set of class intervals
Stem-and-Leaf Display to show both the rank order and shape of the
distribution for quantitative data
96
Box Plot
Box Plot
• Box Plot (aka Box and Whisker plot) is a graphical
representation of numerical data that can be used to
understand the variability of the data and the existence of
outliers.
– Lower quartile (1st Quartile), Median and Upper Quartile (3rd

Quartile).
– Lowest and Highest Value
– Inter quartile range (IQR)
Outliers Using Box Plot
• It is possible that the data may contain values beyond Q1 –
1.5 IQR and Q3 + 1.5 IQR.
• The whisker of the box plot extends till Q1 – 1.5 IQR and Q3 +
1.5 IQR, observations beyond these two limits are potential
outliers.
3rd Quartile
+ 1.5 IQD
3rd
Quartile
Median
1st
Quartile
Distribution Shape: Skewness
• An important measure of the shape of a distribution is called skewness.
• The formula for the skewness of sample data is
! '! #'̅ *
Skewness = ∑
(!#$)(!#&) )
• Skewness can be easily computed using statistical software.
105
• Symmetric (not skewed)

• Skewness is zero.
• Mean and median are equal.
.35
Skewness = 0
.30
Relative Frequency
.25
.20
.15
.10
.05
0
106
• Moderately Skewed Left

• Skewness is negative.
• Mean will usually be less than the median.
.35 Skewness = - .31

.30
Relative Frequency
.25
.20
.15
.10
.05
0
107
• Moderately Skewed Right

• Skewness is positive.
• Mean will usually be more than the median.
.35 Skewness = .31

.30
Relative Frequency
.25
.20
.15
.10
.05
0
108
• Highly Skewed Right

• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.35 Skewness = 1.25

.30
Relative Frequency
.25
.20
.15
.10
.05
0
109

Seventy efficiency apartments were randomly sampled in a college town. The
monthly rent prices for the apartments are listed below in ascending order.
525 530 530 535 535 535 535 535 540 540
540 540 540 545 545 545 545 545 550 550
550 550 550 550 550 560 560 560 565 565
565 570 570 572 575 575 575 580 580 580
580 585 590 590 590 600 600 600 600 610
610 615 625 625 625 635 649 650 670 670
675 675 680 690 700 700 700 700 715 715
110

.35
Skewness = .92
.30
Relative Frequency
.25
.20
.15
.10
.05
0
111
Z-scores
• The z-score is often called the standardized
value.
• It denotes the number of standard deviations
a data value xi is from the mean
"! #"̅
𝑧! =
%
• Excel’s STANDARDIZE function can be used to
compute the z-score.
Z-scores
• An observation’s z-score is a measure of the
relative location of the observation in a data
set.
• A data value less than the sample mean will
have a z-score less than zero.
• A data value greater than the sample mean
will have a z-score greater than zero.
• A data value equal to the sample mean will
have a z-score of zero.
z-Scores
• z-Score of Smallest Value (525)
'! #'̅ ,&,#,-..0.
𝑧+ = )
= ,1.21
= -1.20
Standardized Values for Apartment Rents

-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
114
Chebyshev’s Theorem
• At least (1 - 1/z2) of the items in any data set
will be within z standard deviations of the
mean, where z is any value greater than 1.
• Chebyshev’s theorem requires z > 1, but z

need not be an integer.
• At least 75% of the data values must be within
z = 2 standard deviations of the mean.


Chebyshev’s Theorem
Let z = 1.5 with 𝑥̅ = 590.80 and s = 54.74
At least (1 - 1/(1.5)2) = 1 - 0.44 = 0.56 or 56%

of the rent values must be between
𝑥̅ - z(s) = 590.80 - 1.5(54.74) = 509
and
𝑥̅ + z(s) = 590.80 + 1.5(54.74) = 673
(Actually, 86% of the rent values

are between 509 and 673.)
117
Empirical Rule
• When the data are believed to approximate a bell-shaped
distribution:
• The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
• The empirical rule is based on the normal

distribution, which is covered in Chapter 6.
118
Empirical Rule
For data having a bell-shaped distribution:

• 68.26% of the values of a normal random variable are within +/- 1
standard deviation of its mean.

standard deviations of its mean.

standard deviations of its mean.
119
Empirical Rule
99.72%
95.44%
68.26%
x
µ – 3s µ – 1s µ µ + 1s µ + 3s
µ – 2s µ + 2s
120
Outliers
• A value, or observation lying outside of norm
• May not be outlier based on variables
• Combination of variables may be extreme
• Sometimes eliminated for analysis
• Not a good idea always
• Good to run analysis both with and without
them
Detecting Outliers
• An outlier is an unusually small or unusually large value in a data set.
• A data value with a z-score less than -3 or greater than +3 might be considered
an outlier.
• It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the data set
• a correctly recorded data value that belongs in the data set
122
Empirical Rule
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there are no outliers in this data
set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
123
Outlier
Outlier

Data Science Tut

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Tut

Uploaded by

Copyright:

Available Formats

Session 1: Data representation

Rishideep Roy, IIM Bangalore

• Making a more informed decision, based on data.

• Can involve any discipline.

• Empirical evidence essential to make decision

• All the data collected in a particular study are

Company Stock Annual Sales Earnings per share

• The scale determines the amount of

Nominal Ordinal Nominal Ordinal Interval Ratio

Graphs of time series data help analysts understand

104 74 62 68 97 105 77 65 80 109

Parts Cost ($) Frequency Percent Frequency

Example: Hudson Auto

Step 1 Step 2 Step 3 Step 4

Analytics is the scientific process of transforming data into

Predictive analytics: Use models constructed from past data to

Prescriptive analytics: The set of analytical techniques that yield

Tabular Graphical Tabular Graphical

• Frequency • Bar Chart • Frequency Dist. • Dot Plot

• A frequency distribution is a tabular summary of

• The objective is to provide insights about the data

Below Average Average Above Average

• The relative frequency of a class is the fraction or

• A relative frequency distribution is a tabular summary

• Histogram is usually used for continuous data.

• Descriptive Statistics Measures:

A Sample is a smaller group of members of

* A random sample is one in which every

v Variation (or Summary of Differences

• Computable values on a Waugh has expressed “An

where: ∑ 𝑋" = sum of the values of the N observations

• Example: Apartment Rents

The Median is relatively • It is also known as the 50th percentile (P50),

The median is the middle value. Median = 19

The median is the average of the middle two values.

Note: Data is in ascending order.

Note: Data is in ascending order.

• VARIANCE AND STANDARD DEVIATION

• PERCENTILES AND QUARTILES

The Range is 29!

about the mean is s =

called the variance. N

• 50th percentile is median

• Arrange the data in ascending order.

• A crosstabulation is a tabular summary of data for two variables.

• Crosstabulation can be used when:

• Example: Finger Lakes Homes

Price Home Style

• Example: Finger Lakes Homes

Insights Gained from Preceding Crosstabulation

• The greatest number of homes (19) in the sample are a split-level

Price Home Style

• Converting the entries in the table into row

• Example: Finger Lakes Homes

Price Home Style

(Colonial and > $250K)/(All > $250K) x 100 = (12/45) x 100

Price Home Style

Total 100 100 100 100

(Colonial and > $250K)/(All Colonial) x 100 = (12/30) x 100

• Data in two or more crosstabulations are often aggregated to produce a

• We must be careful in drawing conclusions about the relationship between the

• In some cases the conclusions based upon an aggregated crosstabulation can