Lec 2 To 5 - Describing Data Sampling Design-2

MODULE TWO : Describing Data & Sampling Design – 2023-24
2 Mathematical and
Statistical Methods
ECON F213
Dr. Rahul Arora (IC)

Assistant Professor,
Department of Economics & Finance,
BITS Pilani, Pilani Campus
rahul.arora@Pilani.bits-Pilani.ac.in
Mob: +91 – 7607481292
Background design is taken from the presentation slides of Salvatore:

International Economics, 10th Edition © 2013 John Wiley & Sons, Inc.
Lecture 2
Data Collection &
Description (Graphically)
Statistics: An Introduction
Croxton & Cowden, Statistics is the science of –

1. Collecting data
2. Organizing data
3. Presenting data
4. Analyzing data
5. Interpreting data
to assist in making effective decisions.
Five Stages in a Statistical Investigation

3
Types of Statistics
❑ Descriptive – methods of organizing, summarizing,

and presenting data in an informative way –
❑ Data can be organized in various ways
❑ Inferential – methods to estimate a population

property based on a sample.
❑ Also known as statistical inference
4
Data Collection
5
Sources of Data
Primary Data –
➢ Collect data itself either through –
✓ Census technique
✓ Sample technique
Secondary Data –
➢ Available data collected by some agency
6
Data Collection Methods (R4: pp 39-61)
Methods to collect Primary Data
✓ Direct Personal Interviews
✓ Indirect Oral Interviews
✓ Information from Correspondents
✓ Mailed Questionnaire
✓ Schedules sent through Enumerators
7
Direct Personal Interviews
Instruments –
✓ Interview Schedule – Structured / Semi-structured
Kinds of Interview –
✓ Structured
✓ Semi-Structured
Interviews taken by the investigator on the phone also a part of this
Merits –
✓ Face-to-face interaction
✓ Hidden questions can be quickly asked
✓ Provide supplementary information
✓ Language can be adjusted based on conditions
✓ Possibility of more accurate information
8
Indirect Oral Interviews
Contacting third parties (witnesses) for the information
Instruments –
✓ Witnesses
✓ Trained interviewer
Merits –
✓ Suitable when the direct source of information do not exist
✓ Suitable when direct respondents are reluctant
Precautions –
✓ Don’t rely on the views of one person
✓ Take care while selecting a third person
9
Information from Correspondents
Appoint local agents to collect information
Instruments –
✓ Questionnaire given to correspondent for recording
information
Merits –
✓ Larger area can be covered
✓ Adopted by government for regular information
10
Mailed Questionnaire Method
Questionnaire is prepared and shared with the informants through post.
Instruments –
✓ Questionnaire – Disguise/non-disguise
Merits –
✓ Larger area can be covered
✓ Personal questions can be asked easily
Demerits –
✓ Can be adopted only when respondents are literate
✓ Difficult to check the accuracy of questionnaire
11
Schedule Sent Through Enumerators
Hire interviewers and sent them schedules to fill from the actual
respondents
Instruments –
✓ Schedules of Interview
Merits –
✓ Filled by enumerators
✓ Can be adopted where informants are illiterate
Demerits –
✓ Has to bear cost of hiring enumerators
✓ Data collected by too many
12
Types of Variables & Levels of
Measurement
Variable
Qualitative Quantitative
Nominal Ordinal Discrete Continuous
Interval Ratio
13
Types of Data and Their
Stacking
❑ Cross-sectional data
❑ Time-series data
❑ Panel data
14
Data Organization &
Presentation
15
Formation of a Frequency Distribution
❑ A frequency distribution or a frequency table is simply a table in which the

data is grouped into classes under a variable and the number of cases that
fall in each class is recorded. The numbers in each class are referred to as
frequencies, hence the term frequency. When the number of items is
expressed by their proportion in each class, the table is usually referred to
as a relative frequency distribution or simply a percentage distribution
(Morris Hamburg)
❑ A variable either be discrete or continuous.
❑ A continuous variable is capable of taking every fractional value

within the range of possibilities. In this, data are obtained by
numerical measurement rather than counting.
❑ A discrete variable is that which can vary only by finite jumps.

16
Frequency Distributions – of two types
Discrete Frequency Continuous Frequency
Distribution Distribution
No. of Frequency Weight Frequency
Children (No. of (in lbs.) (No. of
families) persons)
0 10 100-110 10
1 40 110-120 15
2 80 120-130 40
3 100 130-140 45
4 250 140-150 20
5 150 150-160 4
6 50 160-170 6
Total 680 Total 140
17
Formation of Discrete Frequency Distribution
❑ Set a variable that varies by classes/categories given in

the data
❑ Count the number of times a particular value is repeated
which is called the frequency of that class
Class Frequency Relative
Height in Height in Frequency
Centimeters Centimeters 152 1
182 165 154 1
152 163 163 1
154 164 164 1
166 165 165 3
165 167 166 1
167 1
182 1 18
Formation of Continuous Frequency Distribution
❑ Define Class Limits – Lowest and highest values that can be

included in the class
❑ Define class intervals (i) – Difference between upper and

lower limit of the class
𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝐼𝑡𝑒𝑚 𝐿 − 𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝐼𝑡𝑒𝑚(𝑆)
𝑖≥
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 (𝑘)
❑ Define number of classes (k) –

Using Sturges’ rule: 𝑘 = 1 + 3.322 log 𝑁
Where N is number of observations
e.g., If N = 10 then k would be [1+(3.322 X 1)] = 4.322 or 4
Or
Use: 2k > N
19
Frequency construction under Class Intervals
Exclusive Method –
❑ Fixed class intervals in which upper limit of one class is lower
limit of the next class
❑ It ensures continuity
❑ In such case, only those values will be included which is or
greater than or equal to (≥) lower limit but less than (<) to the
upper limit
❑ To get rid of the confusion of including data, another way of
specifying it using description as follows
❑ One way – 10-15

❑ Other way – 10 but under 15
20
Frequency construction under Class Intervals
Inclusive Method –
❑ Fixed class intervals in which upper limit of one class is
included in that class itself
For example – 10-14; 15-19; …
Where to use which method ?

✓ In case, the nature of variable is continuous such as height
then preferable method is exclusive and
✓ In case nature of variable is discrete, then use inclusive
method
21
Points to Remember
➢ No hard and fast rules because everything depends upon the nature
of the data
➢ Still, preferable rules are as follows –
➢ The number of classes should preferably be between 5 to 20
➢ Take class intervals either five or multiples of five
➢ The starting point, i.e., the lower limit of the first class, should either
be zero or a multiple of 5
➢ Try adopting an exclusive method for getting correct class intervals.
In the case of the inclusive method, adjust to get the corrected class
interval. The process is given as follows:
𝐿𝑜𝑤𝑒𝑟 𝐿𝑖𝑚𝑖𝑡 𝑜𝑓 2𝑛𝑑 𝑐𝑙𝑎𝑠𝑠 −𝑈𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 1𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

➢ 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝐹𝑎𝑐𝑡𝑜𝑟 =
2
➢ Make adjustments in inclusive intervals as: Deduct CF from lower limit
of all classes and add CF to the upper limits of all classes
22
Two way frequency distribution
❑ Known as Bivariate Frequency Distribution

❑ Height and Weight example
23
Graphic Description of Data
❑ Qualitative Data –
❑ Bar Chart – show the frequency or relative
frequency (e.g., vehicles sold…)
❑ Pie Chart – show the relative frequency
❑ Quantitative Data –
❑ Histogram – show the frequency distribution
❑ Frequency Polygon – show the class mid-points
❑ Cumulative Frequency Polygon – show the

cumulative frequency
24
Any Question
25
Lecture 3
Data Description
(Numerical Measures)
Numeric Description of Data
❑ Locational Measure – to identify the center of the
values
❑ Arithmetic mean (Simple & Weighted)
❑ Median
❑ Mode
❑ Geometric mean
❑ Measures of Dispersion – to check spread of values

❑ Range
❑ Mean deviation
❑ Standard deviation
❑ Variance
27
Types of Data Series
Individual Series Discrete Series Continuous Series
Height Height Frequency Height Frequency
(in Cms) (X) (f) (X) (f)
152 152 2 152-154 2
154 154 3 154-156 3
163 163 1 156-158 5
164 164 1 158160 8
165 165 3 160-162 3
166 166 1 162-164 2
167 167 1 164-166 3
182 182 1 166-168 4
28
Arithmetic Mean
σ𝑋
Individual Series 𝑋ത =
𝑁
σ 𝑓𝑋
Discrete Series 𝑋ത = σ𝑓
f – frequency
σ 𝑓𝑚
Continuous Series 𝑋ത = σ𝑓
m –mid point of classes
Merits –
✓ Single value
✓ Based on all values
✓ Easy to compute
ത taken from actual mean
✓ Sum of deviation σ(𝑋 − 𝑋)
is zero 29
Representations of Arithmetic Mean
Population mean –
σ𝑋
μ=
𝑁
Sample mean –
σ𝑋
𝑋ത =
𝑛
Parameter & Statistic –
▪ Any measureable characteristic of population is known
as parameter
▪ Any measurable characteristics of sample is known as
statistic
30
Weighted Mean
❑ In case of varying importance of different items is a data series,
weighted mean is better average than arithmetic mean
❑ Any measure of importance can be weight
σ(𝑤𝑋)
Individual Series 𝑋𝑤 = σ𝑤
A weighted average is most often computed to equalize the

frequency of the values in a data set.
X 1 2 3 4 𝑿 = 𝟏𝟎 𝑿 = 𝟐. 𝟓
W1 0.25 0.25 0.25 0.25 𝑾=𝟏 തതതത

𝑿 𝒘 = 𝟐. 𝟓
W2 0.1 0.1 0.7 0.1 𝑾=𝟏 തതതത

𝑿 𝒘 = 𝟐. 𝟖
31
Combined Mean
❑ Mean of two or more separate groups and calculated by
combining means of all the groups
❑ Similar to weighted mean, where weights are the size of each
group
𝑁1 𝑋ത1 + 𝑁2 𝑋ത2
𝑋ത12 =
𝑁1 + 𝑁2
32
Median – Positional Average
In case of very high/low extreme values in the data median measure is
more useful
Median – Midpoint of the value after arranging in an order

𝑁+1
• In case of odd no. of observations: M = Size of th item
2
• In case of even no. of observations: M = average of two middle
position values
Discrete Series – Look at cumulative frequency (c.f.) and find the total
σ 𝑓+1
equal to or next higher to that and corresponding X is median
2
σ𝑓
2
− 𝑐.𝑓.𝑜𝑓 𝑝𝑟𝑒𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑐𝑙𝑎𝑠𝑠
Continuous Series – 𝑀=𝐿+ ×𝑖
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
Where, L – Lower limit of median class; i – class interval

33
Median – Positional Average
Merits –
✓ Most appropriate average dealing with qualitative
data (in case of preferences)
✓ Useful in case of high/low extreme values in the data
✓ It can be determine graphically whereas mean can’t be
located
✓ Sum of deviations taken from median, ignoring signs,
is the least
34
Other Positional Measures
❑ Quartiles – divides total frequency into four equal parts
❑ Deciles – divides total frequency into ten equal parts
❑ Percentiles– divides total frequency into hundred equal

parts
❑ Calculations –
𝑁+1
Q1 = Size of th item
4
2(𝑁+1)
Q2 = Size of th item
4
4(𝑁+1)
D4 = Size of 10 th item
60(𝑁+1)
P60 = Size of 100 th item
35
Mode
A value that occurs mostly in the data or with greatest
frequency
Height in Height in Class Frequency
Centimeters Centimeters 152 1
182 165
154 1
152 163
163 1
154 164
164 1
166 165
165 3
165 167
166 1
Case of Ill-defined mode –
167 1
Mode = 3 Median – 2 Mean
182 1
By Karl Pearson
36
Relative Position of Mean, Median,
and Mode
37
Geometric Mean
❑ Special case of Arithmetic Mean whose value is always be less
than or equal to arithmetic mean
❑ It is useful in case of finding average change in percentages,
ratios, indexes, or growth rates over time
𝑛
Individual Series G. M. = (𝑋1 × 𝑋2 × ⋯ × 𝑋𝑛 )
σ 𝑙𝑜𝑔𝑋
G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔
𝑁
σ 𝑓𝑙𝑜𝑔 𝑋
Discrete Series G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 σ𝑓
σ 𝑓𝑙𝑜𝑔 𝑚
Continuous Series G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 σ𝑓
38
Geometric Mean - Calculations
Problem: Increment in income in first year = 5 percent and Increment in income
in second year = 15 percent. Calculate the average increment in income.
Hint: Use GM, because data in percentage is given
Step 1: Convert data into normal form: 5 percent hike in previous income
means: 105% (or 1.05) & 15 percent hike means 115 % (or 1.15)
Step 2: Calculate GM using converted data:
G. M. = 1.05 𝑋(1.15) = 1.09886
The average annual percent increment in income is: 9.886 % (or 0.09866)
Verification with income level 3000

❑ Increment 1 in rupees: 150 (5 % of 3000)
❑ Increment 2 in rupees: 472.50 (15 % of 3150)
❑ Total increment in rupees: 622.50
Using GM: 3000(0.09866) + 3296.58(0.09866) = 622.48 39

Any Question
40
Lecture 4
Data Description
Measure of Dispersion
❑ To measure Variability of the observations
Wage Earner Wages in Wages in Wages in
(ID) Factory A Factory B Factory C
L1 100 97 15
L2 100 105 395
L3 100 102 52
L4 100 103 33
L5 100 93 5
Total 500 500 500
Mean (AM) 100 100 100
Merits –
✓ Single value representing variability and serve as a basis for control it
✓ It is average of average – second order average – It determines the
reliability of an average
✓ Facilitate the use of other statistical measures 42
Range – Positional Measure
❑ Difference between the value of the largest (L) item and
the value of the smallest (S) item
In a frequency distribution –
✓ Difference between upper limit of the highest class

and lower limit of the lowest class
Decision –
✓ Distribution with smaller range has less dispersion
43
Mean Deviation – True Measure
❑ Average difference between the items in a distribution
and the average (mean or median) value of that series
1
Individual Series 𝑀𝐷 = σ 𝑋 − 𝐴𝑣𝑔
𝑁
1
Discrete Series 𝑀𝐷 = σ 𝑓 𝑋 − 𝐴𝑣𝑔
σ𝑓
1
Continuous Series 𝑀𝐷 = σ𝑓 𝑚 − 𝐴𝑣𝑔
σ𝑓
Decision –
✓ In case of small MD, the distribution is compact or
uniform
44
Standard Deviation
❑ Square root of the mean of square deviation from
arithmetic mean.
σ 𝑥2
Individual Series 𝑆𝐷 𝜎 = ത
𝑥 = (𝑋 − 𝑋)
𝑁
Discrete Series –
σ 𝑓𝑥 2
𝑆𝐷 𝜎 =
𝑁
Decision –
✓ Greater the SD, greater is the magnitude of
deviations of the values from their mean
45
Standard Deviation – Cont…
❑ Usage of assumed mean instead actual mean
Individual series
2
σ 𝑑2 σ𝑑
𝑆𝐷 𝜎 = − 𝑑 = (𝑋 − 𝐴𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛)
𝑁 𝑁
Discrete Series –
2
σ 𝑓𝑑 2 σ 𝑓𝑑
𝑆𝐷 𝜎 = −
σ𝑓 σ𝑓
Continuous Series – Step Deviation Method
2
σ 𝑓𝑑 2 σ 𝑓𝑑 𝑚−𝐴
𝑆𝐷 𝜎 = − ×𝑖 𝑑=
σ𝑓 σ𝑓 𝑖
46
Variance
❑ Mean of square deviation from arithmetic mean

OR
❑ Square of the Standard Deviation - V𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2
Decision –
✓ More variance implies greater variability
47
Coefficient of Variation (CV) –
Relative Measure
❑ Use to compare the variability among distributions
𝜎
𝐶𝑉 = × 100
𝑋ത
Decision –
✓ More the CV, greater is the variability
48
Uses of Standard Deviation
❑ Chebyshev developed a theorem that allows us to determine the
minimum proportion of the values that lie within a specified number of
SD of the mean
Theorem – For any set of observations (sample or population), the proportion of
the values that lie within k SDs of the mean is at least (1-1/k2), where k is any value
greater than 1. The relationship applies regardless of the shape of the distribution.
For a symmetrical and bell-shaped distribution, one can be more precise in
explaining the dispersion about the mean
✓ Approx. 68 % observations lies within

± 1 𝑆𝐷 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 (i.e., 𝑋ത ± 1𝜎)
✓ Approx. 95 % observations lies within
✓ Approx. 99.7 % observations lies within
49
Combined Standard Deviation
❑ Possible to calculate the combined SD of two or more groups
𝑁1 𝜎12 + 𝑁2 𝜎22 + 𝑁1 𝑑12 + 𝑁2 𝑑22

𝜎12 =
𝑁1 + 𝑁2
Where,
𝑑1 = 𝑋ത1 − 𝑋ത12
𝑑2 = 𝑋ത2 − 𝑋ത12
𝑋ത12 is the combined mean
50
Population & Sample Difference
❑ Population & Sample Variance & Standard Deviation

σ(𝑋−𝜇)2
Population 𝑆𝐷 𝜎 =
𝑁
ത 2
σ(𝑋−𝑋)
Sample 𝑆𝐷 𝑠 =
𝑛−1
❑ Using sample mean to calculate the sample SD

underestimate the value of true SD so one is deducted
from the denominator to increase the final value
❑ Formula with n-1 is the unbiased estimator of true
parameter
51
Any Question
52
Lecture 5
Data Description
Skewness
Why Skewness –
Possibility of having same mean and standard deviation but
may differ in their overall appearance
❑ Any measure of skewness indicate the difference between the manner in
which items are distributed in a particular distribution compared with
symmetrical (normal) distribution [R4: PP 338]
Skewness – Lack of symmetry. It tells us

the direction of the variation
54
Measures of Skewness - Absolute
𝑆𝑘 = 𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝑘 = 𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
Decision –
✓ Positive difference means positively skewed distribution
and vice versa
55
Measures of Skewness - Relative
❑ The Karl Pearson’s coefficient of skewness
❑ The Bowley’s coefficient of skewness
❑ The Kelly’s coefficient of skewness
❑ Moments based Measure
56
The Karl Pearson’s coefficient of skewness –
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
In case mode is ill-defined –

3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
✓ Provides both direction and extent of skewness

✓ Mostly value lies between -1 to +1 (rarely very high)
✓ In case of mode is ill-defined, mostly value lies between -3
to +3
57
The Bowley’s coefficient of skewness –
𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝐵 =
𝑄3 − 𝑄1
✓ Based on quartiles
✓ Its numerical value lies between -1 and +1
58
The Kelly’s coefficient of skewness –
𝑃10 + 𝑃90 − 2𝑀𝑒𝑑𝑖𝑎𝑛

𝑆𝐾𝐾 =
𝑃90 − 𝑃10
𝐷1 + 𝐷9 − 2𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝐾 =
𝐷9 − 𝐷1
✓ Based on percentiles and deciles
59
Moments
Why Moments –
First four moments are suffice to show the basic
characteristics – location, scattered-ness, asymmetry and
peaked-ness
❑ The arithmetic mean of various powers of deviations (taken from

actual mean) in any distribution is called the moments of the
distribution about mean [R4: PP 338]
❑ Denoted by Greek letter 𝜇
60
Moments – about Mean
Individual Series Frequency Distribution
σ 𝑋 − 𝑋ത σ𝑥 σ 𝑓 𝑋 − 𝑋ത σ 𝑓𝑥
𝜇1 = = 𝜇1 = =
𝑁 𝑁 𝑁 𝑁
σ 𝑋 − 𝑋ത 2 σ 𝑥2 σ 𝑓 𝑋 − 𝑋ത 2 σ 𝑓𝑥 2
𝜇2 = = 𝜇2 = =
𝑁 𝑁 𝑁 𝑁
𝜇3 = = 𝜇3 = =
𝑁 𝑁 𝑁 𝑁
𝜇4 = = 𝜇4 = =
𝑁 𝑁 𝑁 𝑁
61
Moments – about Arbitrary Origin
Individual Series Frequency Distribution
σ 𝑋−𝐴 σ𝑓 𝑋 − 𝐴
𝜇1′ = 𝜇1′ =
𝑁 𝑁
σ 𝑋−𝐴 2 σ𝑓 𝑋 − 𝐴 2
′ ′
𝜇2 = 𝜇2 =
𝑁 𝑁
σ 𝑋−𝐴 3 σ𝑓 𝑋 − 𝐴 3
𝜇3′ = 𝜇3′ =
𝑁 𝑁
σ 𝑋−𝐴 4 σ𝑓 𝑋 − 𝐴 4
′ ′
𝜇4 = 𝜇4 =
𝑁 𝑁
62
Moments – Conversion
𝜇1 = 𝜇1′ − 𝜇1′ = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2
𝜇3 = 𝜇3′ − 3𝜇1′ 𝜇2′ + 2 𝜇1′ 3
𝜇4 = 𝜇4′ − 4𝜇1′ 𝜇3′ + 6 𝜇1′ 2 𝜇2′ − 3 𝜇1′ 4
63
Moments – about Zero
σ 𝑓𝑋 σ 𝑓𝑋 2 σ 𝑓𝑋 3 σ 𝑓𝑋 4
𝑣1 = ; 𝑣2 = ; 𝑣3 = ; 𝑣4 =
𝑁 𝑁 𝑁 𝑁
Summary of Moments –
First moment about zero is – Mean
Second moment about mean – Variance
Third moment about the mean – Skewness
Fourth moment about mean – Kurtosis
64
Moment based Measure –
𝜇32
𝛽1 = 3
𝜇2
✓ Based on third and second moment

✓ For a symmetrical distribution, its value becomes zero
✓ Greater the value more skewed is the distribution
✓ Can measure the extent of skewness not direction
65
𝜇3
𝛾1 = 𝛽1 = 3/2
𝜇2
✓ Based on third and second moment

✓ This is known as Karl Pearson’s gamma measure
✓ Greater the value more skewed is the distribution
✓ Can measure the extent and direction of skewness
66
Kurtosis
Why Kurtosis –
To know the peaked-ness of the frequency distribution
curve
❑ Degree of sharpness of the peak of frequency distribution curve.

It is measured relative to the peaked-ness of Normal curve
Three variants –
✓ More peaked than normal curve called Leptokurtic
✓ More flat than normal curve called Platykurtic
✓ Same a normal curve is called Mesokurtic
67
Measures of Kurtosis
𝜇4
𝛽2 = 2
𝜇2
✓ Based on fourth and second moment

✓ In case of normal distribution, value of 𝛽2 = 3 (mesokurtic)
✓ 𝛽2 > 3 implies curve is more peaked and leptokurtic
✓ 𝛽2 < 3 implies curve is less peaked and platykurtic
68
Measures of Kurtosis
𝛾2 = 𝛽2 − 3
✓ In case of normal distribution, value of 𝛾2 =0 (mesokurtic)

✓ 𝛾2 > 0 implies curve is leptokurtic
✓ 𝛾2 <0 implies curve is platykurtic
69
Sampling Design
70
Sampling – An Introduction
❑ Process of learning about the population on the basis of sample
drawn from it
Why Sampling –
✓ Difficult to reach entire population
✓ Financial constraints
✓ Time constraints
✓ Sometimes sampling is sufficient even if funds and time are
available
✓ Studying entire population is destructive sometimes
71
Laws of Sampling
❑ Law of Statistical Regularity
❑ Random sample taken from a large population generally

posses almost same characteristics as that of population
❑ Law of Inertia of Large numbers
❑ Larger the sample size more accurate the results
72
Essentials of Sampling
❑ Representativeness
✓ Random selection is the key
❑ Adequacy
✓ Size of sample should be large enough
❑ Independence
✓ Selection of one item in one draw has no influence on
probability of selection in any other draw
❑ Homogeneity
✓ Nature of Sample units remains same as in the population
73
Sampling Elements
1. Selection of a Sample
➢ Sample size
➢ Types of respondents
➢ Location of respondents
➢ Data collection method
2. Collection of Information
➢ Pilot study
➢ Final data collection
➢ Describing data
3. Making an Inference
➢ Estimation techniques to infer
74
Sampling Methods
❑ Probability Sampling
➢ Simple Random Sampling (unrestricted)
➢ Systematic Random Sampling (restricted)
➢ Stratified Random Sampling (restricted)
➢ Cluster Sampling (restricted)
❑ Non-Probability Sampling
❑ Judgement Sampling
❑ Convenience Sampling
❑ Quota Sampling
75
Probability Sampling - SRS
❑ Methods
✓ Lottery Method (With or without replacement)
✓ Random Numbers Table
❑ Merits
✓ Each unit has equal chance of selection
✓ Unbiased
✓ Easy to assess accuracy of the estimate
❑ Demerits
✓ Requires detailed information on each population unit
✓ Results are more dispersed than restricted random
sampling
76
Probability Sampling – Systematic
Sampling
❑ Method
✓ Select one unit at random and remaining on the basis of
𝑁
evenly spaced interval (k). It is calculated as: 𝑘 =
𝑛
❑ Merits
✓ Useful in case of available list of population
✓ Useful when population units are ordered
✓ Relatively simple than SRS
❑ Demerits
✓ Chances of biasedness from investigator occurs
77
Probability Sampling – Stratified
Sampling
❑ Method
✓ Divide the total population into mutually exclusive groups
(strata) and then use SRS for selection
❑ Properties of Good Stratified Sampling
✓ There should be marked difference between different
strata
✓ Homogeneity within each stratum
✓ Limited strata should be defined (≤ 6)
❑ Merits
✓ More representative and accurate but requires skilled
supervisors
78
Probability Sampling – Cluster
Sampling
❑ Method
✓ Random selection is made of primary, intermediate and
final units from a given population or stratum. Also known
as multi-stage sampling
❑ Merits
✓ Usage of SRS at multiple stage
✓ Flexible and covers larger area or population
❑ Demerits
✓ Less accurate than single stage random sampling having
same number of final stage units
79
Difference between Stratified and
Cluster Sampling
❑ In stratified sampling, random selection is made out of from all
strata made from population
❑ In cluster sampling, selection is done out of randomly created
clusters from population
❑ Procedure –
❑ In stratified, first divide the population into strata and then
select sample from each strata – One step random approach
❑ In cluster, study objects/groups have been partitioned out

of population through random selection and then make the
final selection – more than one step random approach
80
Non-Probability Sampling - Judgement
❑ Merits
✓ Selection is on the basis of the investigator
✓ Useful in case of small population
✓ Useful in quick policy decisions
❑ Demerits
✓ Create bias
✓ No objective way to check the reliability of sampling results
81
Non-Probability Sampling -
Convenience
❑ Merits
✓ Selection is on the basis of the Convenience
✓ Useful for conducting pilot studies
❑ Demerits
✓ Create bias and produce unsatisfactory results
✓ No objective way to check the reliability of sampling
results
82
Non-Probability Sampling - Quota
❑ Merits
✓ Most commonly used Non-Prob. Sampling
✓ Quota is fixed on the basis of some criteria
✓ Investigator is free to choose respondent
✓ Provide satisfactory results if interviewer is carefully
trained and followed instructions
❑ Demerits
✓ Create bias because of investigator’s judgement
✓ No objective way to check the reliability of sampling
results
83
Sample Size Determination
❑ Neither be too Small nor too large. Should be Optimum
❑ In the words of Parten, “Optimum size is one that fulfills the

requirements of efficiency, representativeness, reliability and flexibility”
[R4: pp81]
❑ Sample size depends upon – population size, availability of

resources, desired level of precision, type of population
(heterogeneous or homogeneous), study’s nature, sampling method,
nature of respondents
❑ Numerical Methods - Cochran’s and Slovin's
84
Sample Size determination
Slovin’s Formula –
When you don’t know anything about the population then

Slovin’s formula could be helpful in determining the sample size
𝑁
𝑛=
1 + 𝑁𝐸 2
n – sample size
N – Size of population
E – maximum allowed error
5 percent (0.05) or 1 percent (0.01) tolerance limit
85
❑ The decision to fix the sample size is based on three variables:
1. The margin of error a researcher can tolerate
𝜎
𝐸𝑟𝑟𝑜𝑟 = 𝑍
𝑛
2. The level of confidence
• High levels of confidence is preferable (95 % or higher)
3. The dispersion of the population being studied

• If the population is widely dispersed, a large sample is
required
• In case, one doesn’t know the population SD then
one can calculate the SD for a small sample taken
for the pilot study
86
Solving the following equation for n:
𝜎
𝐸𝑟𝑟𝑜𝑟 = 𝑍
𝑛
𝑧𝜎 2
𝑛=
𝐸
n – sample size
z – value of z corresponding to desired level of confidence
E – maximum allowed error
𝜎 – population S
87
Statistical Terminology
❑ Parameter –
✓ A characteristic of a population – Any measurable characteristic
of a population
❑ Statistic –
✓ A characteristic of a sample
❑ Estimator –
✓ It is a statistic used to infer the value of an unknown parameter
– Method of estimation
❑ Estimate –
✓ Numerical value representing the estimate of the parameter on
the basis of sample
88
Sampling Errors
❑ Errors occur at any stage of sampling while inferring about the
population
❑ Biased Errors – Arises from any kind of bias at any stage of sampling
❑ Unbiased Errors – Arises due to chance differences between the

members of population included in the sample and those who are
not included.
Error = Value of Statistic – Value of Parameter
Note – As sample size increase, the error because of chance differences

will decline, however, the biased error will not decrease
89
Sampling Errors – Causes of Bias
❑ Fault in process of selection
❑ Fault in collection
❑ Usage of faulty method for analysis
90
Non-Sampling Errors
❑ Inadequate data specification
❑ Inappropriate statistical unit
❑ Inaccurate interview method
❑ Lack of experienced investigator
❑ Errors while data processing
❑ Errors during presentation
❑ … among others
91
Testing of Sample Reliability
❑ In case of known population characteristics, compare the
characteristics of sample and check the reliability
❑ More than one sample can be drawn from same universe

(population) and then compare the results of two samples to
check reliability
❑ Draw sub-samples from a sample and calculate the results.

Compare with the results of main sample
Similar results in each case proves sample reliable
92
Reference
(Statistics) – Gupta, S.P., Statistical Methods, Sultan Chand and

Sons, 45th Revised Edition (2017)
TB - Statistics – Lind, Douglas A., Marchal, William G. and

Wathen, Samuel A., Statistical Techniques in Business and
Economics, McGraw-Hill, 14th International Edition (2010)
Sample Size determination Video:

https://www.youtube.com/watch?v=-
pwQJYjWWMc&ab_channel=MohiniYadav
93
Module 2 – Summary
✓ Discussed the organization of raw data into some meaningful form
✓ Covered various graphical methods to present data
✓ Covered various locational measures to comment on one value
representing the entire data
✓ Covered various measures of dispersion showing the variability in
the data
✓ Covered various measures to comment on the shape of the data
✓ Covered sampling design and methods
✓ Two tutorial sheets are shared with numerical to practice
94
End of Module – 2
Describing Data – Graphically and Numerically
& Sampling Design
*****Happy Learning******
95

Lec 2 To 5 - Describing Data Sampling Design-2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 2 To 5 - Describing Data Sampling Design-2

Uploaded by

Copyright:

Available Formats

MODULE TWO : Describing Data & Sampling Design – 2023-24

Dr. Rahul Arora (IC)

Background design is taken from the presentation slides of Salvatore:

Croxton & Cowden, Statistics is the science of –

Five Stages in a Statistical Investigation

❑ Descriptive – methods of organizing, summarizing,

❑ Data can be organized in various ways

❑ Inferential – methods to estimate a population

❑ Also known as statistical inference

Appoint local agents to collect information

Nominal Ordinal Discrete Continuous

❑ A frequency distribution or a frequency table is simply a table in which the

❑ A variable either be discrete or continuous.

❑ A continuous variable is capable of taking every fractional value

❑ A discrete variable is that which can vary only by finite jumps.

❑ Set a variable that varies by classes/categories given in

❑ Define Class Limits – Lowest and highest values that can be

❑ Define class intervals (i) – Difference between upper and

❑ Define number of classes (k) –

❑ One way – 10-15

For example – 10-14; 15-19; …

Where to use which method ?

𝐿𝑜𝑤𝑒𝑟 𝐿𝑖𝑚𝑖𝑡 𝑜𝑓 2𝑛𝑑 𝑐𝑙𝑎𝑠𝑠 −𝑈𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 1𝑠𝑡 𝑐𝑙𝑎𝑠𝑠

❑ Known as Bivariate Frequency Distribution

❑ Pie Chart – show the relative frequency

❑ Frequency Polygon – show the class mid-points

❑ Cumulative Frequency Polygon – show the

❑ Measures of Dispersion – to check spread of values

A weighted average is most often computed to equalize the

W1 0.25 0.25 0.25 0.25 𝑾=𝟏 തതതത

W2 0.1 0.1 0.7 0.1 𝑾=𝟏 തതതത

Median – Midpoint of the value after arranging in an order

Where, L – Lower limit of median class; i – class interval

❑ Deciles – divides total frequency into ten equal parts

❑ Percentiles– divides total frequency into hundred equal

Verification with income level 3000

Using GM: 3000(0.09866) + 3296.58(0.09866) = 622.48 39

✓ Difference between upper limit of the highest class

✓ Distribution with smaller range has less dispersion

❑ Mean of square deviation from arithmetic mean

✓ Approx. 68 % observations lies within

𝑁1 𝜎12 + 𝑁2 𝜎22 + 𝑁1 𝑑12 + 𝑁2 𝑑22

❑ Population & Sample Variance & Standard Deviation

❑ Using sample mean to calculate the sample SD

Skewness – Lack of symmetry. It tells us

❑ The Bowley’s coefficient of skewness

❑ The Kelly’s coefficient of skewness

❑ Moments based Measure

In case mode is ill-defined –

✓ Provides both direction and extent of skewness

𝑃10 + 𝑃90 − 2𝑀𝑒𝑑𝑖𝑎𝑛

✓ Based on percentiles and deciles

❑ The arithmetic mean of various powers of deviations (taken from

❑ Denoted by Greek letter 𝜇

𝜇3 = 𝜇3′ − 3𝜇1′ 𝜇2′ + 2 𝜇1′ 3

𝜇4 = 𝜇4′ − 4𝜇1′ 𝜇3′ + 6 𝜇1′ 2 𝜇2′ − 3 𝜇1′ 4

First moment about zero is – Mean

Second moment about mean – Variance

Third moment about the mean – Skewness

Fourth moment about mean – Kurtosis

✓ Based on third and second moment

✓ Based on third and second moment

❑ Degree of sharpness of the peak of frequency distribution curve.