Professional Documents
Culture Documents
Lec 2 To 5 - Describing Data Sampling Design-2
Lec 2 To 5 - Describing Data Sampling Design-2
2 Mathematical and
Statistical Methods
ECON F213
4
Data Collection
5
Sources of Data
Primary Data –
➢ Collect data itself either through –
✓ Census technique
✓ Sample technique
Secondary Data –
➢ Available data collected by some agency
6
Data Collection Methods (R4: pp 39-61)
Methods to collect Primary Data
✓ Direct Personal Interviews
✓ Indirect Oral Interviews
✓ Information from Correspondents
✓ Mailed Questionnaire
✓ Schedules sent through Enumerators
7
Direct Personal Interviews
Instruments –
✓ Interview Schedule – Structured / Semi-structured
Kinds of Interview –
✓ Structured
✓ Semi-Structured
Interviews taken by the investigator on the phone also a part of this
Merits –
✓ Face-to-face interaction
✓ Hidden questions can be quickly asked
✓ Provide supplementary information
✓ Language can be adjusted based on conditions
✓ Possibility of more accurate information
8
Indirect Oral Interviews
Contacting third parties (witnesses) for the information
Instruments –
✓ Witnesses
✓ Trained interviewer
Merits –
✓ Suitable when the direct source of information do not exist
✓ Suitable when direct respondents are reluctant
Precautions –
✓ Don’t rely on the views of one person
✓ Take care while selecting a third person
9
Information from Correspondents
Instruments –
✓ Questionnaire given to correspondent for recording
information
Merits –
✓ Larger area can be covered
✓ Adopted by government for regular information
10
Mailed Questionnaire Method
Questionnaire is prepared and shared with the informants through post.
Instruments –
✓ Questionnaire – Disguise/non-disguise
Merits –
✓ Larger area can be covered
✓ Personal questions can be asked easily
Demerits –
✓ Can be adopted only when respondents are literate
✓ Difficult to check the accuracy of questionnaire
11
Schedule Sent Through Enumerators
Hire interviewers and sent them schedules to fill from the actual
respondents
Instruments –
✓ Schedules of Interview
Merits –
✓ Filled by enumerators
✓ Can be adopted where informants are illiterate
Demerits –
✓ Has to bear cost of hiring enumerators
✓ Data collected by too many
12
Types of Variables & Levels of
Measurement
Variable
Qualitative Quantitative
Interval Ratio
13
Types of Data and Their
Stacking
❑ Cross-sectional data
❑ Time-series data
❑ Panel data
14
Data Organization &
Presentation
15
Formation of a Frequency Distribution
17
Formation of Discrete Frequency Distribution
Exclusive Method –
❑ Fixed class intervals in which upper limit of one class is lower
limit of the next class
❑ It ensures continuity
❑ In such case, only those values will be included which is or
greater than or equal to (≥) lower limit but less than (<) to the
upper limit
❑ To get rid of the confusion of including data, another way of
specifying it using description as follows
20
Frequency construction under Class Intervals
Inclusive Method –
❑ Fixed class intervals in which upper limit of one class is
included in that class itself
21
Points to Remember
➢ No hard and fast rules because everything depends upon the nature
of the data
➢ Still, preferable rules are as follows –
➢ The number of classes should preferably be between 5 to 20
➢ Take class intervals either five or multiples of five
➢ The starting point, i.e., the lower limit of the first class, should either
be zero or a multiple of 5
➢ Try adopting an exclusive method for getting correct class intervals.
In the case of the inclusive method, adjust to get the corrected class
interval. The process is given as follows:
22
Two way frequency distribution
23
Graphic Description of Data
❑ Qualitative Data –
❑ Bar Chart – show the frequency or relative
frequency (e.g., vehicles sold…)
❑ Quantitative Data –
❑ Histogram – show the frequency distribution
25
Lecture 3
Data Description
(Numerical Measures)
Numeric Description of Data
❑ Locational Measure – to identify the center of the
values
❑ Arithmetic mean (Simple & Weighted)
❑ Median
❑ Mode
❑ Geometric mean
27
Types of Data Series
Individual Series Discrete Series Continuous Series
Height Height Frequency Height Frequency
(in Cms) (X) (f) (X) (f)
152 152 2 152-154 2
154 154 3 154-156 3
163 163 1 156-158 5
164 164 1 158160 8
165 165 3 160-162 3
166 166 1 162-164 2
167 167 1 164-166 3
182 182 1 166-168 4
28
Arithmetic Mean
σ𝑋
Individual Series 𝑋ത =
𝑁
σ 𝑓𝑋
Discrete Series 𝑋ത = σ𝑓
f – frequency
σ 𝑓𝑚
Continuous Series 𝑋ത = σ𝑓
m –mid point of classes
Merits –
✓ Single value
✓ Based on all values
✓ Easy to compute
ത taken from actual mean
✓ Sum of deviation σ(𝑋 − 𝑋)
is zero 29
Representations of Arithmetic Mean
Population mean –
σ𝑋
μ=
𝑁
Sample mean –
σ𝑋
𝑋ത =
𝑛
Parameter & Statistic –
▪ Any measureable characteristic of population is known
as parameter
▪ Any measurable characteristics of sample is known as
statistic
30
Weighted Mean
❑ In case of varying importance of different items is a data series,
weighted mean is better average than arithmetic mean
❑ Any measure of importance can be weight
σ(𝑤𝑋)
Individual Series 𝑋𝑤 = σ𝑤
32
Median – Positional Average
In case of very high/low extreme values in the data median measure is
more useful
Discrete Series – Look at cumulative frequency (c.f.) and find the total
σ 𝑓+1
equal to or next higher to that and corresponding X is median
2
σ𝑓
2
− 𝑐.𝑓.𝑜𝑓 𝑝𝑟𝑒𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑐𝑙𝑎𝑠𝑠
Continuous Series – 𝑀=𝐿+ ×𝑖
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
34
Other Positional Measures
❑ Quartiles – divides total frequency into four equal parts
35
Mode
A value that occurs mostly in the data or with greatest
frequency
Height in Height in Class Frequency
Centimeters Centimeters 152 1
182 165
154 1
152 163
163 1
154 164
164 1
166 165
165 3
165 167
166 1
Case of Ill-defined mode –
167 1
Mode = 3 Median – 2 Mean
182 1
By Karl Pearson
36
Relative Position of Mean, Median,
and Mode
37
Geometric Mean
❑ Special case of Arithmetic Mean whose value is always be less
than or equal to arithmetic mean
❑ It is useful in case of finding average change in percentages,
ratios, indexes, or growth rates over time
𝑛
Individual Series G. M. = (𝑋1 × 𝑋2 × ⋯ × 𝑋𝑛 )
σ 𝑙𝑜𝑔𝑋
G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔
𝑁
σ 𝑓𝑙𝑜𝑔 𝑋
Discrete Series G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 σ𝑓
σ 𝑓𝑙𝑜𝑔 𝑚
Continuous Series G. M. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 σ𝑓
38
Geometric Mean - Calculations
Problem: Increment in income in first year = 5 percent and Increment in income
in second year = 15 percent. Calculate the average increment in income.
Hint: Use GM, because data in percentage is given
Step 1: Convert data into normal form: 5 percent hike in previous income
means: 105% (or 1.05) & 15 percent hike means 115 % (or 1.15)
Step 2: Calculate GM using converted data:
G. M. = 1.05 𝑋(1.15) = 1.09886
The average annual percent increment in income is: 9.886 % (or 0.09866)
40
Lecture 4
Data Description
(Numerical Measures)
Measure of Dispersion
❑ To measure Variability of the observations
Wage Earner Wages in Wages in Wages in
(ID) Factory A Factory B Factory C
L1 100 97 15
L2 100 105 395
L3 100 102 52
L4 100 103 33
L5 100 93 5
Total 500 500 500
Mean (AM) 100 100 100
Merits –
✓ Single value representing variability and serve as a basis for control it
✓ It is average of average – second order average – It determines the
reliability of an average
✓ Facilitate the use of other statistical measures 42
Range – Positional Measure
❑ Difference between the value of the largest (L) item and
the value of the smallest (S) item
In a frequency distribution –
Decision –
43
Mean Deviation – True Measure
❑ Average difference between the items in a distribution
and the average (mean or median) value of that series
1
Individual Series 𝑀𝐷 = σ 𝑋 − 𝐴𝑣𝑔
𝑁
1
Discrete Series 𝑀𝐷 = σ 𝑓 𝑋 − 𝐴𝑣𝑔
σ𝑓
1
Continuous Series 𝑀𝐷 = σ𝑓 𝑚 − 𝐴𝑣𝑔
σ𝑓
Decision –
✓ In case of small MD, the distribution is compact or
uniform
44
Standard Deviation
❑ Square root of the mean of square deviation from
arithmetic mean.
σ 𝑥2
Individual Series 𝑆𝐷 𝜎 = ത
𝑥 = (𝑋 − 𝑋)
𝑁
Discrete Series –
σ 𝑓𝑥 2
𝑆𝐷 𝜎 =
𝑁
Decision –
✓ Greater the SD, greater is the magnitude of
deviations of the values from their mean
45
Standard Deviation – Cont…
❑ Usage of assumed mean instead actual mean
Individual series
2
σ 𝑑2 σ𝑑
𝑆𝐷 𝜎 = − 𝑑 = (𝑋 − 𝐴𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛)
𝑁 𝑁
Discrete Series –
2
σ 𝑓𝑑 2 σ 𝑓𝑑
𝑆𝐷 𝜎 = −
σ𝑓 σ𝑓
Continuous Series – Step Deviation Method
2
σ 𝑓𝑑 2 σ 𝑓𝑑 𝑚−𝐴
𝑆𝐷 𝜎 = − ×𝑖 𝑑=
σ𝑓 σ𝑓 𝑖
46
Variance
Decision –
✓ More variance implies greater variability
47
Coefficient of Variation (CV) –
Relative Measure
❑ Use to compare the variability among distributions
𝜎
𝐶𝑉 = × 100
𝑋ത
Decision –
✓ More the CV, greater is the variability
48
Uses of Standard Deviation
❑ Chebyshev developed a theorem that allows us to determine the
minimum proportion of the values that lie within a specified number of
SD of the mean
Theorem – For any set of observations (sample or population), the proportion of
the values that lie within k SDs of the mean is at least (1-1/k2), where k is any value
greater than 1. The relationship applies regardless of the shape of the distribution.
For a symmetrical and bell-shaped distribution, one can be more precise in
explaining the dispersion about the mean
Where,
𝑑1 = 𝑋ത1 − 𝑋ത12
𝑑2 = 𝑋ത2 − 𝑋ത12
𝑋ത12 is the combined mean
50
Population & Sample Difference
ത 2
σ(𝑋−𝑋)
Sample 𝑆𝐷 𝑠 =
𝑛−1
52
Lecture 5
Data Description
(Numerical Measures)
Skewness
Why Skewness –
Possibility of having same mean and standard deviation but
may differ in their overall appearance
❑ Any measure of skewness indicate the difference between the manner in
which items are distributed in a particular distribution compared with
symmetrical (normal) distribution [R4: PP 338]
54
Measures of Skewness - Absolute
𝑆𝑘 = 𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝑘 = 𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
Decision –
✓ Positive difference means positively skewed distribution
and vice versa
55
Measures of Skewness - Relative
❑ The Karl Pearson’s coefficient of skewness
56
Measures of Skewness - Relative
The Karl Pearson’s coefficient of skewness –
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
57
Measures of Skewness - Relative
The Bowley’s coefficient of skewness –
𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝐵 =
𝑄3 − 𝑄1
✓ Based on quartiles
✓ Its numerical value lies between -1 and +1
58
Measures of Skewness - Relative
The Kelly’s coefficient of skewness –
𝐷1 + 𝐷9 − 2𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝐾 =
𝐷9 − 𝐷1
59
Moments
Why Moments –
First four moments are suffice to show the basic
characteristics – location, scattered-ness, asymmetry and
peaked-ness
60
Moments – about Mean
Individual Series Frequency Distribution
σ 𝑋 − 𝑋ത σ𝑥 σ 𝑓 𝑋 − 𝑋ത σ 𝑓𝑥
𝜇1 = = 𝜇1 = =
𝑁 𝑁 𝑁 𝑁
σ 𝑋 − 𝑋ത 2 σ 𝑥2 σ 𝑓 𝑋 − 𝑋ത 2 σ 𝑓𝑥 2
𝜇2 = = 𝜇2 = =
𝑁 𝑁 𝑁 𝑁
σ 𝑋 − 𝑋ത 3 σ 𝑥3 σ 𝑓 𝑋 − 𝑋ത 3 σ 𝑓𝑥 3
𝜇3 = = 𝜇3 = =
𝑁 𝑁 𝑁 𝑁
σ 𝑋 − 𝑋ത 4 σ 𝑥4 σ 𝑓 𝑋 − 𝑋ത 4 σ 𝑓𝑥 4
𝜇4 = = 𝜇4 = =
𝑁 𝑁 𝑁 𝑁
61
Moments – about Arbitrary Origin
Individual Series Frequency Distribution
σ 𝑋−𝐴 σ𝑓 𝑋 − 𝐴
𝜇1′ = 𝜇1′ =
𝑁 𝑁
σ 𝑋−𝐴 2 σ𝑓 𝑋 − 𝐴 2
′ ′
𝜇2 = 𝜇2 =
𝑁 𝑁
σ 𝑋−𝐴 3 σ𝑓 𝑋 − 𝐴 3
𝜇3′ = 𝜇3′ =
𝑁 𝑁
σ 𝑋−𝐴 4 σ𝑓 𝑋 − 𝐴 4
′ ′
𝜇4 = 𝜇4 =
𝑁 𝑁
62
Moments – Conversion
𝜇1 = 𝜇1′ − 𝜇1′ = 0
𝜇2 = 𝜇2′ − 𝜇1′ 2
63
Moments – about Zero
σ 𝑓𝑋 σ 𝑓𝑋 2 σ 𝑓𝑋 3 σ 𝑓𝑋 4
𝑣1 = ; 𝑣2 = ; 𝑣3 = ; 𝑣4 =
𝑁 𝑁 𝑁 𝑁
Summary of Moments –
64
Measures of Skewness - Relative
Moment based Measure –
𝜇32
𝛽1 = 3
𝜇2
65
Measures of Skewness - Relative
Moment based Measure –
𝜇3
𝛾1 = 𝛽1 = 3/2
𝜇2
66
Kurtosis
Why Kurtosis –
To know the peaked-ness of the frequency distribution
curve
Three variants –
✓ More peaked than normal curve called Leptokurtic
✓ More flat than normal curve called Platykurtic
✓ Same a normal curve is called Mesokurtic
67
Measures of Kurtosis
Moment based Measure –
𝜇4
𝛽2 = 2
𝜇2
68
Measures of Kurtosis
𝛾2 = 𝛽2 − 3
69
Sampling Design
70
Sampling – An Introduction
❑ Process of learning about the population on the basis of sample
drawn from it
Why Sampling –
✓ Difficult to reach entire population
✓ Financial constraints
✓ Time constraints
✓ Sometimes sampling is sufficient even if funds and time are
available
✓ Studying entire population is destructive sometimes
71
Laws of Sampling
❑ Law of Statistical Regularity
72
Essentials of Sampling
❑ Representativeness
✓ Random selection is the key
❑ Adequacy
✓ Size of sample should be large enough
❑ Independence
✓ Selection of one item in one draw has no influence on
probability of selection in any other draw
❑ Homogeneity
✓ Nature of Sample units remains same as in the population
73
Sampling Elements
1. Selection of a Sample
➢ Sample size
➢ Types of respondents
➢ Location of respondents
➢ Data collection method
2. Collection of Information
➢ Pilot study
➢ Final data collection
➢ Describing data
3. Making an Inference
➢ Estimation techniques to infer
74
Sampling Methods
❑ Probability Sampling
➢ Simple Random Sampling (unrestricted)
➢ Systematic Random Sampling (restricted)
➢ Stratified Random Sampling (restricted)
➢ Cluster Sampling (restricted)
❑ Non-Probability Sampling
❑ Judgement Sampling
❑ Convenience Sampling
❑ Quota Sampling
75
Probability Sampling - SRS
❑ Methods
✓ Lottery Method (With or without replacement)
✓ Random Numbers Table
❑ Merits
✓ Each unit has equal chance of selection
✓ Unbiased
✓ Easy to assess accuracy of the estimate
❑ Demerits
✓ Requires detailed information on each population unit
✓ Results are more dispersed than restricted random
sampling
76
Probability Sampling – Systematic
Sampling
❑ Method
✓ Select one unit at random and remaining on the basis of
𝑁
evenly spaced interval (k). It is calculated as: 𝑘 =
𝑛
❑ Merits
✓ Useful in case of available list of population
✓ Useful when population units are ordered
✓ Relatively simple than SRS
❑ Demerits
✓ Chances of biasedness from investigator occurs
77
Probability Sampling – Stratified
Sampling
❑ Method
✓ Divide the total population into mutually exclusive groups
(strata) and then use SRS for selection
❑ Properties of Good Stratified Sampling
✓ There should be marked difference between different
strata
✓ Homogeneity within each stratum
✓ Limited strata should be defined (≤ 6)
❑ Merits
✓ More representative and accurate but requires skilled
supervisors
78
Probability Sampling – Cluster
Sampling
❑ Method
✓ Random selection is made of primary, intermediate and
final units from a given population or stratum. Also known
as multi-stage sampling
❑ Merits
✓ Usage of SRS at multiple stage
✓ Flexible and covers larger area or population
❑ Demerits
✓ Less accurate than single stage random sampling having
same number of final stage units
79
Difference between Stratified and
Cluster Sampling
❑ In stratified sampling, random selection is made out of from all
strata made from population
❑ In cluster sampling, selection is done out of randomly created
clusters from population
❑ Procedure –
❑ In stratified, first divide the population into strata and then
select sample from each strata – One step random approach
80
Non-Probability Sampling - Judgement
❑ Merits
✓ Selection is on the basis of the investigator
✓ Useful in case of small population
✓ Useful in quick policy decisions
❑ Demerits
✓ Create bias
✓ No objective way to check the reliability of sampling results
81
Non-Probability Sampling -
Convenience
❑ Merits
✓ Selection is on the basis of the Convenience
✓ Useful for conducting pilot studies
❑ Demerits
✓ Create bias and produce unsatisfactory results
✓ No objective way to check the reliability of sampling
results
82
Non-Probability Sampling - Quota
❑ Merits
✓ Most commonly used Non-Prob. Sampling
✓ Quota is fixed on the basis of some criteria
✓ Investigator is free to choose respondent
✓ Provide satisfactory results if interviewer is carefully
trained and followed instructions
❑ Demerits
✓ Create bias because of investigator’s judgement
✓ No objective way to check the reliability of sampling
results
83
Sample Size Determination
❑ Neither be too Small nor too large. Should be Optimum
84
Sample Size determination
Slovin’s Formula –
85
Sample Size determination
❑ The decision to fix the sample size is based on three variables:
1. The margin of error a researcher can tolerate
𝜎
𝐸𝑟𝑟𝑜𝑟 = 𝑍
𝑛
2. The level of confidence
• High levels of confidence is preferable (95 % or higher)
86
Sample Size determination
Solving the following equation for n:
𝜎
𝐸𝑟𝑟𝑜𝑟 = 𝑍
𝑛
𝑧𝜎 2
𝑛=
𝐸
n – sample size
z – value of z corresponding to desired level of confidence
E – maximum allowed error
𝜎 – population S
87
Statistical Terminology
❑ Parameter –
✓ A characteristic of a population – Any measurable characteristic
of a population
❑ Statistic –
✓ A characteristic of a sample
❑ Estimator –
✓ It is a statistic used to infer the value of an unknown parameter
– Method of estimation
❑ Estimate –
✓ Numerical value representing the estimate of the parameter on
the basis of sample
88
Sampling Errors
❑ Errors occur at any stage of sampling while inferring about the
population
❑ Biased Errors – Arises from any kind of bias at any stage of sampling
89
Sampling Errors – Causes of Bias
❑ Fault in process of selection
❑ Fault in collection
90
Non-Sampling Errors
❑ Inadequate data specification
❑ … among others
91
Testing of Sample Reliability
❑ In case of known population characteristics, compare the
characteristics of sample and check the reliability
92
Reference
93
Module 2 – Summary
✓ Discussed the organization of raw data into some meaningful form
✓ Covered various graphical methods to present data
✓ Covered various locational measures to comment on one value
representing the entire data
✓ Covered various measures of dispersion showing the variability in
the data
✓ Covered various measures to comment on the shape of the data
✓ Covered sampling design and methods
✓ Two tutorial sheets are shared with numerical to practice
94
End of Module – 2
Describing Data – Graphically and Numerically
& Sampling Design
*****Happy Learning******
95