Professional Documents
Culture Documents
22BC285 - Ayush Jain - ITDA - Sem4
22BC285 - Ayush Jain - ITDA - Sem4
Categorical Variable
A categorical variable is a type of variable used in statistics and data analysis that represents
qualitative characteristics or attributes rather than numerical values. Categorical variables can take on
a limited, predefined set of categories or groups, and each observation or data point is assigned to one
of these categories. These categories can represent characteristics such as types, labels, or attributes.
A categorical variable has values that you can put into a countable number of distinct groups based on
a characteristic. For a categorical variable, you can assign categories but the categories have no
natural order. If the variable has a natural order, it is an ordinal variable. Categorical variables are
also called qualitative variables or attribute variables.
For example, college major is a categorical variable that can have values such as psychology, political
science, engineering, biology, etc.
The methods of collecting primary data can be further divided into quantitative data collection
methods (deals with factors that can be counted) and qualitative data collection methods (deals with
factors that are not necessarily numerical in nature).
A. Observation Method
Observation method is used when the study relates to behavioural science. This method is
planned systematically. It is subject to many controls and checks. The different types of
observations are:
Structured and unstructured observation
Controlled and uncontrolled observation
Participant, non-participant and disguised observation
B. Interview Method
The method of collecting data in terms of verbal responses. It is achieved in two ways, such as:
Personal Interview – In this method, a person known as an interviewer is required to ask
questions face to face to the other person. The personal interview can be structured or
unstructured, direct investigation, focused conversation, etc.
Telephonic Interview – In this method, an interviewer obtains information by contacting
people on the telephone to ask the questions or views, verbally.
C. Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the form.
A good survey should have the following features:
Short and simple
Should follow a logical sequence
Provide adequate space for answers
Avoid technical terms
Should have good physical appearance such as colour, quality of the paper to attract the
attention of the respondent
D. Schedules
This method is similar to the questionnaire method with a slight difference. The enumerations are
specially appointed for the purpose of filling the schedules. It explains the aims and objects of the
investigation and may remove misunderstandings, if any have come up. Enumerators should be
trained to perform their job with hard work and patience.
E. Experiments
Experimental methods involve manipulating one or more variables to observe the effect on
another variable. Experiments are conducted in controlled settings where researchers can control
and manipulate variables. Experiments allow for establishing cause-and-effect relationships,
precise control over variables, and replication of findings. Experimental designs require careful
planning, randomization, and control to minimize confounding variables and ensure internal
validity.
A. Published Sources
a. Government Agencies
National statistics bureaus (e.g., U.S. Census Bureau, UK Office for National Statistics)
provide published reports and datasets on demographic, economic, and social indicators.
Health departments publish statistics on diseases, mortality rates, healthcare access, and
public health interventions.
Other government agencies publish reports, white papers, and statistical bulletins on topics
such as education, labor, transportation, crime, and the environment.
b. Academic Institutions
Universities and research institutions publish academic journals, research reports, theses,
dissertations, and conference papers covering various disciplines.
Institutional repositories provide access to scholarly works produced by faculty, researchers,
and students, including published research articles and reports.
e. Media Outlets
Newspapers, magazines, television networks, and online news websites publish articles,
reports, and multimedia content covering current events, social issues, and trends.
Archives of news articles and documentaries serve as published sources of information on
historical events, social movements, and cultural phenomena.
B. Unpublished Sources
a. Academic Institutions
Universities and research institutions maintain unpublished research data, survey responses,
field notes, and raw data collected by researchers for ongoing or completed studies.
Institutional repositories may include unpublished manuscripts, working papers, technical
reports, and datasets that have not yet been formally published.
Frequency Distribution:
Class
Frequency
Interval
1-7 15
8-14 12
15-21 15
22-28 10
29-35 6
36-42 2
Total 60
Class Frequency
75-89 10
90-104 11
105-119 23
120-134 26
135-149 31
150-164 23
165-179 9
180-194 9
195-209 6
210-224 2
Histogram:
7) Prove that sum of deviations from the mean is 0. Use Equation Editor.
n
¿ Prove: ∑ ( x i−x )=0
i=1
n n
∑ ( xi −x ) =n x−n x ∑ ( xi −x ) =0
i=1 i=1
x 1+ x2 + x 3 +…+ x n
since , =x therefore , x 1 + x 2+ x3 + …+ x n=n x
n
8) Find the weighted arithmetic mean of first n natural numbers, the weights being the
numbers themselves.
Weighted Mean=
∑ wx Weighted Mean= ( 1 ×1 ) +( 2 ×2 ) +( 3 ×3 )+ …+( n ×n )
∑w 1+2+3+ …+n
2 2 2 2
1 +2 + 3 + …+n
Weighted Mean=
1+ 2+ 3+…+n
n ( n+1 ) ( 2 n+1 )
Since , ∑ of Square of first n natural numbers=
6
n ( n+1 )
¿ , ∑ of first n natural numbers=
2
n ( n+1 ) ( 2 n+1 )
6 1
So ,Weighted Mean= Weighted Mean= (2 n+1)
n ( n+ 1 ) 3
2
9) From the following table showing the wage distribution in a certain factory, determine
a) The mean wage
b) The median wage
c) The modal wage
d) The wage limits for the middle 50% of the wage earners
e) The percentage of workers who earned between Rs. 75 and Rs. 125
f) The percentage of workers who earned more than Rs. 150 per week
g) The percentage of workers who earned less than Rs. 100 per week
a. Mean
Mean= 107.03
b. Median
N/2= 87.5
Cf= 70
f= 40
l=100
h=20
Median= 108.75
c. Mode
f1= 40
f0= 30
f2= 35
l= 100
h= 20
Mode= 113.33
Calculate Q3
3(N/4) = 131.25
Cf= 40
F= 30
L=80
H= 20
Q3= 132.14
Frequency Table
1000
800
Frequency
600
Less Than
400 Ogive
200
0
-10 0 10 20 30 40 50 60 70 80 90
Calculation of Median
N/2= 500
cf= 398
Median Class= 39.5-49.5
l= 39.5
f= 240
h= 10
N
−cf
2
Median=l+ ×h
f
Median = 43.75
11) Given below is the distribution of 140 candidates, obtaining marks X or higher in an
examination. (All marks are given in whole numbers). Calculate the mean, median and
mode of the distribution.
X C.F.
10 140
20 133
30 118
40 100
50 75
60 45
70 25
80 9
90 2
100 0
Frequency Table:
cf (more
X f fx cf
than)
10 140 7 70 7
20 133 15 300 22
30 118 18 540 40
40 100 25 1000 65
50 75 30 1500 95
60 45 20 1200 115
70 25 16 1120 131
80 9 7 560 138
90 2 2 180 140
100 0 0 0 140
140 6470
N=140
Calculation of Mean:
Mean=
∑ fx
N
Mean = 46.214
Calculation of Median:
N +1 140+1
= =70.5
2 2
c.f. greater than 70.5 is 95 and the corresponding X value to it is 50.
Hence,
Median = 50
Calculation of Mode:
Highest frequency is 30 and the corresponding X value to it is 50.
Hence,
Mode = 50
12) The following numbers give the weights of 55 students of a class. Prepare a suitable
frequency table.
Frequency Table:
Class Cumulative
Interval Frequency Less than Frequency
40-50 7 50 7
50-60 7 60 14
60-70 10 70 24
70-80 16 80 40
80-90 7 90 47
90-100 4 100 51
100-110 3 110 54
110-120 1 120 55
55
Histogram:
Frequency Polygon:
0
50 60 70 80 90 100 110 120
Less Than
a. 1/2
b. 3
c. 2/3
d. 3
Calculation of Mean:
Mean=
∑ X Here ,∑ X=1+2+3+ …+n¿ , n=n
n
n ( n+1 )
n ( n+1 ) 1+ 2+ 3+…+n
Also , ∑ of first n natural numbers=
❑
' '
Mean= 2
2 n Mean=
n
n+1
Mean=
2
Calculation of Variance:
n ( n+1 ) ( 2n+ 1 ) 2
n −1
Variance=n ∑ x −¿ ¿ ¿ ¿Variance=n ×
2
−¿ ¿Variance=
6 12
15) Find the mean and standard deviation of the following distribution
x F
2.5-7.5 12
7.5-12.5 28
12.5-17.5 65
17.5-22.5 121
22.5-27.5 175
27.5-32.5 198
32.5-37.5 176
37.5-42.5 120
42.5-47.5 66
47.5-52.5 27
52.5-57.5 9
57.5-62.5 3
Frequency Table:
Calculation of Mean:
Mean=
∑ fx
N
Mean = 30.005
Standard Deviation=
√
Standard Deviation = 10.009
∑ f (x−mean)2
∑ fx
16) The following data gives the arithmetic averages and standard deviations of three
groups. Calculate the arithmetic average and standard deviation of the whole group.
Sub-group No. of men Average wages (in Standard deviation
Rs.) (in Rs.)
A 50 61 8
B 100 70 9
C 120 80.5 10
√ n1 ( σ 1 +d 1 ) + n2 ( σ 2 +d 2 ) + n3 ( σ 3 +d 3 )
2 2 2 2 2 2
Combined SD=
n1+ n2 +n3
D1= 12
D2= 3
D3= -7.5
Combined SD= 11.89
17) Define and provide formulas for Coefficient of Variation and Coefficient of Dispersion.
What is the use of the following measures?
The Coefficient of Variation (CV) and the Coefficient of Dispersion (CD) are statistical measures
used to assess the variability or spread of a dataset relative to its mean.
σ
CV = × 100 %
μ
Where:
𝜎 is the standard deviation of the dataset.
μ is the mean of the dataset.
The CD indicates how much the values in a dataset deviate from the mean. A CD greater than 1
suggests that the data are more dispersed or spread out compared to the mean, while a CD less
than 1 indicates less dispersion.
Uses:
A. Coefficient of Variation (CV):
It is commonly used in fields such as finance, economics, and biology to compare the
variability of datasets with different units or scales.
It helps in assessing the risk associated with an investment portfolio by comparing the
volatility (standard deviation) to the expected return (mean).
It aids in evaluating the consistency of processes or products in manufacturing and
quality control.
In statistics, a positively skewed (or right-skewed) distribution is a type of distribution in which most
values are clustered around the left tail of the distribution while the right tail of the distribution is
longer. The positively skewed distribution is the direct opposite of the negatively skewed distribution.
Unlike with normally distributed data where all measures of the central tendency (mean, median, and
mode) equal each other, with positively skewed data, the measures are dispersed. The general
relationship among the central tendency measures in a positively skewed distribution may be
expressed using the following inequality:
In contrast to a negatively skewed distribution, in which the mean is located on the left from the peak
of distribution, in a positively skewed distribution, the mean can be found on the right from the
distribution’s peak. However, not all negatively skewed distributions follow the rules. You may
encounter many exceptions in real life that violate the rules.
19) Define and illustrate through an example leptokurtic, platykurtic and mesokurtic
distributions.
Kurtosis is a statistical measure that describes the shape of the distribution of data points in a dataset
relative to the normal distribution. A normal distribution has a kurtosis of 3, and distributions with
higher kurtosis are called leptokurtic, while those with lower kurtosis are called platykurtic.
Mesokurtic distributions have kurtosis equal to 3, similar to the normal distribution.
1. Leptokurtic Distribution:
Definition: Leptokurtic distributions have a higher peak and heavier tails compared to the
normal distribution, indicating more extreme values or outliers.
Example: A distribution of stock returns during a period of high market volatility may
exhibit leptokurtic behavior due to frequent large gains or losses.
Illustration: In a leptokurtic distribution, the data points cluster tightly around the mean,
with taller and thinner tails compared to the normal distribution. The peak of the distribution
is higher, indicating a greater concentration of values near the mean, while the tails extend
further outward, suggesting the presence of outliers.
2. Platykurtic Distribution:
Definition: Platykurtic distributions have a flatter peak and lighter tails compared to the
normal distribution, indicating fewer extreme values or outliers.
Example: A distribution of test scores in a classroom where the majority of students perform
similarly with few very high or very low scores may exhibit platykurtic behavior.
Illustration: In a platykurtic distribution, the data points are spread out more evenly across
the range of values, resulting in a lower peak and shorter tails compared to the normal
distribution. The distribution appears flatter, with less clustering around the mean and fewer
extreme values.
3. Mesokurtic Distribution:
Definition: Mesokurtic distributions have kurtosis equal to 3, similar to the normal
distribution, indicating a moderate concentration of data points around the mean with tails
similar to the normal distribution.
Example: The heights of adult males in a population often follow a mesokurtic distribution,
with most individuals clustered around the average height and fewer outliers at the extremes.
Illustration: In a mesokurtic distribution, the shape closely resembles the normal
distribution, with a moderate peak and tails extending to the left and right. The data points are
symmetrically distributed around the mean, and the distribution displays neither excessive
peakedness nor flatness compared to the normal distribution.
f ( x )=
√ 8 π
∞
f ( x )= e 2σ
σ √2 π
Here, 𝜇 is the mean of the distribution and 𝜎 is the standard deviation.
√2 π
∞
d. Calculation of ∫ f ( x ) dx
1
1 − {(x−1)
e 8 } , -∞<x<∞
2
Given, f ( x )=
√8 π
1 −{( x−1)
e 8 } dx
∞ ∞ 2
so ,∫ f ( x ) dx=∫
1 1 √ 8 π
To solve this integral, it's helpful to recognize that the provided PDF is already normalized
(integrates to 1 over the entire real line). Therefore, the integral from 1 to infinity will be the
complement of the cumulative distribution function (CDF) evaluated at 1:
∞ 1
∫ f ( x ) dx=1−∫ f (x ) dx
1 −∞
Since the Normal Distribution is symmetric about its mean, we can rewrite the integral as:
1
1−2 ∫ f ( x) dx
−∞
This is the area to the left of 𝑥=1, which corresponds to the CDF at 𝑥=1. So, the integral
∞