Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Statistics for Data Science - 1

1. The mean annual college fees paid by all students in a college is |55 lakhs. The mean
annual college fees paid by male and female students of the college are |40 lakhs and
|60 lakhs respectively. Then, the percentages of male students studying in the college is

[3 marks]

(a) 60%
(b) 50%
(c) 20%
(d) 25%
(e) 30%

Let the number of male students in the college be n1 .


Let the number of female students in the college be n2 .
Given that the mean annual college fees paid by all students in a college is |55 lakhs.

Sum of all male students fees + Sum of all female students fees
⇒ n1 +n2
= 55

⇒ Sum of all male students fees + Sum of all female students fees = 55n1 + 55n2 (1)
Also, given that the mean annual college fees paid by male students of the college is |40
lakhs.

⇒ Sum of all male


n1
students fees
= 40
⇒ Sum of all male students fees = 40n1 (2)
And the mean annual college fees paid by female students of the college is |60 lakhs.
⇒ Sum of all female
n2
students fees
= 60

⇒ Sum of all female students fees = 60n2 (3)


Using equation (2) and (3) in equation (1), we get

55n1 + 55n2 = 40n1 + 60n2


⇒ n2 = 3n1

1
The percentage of male students studying in the college is
n1
= × 100
n1 + n2
n1
= × 100
n1 + 3n1
1
= × 100
4
= 25

2
2. By multiplying each of the numbers 4, 5, 7, 11, 13 by 4 and then adding 7 to each of
them, we obtain a new dataset. Then, the difference between the sample variance of the
new dataset and the sample variance of the old dataset is
Answer: 225 [3 marks]

Let the sample variance of the dataset 4, 5, 7, 11, 13 is σ 2 . By multiplying each of the
numbers by 4, sample variance value will be 42 σ 2 . Also sample variance value will not
change by adding same number to each of the points of the dataset. Hence the value of
the sample variance of the new dataset is 16σ 2 .
Then the difference between the sample variance of the new dataset and the sample
variance of the old dataset is 16σ 2 − σ 2 = 15σ 2

4 + 5 + 7 + 11 + 13
Mean of the dataset =
5
=8

Now we will calculate the value of σ 2 .

(4 − 8)2 + (5 − 8)2 + (7 − 8)2 + (11 − 8)2 + (13 − 8)2


σ2 =
5−1
= 15

This implies that the difference between the sample variance of the new dataset and the
sample variance of the old dataset is 15σ 2 i.e. 225

3
3. Consider various variables that describes the specifications of flats owned by a builder.
These variables include price of flat, area of flat, BHK(number of bedrooms attached
with 1 hall and 1 kitchen), furnishing(furnished, semi furnished or unfurnished), and
locality. The builder owns 400 flats whose specifications are then organised in a data
table. Based on this information, choose the correct option(s) from below. [1 mark]

(a) The number of variables in the data table is 5.


(b) The number of cases/observations in the data table is 5.
(c) Furnishing is a categorical variable.
(d) Price of flat is a numerical variable.
(e) Area of flat is a discrete numerical variable.
(f) Locality is a numerical variable.

Answer: Option (a):


Variable is the characteristic or attribute that varies across all units. Here, price of flat,
area of flat, BHK(number of bedrooms attached with 1 hall and 1 kitchen), furnishing
(furnished, semi furnished or unfurnished), and locality are the five variables.
Hence, option (a) is correct.

Option (b):
Cases/observations are the data points collected to study. Here, specifications are col-
lected for 250 flats, thus number of cases/observations is 250.
Hence, option (b) is wrong.

Option (c):
Furnishing is just name and belongs to the definite categories. They do not hold numeric
properties therefore, furnishing is a categorical variable.
Hence, option (c) is correct.

Option (d):
Price of flat has numeric properties i.e., we can do arithmetic operations on price, there-
fore, price is a numerical variable.
Hence, option (d) is correct.

Option (e):
Again, area has numeric properties that is we can do arithmetic operations on mileage.
Therefore, area is a numerical variable but area can take any real value between given
range hence, it is continuous variable.
hence, option (e) is wrong.

4
4. If the variance of a set of non-zero observations is zero, you can conclude [1 mark]

(a) that the observations have same number of positive and negative data points.
(b) that the mean (average) value is zero.
(c) that all observations are the same value.
(d) that a mistake in calculation has been made.
(e) none of the above.

Assume that x1 , x2 , ..., xn be the non zero observations whose variance is equal to zero.
let x is the average of above data set. that is
x1 + x2 + ... + xn
x=
n

Since variance of the above data is given zero.


Pn
(xi − x)2
⇒ i=1 =0
n
Pn
(xi − x)2
⇒ i=1 =0
n
X n
⇒ (xi − x)2 = 0
i=1

Sum of all positive terms is equal to zero only when all individual terms are zero.

⇒ (xi − x)2 = 0, for all i, where i =1,2,3,...,n

⇒ (xi − x) = 0, for all i, where i=1,2,3,...,n


⇒ xi = x, for all i, where i = 1,2,3,...,n
It implies that all the observations have the same values.

6
5. If first quartile (Q1) = 80 and third quartile (Q3) = 100, which of the following must
be true?
I. The median will lie in the range [80, 100].
II. The median is 90.
III. The standard deviation is at most 20. [3 marks]

(a) I only
(b) II only
(c) III only
(d) I and II.
(e) All are true.
(f) None is true.

Consider the dataset 50, 80, 100, 100, and 150.


Q1 and Q3 of this dataset is 80 and 100 respectively.
The mean of this dataset is
50 + 80 + 100 + 100 + 150
= 96
5
The standard deviation of this dataset is
r
(0 − 42)2 + (30 − 42)2 + (40 − 42)2 + (40 − 42)2 + (100 − 42)2
= 32.62
5

Median is the 50th percentile, Hence it will lie in [Q1 , Q3 ]. So, statement first is correct.
Hence, option (a) is correct.

7
6. Suppose the correlation coefficient between two variables x and y is 0.45. What will be
the new correlation coefficient if 0.10 is added to all values of the x variable, every value
of the y variable is doubled, and the two variables are interchanged? [3 marks]

(a) 0.55
(b) 0.65
(c) 0.90
(d) 0.45
(e) 0.80

If r is the correlation coefficient for the data pairs (xi , yi ), i = 1, ..., n. Then, the
correlation coefficient for the data pairs (a + bxi , c + dyi ), i = 1, ..., n, provided that b
and d have the same sign, is r.
Here a = 0.10, b = 1, c = 0, and d = 2.
This implies that correlation coefficient will be same if 0.10 is added to all values of the
x variable, every value of y variable is doubled.
Also the correlation coefficient does not change its value if both the variables x and y
are interchanged.
So the new correlation coefficient value will not change, i.e. the new correlation coefficient
value will be 0.45.

8
7. The bar chart given in Figure Q.1 shows the shoe sizes of a group of 70 children. Based
on this information, which of the following statements is(are) true?

[3 marks]

Figure Q.1: Shoe size dataset

(a) 16 children wear a size 8 shoe.


(b) 29 children wear a shoe size less than 8.
(c) 7 is the median shoe size.
(d) 35 children wear a shoe size larger than 6.
(e) 6 is the mode shoe size.
(f) Range of the shoe size is 4.
(g) The value of first quartile (Q1 ) for shoe size is 6.

Answer: Option (a):


We know that lengths of the bars in the bar chart denote the count of corresponding
observations.
It implies that number of children who wear 8 size shoes is 16.
Hence, option (a) is correct.

Option (b):
Number of children who wear shoes of sizes less than 8 = 9+20+25=54
Hence, option (b) is incorrect.

Option (c):
Total number of children= 9+20+25+16=70
It implies that 35th and 36th shoe size is the median shoe size.
25th and 26th shoe size is 7.

9
Therefore, median shoe size is 7.
Hence, option (c) is correct.

Option (d):
Number of children who wear shoes of sizes larger than 6 = 25+16=41.
Hence, option (d) is incorrect.

Option (e):
Mode is the most frequent observation
From the bar chart, it is clear that shoes of size 7 is wore by most number of students.
It implies that shoes size 7 is the modal shoe size.
Hence, option (e) is incorrect.

Option (f):
Maximum shoes size= 8
Minimum shoe size = 5
⇒ Range = 8-5= 3
Hence, option (f) is wrong.
Option (g):
First quartile is the 25th percentile. As discussed in the option (e), there are 70 children.

n = 70 and p = 0.25
np = 17.5
It implies that 18th observation is the first quartile and 13th observation is shoes of size
6.
Hence, option (g) is correct.

10
Use the following information and data given in Figure Q.2 and Figure Q.3 to answer
the questions 8, 9, and 10.
The stem and leaf plot diagrams given in Figure Q.2 and Figure Q.3 show the results of
Computational thinking(CT) and English exams conducted in a college respectively. In
Figure Q.3, x is an unknown value.

3 0 7
4 2 5 6 8
5 1 3 5 79
6 2 4 6 8
7 4 6 9
8 3 5 8 8

Figure Q.2: Stem and leaf plot of scores of CT paper, Key: 3|0 = 30

4 3 5
5 2 4 999
6 3 5 79
7 4 8
8 2 4 7
9 0 x

Figure Q.3: Stem and leaf plot of scores of English paper, Key: 4|3 = 43

8. What is the difference of the modal score of Computational thinking and English?

[1 mark]

Answer: 29
From the stem and leaf diagrams, it is clear that 88 is scored by most number of students
in Computational thinking and 59 is scored by most number of students in English.
88 and 59 are the mode for Computational thinking and English respectively.
Therefore the difference between the modal scores of Computational thinking and English

= 88 − 59

= 29

9. If the range of Computational thinking scores is greater than the range of English scores
by 7, then the value of x is [3 marks]
Answer: 4
From the Stem and leaf diagram,
Maximum Computational thinking score=88

11
Minimum Computational thinking score=30
⇒ Range for Computational thinking= 88-30= 58

Since, Range of Computational thinking scores is greater than the range of English scores
by 7
⇒ Range for English= 51
From the stem and leaf diagram,
Minimum English score=43

⇒ Range for English= Maximum English score - 43


⇒ 51= Maximum English score - 43
⇒ Maximum English score= 94

It implies that x = 4

10. What is the difference between the medians of the two scores? [1 mark]
Answer: 5.5
We know that in the stem and leaf plot, data points are arranged in the increasing order.
There are 22 scores in Computational thinking, therefore Average of 11th and 12th score
will be the median score for Computational thinking.
59 + 62
median for Computational thinking =
2
⇒ median for Computational thinking = 60.5

There are 18 scores in English, therefore average of 9th and 10th score will be the median
score for English.
65 + 67
median for English =
2
⇒ median for English = 66

Hence, Difference between medians= 66-60.5 =5.5

12
Use the following information and data given in Table Q.1 to answer the questions 11,
12, 13, and 14.
The placement statistics for the year 2020 of a polytechnic college that grants degrees
in Computer Science Engineering (CSE) and Information Technology (IT) is given in
Table Q.1.

Roll No Gender Score Percentage Specialisation Placement Status Salary (INR lakhs)
CS18B001 M 85% CSE Placed 8.50
IT18B001 M 95% IT Placed 4.00
IT18B002 F 75% IT Placed 3.00
CS18B002 F 78% CSE Placed 4.00
CS18B003 M 85% CSE Not placed
CS18B004 F 88% CSE Placed 5.00
IT18B003 M 85% IT Not placed
CS18B005 F 75% CSE Placed 7.00
CS18B006 F 65% CSE Placed 3.00
CS18B007 M 92% CSE Placed 14.00
CS18B008 F 55% CSE Not placed
IT18B004 M 95% IT Placed 3.5
CS18B009 M 82% CSE Placed 10.00
CS18B010 F 87% CSE Placed 4.00
IT18B005 M 50% IT Not placed

Table Q.1: Placements dataset

11. Which of the following is (are) case(s)? [1 mark]


a. IT18B001
b. M
c. F
d. CS18B005
e. IT18B009
f. CSE
g. IT
Cases/observations are the data points collected to study.
Roll numbers are the cases/observations for the given dataset.
Hence, options (a), and (d) are correct options.

12. Which of the following is (are) numerical variable(s)? [1 mark]

13
a. Specialisation
b. Score Percentage
c. Placement Status
d. Salary (INR lakhs)

Solution:
Score Percentage and Salary have specific units and have numerical properties, so they
are numerical variables. Specialisation and Placement Status are categorical variables.

13. What is the population standard deviation of the salary in INR lakhs of the placed
students? (Ignore the cases of students who are not placed.) Enter the answer up to 3
decimal points accuracy. [3 marks]

3.364 Accepted range 3.2 to 3.5


Solution:
The mean of the salary (ignoring the cases of students who did not get placed) is
P11
i=1 xi
11
8.50 + 4 + 3 + 4 + 5 + 7 + 3 + 14 + 3.5 + 10 + 4
⇒ =6
11
Therefore, mean is INR 6 lakhs.
Population standard deviation:
r Pn
2
i=1 (xi − x̄)
σ=
n

(8.50 − 6)2 + (4 − 6)2 + (3 − 6)2 + (4 − 6)2 + (5 − 6)2 + (7 − 6)2



⇒σ=
11
1/2
(3 − 6)2 + (14 − 6)2 + (3.5 − 6)2 + (10 − 6)2 + (4 − 6)2
+
11


⇒σ= 11.318
⇒ σ = 3.364 lakh rupees approximately

14. What is the absolute value of the point bi-serial correlation coefficient of association
between gender and salary among the students? (Ignore the cases of students who are
not placed.) Enter the answer up to 3 decimal points accuracy. [5 marks]

14
0.542; can be in range [0.52, 0.56]
Solution:
Point bi-serial correlation coefficient formula (rpb ) is

(Y¯0 − Y¯1 ) p0 × p1
σ
From previous question we know that population standard deviation is 3.364 lakh rupees.
Let females be encoded as 1 and males be encoded as 0.
Therefore,
8.5 + 4 + 14 + 3.5 + 10
Y¯0 = =8
5
and
3+4+5+7+3+4
Y¯1 = = 4.33
6
5 6
p0 = = 0.454, p1 = = 0.545
11 11

(Y¯0 − Y¯1 ) p0 × p1
⇒ rpb =
σ

(8 − 4.33) 0.454 × 0.545
⇒ rpb = = 0.542
3.364

15
15.
18. Annual donations to a non profitable organisation is given in Table Q.2 where some data
is missing. The donations given by company B is 5 crore rupees more than company D.
How much did company D donate to the organisation? [3 marks]

Company Donated amount in INR crores Relative frequency


A 56 0.28
B
C 45 0.225
D
E 16 0.08

Table Q.2: Donations dataset

Answer: 39

Since,
frequency
Relative frequency=
Total
Now, the relative frequency and frequency for C is given to be 0.225 and 45 respectively.
Therefore, total number of insurance policies that has been sold is Relative frequency of C
frequency of C

45
⇒ = 200
0.225
Let the donation given by company D be x.
A/Q, the donation given by company B will be x + 5.
Now
56 + x + 45 + 5 + x + 16 = 200
⇒ 122 + 2x = 200
78
⇒x= = 39
2
Therefore, number of insurance policies donated by company D is 39.

18
Use the data given in Figure Q.5 to answer the questions 19 and 20.
The histogram of runs scored by a batsman in his career is given in Figure Q.5.

Figure Q.5: Runs dataset

19. What is the approximate mean of the runs scored by the batsman? [1 mark]

Answer: 50
Each bar here represents the frequency of the runs scored by a batsman in a particular
class interval.

Class interval Mid point(xi ) fi (xi ) · (fi ) fi · (xi − x̄)2


5 − 15 10 1 10 1(10 − 50)2
15 − 25 20 3 60 3(20 − 50)2
25 − 35 30 4 120 4(30 − 50)2
35 − 45 40 4 160 4(40 − 50)2
45 − 55 50 3 150 3(50 − 50)2
55 − 65 60 4 240 4(60 − 50)2
65 − 75 70 1 70 1(70 − 50)2
75 − 85 80 2 160 2(80 − 50)2
85 − 95 90 2 180 2(90 − 50)2
95 − 105 100 1 100 1(100 − 50)2
25 1250 14600

Table 1

19
Mean: P10
(xi ) · (fi ) 1250
x̄ = i=1
P10 = = 50
i=1 fi
25

Therefore, approximate mean of the runs scored by the batsman is 50.

20. What is the approximate population standard deviation of the runs scored by the bats-
man? Enter the answer up to 3 decimals accuracy. Hint: Use the class marks and its
frequencies to solve for standard deviation. [5 marks]

Answer: 24.166, accepted range: 23.9 to 24.4


Variance: P10 2
2 i=1 fi · (xi − x̄) 14600
s = P10 = = 584
i=1 fi
25

Now, the approximate sample standard deviation of the runs scored by the batsman is

584 = 24.16

20
21. In a call center, there are 75 employees and the number of calls they receive vary over
the length of a day. The working hours are 9 AM to 6 PM with lunch break from 1 PM
to 2 PM. The average number of calls received from 9 AM to 1 PM by an employee is
15 per hour, and the average number of calls received by the employee from 2 PM to 6
PM is 5 per hour. Based on this data, choose the correct options from below.
[3 marks]

a. Average number of calls received by employee in working hours is 10 calls/hour.


b. The correlation coefficient of time and calls received is positive.
c. The correlation coefficient of time and calls received is negative.
d. The standard deviation of the calls received is equal to zero.
e. The slope of the trend line is negative.

The average number of calls received from 9 AM to 1 PM (working hours) by an em-


ployee is 15 per hour.
The total number of calls received from 9 AM to 1 PM = 4(No. of hours) x 15(calls per
hour) = 60
Similarly, the average number of calls received from 2 PM to 6 PM (working hours) by
an employee is 5 per hour.
The total number of calls received from 2 PM to 6 PM = 4(No. of hours) x 5(calls per
hour) = 20
So, total number of calls received by an employee within working hours = 60 + 20 = 80
Therefore, average number of calls received by an employee in working hours = 80 8
= 10
Hence, option(a) is a correct option.
From the given information we can see that the average number of calls received by an
employee decreases as time increases and hence the trend is negative.
So, we can say that correlation coefficient will be negative.
Hence, option (b) is incorrect and option (c) is correct.
Standard deviation will be zero if all the values are same. But, if the averages are dif-
ferent that means values are not same.
Hence, option (d) is incorrect.
If the correlation coefficient is negative then the slope of the trend line will also be neg-
ative.
Hence, option (e) is correct option.

21
22. The correlation was found to be r =0.86 between price (x) and demand of mobile phones
(y). Which of the following options could be true? [3 marks]

a. Given two points from the scatter plot of price and demand of mobile phones, one point has
a smaller x value and a larger y value than another point.

b. Given two points from the scatter plot of price and demand of mobile phones, one point has
a larger x value and a smaller y value than another point.
c. The covariance of price and demand of mobile phones is positive.
d. The covariance of price and demand of mobile phones is negative.

From the above figure, the y value corresponding to 3 is larger than the y value corre-
sponding to 4. Hence, options (a) and (b) are correct.
The formula for the correlation coefficient is given by,

cov(x, y)
r= (1)
Sx ∗ Sy
The sign of the correlation coefficient depends on cov(x, y), since Sx and Sy will be hav-
ing positive values.
The covariance is directly proportional to the product of the deviations.
When the large(small) values of x tend to be associated with large(small) values of y,

22
the signs of the deviations, (xi − x̄) and (yi − ȳ) will also tend to be same.
If the signs of the deviations are same then the product of them will be positive.
Now, from the information given in the question, as the price(x) increases, demand(y)
should increase in order to have positive correlation coefficient.
Hence, option (c) is correct and option (d) is incorrect.

23

You might also like