Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 128

Statistical Computing using R

software : MA Economics Sem. - II


Objective of the course : The objective of this course is to
impart a practical understanding of ‘R’ software for
drawing measures of central tendency ,dispersion
,correlation and regression coefficients. The students will
be able to develop the skill of using the software to
generate statistical output to be interpreted for further
statistical inference. It will impart practical hands on
knowledge as students will be trained in computer science
laboratories and taught the ‘R’ language using case studies
and data sets.
Synopsis :

Unit – I : Measures of central tendency and dispersion


Arithmetic Mean
Geometric Mean
Harmonic Mean
Mode
Median
Partition values
Range and coefficient of range
Quartile deviation and coefficient of quartile deviation
Mean deviation
Synopsis :
Standard deviation
Coefficient of variation(For Discrete observations,
Ungrouped frequency Distribution ,Grouped frequency
distribution)
Synopsis :
Unit – II : Probability Distributions
Probability
Binomial Distribution
Poisson Distribution
Normal Distribution
Exponential Distribution
Synopsis :
Unit – 3: Correlation and Regression
Types of correlation
Scatter Diagram
Product moment correlation coefficient
Simple Linear Regression
Regression Diagnostics by Graphical method
Multiple Linear Regression
Synopsis :
Unit – 4 : Hypothesis Testing
Large sample test ( Z test )
Small sample test(t , F and chisquare test ) ANOVA
Unit – I :Measures of central tendency and
Dispersion
Measures :
Arithmetic Mean(A.M.)
Median
Mode
Geometric Mean
Harmonic Mean
Partition Values: Quartiles , Deciles and Percentiles
Range ,Quartile Deviation ,Standard Deviation ,C.V.
Unit – I :Measures of central tendency and
Dispersion
P # 1.The yield of sugarcane in India from 2006- 07 to 2019-20
in tonnes / hectare is given below :
Year : 2006-07 2007-08 2008-09 2009-10 2010-2011 2011-12 2012-13 2013-14
Yield : 69 68.9 64.6 70 70.1 71.7 68.3 70.5
Year : 2014-15 2015-16 2016-17 2017-18 2018-19 2019-20
Yield : 71.5 70.7 69 79.7 78.3 77.6
Find A.M. , Median , Mode ,G.M. , H.M., Q1, Q2 , Q3 , D7 , P65
M.D. mean
mean
R code and interpretation
AM , Median, Mode , GM and HM
>sugpn = c(69, 68.9, 64.6,70,70.1,71.7,68.3,70.5, 71.5,70.7, 69,79.7,78.3,77.6)
> # Finding Mean :
> M = mean(sugpn) ; n = length(sugpn)
> # Finding Median :
> Med = median(sugpn)
> # Finding Mode :
> tx = table(sugpn)
> m =which( tx == max(tx)) ; stx = sort(unique(sugpn))
> MODE = stx[m ]
> lx=log10(sugpn)
> gm=10^mean(lx)
> hm=n/sum(1/sugpn)
> MD = sum(abs(sugpn – M))/ n
R code and interpretation
Finding Mode :
# calculate mode in r example
# mode calculation function
test <- c(1,2,3,4,5,5,5,5,3,2,3,1,1,2)
getMode <- function(x) {
keys <- unique(x)
keys[which.max(tabulate(match(x, keys)))]
}
R code : Partition Values
>Q1=quantile(sugpn,0.25); Q3=quantile(sugpn,0.75)
> D7=quantile(sugpn,0.7)
> P65=quantile(sugpn,0.65)
> range=(max(sugpn) - min(sugpn))
> V= var(sugpn)
> sd=sqrt(V)
> CV= sd / M *100
R code : Partition Values
P # 2. Monthly sales (in ‘00 Rs.) of 10 small shops are given below :
100 , 190 , 210 , 160 , 150 , 160 , 190 , 200 , 170 ,152
Compute A.M. , Median , Mode , GM,HM , Q1 , Q2 , Q3 ,IQR , QD ,coefficient of QD
D8 , P60 , Variance , SD and CV , MD from median
Solution :
> S = c(100 , 190 , 210 , 160 , 150 , 160 , 190 , 200 , 170 ,152 )
> M = mean(S) ; MED = median(S)
> TS = table(S)
> m = which( TS == max(TS)) ; TSX = sort(unique (S))
> MODE = TSX[ m ]
> lx = log10(S)
> GM = 10^mean(x)
> HM = sum( 1/ S)
R code : Partition Values
> Q1 = quantile( S , 0.25) ; Q3 = quantile(S,0.75)
> IQR = Q3 – Q1 ; QD = (Q3 – Q1)/2 ; COEFFQD = (Q3 – Q1)/(Q3 + Q1)
> D8 = quantile( S , 0.8) ; P60 = quantile(S , 0.60)
> V = var(S) ; SD = sqrt(V) ; CV = Sd / M *100
R code : For ungrouped data
P# 3. For the following frequency distribution :
Age in years : 10 20 30 40 50 60
No. of persons : 7 11 9 8 3 2
Calculate AM, Median , Mode ,GM,HM, Q1,Q2,Q3 , D7 and 29thth percentile
Solution :
> x = c(10 ,20,30,40,50, 60) ; f = c(7,11,9,8,3,2) ; N = sum(f)
> y = rep( x , f)
> M = mean(y) ; MED = median(y) ; ly = log10(y); GM = 10^mean(ly);HM= N/sum(f/x)
> m= which(f == max(f)); MODE = x[m]
> Q1 =quantile( y ,0.25) ; Q3=quantile(y,0.75)
>D7 = quantile( y , 0.7) ; P29=quantile(y ,0.29)
R code :
For Grouped Frequency Distribution
P# 4. The frequency distribution of weight(in gms) of mangoes of a certain variety
is given below :
Weight : 410 – 420 420 – 430 430 – 440 440 – 450 450 – 460 460 – 470 470 – 480
No. of mangoes: 14 20 42 54 45 18 7
Calculate AM , Median , Mode , GM ,HM ,Q1 , Q3 , D6 and P62
Solution :
> lb= seq(410,470,10); ub = seq(420, 480 , 10); h = 10
> f =14 , 20 , 42 , 54 , 45 , 18 , 7 c(); x = (lb + ub)/2 ; N =sum(f)
> AM = sum(f*x)/N ; GM = 10^(sum(f *log10(x))/ N ) ; HM = N/sum(f/x)
> lcf = cumsum(f) ; mc = min(which(lcf > = N/2))
> MED = lb [ mc] + ( N / 2 – lcf[ mc – 1] ) * (h /f[mc])
> qc = min(which (lcf > = 3 * N/4))
> Q3 = lb[qc] + ( 3*N/4 – lcf [qc – 1 ]) *(h/f[qc])
R code :
For Grouped Frequency Distribution
> dc = min(which(lcf > = 6 * N / 10))
> D6 = lb[ dc ] + ( 6 * N/10 – lcf [dc – 1 ] )*( h / f[ dc ] )
> pc = min(which(lcf > = 62 * N /100))
> P62 = lb[ pc] + ( 62 * N / 100 – lcf [ pc – 1 ] ) * ( h /f[pc])
> moc = which(f == max(f))
> MODE = lb[moc] + ((f[ moc ] – f[ moc – 1 ]) / (2 * f[moc] – f[moc – 1 ]*f[moc+1]))*h
> y =rep(x,f)
> AM = mean(y)
> V = var(y)
> SD = sqrt(V)
> CV = SD / AM * 100
R code :

P #5. The production manager of Hinton Press is interested in determining the


average time needed to photograph one printing plate. Using a stop watch and
observing the plate makers, he collects the following times (in seconds) :
20.4 , 20.0,22.2,23.8,21.3,25.1,21.2,22.9,28.2,24.3,22.0,24.7,25.7,24.9,22.7,24.4,24.3,
23.6,23.2,21.0. An average per-plate time of less than 23.0 seconds indicates
satisfactory productivity. Should the production manager be concerned.

P # 6. A cosmetics manufacturer recently purchased a machine to fill


3.02,2.89,2.92,2.84,2.90,2.97,2.95,2.94,2.93,3.01,2.97,2.95,2.90,2.94,2.96,2.99,
2.99,2.97. 3-ounce cologne bottles. To test the accuracy of the machine’s volume
setting , 18 trial(sample) bottles were run. The resulting volumes (in ounces) for
the trials were as follows :
The company does not normally recalibrate the filling machine for this cologne if
the average volume is within 0.04 of 3.00 ounces. Should it recalibrate?
R code :

P #5. Mean = 23.295


Interpretation : Since the average per – plate time is
23.295 seconds > 23 seconds indicates unsatisfactory productivity
so the production manager be concerned.
P # 6. The range for the volume of cologne water :
Lower Upper
2.96 3.04
Interpretation : Since the average volume is 2.946667
which is less than the lower limit the filling machine should be
recalibrated.
R code :

P # 7. The following are the prices of shares of three companies :


MRF : 78360 , 79200, 76300,80100,80500, 81366
Honeywell : 44200, 40100, 39400, 42200,41500,43124 ,
Nestle India : 14200,15300,13100,16200,16700, 17252
Find out which company is performing well.(Use coefficient of Variation)
Means SDs CV
MRF 79304.33 1802.323 2.273%
Honeywell 41754 1811.092 4.338 %
NestleIndia 15458.67 1579.826 10.22 %

MRF is performing well.


R code :

P # 8. Students’ ages in the regular day time M.B.A. program and the
evening program of a management institute described by the
following samples :
(Ages)Regular M.B.A. : 23 29 27 22 24 21 25 26 27 24
(Ages) Evening M.B.A . 27 34 30 29 28 30 34 35 28 29
If homogeneity of the class is a positive factor in learning , use
coefficient of variation method to suggest which of the two groups
will be easier to teach.
Mean SD CV
Regular MBA 24.8 2.485514 11.59%
Evening MBA 30.4 2.875181 9.46%
Evening MBA group is easier to teach as CV is less.
Unit – II : Probability Distributions

Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Normal Distribution
Exponential Distribution
Discrete Probability Distributions
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Bernoulli Distribution
Jacob(James) Bernoulli
Born on 6th Jan. 1655
Died on 16th Aug. 1705
Swiss mathematician.
Although he studied
theology and philosophy
at university level he was
more interested in mathematics
and astronomy. 
Jacob(James) Bernoulli
STATISTICAL MODELS
• DISCRETE DISTRIBUTIONS
# BERNOULLI
1. P(x) = p x q 1–x , x = 0 , 1
2. Parameter : p
3. Range :x=0,1
Binomial distribution

1. p.m.f. : P(x) = n C x p x qn – x , x= 0,1,2,-----n


n: No. of trials p+q=1
x: No. of successes 0<p<1
p: probability of success 0<q<1
q: probability of failure
2. Parameter : n and p
3. Range : x = 0,1,2,3, ------ n.
4. Mean = np , Variance = npq , sd = √ npq
STATISTICAL MODELS
Conditions to use binomial distributions:
Random experiment should have two
outcomes viz. success and failure.
n is finite or small.
probability of success p is constant or fixed for
all
the trials.
Poisson Distribution

Born: 21 June 1781 in Pithiviers, France


Died: 25 April 1840 in Sceaux (near Paris),France
His teachers Laplace and Lagrange quickly saw
his Mathematical talents.

Simeon Denis Poisson


STATISTICAL MODELS
• DISCRETE DISTRIBUTIONS
# POISSON
1. p.m.f. : P(x) = e - λ λx / x!
2. Parameter : λ
3. Range : x = 0,1,2,3, ------ ∞.
4. Mean E(x) = λ
5. Variance V(x) = λ , S.D. = √ λ
# POISSON
• Applications :
• Used frequently in quality control,
reliability ,queuing theory and so on.
Problems
Combination Formula :
1. n C x = n! / [x! (n – x )!]
2. n C 0 = 1 = n C n e.g. 5 C 0 = 1 , 5 C 5 = 1
3. n C 1 = n e.g. 5 C 1 = 5
4. n C x = n C n – x e.g. 5 C 4 = 5 C 5 – 4 = 5 C 1 = 5
P#1. Let X ~ bin.(n = 5 , P=0.3), P(X=x )= 5 C x 0.3x 0.7 5 – x
Find (i)P(X ≤ 3) (ii)P(X ≥ 2) (iii)P(X = 4) (iv)P(1 < X ≤ 4)
(i) P(X ≤ 3) = P(X=3)+P(X=2)+P(X=1)+P(X=0)
= 5C3 0.33 0.72 + 5C2 0.32 0.73 + 5C1 0.31 0.74 + 5C0 0.30 0.75
= 0.1323 + 0.3087 + 0.36015 + 0.16807 = 0.96922
Problems
P#1. Let X ~ bin.(n = 5 , P=0.3), P(X=x )= 5 C x 0.3x 0.7 5 – x
Find (i)P(X ≤ 3) (ii)P(X ≥ 2) (iii)P(X = 4) (iv)P(1 < X ≤ 4)
(ii)P( X ≥ 2) = 1 – P( X < 2)
= 1 – [ P( X = 1 + P( X = 0) ]
= 1 – [0.36015 + 0.16807 ]
= 0.559
(iii)P( X = 4) = 5 C 4 0.34 0.7 = 0.02835
(iv) P(1 < X ≤ 4) = P(X=2 ) + P(X = 3 ) + P(X = 4)
= 0.3087 + 0.1323 + 0.02835
= 0.46935
Problems
P#1. Let X ~ bin.(n = 5 , P=0.3)
Find (i)P(X ≤ 3) (ii)P(X ≥ 2) (iii)P(X = 4) (iv)P(1 < X ≤ 4)
P#2. Let X ~ bin.( n = 8 , P = 0.5)
Find (i) P(X ≥ 4 ) + P(X ≤ 2) (ii)P( 4 ≤ X ≤ 7)
P#3. Let X ~ bin.(n = 8 , P = 0.3). Find K such that P(X ≤ K) = 0.2552
> k= qbinom(0.2552,8,0.3) , k=1
P#4. Let X ~ P( λ = 1.5 )
Find (i) P( X = 2 ) (ii)P( X <=5) (iii)P(X >=3)
P# 5. Let X ~ P( λ = 2.5) , Find (i)P(X > 2 ) (ii)P(3 ≤ X < 6)
(iii)P(X = 4) (iv) P( X ≤ 5) (v) Find k such that P( X ≤ k)=0.8571235 P# 6. Let
X~ P(λ) such that P(X = 0) = P(X = 1) hence find P( X > 2)
e –– λλ = λ e –– λλ λ = 1 , P( X > 2 ) = 1 – P( X ≤ 2 ) = 0.0803014
Problems
P#5. X ~ P(λ = 2.5)
(i)P(X > 2 ) = 1 – P( X ≤ 2 ) = 0.4561869
(ii)P( 3 ≤ X < 6) = 0.4141658
(iii) P( X = 4) = 0.1336019
(iv) P( X ≤ 5) = 0.957979
(v)P( X ≤ k ) = 0.8571235 , > k= qpois(0.8571235 , 2)
k=4
Problems
P#7.Let’s say that 80% of all business startups in the IT
industry report that they generate a profit in their first year.
If a sample of 10 new IT business startups is selected, find the
probability that exactly seven will generate a profit in their
first year.
Solution : X : The no. of IT startups will generate a profit
X ~ bin. (n = 10 , P = 0.80)
P( X ) = n Cx p x q n – x , x = 0,1,2,-----n
= 0 , other wise
P( X = 7 ) = 10C7 0.87 0.2 3 = 0.2013266
Problems
P#7. Interpretation/solution: There is a 20.13% probability
that exactly 7 of 10 IT startups will generate a profit in their
first year when the probability of profit in the first year for
each startup is 80%.
Problems
P# 8.A box of candies has many different colors in it. There is a 15%
chance of getting a pink candy. If 10 candies are selected randomly what
is the probability that (i)exactly 4 pink candies (ii) No pink candies
X : No. of pink candies , p = 0.15 , n = 10
q = 0.85
P(X = x) = 10
10 C x 0.15 x
x
x 0.85 10
10 –– xx x = 0 , 1 , 2 ---- 10

(i) P ( X = 4 ) = 0.04009571 , (ii) P( X = 0) = 0.1968744

P# 9. The mean and sd of a binomial distribution are 3 and √2


respectively. Find the parameters of the distribution and hence find the
probability that the variable takes
values : (i) less than or equal to 2 (ii) greater than 7
Given : mean = 3 , variance = 2
q =2/3 p =1/3 n=9
np = 3 , npq = 2
Problems
P# 9. (i)P(X ≤ 2 ) = 0.3771783
(ii)P(X > 7) = P( X = 8 ) + P( X = 9) = 0.0009653
P#10. X : No. of defective PD’s
X ~ bin.( n = 6, p = 0.15 )
(i)P( All are defective)
=P( X = 6 ) = 1.139063e-05 = 0. 00001139063
(ii)P( All are good)
= P( X = 0) = 0. 3771495
(iii)P(at least one is defective)
= P( X ≥ 1) = 1 – P( X < 1 ) = 1 - P( X = 0) = 0.6228505
Problems
P# 10. P(At most 2 are defectives)
= P( X ≤ 2 )
= 0.9526614
P# 11. X : The no. orders result in an order
(a) X ~ bin.( n = 8 , p=0.48)
P( X = 6 ) = 0.09260025
(b)X ~ bin. ( n = 4 , p=0.48)
P( X ≤ 1) = 0.3430835
Problems
P# 10. A box contains 100 P.D.’s, 15 of which are defective , 6
are selected for inspection. Find the probability that (i) all
are defectives (ii) all are good (iii) at least one is defective
(iv) at 2 are defectives
P #11. An industrial chemical that will retard the spread of
fire in paint has been developed, The local sales
representative hmostas estimated, from past experience that
48% of the sales calls result in an order.
a) If eight sales calls are made in a day, what is the probability of receiving
exactly six orders
b) If four sales calls are made before lunch, what is the probability that one
or fewer results in an order?
probability distributions
P#12. The hawks are currently winning 0.55 of their games.
There are 5 games in the next two weeks. What is the
probability that they will win more games than they lose ?
X : The no. of games hawks win
X ~ bin.( n = 5 , p=0.55)
Required Prob. = P( X ≥ 3)
= 1 – P( X < 3)
= 1 – P( X ≤ 2 )
= 0.5931269
probability distributions
P#13. The number of hurricanes hitting the coast of Florida
annually as a Poisson distribution with a mean of 0.8.
a)what is the probability that more than two hurricanes will
hit the Florida coast in a year?
b)what is the probability that exactly one hurricane will hit
the coast of Florida in a year?
X : The no. of hurricanes hitting the coast of Florida/year
X ~ P( λ = 0.8)
(a)P( X > 2 ) = 1 – P( X ≤ 2) = 0.0474226
(b)P( X = 1 ) = 0.3594632
probability distributions
P# 14. Arrivals at a bank teller’s drive-through window are
Poisson distributed at the rate of 1.2 per minute.What is the
probability of :
(a)zero arrivals in the next minute
(b) zero arrivals in the next two minutes
X : No. of arrivals / minute
X ~ P( λ = 1.2/minute)
(a)P( X = 0) = 0.3011942
(b) X : No. of arrivals / 2 minutes , λ = 2.4/ 2 minutes
P( X = 0) = 0.09071795
probability distributions
P #15. local electrical appliances shop has found from
experience that the demand for tube lights is distributed as
Poisson with a mean of 3 tubes per week. If the shop keeps 4
tubes during particular week, what is the probability that the
demand will exceed the supply during that week ?
X : Demand(No. of tube lights) for tube lights/week
X ~ P(λ = 3 / week)
P( X > 4) = 1 – P( X ≤ 4 )
= 0.1847368
probability distributions
P#16. Records indicate that 1.8% of the entering students at a
large state University Drop out of school by midterm. What is
the probability that three or fewer students will drop out of a
random group 300 entering students.
n = 300 , p = 1.8/100
When n is large in binomial distribution, then binomial tends
to Poisson distribution with binomial mean = np = λ mean of
Poisson distribution.
np = 300 * 1.8 / 100 = 5.4 = λ
Hence X ~ P(λ = 5.4 / 300 group)
P( X ≤ 3) = 0.213291
probability distributions
P#17. If 2% of electric bulbs manufactured by a certain
company are defective, find the probability that in the sample of
200 bulbs i) less than two bulbs (ii) more than three bulbs are
defective.
n = 200 , p = 2/100
np = 200 * 2/100 = 4 = λ
X ~ P(λ = 4 / sample of 200)
(i) P( X < 2) = P( X ≤ 1) = 0.4060058
(ii) P( X > 3 ) = 1 – P(X ≤ 3) = 0.1428765
probability distributions
P#18.It is known from past experience that in a certain plant
there are on an average four industrial accidents per month.
Find the probability that in a given month there will be less
than four accidents.
X : No. of industrial accidents / month
X ~ P( λ = 4 / month)
P(X < 4 ) = P( X ≤ 3)
= 0.8152632
probability distributions
P#19. Draw a random sample of size 8 from a binomial
distribution with n = 6 and p = 1/3. Find mean and SD of
sample.
> r = rbinom(8, 6 , 1/3)
> M =mean(r) ; SD = sqrt(var(r))
P #20. Draw a random sample of size 5 from a binomial
distribution with mean = 3.2 and variance = 1.92.
Find mean and variance of sample.
np = 3.2 , npq = 1.92 q = 1.92/3.2= 0.6 , p = 0.4 , n = 8
probability distributions
P #20. > r = rbinom( 5, 5 , 0.6)
> M=mean(r) ; V = var(r)
P #21. Draw a random sample of size 10 from a Poisson
distribution with parameter 1.7. Find mean and SD of
sample.
> R = rpois(10, 1.7)
> M =mean(R) ; SD = sqrt(var(R))
probability distributions
P#22. Draw a random sample of size 5 from a Poisson
distribution with mean = 0.8. Find mean and SD of sample.
> r = rpois( 5 , 0.8)
> M= mean(r) ; SD = sqrt(var(r))
Fitting of binomial distribution
P #23. The frequency distribution of number of heads
obtained in an experiment of tossing 5 coins 110 times is given
below :
No. of heads : 0 1 2 3 4 5
Frequency : 6 15 25 42 18 4
Fit a binomial distribution to the above data and find the expected frequencies.
Plot observed and expected frequencies and comment on the adequacy of model.
Fitting of binomial distribution
> x = 0:5 ; n = 5
> f=c(6,15,25,42,18,4) ; N = sum(f)
> smean = sum(x * f)/ N
> p = smean/n
> px= dbinom( x , n , p)
> px= round(px, 6)
> ef = N * px
> ef1=round(ef , 0)
> d = data.frame(x , f , px , ef1)
probability distributions
> plot(f,ef1,xlab=“Observed freq.”,ylab=“Expected
freq”,”p”, pch=16)
> abline(0,1)
probability distributions
P#24. Fit a Poisson distribution to the following data and
find expected frequencies. Plot observed and expected
frequencies and comment on adequacy of model.
No. of faults : 0 1 2 3 4 5 6 and more
No. of shifts : 4 14 23 23 18 9 9
> x = 0:6
> f = c(4 , 14 , 23 , 23 , 18 , 9 , 9 ) ; N = sum(f)
> M = sum( x * f ) / N
> px = round(dpois( x , M),6)
> ef = round( N *px , 0)
probability distributions
> px1 = 1 – sum(px)
> x = 0:7
>f = c( f , 0) ; px= c(px, px1)
> ef = c( ef , N*px1)
> data.frame(x , f , px , ef )
> plot( f , ef ,xlab =“Observed freq.”,ylab=“Expected freq.”,
“p”,pch = 16)
> abline(0,1)
probability and combination and
permutation
Permutation : n P r = n! / (n – r )!
e.g. 5P 2 = 5! / 3! = (5 × 4 × 3 × 2) / ( 3 × 2 ) = 20
Find the values of (i) 9C3 (ii) 10C4 (iii) 8 P 2 (iv) 6 P3
Solution :
(i) > c1 = choose(9 , 3)
(ii) > c2 =choose(10 , 4)
(iii) > r1 = 1:2 ; p1 = prod(r1)*choose(8 , 2)
(iv) > r2 = 1:3 ; p2 = prod(r2)*choose(6 , 3)
Probability
P # 1. In how many ways can committee of 4 persons be
formed from 4 teachers and 8 students such that exactly 2
teachers will be included in the committee ?
Solution :
Out of 4 teachers , 2 teachers can be selected in 4C2 ways
out of 8 students remaining 2 persons can be selected in 8C2
ways. Required no. ways = 4C2 × 8C2
R – code : > nw = choose(4 , 2) × choose(8 , 2 )
168 ways
Probability
P # 2. What is the probability of drawing 2 Kings from a
well shuffled pack of 52 playing cards ?
Solution :
n = 52C2 , m = 4C2 , required prob. = m /n = 4C2 / 52C2
R – code :
> rp = choose(4 , 2 ) / choose( 52 , 2 )
0.004524887
Probability
P # 3. A show-room has 6 Toyota and 8 Maruti cars. A sample of
4 is selected at random. What is the prob. of selecting
(i) at least 2 Toyota (ii) Exactly 3 Maruti
X : No. of Toyota cars P(X = x) = MC N – M C
x n–x

N N
Cn

P(X = x) = 6C x 8C 4 – x
N–M M
n sample 14
C4
n–x x
Probability
P#3. N = 6 + 8 = 14 , n = 4 , X : No. of Toyota cars
(i) At least 2 Toyota cars = 2 or 3 or 4
R code :
> x = 2:4
> N= sum(choose(6 , x ) *choose( 8, 4 – x))
> D = choose( 14 , 4)
> P = N/D
= 0.5944056
(ii) P( Exactly 3 Maruti) = P ( 1 Toyota and 3 Maruti)
= P (X = 1)
> P = choose(6 , 1) *choose(8 , 3) /choose(14 , 4 )
= 0.3356643
Probability
P # 4. Five cards are drawn without replacement from a pack
of 52 cards. Find the prob. that (i)No diamond card is drawn
(ii)Exactly two diamond cards are drawn (iii)at least two
diamond cards are drawn.
Probability
P # 4. X : No. of diamond cards

N = 52 , M=13, n = 5 , N – M = 39
N
P(X = x) = MC x N–M
C n–x

N–M M N
Cn
n sample
P(X = x) = 13C x 39
C 5 –x
n–x x
52
C5
Probability
P # 4.
(i) P(X=0) = 13C0 × 39 C5 / 52 C5 = 0. 0.2215336
(ii) P(X=2) = 13C2 × 39C3 / 52C5 = 0.2742797
(iii) P( X >= 2) = P(X=2)+P(X=3)+P(X=4)+P(X=5)
R – code :
> x= 2:5
> sum(choose(13,x) *choose(39, 5-x)) / choose(52,5)
0.3670468
Probability
P # 5. Four cards are drawn at random from a well shuffled
pack of 52 cards. Find the prob. that :
(i)Two cards are red and the remaining black
(ii) All cards are of different suits
(iii)All are of same suit
(iv)One is King
Solution : n = 52C4
(i) m = 26 C2 × 26 C2 P = m / n = 0.3901561
(ii) m = 13C1 × 13C1 × 13C1 × 13C1 = 134 , P = m / n = 0.1054982
Probability
P # 5. (iii)P(All are of same suit)= (13C4 + 13C4 + 13C4 + 13C4 )
52C
4

= 0.01056423
(iv)P(One is King) = 4C1 48C3 = 0.2555508
52C
4
Probability
P # 6. Out of 20 persons in a company , five are graduates.
If 3 persons are selected at random, what is the prob. that
(i)They are all graduates (ii)There is no graduate
(iii)At least two of them are graduates
Solution :
n=3 N= 20

15 Non-graduates 5 graduates
(i)P(all are graduates) = 55C33 / 20
20C3 = 0.00877193
3

(ii) P(There is no graduate) = 1515C3 / 20


20C3 = 0.3991228

(iii) P(At least two are graduates)=P( Two )+P(Three)=


Probability
(iii) P(At least two are graduates)=P( Two )+P(Three)
= ( 5C2 × 15C1 + 5C3 × 15C0 ) = 0.1403509
20
C33
Continuous probability distributions

Exponential
Normal
EXPONENTIAL DISTRIBUTION
Definition : A continuous random variable X
is said to follow exponential distribution with
the random variable value x ≥ 0 and its p.d.f.
is given by f(x) = λ e – λx , x ≥ 0
=0 , other wise
Mean = 1 / λ F(x)= P(X≤ x) = 1 - e – λx
Another form f(x) = 1/λ e – x / λ , x ≥ 0
=0 , other wise
Mean = λ F(x)= P(X≤ x) = 1 - e – x / λ
Exponential Distribution PDF graph
EXPONENTIAL DISTRIBUTION Problems

P # 1. If X ~ Exp( mean = 2 ) , find (i)P( X ≤ 5) (ii)P( X ≥ 3)


Solution :
(i) > pexp(5 , 1/2 )
0.917915
(ii) > 1 – pexp(3 , 1/2 )
0.2231302
EXPONENTIAL DISTRIBUTION Problems
P#2.The mileage (in thousands of miles ) which car owners get
with a certain kind of tyres is a random variable having
probability density function :
f(x) = 1/20 e-x/20 , x >0
= 0 other wise
Find the probability that one of these tyres will last for
(i)at most 10000 miles
(ii)any where from 16000 to 24000 miles
X : Life of tyres in ‘000 miles
R – code : (i)P(X ≤ 10) > pexp(10, 1/20) = 0.3934693
(ii)P( 16 < X < 24) > pexp(24 , 1/20) – pexp(16, 1/20)
= 0.1481348
EXPONENTIAL DISTRIBUTION Problems
P#3The lifetime , in years , of a satellite placed in orbit is
given by the following pdf
f(x) = 0.4 e – 0.4 x , x > 0
=0 other wise
(a)What is the probability that this satellite is still alive after
5 years ?
(b)What is the probability that the satellite dies between
3 and 6 years from the time it is placed in orbit
X : The lifetime of satellite in years
(a)P( X > 5) = 1 – P(X ≤ 5) = 0.1353353 (b)P( 3 < X < 6 ) = 0.2104763
Reqd. Prob. = 1 - 0.2104763 = 0.789524
EXPONENTIAL DISTRIBUTION Problems
P# 4 The number of days ahead travelers purchase their
airline tickets can be modeled by an exponential distribution
with the average amount of time equal to 15 days.
(a) Find the probability that a traveler will purchase a ticket
fewer than 10 days in advance.
(b) How many days do half of all travelers wait?
EXPONENTIAL DISTRIBUTION Problems
P# 4 Solution :
X : No. of days in advance the traveler will purchase ticket
X ~ Exp( mean = 15 ) , f(x ) = 1/15 e-x/15 , x > 0
= 0 other wise
(i)P( X < 10 ) = 1 – e - 10/15 = 0.486583
R – Code : > pexp(10 , 1/15)
0.4865829
(ii) P( X < k ) = 0.5 ( Given)
> qexp(0.5 , 1/15)
K = 10.39721
EXPONENTIAL DISTRIBUTION Problems
P# 5. On average , a certain device lasts 10 years. Assume that
The length of time is exponentially distributed.
(a) What is the prob. that the device lasts more than 7 years ?
(b) Eighty percent of device last at most how long ?
(c) What is the prob. that the device lasts between 9 and 11
years.
EXPONENTIAL DISTRIBUTION Problems
P# 5. X : The life of the device in years.
X ~ Exp(mean = 10 years) , f(x) = 1/10 e-x/10 , x > 0
= 0 , other wise
(a) P(X > 7) = 1 – P( X ≤ 7)
> 1 – pexp( 7 , 1/10)
0.4965853
(b) P(X > K ) = 0.80
i.e. P( X ≤ K) = 0.20 , K = 16 .09438
(c) P( 9 < X < 11 )
= P( X < 11) – P (X < 9)
= 0.07369858
PROPERTIES
1 # Exponential distribution is a continuous
probability distribution with x ≥ 0 and its
p.d.f. is f(x) = λ e – λ x
2 # c.d.f. of exponential F(x) = 1 – e – λ x
3 # Mean of the distribution is 1/λ
4 # Variance of the distribution is 1/λ2 and
s.d. is 1/λ
Parameter of the distribution is λ
NORMAL DISTRIBUTION
Definition : A continuous random variable X
is said to follow normal distribution with the
random variable values - ∞ < x < ∞ and its
probability density function is given by :
1 x 2
1  ( )
f ( x)  e 2 
 2
PROPERTIES
1# Normal distribution is continuous
distribution with - ∞ < x < ∞ and its p.d.f.
is : 1 x 2
1  ( )
f ( x)  e 2 
 2
2 # The parameters : µ and σ , - ∞ < µ < ∞
σ >0
PROPERTIES Contd..
3 # Mean = µ , variance = σ2 and s.d. = σ
4 # In normal distribution
mean = median = mode

mean = median = mode


PROPERTIES Contd..
5 # The Normal curve is bell shaped and
symmetric about x = µ

-∞ ∞
PROPERTIES Contd..
6 # In normal distribution the Q1 and Q3 are
equidistant. Q1 = µ - 0.675 σ, Q3 = µ + 0.675 σ
PROPERTIES Contd..
6 # Quartile Deviation(Q.D.)= 0.675 σ and
Mean Deviation(M.D.)= √ 2/∏ σ
= 0.8 σ
7 # The measure of skewness = 0 ,kurtosis=0
β2 = 3
8 # The first four central moments are
µ1 = 0 , µ 2 = σ 2
µ3 = 0 , µ4 = 3σ4
PROPERTIES Contd..
9 # The points of inflections of the curve are
x=µ ±σ

Area property

Total area = 1
= 100%
Area Property
Area property
Standard Normal Variate(S.N.V.)

A normal variable with mean 0 and s.d. 1 is


known as standard normal variable(S.N.V.)
and usually it is denoted by Z.
Z = x - µ symbolically Z ~ N(0, 1)
σ
p.d.f. of Z is f(z) = 1/√2∏ e – z 2
CENTRAL LIMIT THEOREM

This theorem was first stated by Laplace in 1812


Statement : X1 , X2 , X3 ----- Xn are independent
random variables each with mean µ and
variance σ2 . Then Sn = X1 + X2 + X3 + -----+Xn
= ∑Xi follows normal
distribution with mean µ= ∑ µi and
variance = σ2= ∑ σi2 and
sample mean x also follows normal distribution
with mean µ and variance σ2
n
IMPORTANCE OF THE NORMAL DISTRIBUTION

Normal Distribution plays a very important


role in Statistical theory and, in particular
in Sampling Theory. It has been found that:
1# Data obtained from Psychological,
Physical and Biological measurements
approximately follow Normal distribution.
IMPORTANCE OF THE NORMAL
DISTRIBUTION
2# Distribution like Binomial, Poisson, etc.,
can be approximated. If the number of
trails n is indefinitely large and neither p
nor q is very small, then Binomial
distribution tends to Normal distribution. If
the parameter λ is indefinitely large ( i.e., if
λ ), then Poisson distribution tends to
normal distribution.
IMPORTANCE OF THE NORMAL
DISTRIBUTION
3# For large samples, any statistic( i.e. sample mean, sample
S.D. etc.) approximately follows Normal distribution and as
such it can be studied with the help of normal curve.

4# Normal curve is used to find confidence limits of the


population parameters.

  5# Normal distribution is largely applied in statistical


quality control( S.Q.C.) in industry for finding control
limits.

 
IMPORTANCE OF THE NORMAL
DISTRIBUTION
6# The theory of errors of observations in
physical measurements are based on
Normal distribution.
Problems of Normal distribution
P #1. X ~ N(µ = 5 , σ = 2 )
Find (i)P( X > 4) (ii)P(X ≤ 3) (iii)P(10 < X < 15)
Solution : (i) > 1 – pnorm(4 , 5 , 2 )
0.6914625
(ii) > pnorm(3 , 5 , 2)
0.1586553
(iii) > pnorm(15 , 5 , 2) – pnorm(10 , 5 , 2)
0.006209379
Problems of Normal distribution
P #1. X ~ N(µ = 5 , σ = 2 )
Find (i)P( X > 4)
(i)P(X > 4) = P( (X - µ) / σ > (4 - µ)/ σ)
= P ( Z > - 0.5)
= 0.1915 + 0.5 -0.5 Z = 0

= 0.6915
Problems of Normal distribution
Problems of Normal distribution
P # 2. Let X ~ N(µ = 100 , σ2 = 64 )
Find (i)P(X ≤ 110 ) (ii)P( | X – 95 | < 5 )
(iii) P( X ≥ K )= 0.9 , P( X < K )=0.01 , find K
Solution : > mu = 100 ; sd = 8
(i) > pnorm( 110 , mu , sd )
0.8943502
(ii) P( | X – 95 | < 5 ) = P( 90 < X < 100)
> pnorm(100 , mu , sd) – pnorm(90, mu,sd)
0.3943502
(iii) > K1= qnorm( 0.1,mu , sd)
K= 89.74759 ≈ 90
> K2= qnorm(0.01, mu , sd)
K = 81.38922 ≈ 81
Problems of Normal distribution
P # 3. If the heights of 1000 soldiers in a regiment are
normally distributed with a mean of 172 cm. and s.d.
of 5 cm. , how many soldiers have heights greater than
180 cm.
Solution : X : Heights of soldiers in cms. N = 1000
X ~ N( µ = 172 cms. σ = 5 cms.)
P(X > 180) = 1 – P( X ≤ 180)
> 1 – pnorm(180 , 172, 5 )
0.05479929
The expected no. of soldiers with height > 180 = N × P(X>180)
1000× 0.05479929 = 54.79929 ≈ 55
Problems of Normal distribution
P # 4. The income distribution of a group of 10000 persons
was found to be normal with mean of Rs. 7500 per
month and s.d. of Rs. 500 per month. What percentage
of this group had income
(i) exceeding Rs. 6680
(ii) not more than Rs. 7000
Solution : X : Monthly income of a group in RS.
X ~ N( µ = 7500 , σ = 500)
(i) P(X > 6680) = 1 – P(X ≤ 6680)
> 1 – pnorm(6680 , 7500 , 500)
0.9494974 = 94.94974 %
(ii)P(X ≤ 7000) >pnorm(7000,7500,500) = 0.1586553=15.86553%
Problems of Normal distribution
P # 5. The distribution of monthly incomes of a group of 3000 factory
workers is following normal distribution with the mean equal to
Rs. 10000 and s.d. Rs 2000.
Find (i) the percentage of workers having a monthly income of
more than Rs. 12000
(ii) the number of workers having a monthly income of less than
Rs. 9000
(iii)the highest monthly income among the lowest paid 100 workers
(iv) the least monthly income among the highest paid 100 workers
Problems of Normal distribution
P # 5. X :
X ~ N( µ = 10000 , σ = 2000 )
(iii) Proportion = 100/3000 = 0.03333= 3.333 %
(iii)Given P( X ≤ K ) = 0.03333
K = 6332.081
(iv) P( X > K) = 0.03333 3.333%

i.e. P( X ≤ K ) = 0.96667 Z = X - 10000

K = 13667.92 2000
2000
R code for probability distributions

Distribution R code Arguments

Binomial binom (x,n,p)

Poisson pois (x,λ)

Hypergeometric hyper (x,M,N-M,n)

Uniform unif (x,a,b)

Exponential exp (x,1/mean)

Normal norm (x,μ,λ)


Correlation & Regression
 Simple correlation & regression
 Finding of Predicted values & Residual values
 Interpretation of r2 & r2adj.
 Multiple Regression
 ANOVA in regression
 t test in regression
 Multi-collinearity
Unit –III
Correlation and Regression
Correlation is the logical relationship that exists between two
or more variables.
If two or more variables are related each other in such a way
that change increases or decreases the corresponding change
in other , then the variables are
said to be correlated.
Examples : 1. Relation between heights and weights
2. Relation between dose of insulin and blood sugar.
3. Income and expenditure
4. Cigarette smoking and lung cancer.
Types of correlation

Positive
Negative
Zero
Types of correlation
 Positive correlation : If both the variables are varying in
the same direction then the correlation is said to be
positive.
 
 Negative correlation :If both the variables are varying in
opposite direction then the correlation is said to be
negative.

 Zero correlation: If there is no relationship between two


variables then the correlation is said to be zero.
Regression Analysis
Definition: Regression takes its name from studies made by
Sir Francis Galton. He compared the heights persons to the
heights of their parents. His major conclusion was that the
off-spring of unusually tall persons tend to be shorter than
their parents while children of usually short parents tend to
be taller than their parents .
Regression Analysis
In a sense ,the successive generations of off-springs from tall
persons “regress” downward towards the height of the
population, while the reverse is true originally for short
families .But the distribution of height for the total
population continues to have the same variability from
generation to generation.
Regression Analysis
By regression (linear)we mean average relationship between
two variables , and this relationship is used to estimate or
predict the most likely values of one variable for specified
values of the other variable. Regression is cause and effect
relation between variables.
Eg. Demand(cause ) and supply(effect), Age(cause) and
height(effect).
Regression Analysis
One of the variables is called independent or the explained
variable and the
other is called dependent or the explaining variable.
There are two regression equations:
Regression model is : y = β0 + β1 x + e
Where y is dependent variable , x is independent variable
and e is residual( or error term) which follows normal
distribution with mean E(e)=0 and
variance V(e) = σi2
Regression
ei =( y – (β0 + β1 x ))
∑ ei2 is minimized using derivative method and two
equations are obtained and are known as “normal
equations” by solving which we get the unknown
values of β0 and β1 and the equations are :
• ∑y = n β0 + β1 x ------------- (1)
• ∑xy = β0 ∑x + β1 ∑ x2 ------------- (2)

 
Regression
This process of minimizing ∑ ei2 is known as “
Ordinary Least Square(OLS)
• Assumptions of OLS :
• E(ei) = 0
• Var.(ei) = E(ei)2 = σei2
= constant (Homo- scedasticity)
• Var.(ei) = E(ei)2 = σei2 ≠ constant (Hetero-
scedasticity)
Multiple Regression
Multiple Regression
• In case of simple regression analysis only one
independent variable is included and predict the value
of the dependent variable through the appropriate
regression line. e.g. Sales(Y) ,advertising
expenditure(X).
• Then the simple linear regression equation can be :
• Y = β0 + β1 X
Multiple Regression
If we take sales as a function of advertising expenditure
, then we can predict sales for a given advertising
expenditure using the regression line of sales(Y) on
advertising expenditure(X).( If R2 = 0.80 which means
that (1-0.80)% = 20% of the variation in sales could be
due to the influence of other variable(factor) besides
advertising expenditure.
Multiple Regression
For instance , per capita income in the concerned trading
area could also have an influence on sales. Then results
of the simple regression model might be improved by
adding per capita income as an explanatory
(independent) variable. This extension of the simple
regression technique i.e. the use of two or more
independent variables, is known as multiple regression
analysis.And the multiple regression model is
Y = β 0 + β 1 X 1 + β 2 X2
Multiple Regression
Some times there is interrelation between many
variables and the value of one variable may be
influenced by many others. e.g. the yield of crop per
acre say(Y) depends upon quality of seed(X1) ,fertility
of soil(X2),fertilizer used (X3) , irrigation
facility(X4),weather conditions(X5) and so on. The
joint effect of a group of variables upon a variable not
included in that group , our study is Multiple
Regression.
7 assumptions of OLS
1.The regression model is linear in the coefficients and the error
term
2. The error term has a population mean of zero
3. All independent variables are uncorrelated with the error term
4. Observations of the error term are uncorrelated with each other
5. The error term has a constant variance (no heteroscedasticity)
6. No independent variable is a perfect linear function of other
explanatory variables
7. The error term is normally distributed
Multicollinearity in Regression
Multicollinearity occurs when independent variables in
a regression model are correlated. This correlation is a
problem because independent variables should be
 independent. If the degree of correlation between variables is
high enough, it can cause problems when you fit the model
and interpret the results.
Why is Multicollinearity a Potential
Problem?
• A key goal of regression analysis is to isolate the relationship
between each independent variable and the dependent
variable. The interpretation of a regression coefficient is that it
represents the mean change in the dependent variable for each
1 unit change in an independent variable when you hold all of
the other independent variables constant. That last portion is
crucial for our discussion about multicollinearity.
• The idea is that you can change the value of one independent
variable and not the others. However, when independent
variables are correlated, it indicates that changes in one
variable are associated with shifts in another variable.
Multicollinearity in Regression
• The stronger the correlation, the more difficult it is to change
one variable without changing another. It becomes difficult for
the model to estimate the relationship between each
independent variable and the dependent
variable independently because the independent variables tend
to change in unison.
Why Multi-Collinearity is a problem?

• When independent variables are highly correlated, change in


one variable would cause change to another and so the model
results fluctuate significantly. The model results will be
unstable and vary a lot given a small change in the data or
model. This will create the following problems:
1. It would be hard for you to choose the list of significant
variables for the model if the model gives you different
results every time.
2. Coefficient Estimates would not be stable and it would be
hard for you to interpret the model. In other words, you cannot
tell the scale of changes to the output if one of your predicting
factor changes by 1 unit.
Why Multi-Collinearity is a problem?

3.The unstable nature of the model may cause overfitting. If you


apply the model to another sample of data, the accuracy will drop
significantly compared to the accuracy of your training dataset.
How to fix Multi-Collinearity issue?

1. Variable Selection
• The most straight-forward method is to remove some variables
that are highly correlated to others and leave the more
significant ones in the set.
2. Variable Transformation
• The second method is to transform some of the variables to
make them less correlated but still maintain their feature.
How to fix Multi-Collinearity issue?

3. Principal Component Analysis


• Principal Component Analysis(PCA) is commonly used to
reduce the dimension of data by decomposing data into the
number of independent factors. It has many applications like
simplifying model calculation by reducing the number of
predicting factors.

Under perfect correlation among the explanatory variables ,


  OLS estimators can never be deterministic.
YOUTUBE LINK(in Hindi) :
https://www.youtube.com/watch?v=03RBbjUTp-c
Multiple Regression Problem
GDP =c( 8461.5, 11924.2, 9375.7,7758.8,6072.9,5491.7,5237.2,4586.9,4556.1,4850.4,5872.2, 7060.5 ,8402.4)

IN=c(1389,1040.7,627.3,581.4,526.3,316.7,584.8,557,704.4,956.9,1080.6,1332,1600.9
)

TOURIST=c(7070,6595,5727,5552,5109,4920,4875,4847,5057,5639,5805,6216,6972)

V=c(24111,21838,19611,19183,17670,17647,18122,17277,17845,18501,18373,18992,
20593)

 
Multiple Regression Problem: finding VIF
> install.packages("car")             -------select India(http)
> library(car)
> data = mtcars[, c(“disp”,”hp”,”wt”,”drat”)]
> model = lm(mpg~ disp+hp+wt+drat , data = mtcars)
> VIF =vif(model)
>barplot(VIF , main = “VIF values “ , horiz =TRUE ,col =”steel blue”)
> abline(v=5,lwd = 3 , lty = 2)
For checking normality of residuals in regression in R
> res=c( 0.19,2.12,0.992,0.123,-0.47 ,-1.34,0.678)
>qqnorm(res)
>qqline(res)
ANOVA in Regression
For simple regression :
Ho : β1 = 0 , H1: β1 ≠ 0
ANOVA table
Source d.f. Sum of square(S.S.) Mean sum of square(M.S.S.) F cal.

Regression k – 1 S.S. reg. =∑(y^ - y)22 M.S. reg. = S.S. reg. / d.f. M.S.reg.
Residual n–2 S.S. res. =∑(y – y^)22 M.S. res. = S.S.res. / d.f. M.S.res.

Total n – 1 S.S.total
total
=∑(y – y )22

Ho : β11 = β22= 0 , H11: At least β11 , β22 not zero


ANOVA in Regression
For simple regression :
Ho : β1 = 0 , H1: β1 ≠ 0
ANOVA table
Source d.f. Sum of square(S.S.) Mean sum of square(M.S.S.) F cal.

Regression k – 1 1 S.S. reg. = 77.770 M.S. reg. = 77.770 6.133

Residual n–2 6 S.S. res. = 76.106 M.S. res. = 12.68 5.98(table)

Total n – 17 S.S.total
total
= 153.876

Conclusion : Since F cal. (6.133) > F table(5.98) reject Ho , and conclude that
the Ht. of Son is influencing the Ht. of father.
5 % points of F distribution

You might also like