Professional Documents
Culture Documents
R Studio Notes
R Studio Notes
lines(density(OYT), lty = 1)
##lty means “line style” 0 means no line, 1 means solid,
lines(density(OYT), lty = 1)
###to get just density of a distribution
d<-density(OYT)
plot(d)
#text(x,y,labels= ,adj= )adj- adjust the text position,
text(h$mids,h$counts,labels=h$counts, adj=c(0.5, -0.5))
BOX PLOT boxplot(OYT, horizontal = T,
xlab=”1 yr returns”,col=”green”, range=0)
with whiskers going to the entire range of the boxplot(x,y,z ,names=c(“x”,”y”,”z”), xlab=”r “, ylab=”a “,
distribution ylim=c(-15,35), col=c(“magenta”,”green”,”yellow”),
range=1.5)
boxplot(x, y ,z ,xlabel=”a”,main=”TITLE
“,col=c(“A”,”B”,”C”), names=c(“8”,”9”,”2”),
ylab=”D”,ylim=c(0,10))
text(y=c(fivenum(X),fivenum(Y),fivenum(Z)),labels=c(fiv
enum(X),fivenum(Y),fivenum(Z)),x=c(.4,1.5,2.5),cex=0.5)
DESCRIPTIVE STATISTICS
Mean OYT<-retire$X1YrReturn.
Median MeanOYT<-mean(OYT)
Trim mean medianOYT<-median(OYT)
trimOYT10<-mean(OYT,trim = 0.10)
SUMMARY # mean median min max 1 quarter and 3 rd summary(OYT)
quarter mean
install.packages(“moments”) skewness(OYT)
library(moments) # TO IMPORT LIBRARY kurtosis(OYT)
VAARIENCE var(OYT)
STANDARD DEVIATION sd(OYT)
SAMPLING
sample of size 50 from normal distbn mean 100 sd 15 x <- rnorm(50, mean=100,sd=15)
many samples of equal size from normal distribution replicate(1000,rnorm(50))
1000 samples of size 50 each t(replicate(1000,rnorm(50)))
##transpose for better view of 50 samples
sample from Poisson y<-rpois(50,lambda=1.2)
sample from binomial s3<-rbinom(n=6,size=5,0.62)
n=6, size= 5 trials experiment, p=(prob success
sampling from hypergeometric, rhyper(nn=6,m=3,n=4,k=2)
Sampling from uniform distribution runif(10,3,8)
Sampling from exponential distribution rexp()
sampling distribution of sample mean Means1<-replicate(1000,mean(rnorm(50)))
# each sample has a mean, hist(means1)
#1000 means stored in the object "means1"
##taking many samples of equal size replicate(1000,rexp(50))
#from exponential distribution
##1000 samples of size 50 each
###sampling distribution of sample mean means_exp<-replicate(1000,mean(rexp(50,rate=3)))
# each sample has a mean, hist(means_exp)
#1000 means stored in the object "means_exp" mean(means_exp)
sampling distribution of sample median medians_norm<-replicate(1000,median(rnorm(50)))
med<-density(medians_norm)
plot(med)
sampling distribution of sample variance variances <- replicate(10000, var(rnorm(50)))
Check E(s^2)= sigma^2 d<-density(variances)
mean(variances) plot(d)
S<-c(1:5)
means_unif<-
replicate(100,mean(sample(S,3,replace=TRUE)))
j<-density(means_unif)
plot(j)
CONFIDENCE LEVEL AND HYPOTESIS TESTING
confidence interval population mean, population L=qnorm(.025,lower.tail = T)
distribution and sigma unknown, m=sample size large, U= qnorm(.025,lower.tail=F)
#using qnorm(p,lower.tail=T or F)#True is the default W=u-l
CI<-c(L,U)
population distribution normal, sigma unknown, sample U=qt(p=.025,df=n-1, lower.tail =T )
size does not matter, no need of CLT L=qt(p=.025,df=n-1, lower.tail =F )
x-bar - mu/(s/sqrt(n)) is EXACTLY t distributed CI2<-c(lower1yrwith_t,upper1yrwith_t)
##using qt(p=,df=,lower.tail=)
population distribution normal, sigma unknown, sample t.test(Growth_1yr$X1YrReturn.)$conf.int
size does not matter
#### another way of getting confidence interval when t.test(Growth_1yr$X1YrReturn.,conf.level
using t.test (95% CI default) = .99)$conf.int
CI for population proportion(pi) where sample N=no of sample
proportion X=no of success
##is p(Levine et al) or p hat(Devore) P=x/n
##using prop.test(x,n,...)default con level=95% prop.test(?,?)$conf.int
prop.test(?,?, conf.level =.90 )$conf.int
EXCEL SHORTCUT
FREQUENCY DISTRIBUTION In second column select all cell where frequency displayed
Right bin sizes in first column type =FREQUENCY(range of original data,range of cell of
After formulae Press Cntrl +Shift +Enter bin value)
PIVOT TABLE Insert pivot table
Right click cel and change show value as totoal
Show value as column total or row total means percentage
distribution in no of male, female income
Double click on any table to see all data
Insert and slicers right click and remove slicers
DESCRIPTIVE STATISTICS
Average =AVERAGE(C24:C63)
Data>data_analysis>descriptive statistics> =TRIMMEAN() # round mean removing top 10% value
summary statistics =GEOMEAN() # growth rate of mean, time series data
MODE =MODE.MULT()
=MODE.SNGL()
=MODE(range)
Min max =min(range)
=max(range)
Median =quartile(range,2)
=median(range)
Quartile =quartile(range,1)
Standard deviation =standardize() #for standardizing values
=stdev.s(range)
=stdev.p(range)
VARIENCE =VAR.S(range)
=VAR.P(range)
Kurtosis =kurt(range)
SKEWNESS =SKEW(range)
RANDOM SAMPLING =RAND() #A random value btw 0 and 1
Gives a random value =randbetween(bottom,top) #rand btw top bottom
Empirical rule for bell shaped frequency 68% data at mean-1*sd mean+1*stdev distance
curve 95% data at mean-2*sd mean +2*stdev distance
Chebyshey’s theorem or inequality
Propotion=1−
1
|
k2 1
<k < 4 btw k sd distance from both
sides of mean
PROBABILITY DISTRIBUTION
BINOMIAL =BINOM.DIST(x, n, p, cumulative) #true cumulative
=BINOM.INV(n,p,alpha)#return x value for alpha prob
#low_range not include up_ran include =BINOM.DIST.RANGE(n, p, low_range,up_range)
POISSON =POISSON.DIST(x, mean, cumulative) #true cummulative
BERNOLII =
NORMAL.S #standard normal mean=0, sd=1 =NOR.S.DIST(z, cumulative) #true cumulative false height
=NORM.S.INV(alpha)
=NORM.DIST(x, mean, sd, cummlative)
=NORM.INV(alpa, mean, sd)
T test =T.DIST(x, degree_freedom, cmmlative)
=T.INV(p, degree_freedom)
SAMPLING
SAMPLING Alt +A+ Y+ 2 data >data analysis> random number generation
Number of variables -no of different sample number of random number – sample size
CENTRAL LIMIT THEROM Sample mean of n sample is normally distributed
Mean of all sample means= mean of population
Varience of all sample mean=population variance /n
Mean of all sample varience= poplation variance
α z α ∗σ z α ∗σ 2∗z α ∗σ
CONFIDENCE INTERVAL z α , x=1−
2
2 U =x+ 2
, L=x− 2
, RANGE= 2
( ) ( ) ( )
SAMPLE SIZE determination to reduce α z α ∗σ 2
z α ∗r 2
z α ∗1 2
Very Important: Please write your roll number and the details mentioned above(Name of the course etc.) in the
first set of rows of your worksheet and name your worksheet as your roll number.
The data in the attached EXCEL sheet contains information related to cars that were part of an inventory of a used
car dealership on 31/03/2020. The variables included are year(of manufacturing), price ($), mileage(the number of
miles the car has travelled) and fuel economy(mpg: miles per gallon).
1. Add “Age” (in years in 2020) in a new column.
2. Visualise price and mileage on an appropriate graph with proper labels and titles . What is your inference
from this graph?
3. Find the correlation between price and mileage.
4. Put an appropriate formula/function to get the price of a car in the cell I3 for any car number that is typed
in H3 i.e. if we type a car number(say Car-23) in H3 we should get it’s price in I3
5. Import this data in R and use R commands to find the coefficient of skewness for the distribution of the
variable “Price”
6. Compare the distributions of prices 2000 made cars with 2014 made cars using boxplots. Label your charts
properly.
Write the functions used in Excel and the commands used in R in your answer scripts.