Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Workshop

Introduction to R-Program
R/R Studio-program installation

• The latest copy of Rcan be downloaded from


the CRAN
https://cloud.r-project.org/
• RStudio is an integrated development
environment for Rprogramming.Download
and install
http://www.rstudio.com/download.
RPackages

• R-Packages
• Rpackages can also be downloaded from this
site or alternatively, they can be obtained via
Ronce the package has beeninstalled.

• library() function is used to load libraries, or


groups of functions and data sets thatare not
included in the base Rdistribution.
Importing Data

data=read.csv(file.choose(),header=TRUE)
Datasets in R
• There are some built-in datasets in R. These
datasets are stored as data frames. To see the list
of datasets, type
• data()
• To open the dataset called trees, simply type
• data(trees)
• You can access single variables in a data frame by
using the $ argument.
• trees$Height
• sum(trees$Height) # sum of just these values ,
2356
c() Function

• The c() Function and the Assignment Operator is useful command in R for
entering smalldata sets.Thisfunction combinesterms together.

• Toenter this into an Rsession, wetype


• diceroll <-c(2,5,1,6,5,5,4,1)
• diceroll
• [1] 2 5 1 6 5 5 41
c() Function
• All variables or \objects created inRare stored in
the workspace.
• To see what variables are in the workspace, you
can use the function ls() to listthem.

• If we define a new variable a simple function of


the variable diceroll it will be added to the
workspace:
• newdiceroll <- diceroll/2 # divide every element
by two
• newdiceroll
• 1.0 2.5 0.5 3.0 2.5 2.5 2.0 0.5
The Workspace

• You can add a comment to a command line by


beginning it with the# character.

• To remove objects from the workspace, use the rm()


function:

• rm(newdiceroll) # this was a silly variable anyway


• ls()
• [1] "diceroll"
help()

• Help
• There is text help available from within Rusing
the function help() or the ? character typed
before a command.

• For example, suppose you would like to learn


more about the function log() inR.
• help(log)
• ?log
Ras a calculator

• Basic Math
• One of the simplest (but very useful) ways to use Ris asa
powerful number cruncher.
• Examples
– 2+3
– [1] 5
– 3/2
– [1] 1.5
– 2^3
– [1] 8
– 4^2 -3*2
– [1] 10
– (56-14)/6 - 4*7*10/(5^2-5) # this is more complicated
– [1] -7
Ras a calculator

• Other standard functions that are foundon


most calculators are available in R:
ArithmeticOperators
LogicalOperators
LogicalOperators
ArithmeticOperators
• Example
– sqrt(2)
– [1] 1.414214
– abs(2-4)
– [1] 2
– cos(4*pi)
– [1] 1
– log(0) # not defined
– [1] -Inf
– factorial(6) # 6!
– [1] 720
– choose(52,5) # this is 52!=(47!*5!)
– [1] 2598960
Vector ArithmeticOperators
• Vectors can be manipulated in a similar manner to
scalars by using the same functions introduced in the
last section.
– x <- c(1,2,3,4)
– y <- c(5,6,7,8)
– x*y
– [1] 5 12 21 32
– y/x
– [1] 5.00 3.00 2.33 2.000
– y-x
– [1] 4 4 4 4
– x^y
– [1] 1 64 2187 65536
ArithmeticOperators

• Other useful functions that pertain tovectors


include:
ArithmeticOperators
• Some examples using these functions:
– s <-c(1,1,3,4,7,11)
– length(s)
– [1] 6
– sum(s) # 1+1+3+4+7+11
– [1] 27
– prod(s) # 1*1*3*4*7*11
– [1] 924
– cumsum(s)
– [1] 1 2 5 9 16 27
– diff(s) # 1-1, 3-1, 4-3, 7-4, 11-7
– [1] 0 2 1 3 4
– diff(s, lag = 2) # 3-1, 4-1, 7-3, 11-4
– [1] 2 3 4 7
Matrix Operations
• Matrix Operations
• Among the many powerful features of Ris its ability
to perform matrix operations. You can create matrix
objects from vectors of numbers using the matrix()
command.
• a <- c(1,2,3,4,5,6,7,8)
• A <- matrix(a,nrow=2,ncol=4, byrow=FALSE) # a is
different from A
• Note that we could have left o the
byrow=FALSE argument, since this is the
default value.
• A <- matrix(a,2,4)
Matrix Operations

• Example
– a <- c(1:10)
– A <- matrix(a, nrow = 5, ncol = 2) # fill in by
column.
– B<- matrix(a, nrow = 5, ncol = 2, byrow = TRUE) #
fill in byrow.
– C<- matrix(a, nrow = 2, ncol = 5, byrow =TRUE)
Matrix Operations

• Matrix operations (multiplication, transpose,


etc.) can easily be performed in Rusing a few
simple.
Matrix Operations

• Using the matrices A, B, and Cjust created, we


can have some linear algebra calculation using
the above functions.
– t(C) # this is the same asA
– B%*%C
– D <-C%*%B
– det(D)
– solve(D) # this is D-1
Exercises

• Use Rto compute the following:


Exercises-Answer
• 1. abs(2^3-3^2)
– [1] 1
• 2. exp(exp(1))
– [1] 15.15426
• 3. (2*3)^8+log(7.5)-cos(pi/sqrt(2))
– [1] 1679619
• 4.
– a=c(1,2,3,2,2,1,6,4,4,7,2,5)
– b=c(1,3,5,2,0,1,3,4,2,4,7,3,1,5,1,2)
– A=matrix(a, nrow=3, ncol=4, byrow = TRUE)
– B=matrix(b, nrow = 4, ncol=4, byrow = TRUE)
– A
– B
– A%*%solve(B)
– B%*%t(A)
• 5. prod(2,5,6,7)*prod(-1,3,-1,-1)#
-1260
Graphs in R
• The plot() Function

• The most common function used to graph anything in R is


the plot() function. This is a generic function that can be
used for scatterplots, time-series plots, function graphs,
etc.

– plot(x, y)
– x=c(2,3,4,5,7,8,9,1)
– y=c(3,4,5,2,5,8,7,2)
– plot(x,y)
– data(trees)
– plot(Height, Volume)
Graphs in R

• The curve() Function

• To graph a continuous function over a range of


values, the curve() function can beused.

– curve(sin(x), from = 0, to =2*pi)


Graphs in R
• Additional Features on Graphs
Summarizing Data

• Rincludes several functions for computing


sample statistics for both numerical (both
continuous and discrete) and categorical data.
Summarizing Data
• Example
• Lets consider the dataset mtcars in Rcontains measurements on 11
aspects of automobile design and performance for 32 automobiles
(1973-74 models).

– data(mtcars) # load in dataset


– attach(mtcars) # add mtcars to searchpath
– mtcars
– mean(hp) #146.6875
– var(mpg)#36.3241
– quantile(qsec, probs = c(.20, .80)) # 20&80 percentiles(16.7, 19.3)
– cor(wt,mpg) # not surprising that thisis negative
– For the discrete variables, we can get summarycounts:
– table(cyl)
Graphical Summaries

• For discrete or categorical data, we can display


the information given in a table command in a
picture using the barplot() function.

• barplot(table(cyl)/length(cyl)) # use relative


frequencies on the y-axis.
Graphical Summaries

• hist()
• This function will plot a histogram that is typically used
to display continuous-type data. As an example,
consider the faithful dataset in R, which is a famous
dataset that exhibits natural bimodality. The variable
eruptions gives the duration of the eruption (in
minutes) and waiting is the time between eruptions for
the Old Faithful geyser:
– data(faithful)
– attach(faithful)
– hist(eruptions, main = "Old Faithful data", prob =T)
Graphical Summaries

• We can give the picture a slightly different


look by changing the number of bins

– hist(eruptions, main = "Old Faithful data", prob =


T,breaks=18)
Graphical Summaries

• boxplot()
• This function will construct a single boxplot.
For the two data files in the Old Faithful
dataset:
– boxplot(faithful) # same as boxplot(eruptions,
waiting).
• Thus, the waiting time for an eruption is
generally much larger and has higher
variability than the actual eruption time.
Exercises

• Using the stackloss dataset that isavailable


from R.

• Compute the mean, variance, and 5 number


summary of the variable stack.loss.

• Create a histogram and boxplot for the


variable stack.loss.
Exercises-Answer

• data(stackloss)
• mean(stack.loss)
• fivenum(stack.loss)
• var(stack.loss)

• hist(stack.loss)
• boxplot(stack.loss)
Statistical Analysis Welead

• Make sure you have a good data set;


1. First describe and present your data, e.g.frequency
distributions in tables or charts.
2. Calculate basic statistics where possible, e.g. means and
standard deviations, quintiles etc.
3. Start to interpret your data – what might it mean?
4. Select specific items for closer attention (based onyour
research hypotheses).
5. Select and carry out the right kind of test.
6. Interpret your findings in terms of significance levels.
7. Modify and repeat the analysis if necessary.
The Hypothesis Welead

• Hypothesis is a statement relating to an


observation that may be true but for which a
proof (or disproof) has not beenfound.

• Null hypothesis (H0)


– Opposite of desired result .

• Alternative hypothesis (H1)


– Opposite of the null hypothesis.
Hypothesis Testing Welead

• Hypothesis testing is a procedure, based on


sample evidence, used to determine whether

– the hypothesis is a reasonable statement and


should not be rejected,
or
– unreasonable and should be rejected.
Statistical Comparison Tests Welead

– One-sample T-test.
– Two sample independent T-test.
– Paired sample T-test.
– ANOVA.

One group Two groups 3 or more groups
One Sample T-Test Independent Samples T-Test One-
Paired sample T-test. Way
ANOVA
One-sampleT-test Welead

• Compare the sample mean with a known


value, when the variance of the population is
unknown.

• The Rfunction t.test() can be used to perform


both one and two sample t-tests on vectors of
data.
One-sampleT-test
Welead

• The function contains a variety of options and can be


called as follows

• t.test(x, y = NULL, alternative = c("two.sided", "less",


"greater"), mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95).

– x is a numeric vector of data values.


– y is an optional numeric vector of datavalues.
– If y is excluded, the function performs a one-sample t-test
on the data contained in x.
– if it is included it performs a two-sample t-tests using both
x and y.
One-sampleT-test
Welead

• mu provides a number indicating the true value


of the mean (or difference in means if you are
performing a two sample test) under the null
hypothesis.
• The option alternative is a character string
specifying the alternative hypothesis, and must
be one of the following: "two.sided" (which is the
default), "greater" or "less" depending on
whether the alternative hypothesis is that the
mean is different than, greater than or less than
mu, respectively.
One-sampleT-test Welead

• The option var.equal is a logical variable indicating


whether or not to assume the two variances as being
equal when performing a two-sample t-test.

• If TRUE then the pooled variance is used to estimate


the variance otherwise the Welch (or Satterthwaite)
approximation to the degrees of freedom is used. If
you leave this option out it defaults to FALSE.

• Finally, the option conf.level determines the confidence


level of the reported confidence interval for in the one-
sample case and M1- M2 in the two-samplecase.
One-sampleT-test
Welead

• Example

• t.test(x, alternative ="less", mu =10)


• performs a one sample t-test on the data contained in x where the null
hypothesis is that M=10andthe alternative isthatM<10.
One-sampleT-test
Welead

• Example
An outbreak of Salmonella-related illness
attributed to was ice cream
producedthe level of
factory. Scientists measured at Salmonella
a certain
in 9 randomly sampled batches of ice cream. The
levels (in MPN/g) were;
(0.59, 0.14, 0.32, 0.69, 0.23, 0.79, 0.52, 0.39, 0.42).
Is there evidence that the mean level of Salmonella
in the ice cream is greater than 0.3MPN/g?
One-sampleT-test Welead

• Let be the mean level of Salmonella in all batches of ice


cream.
• Here the hypothesis of interest can be expressedas:
– H0:M = 0.3
– Ha: M> 0.3

• Hence, we will need to include the options


alternative="greater", mu=0.3. Below is the relevant R-
code:
– x = c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392,
0.418)
– t.test(x, alternative="greater", mu=0.3)
One-sampleT-test Welead

• From the output we see that the p-value =


0.029. Hence, there is moderately strong
evidence that the mean Salmonella level in
the ice cream is above 0.3 MPN/g.
Two-sample Independent T-test Welead

• Compare the means of two groups under the assumption that both samples
are random, independent, and normally distributed with unknown but
equalvariances.
Two-sample T-test Welead

• Example
• 6 subjects were given a drug (treatment
group) and an additional 6 subjects a placebo
(control group). Their reaction time to a
stimulus was measured (in ms). We want to
perform a two-sample t-test for comparing
the means of the treatment and control
groups.
Two-sample T-test Welead

• Let µ1 be the mean of the population taking medicine


and µ2 the mean of the untreated population. Here the
hypothesis of interest can be expressedas
– H0: µ1 - µ2 =0
– Ha: µ1 - µ2 <0

• we will need to include the data for the treatment


group in x and the data for the control group in y.
• We will also need to include the options
alternative="less", mu=0.
• Finally, we need to decide whether or not the standard
deviations are the same in bothgroups.
Two-sample T-tests Welead

• Below is the relevant R-code when assuming


equal standard deviation.

– Control = c(91, 87, 99, 77, 88, 91)


– Treat = c(101, 110, 103, 93, 99, 104)
– t.test(Control,Treat,alternative="less",
var.equal=TRUE)
Two-sample t-tests Welead

The output
Two-sample t-tests Welead

• Below is the relevant R-code when not


assuming equal standard deviation.

• t.test(Control,Treat,alternative="less")
Two-sample t-tests Welead

• Here the pooled t-test and the Welsh t-test


give roughly the same results (p-value =
0.00313 and 0.00339, respectively).
Paired sample T-test Welead

• There are many experimental settings where


each subject in the study is in both the treatment
and control group.
• For example, in a matched pairs design, subjects
are matched in pairs and different treatments
are given to each subject in thepair.
• The outcomes are thereafter compared pair-
wise. Alternatively, one can measure each subject
twice, before and after atreatment.
• In either of these situations we can’t use two-
sample t-tests since the independence
assumption is not valid.
Paired sample T-test Welead

• Compare the means of two sets of paired


samples, taken from two populations with
unknown variance.

• The option paired indicates whether or not


you want a paired t-test (TRUE = yes and
FALSE = no). If you leave this option out it
defaults to FALSE.
Paired sample T-test Welead

• Example
• A study was performed to test whether cars get
better mileage on premium gas than on regular gas.
Each of 10 cars was first filled with either regular or
premium gas, decided by a coin toss, and the mileage
for that tank was recorded. The mileage was
recorded again for the same cars using the other kind
of gasoline. We use a paired t-test to determine
whether cars get significantly better mileage with
premium gas.
Paired Sample T-test Welead

• Below is the relevant R-code

– reg = c(16, 20, 21, 22, 23, 22, 27, 25, 27, 28)
– prem = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
– t.test(prem,reg,alternative="greater",
paired=TRUE)
Paired sample T-test Welead

• The output

The results show that the t-statistic is equal to 4.47


and the p-value is 0.00075. Since the p-value is very
low, we reject the null hypothesis. There is strong
evidence of a mean increase in gas mileage between
regular and premium gasoline.
Analysis of Variance (ANOVA) Welead

• The T-test is limited to compare two sets ofdata, but


to compare many groups at once you need analysis
of variance (ANOVA).
• The test statistic is an F test with k-1 and N-k degrees
of freedom, where N is the total number ofsubjects.
• While P-Value < 0.05 for this test indicates evidence
to reject the null hypothesis in favor to the
alternative hypothesis. In other words, there is
evidence that at least one pair of means are not
equal.
Analysis of Variance Welead

• The hypotheses for the comparisonof


independent groups are:
• Ho: µ1 = µ 2 ...= µ k (means of the all groupsare
equal).
• Ha: µ 1 ≠ µ 2 ≠ µ k (means of the two or more
groups are not equal).
– Reject null if at least one population has amean
that differs from the others
Analysis of Variance Welead

• Assumptions:
• Subjects are randomly assigned to one of k
groups.
• The distribution of the means by groupis
normal with equal variances.
• Sample sizes between groups do not have to
be equal, but large differences in samplesizes
by group may affect the outcome of the
multiple comparisons tests.
Analysis of Variance Welead

• In the ANOVAtable

– Sources of variation. The analysis of variance requires the estimation


of two variances: between groups and the within groups.
– SS.Sum of square deviations.
– df. Degrees of freedom.
– MS. Mean square of deviations (variance estimates), which is equalto
SS/df,.
– F.Is a probability distribution. It is the ratio of twovariances.
– P-value. This is the value that answers your question. We interestedto
know whether there is some sort ofrelationship.
– ANOVAassumes by default that there is no relationship.
– As a general rule, a p-value greater than 0.05 meansANOVA
assumption may be right.
Example
Welead

We’re going to use a data set called InsectSprays. 6 different


insect sprays (1 Independent Variable with 6 levels) were tested
to see if there was a difference in the number of insects found in
the field after each spraying (DependentVariable).
data(InsectSprays)
attach(InsectSprays)
str(InsectSprays)
boxplot(count ~ spray)
oneway.test(count~spray)
Example
Welead

• Default is equal variances (i.e. homogeneity of variance) not


assumed – i.e. Welch’s correction applied.
• Oneway.test( ) corrects for non-homogeneity, but doesn’t give
much information – i.e. just F, p-value and dfs for numerator
and denominator – no MSetc.
• aov.out = aov(count ~ spray, data=InsectSprays)
• summary(aov.out)
• The "select if" command or the tapply( ) function can be used
to get standard deviations and sample sizes for eachgroup.
Example
Welead

– mean_group1= tapply(count,spray,mean)
– mean_group1

• Post Hoc tests


• Tukey HSD(Honestly Significant Difference) is
default in R
– TukeyHSD(aov.out)
Example2
Welead

• The data set contains information on 76


people who undertook one of three diets
(referred to as diet A, B and C). There is
background information such as age, gender,
and height. The aim of the study was to see
which diet was best for losingweight.
Example2
Welead

– diet =read.csv(choose.files(),header=TRUE)
– attach(diet)
– aov.out= aov(loss_weight~Diet,data=diet)
– aov.out
– summary(aov.out)

– mean_group= tapply(loss_weight,Diet,mean)
– mean_group
Shapiro-Wilk Normality test Welead

• Shapiro.Test()
• NULLhypothesis that the samples came from a Normal distribution.
• if the p-value <= 0.05, then you would reject the NULLhypothesis .

• The p-value > 0.05 implying that the distribution of the data are not
significantly different from normal distribution. In other words, we can
assume the normality.

– shapiro.test(loss_weight)
– hist(loss_weight, probability=T, breaks = 15, main="Histogram of normal
– data",xlab="Approximately normally distributed data")
– lines(density(loss_weight))
Normality test Welead

• Draw the qq-plot of the normally distributed


data using pch=19 to produce solid circles.
– qqnorm(loss_weight,main="QQ plot of normal
data")
• Add a line where x = y to help assess how
closely the scatter fits theline.
– qqline(loss_weight)
Normality test Welead

• What if the data is notnormally distributed?

• Transform the dependent variable (repeating the


normality checks on the transformed data): Common
transformations include taking the log or square root
of the dependent variable.
• Use a non-parametric test: Non-parametric tests are
often called distribution free tests and can be used
instead of their parametric equivalent.
Non-parametrictests
Welead
Commands for non-parametric tests inR
Welead

You might also like