Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 56

Frequency Tables

Definition of frequency tables here will restrict itself to 1-dimensional variations, which in
theory each one of us learnt in primary school where you could calculate manually, given
time. But they are such a common tool, that analysts can use for all sorts of data validation
and exploratory data analysis jobs, that finding a nice implementation might prove to be a
time-and-sanity saving task over a lifetime of counting how many things are of which type.

Whenever you have a limited number of different values in R, you can get a quick summary

of the data by calculating a frequency table. A frequency table is a table that represents the

number of occurrences of every unique value in the variable. In R, you use

the table() function for that.

CREATING A TABLE IN R
You can tabulate, for example, the amount of cars with a manual and an automatic gearbox

using the following command:


> amtable <- table(mtcars$am)
> amtable
auto manual
13 19

This outcome tells you that your data contains 13 cars with an automatic gearbox and 19

with a manual gearbox.

The Example Data Set – mtcars


Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to
explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6
observations of just the car model and the gear. We can use the subset() function to restrict the data set
to show just the row names and “gear”.
> head(subset(mtcars, select = 'gear'))

gear

Mazda RX4 4

Mazda RX4 Wag 4


Datsun 710 4

Hornet 4 Drive 3

Hornet Sportabout 3

Valiant 3

What are the possible values of “gear”? Let’s use the factor() function to find out.
> factor(mtcars$gear)

[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

Levels: 3 4 5

The cars in this data set have either 3, 4 or 5 forward gears. How many cars are there for each
number of forward gears?
Why the table() function does not work well
The table() function in Base R does give the counts of a categorical variable, but the output is not a data
frame – it’s a table, and it’s not easily accessible like a data frame.
> w = table(mtcars$gear)

> w

3 4 5

15 12 5

> class(w)

[1] "table"

You can convert this to a data frame, but the result does not retain the variable name “gear” in the
corresponding column name.
> t = as.data.frame(w)

> t

Var1 Freq

1 3 15

2 4 12

3 5 5

You can correct this problem with the names() function.


> names(t)[1] = 'gear'
> t

gear Freq

1 3 15

2 4 12

3 5 5

count() to the Rescue! (With Complements to the “plyr” Package)


Thankfully, there is an easier way – it’s the count() function in the “plyr” package. If you don’t already
have the “plyr” package, install it first by running the command
install.packages('plyr')

Then, call its library, and the count() function will be ready for use.
> library(plyr)

> count(mtcars, 'gear')

gear freq

1 3 15

2 4 12

3 5 5

> y = count(mtcars, 'gear')

> y

gear freq

1 3 15

2 4 12

3 5 5

> class(y)

[1] "data.frame"

As the class() function confirms, this output is indeed a data frame!

Frequency Distribution of Quantitative


Data
The frequency distribution of a data variable is a summary of the data occurrence in a collection
of non-overlapping categories.
Example
In the data set faithful, the frequency distribution of the eruptions variable is the summary of
eruptions according to some classification of the eruption durations.
Problem
Find the frequency distribution of the eruption durations in faithful.
Solution

The solution consists of the following steps:

1. We first find the range of eruption durations with the range function. It shows that the
observed eruptions are between 1.6 and 5.1 minutes in duration.
> duration = faithful$eruptions
> range(duration)
[1] 1.6 5.1

2. Break the range into non-overlapping sub-intervals by defining a sequence of equal distance
break points. If we round the endpoints of the interval [1.6, 5.1] to the closest half-integers,
we come up with the interval [1.5, 5.5]. Hence we set the break points to be the half-integer
sequence { 1.5, 2.0, 2.5, ... }.
> breaks = seq(1.5, 5.5, by=0.5) # half-integer sequence
> breaks
[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

3. Classify the eruption durations according to the half-unit-length sub-intervals with cut. As
the intervals are to be closed on the left, and open on the right, we set the right argument
as FALSE.
> duration.cut = cut(duration, breaks, right=FALSE)

4. Compute the frequency of eruptions in each sub-interval with the table function.
> duration.freq = table(duration.cut)

Answer

The frequency distribution of the eruption duration is:


> duration.freq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 41 5 7 30 73 61
[5,5.5)
4

Enhanced Solution
We apply the cbind function to print the result in column format.
> cbind(duration.freq)
duration.freq
[1.5,2) 51
[2,2.5) 41
[2.5,3) 5
[3,3.5) 7
[3.5,4) 30
[4,4.5) 73
[4.5,5) 61
[5,5.5) 4

Note
Per R documentation, you are advised to use the hist function to find the frequency distribution for
performance reasons.
Exercise
1. Find the frequency distribution of the eruption waiting periods in faithful.
2. Find programmatically the duration sub-interval that has the most eruptions.
Histogram
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a
quantitative variable. The area of each bar is equal to the frequency of items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars
showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, # apply the hist function
+ right=FALSE) # intervals closed on the left

Answer

The histogram of the eruption durations is:

Enhanced Solution
To colorize the histogram, we select a color palette and set it in the col argument of hist. In addition,
we update the titles for readability.
> colors = c("red", "yellow", "green", "violet", "orange",
+ "blue", "pink", "cyan")
> hist(duration, # apply the hist function
+ right=FALSE, # intervals closed on the left
+ col=colors, # set the color palette
+ main="Old Faithful Eruptions", # the main title
+ xlab="Duration minutes") # x-axis label
Exercise
Find the histogram of the eruption waiting period in faithful.

Another Ways
1
dat <- data.frame(Treenumber=c(3,3,3,3,6,6,6,9,10,11,11,12,12,12))
dat$Freq <- table(dat)[as.character(dat$Treenumber)]

2
library(dplyr)
df1 %>%
add_count(Treenumber)

Relative Frequency Distribution of


Quantitative Data
The relative frequency distribution of a data variable is a summary of the frequency proportion
in a collection of non-overlapping categories.

The relationship of frequency and relative frequency is:

Example
In the data set faithful, the relative frequency distribution of the eruptions variable shows the
frequency proportion of the eruptions according to a duration classification.
Problem
Find the relative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)

Then we find the sample size of faithful with the nrow function, and divide the frequency distribution
with it. As a result, the relative frequency distribution is:
> duration.relfreq = duration.freq / nrow(faithful)

Answer

The frequency distribution of the eruption variable is:


> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5)
0.187500 0.150735 0.018382 0.025735 0.110294 0.268382
[4.5,5) [5,5.5)
0.224265 0.014706

Enhanced Solution
We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=1)
> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.19 0.15 0.02 0.03 0.11 0.27 0.22
[5,5.5)
0.01
> options(old) # restore the old option

We then apply the cbind function to print both the frequency distribution and relative frequency
distribution in parallel columns.
> old = options(digits=1)
> cbind(duration.freq, duration.relfreq)
duration.freq duration.relfreq
[1.5,2) 51 0.19
[2,2.5) 41 0.15
[2.5,3) 5 0.02
[3,3.5) 7 0.03
[3.5,4) 30 0.11
[4,4.5) 73 0.27
[4.5,5) 61 0.22
[5,5.5) 4 0.01
> options(old) # restore the old option

Exercise
Find the relative frequency distribution of the eruption waiting periods in faithful.
Cumulative Frequency Distribution
The cumulative frequency distribution of a quantitative variable is a summary of data frequency
below a given level.
Example
In the data set faithful, the cumulative frequency distribution of the eruptions variable shows
the total number of eruptions whose durations are less than or equal to a set of chosen levels.
Problem
Find the cumulative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)

We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)

Answer

The cumulative distribution of the eruption duration is:


> duration.cumfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
51 92 97 104 134 207 268
[5,5.5)
272

Enhanced Solution
We apply the cbind function to print the result in column format.
> cbind(duration.cumfreq)
duration.cumfreq
[1.5,2) 51
[2,2.5) 92
[2.5,3) 97
[3,3.5) 104
[3.5,4) 134
[4,4.5) 207
[4.5,5) 268
[5,5.5) 272

Exercise
Find the cumulative frequency distribution of the eruption waiting periods in faithful.

Cumulative Frequency Graph


A cumulative frequency graph or ogive of a quantitative variable is a curve graphically showing
the cumulative frequency distribution.
Example
In the data set faithful, a point in the cumulative frequency graph of the eruptions variable shows
the total number of eruptions whose durations are less than or equal to a given level.
Problem
Find the cumulative frequency graph of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Check the previous
tutorial on Frequency Distribution for details.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)

We then compute its cumulative frequency with cumsum, add a starting zero element, and plot the
graph.
> cumfreq0 = c(0, cumsum(duration.freq))
> plot(breaks, cumfreq0, # plot the data
+ main="Old Faithful Eruptions", # main title
+ xlab="Duration minutes", # x−axis label
+ ylab="Cumulative eruptions") # y−axis label
> lines(breaks, cumfreq0) # join the points

Answer

The cumulative frequency graph of the eruption durations is:

Exercise
Find the cumulative frequency graph of the eruption waiting periods in faithful.

Cumulative Relative Frequency


Distribution
The cumulative relative frequency distribution of a quantitative variable is a summary of
frequency proportion below a given level.

The relationship between cumulative frequency and relative cumulative frequency is:
Example
In the data set faithful, the cumulative relative frequency distribution of the eruptions variable shows
the frequency proportion of eruptions whose durations are less than or equal to a set of chosen
levels.
Problem
Find the cumulative relative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)

We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)

Then we find the sample size of faithful with the nrow function, and divide the cumulative frequency
distribution with it. As a result, the cumulative relative frequency distribution is:
> duration.cumrelfreq = duration.cumfreq / nrow(faithful)

Answer

The cumulative relative frequency distribution of the eruption variable is:


> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.18750 0.33824 0.35662 0.38235 0.49265 0.76103 0.98529
[5,5.5)
1.00000

Enhanced Solution
We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=2)
> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.19 0.34 0.36 0.38 0.49 0.76 0.99
[5,5.5)
1.00
> options(old) # restore the old option

We then apply the cbind function to print both the cumulative frequency distribution and relative
cumulative frequency distribution in parallel columns.
> old = options(digits=2)
> cbind(duration.cumfreq, duration.cumrelfreq)
duration.cumfreq duration.cumrelfreq
[1.5,2) 51 0.19
[2,2.5) 92 0.34
[2.5,3) 97 0.36
[3,3.5) 104 0.38
[3.5,4) 134 0.49
[4,4.5) 207 0.76
[4.5,5) 268 0.99
[5,5.5) 272 1.00
> options(old)

Exercise
Find the cumulative frequency distribution of the eruption waiting periods in faithful.

Cumulative Relative Frequency Graph


A cumulative relative frequency graph of a quantitative variable is a curve graphically showing
the cumulative relative frequency distribution.
Example
In the data set faithful, a point in the cumulative relative frequency graph of the eruptionsvariable
shows the frequency proportion of eruptions whose durations are less than or equal to a given level.
Problem
Find the cumulative relative frequency graph of the eruption durations in faithful.
Solution
We first find the cumulative relative frequency distribution of the eruption durations as follows.
Check the previous tutorial on Cumulative Relative Frequency Distribution for details.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
> duration.cumfreq = cumsum(duration.freq)
> duration.cumrelfreq = duration.cumfreq / nrow(faithful)

We then plot it along with the starting zero element.


> cumrelfreq0 = c(0, duration.cumrelfreq)
> plot(breaks, cumrelfreq0,
+ main="Old Faithful Eruptions", # main title
+ xlab="Duration minutes",
+ ylab="Cumulative eruption proportion")
> lines(breaks, cumrelfreq0) # join the points

Answer

The cumulative relative frequency graph of the eruption duration is:

Alternative Solution
We create an interpolate function Fn with the built-in function ecdf. Then we plot Fn right away.
There is no need to compute the cumulative frequency distribution a priori.
> Fn = ecdf(duration)
> plot(Fn,
+ main="Old Faithful Eruptions",
+ xlab="Duration minutes",
+ ylab="Cumulative eruption proportion")
Exercise
Find the cumulative relative frequency graph of the eruption waiting periods in faithful.

Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one
observation. I would like to know how many observations (e.g. people) are of which type
(e.g. demographic – here a category between A and E inclusive)
TYPE PERSON ID

E 1

E 2

B 3

B 4

B 5

B 6

C 7
I want to be able to say things like: “4 of my records are of type E”, or “10% of my records
are of type A”. The dataset I will use in my below example is similar to the above table, only
with more records, including some with a blank (missing) type.
What would I like my 1-dimensional frequency table tool to do in an ideal world?

1. Provide a count of how many observations are in which category.


2. Show the percentages or proportions of total observations that represents
3. Be able to sort by count, so I see the most popular options at the top – but only when I want to, as
sometimes the order of data is meaningful for other reasons.
4. Show a cumulative %, sorted by count, so I can see quickly that, for example, the top 3 options
make up 80% of the data – useful for some swift Pareto analysis and the like.
5. Deal with missing data transparently. It is often important to know how many of your
observations are “missing”. Other times, you might only care about the statistics derived from
those which are not missing.
6. If an external library, then be on CRAN or some other well supported network so I can be
reasonably confident the library won’t vanish, given how often I want to use it.
7. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great
if I could pipe the results directly into ggplot, dplyr, or whatever for some quick plots and
manipulations. Other times, if working interactively, I’d like to be able to see the key results at a
glance, without having to use further coding.
8. Work with “kable” from the Knitr package, or similar table output tools. I often use R
markdownand would like the ability to show the frequency table output in reasonably presentable
manner.
9. Have a sensible set of defaults (aka facilitate my laziness).

So what options come by default with base R?

Most famously, perhaps the “table” command.

table(data$Type)

A super simple way to count up the number of records by type. But it doesn’t show
percentages or any sort of cumulation. By default it hasn’t highlighted that there are some
records with missing data. It does have a useNA parameter that will show that though if
desired.

table(data$Type, useNA = "ifany")

The output also isn’t tidy and doesn’t work well with Knitr.

The table command can be wrapped in the prop.table command to show proportions.
prop.table(table(data$Type))

But you’d need to run both commands to understand the count and percentages, and the
latter inherits many of the limitations from the former.

Install required packages


epitools

wants <- c("epitools")

has <- wants %in% rownames(installed.packages())

if(any(!has)) install.packages(wants[!has])

Category frequencies for one variable

Absolute frequencies
set.seed(123)

(myLetters <- sample(LETTERS[1:5], 12, replace=TRUE))

[1] "B" "D" "C" "E" "E" "A" "C" "E" "C" "C" "E" "C"

(tab <- table(myLetters))

myLetters

A B C D E

1 1 5 1 4

names(tab)

[1] "A" "B" "C" "D" "E"

tab["B"]

barplot(tab, main="Counts")
plot of chunk rerFrequencies01

(Cumulative) relative frequencies


(relFreq <- prop.table(tab))

myLetters

A B C D E

0.08333 0.08333 0.41667 0.08333 0.33333

cumsum(relFreq)

A B C D E

0.08333 0.16667 0.58333 0.66667 1.00000

Counting non-existent categories


letFac <- factor(myLetters, levels=c(LETTERS[1:5], "Q"))

letFac

[1] B D C E E A C E C C E C

Levels: A B C D E Q

table(letFac)

letFac

A B C D E Q

1 1 5 1 4 0

Counting runs
(vec <- rep(rep(c("f", "m"), 3), c(1, 3, 2, 4, 1, 2)))

[1] "f" "m" "m" "m" "f" "f" "m" "m" "m" "m" "f" "m" "m"

(res <- rle(vec))

Run Length Encoding

lengths: int [1:6] 1 3 2 4 1 2

values : chr [1:6] "f" "m" "f" "m" "f" "m"

length(res$lengths)

[1] 6

inverse.rle(res)
[1] "f" "m" "m" "m" "f" "f" "m" "m" "m" "m" "f" "m" "m"

Contingency tables for two or more variables


Absolute frequencies using table()

N <- 10

(sex <- factor(sample(c("f", "m"), N, replace=TRUE)))

[1] m m f m f f f m m m

Levels: f m

(work <- factor(sample(c("home", "office"), N, replace=TRUE)))

[1] office office office office office office home home office office

Levels: home office

(cTab <- table(sex, work))

work

sex home office

f 1 3

m 1 5

summary(cTab)

Number of cases in table: 10

Number of factors: 2

Test for independence of all factors:

Chisq = 0.1, df = 1, p-value = 0.7

Chi-squared approximation may be incorrect

barplot(cTab, beside=TRUE, legend.text=rownames(cTab), ylab="absolute frequency")

plot of chunk rerFrequencies02

Using xtabs()

counts <- sample(0:5, N, replace=TRUE)

(persons <- data.frame(sex, work, counts))


sex work counts

1 m office 4

2 m office 4

3 f office 0

4 m office 2

5 f office 4

6 f office 1

7 f home 1

8 m home 1

9 m office 0

10 m office 2

xtabs(~ sex + work, data=persons)

work

sex home office

f 1 3

m 1 5

xtabs(counts ~ sex + work, data=persons)

work

sex home office

f 1 5

m 1 12

Marginal sums and means


apply(cTab, MARGIN=1, FUN=sum)

f m

4 6

colMeans(cTab)

home office

1 4
addmargins(cTab, c(1, 2), FUN=mean)

Margins computed over dimensions

in the following order:

1: sex

2: work

work

sex home office mean

f 1.0 3.0 2.0

m 1.0 5.0 3.0

mean 1.0 4.0 2.5

Relative frequencies
(relFreq <- prop.table(cTab))

work

sex home office

f 0.1 0.3

m 0.1 0.5

Conditional relative frequencies


prop.table(cTab, 1)

work

sex home office

f 0.2500 0.7500

m 0.1667 0.8333

prop.table(cTab, 2)

work

sex home office

f 0.500 0.375

m 0.500 0.625
Flat contingency tables for more than two variables
(group <- factor(sample(c("A", "B"), 10, replace=TRUE)))

[1] A A A A A A A B A A

Levels: A B

ftable(work, sex, group, row.vars="work", col.vars=c("sex", "group"))

sex f m

group A B A B

work

home 1 0 0 1

office 3 0 5 0

Recovering the original data from contingency tables


Individual-level data frame

library(epitools)

expand.table(cTab)

sex work

1 f home

2 f office

3 f office

4 f office

5 m home

6 m office

7 m office

8 m office

9 m office

10 m office

Group-level data frame

as.data.frame(cTab, stringsAsFactors=TRUE)

sex work Freq


1 f home 1

2 m home 1

3 f office 3

4 m office 5

Percentile rank
(vec <- round(rnorm(10), 2))

[1] 0.84 0.15 -1.14 1.25 0.43 -0.30 0.90 0.88 0.82 0.69

Fn <- ecdf(vec)

Fn(vec)

[1] 0.7 0.3 0.1 1.0 0.4 0.2 0.9 0.8 0.6 0.5

100 * Fn(0.1)

[1] 20

Fn(sort(vec))

[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

knots(Fn)

[1] -1.14 -0.30 0.15 0.43 0.69 0.82 0.84 0.88 0.90 1.25

plot(Fn, main="cumulative frequencies")

plot of chunk rerFrequencies03

So what’s available outside of base R? I tested 5 options, although there are, of


course , countlessmore. In no particular order:
 tabyl, from the janitor package
 tab1, from epidisplay
 freq, from summarytools
 CrossTable, from gmodels
 freq, from questionr
Because I am fussy, I managed to find some slight personal niggle with all of them, so it’s
hard to pick an overall personal winner for all circumstances. Several came very close. I
would recommend looking at any of the janitor, summarytools and questionr package
functions outlined below if you have similar requirements and tastes to me.
tabyl, from the janitor package
Documentation
library(janitor)

tabyl(data$Type, sort = TRUE)

This is a pretty good start! By default, it shows counts, percents, and percent of non-missing
data. It can optionally sort in order of frequency. It the output is tidy, and works with kable
just fine. The only thing missing really is a cumulative percentage option. But it’s a great
improvement over base table.

I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not
really something I can really criticise anyone else for.

tab1, from the epidisplay package


Documentation
library(epiDisplay)

tab1(data$Type, sort.group = "decreasing", cum.percent = TRUE)


This one is pretty fully featured. It even (optionally) generates a visual frequency chart
output as you can see above. It shows the frequencies, proportions and cumulative
proportions both with and without missing data. It can sort in order of frequency, and has a
totals row so you know how many observations you have all in.

However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the
column names it assigns, although one can certainly claim that’s pure personal preference.

A greater issue may be that the cumulative columns don’t seem to work as I would expect
when the table is sorted, as in the above example. The first entry in the table is “E”,
because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as
you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.

As far as I can tell, the function is working out the cumulative frequencies before sorting the
table – so as category E is the last category in the data file it has calculated that by the time
you reach the end of category E you have 100% of the non-missing data in hand. I can’t
envisage a situation where you would want this behaviour, but I’m open to correction if
anyone can.
freq, from the summarytools package
Documentation
library(summarytools)

summarytools::freq(data$Type, order = "freq")

This looks pretty great. Has all the variations of counts, percents and missing-data output I
want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable
in the console, and works well with Knitr. In fact it has some further nice formatting options
that I wasn’t particularly looking for.

It it pretty much tidy, although has a minor niggle in that the output always includes the total
row. It’s often important to know your totals, but if you’re piping it to other tools or charts,
you may have to use another command to filter that row out each time, as there doesn’t
seem to be an obvious way to prevent it being included with the rest of the dataset when
running it directly.

Update 2018-04-28: thanks to Roland in the comments below pointing out that a new feature to
disable the totals display has been added: set the “totals” parameter to false, and the totals row
won’t show up, potential making it easier to pass on for further analysis.
CrossTable, from the gmodels library
Documentation
library(gmodels)

CrossTable(data$Type)

Here the results are displayed in a horizontal format, a bit like the base “table”. Here though,
the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight
that there are missing values, and isn’t “tidy”. You can get it to display a vertical version
(add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual
R tidyverse sense.
It’s not a great tool for my particular requirements here, but most likely this is because, as
you may guess from the command name, it’s not particularly designed for 1-way frequency
tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually
accessible way to see counts, proportions and even run hypothesis tests.

freq, from the questionr package


Documentation
library(questionr)

questionr::freq(data$Type, cum = TRUE, sort = "dec", total = TRUE)


Counts, percentages, cumulative percentages, missing values data, yes, all here! The table
can optionally be sorted in descending frequency, and works well with kable.

It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are
row labels rather than a standalone column. This means you may have to pop them into in a
new column for best use in any downstream tidy tools. That’s easy enough with
e.g. dplyr’sadd_rownames command. But that is another processing step to remember, which
is not a huge selling point.
There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if
you plan to pass the data onwards in a way where you don’t want to risk double-counting
your totals. There’s an “exclude” parameter if you want to remove any particular categories
from analysis before performing the calculations as well as a couple of extra formatting
options that might be handy.

Stem-and-Leaf Plot
A stem-and-leaf plot of a quantitative variable is a textual graph that classifies data items
according to their most significant numeric digits. In addition, we often merge each alternating row
with its next row in order to simplify the graph for readability.
Example
In the data set faithful, a stem-and-leaf plot of the eruptions variable identifies durations with the
same two most significant digits, and queue them up in rows.
Problem
Find the stem-and-leaf plot of the eruption durations in faithful.
Solution
We apply the stem function to compute the stem-and-leaf plot of eruptions.
Answer

The stem-and-leaf plot of the eruption durations is


> duration = faithful$eruptions
> stem(duration)

The decimal point is 1 digit(s) to the left of the |

16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370

Exercise
Find the stem-and-leaf plot of the eruption waiting periods in faithful.

Scatter Plot
A scatter plot pairs up values of two quantitative variables in a data set and display them as
geometric points inside a Cartesian diagram.
Example
In the data set faithful, we pair up the eruptions and waiting values in the same observation as (x,y)
coordinates. Then we plot the points in the Cartesian plane. Here is a preview of the eruption data
value pairs with the help of the cbind function.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> head(cbind(duration, waiting))
duration waiting
[1,] 3.600 79
[2,] 1.800 54
[3,] 3.333 74
[4,] 2.283 62
[5,] 4.533 85
[6,] 2.883 55

Problem
Find the scatter plot of the eruption durations and waiting intervals in faithful. Does it reveal any
relationship between the variables?
Solution
We apply the plot function to compute the scatter plot of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> plot(duration, waiting, # plot the variables
+ xlab="Eruption duration", # x−axis label
+ ylab="Time waited") # y−axis label

Answer
The scatter plot of the eruption durations and waiting intervals is as follows. It reveals a positive
linear relationship between them.
Enhanced Solution
We can generate a linear regression model of the two variables with the lm function, and then draw
a trend line with abline.
> abline(lm(waiting ~ duration))

Numerical Measures

We explain how to compute various statistical measures in R with examples. The tutorials are based
on the previously discussed built-in data set faithful.
Mean Variance

Median Standard Deviation

Quartile Covariance

Percentile Correlation Coefficient

Range Central Moment

Interquartile Range Skewness

Box Plot Kurtosis

Mean
The mean of an observation variable is a numerical measure of the central location of the data
values. It is the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:

Similarly, for a data population of size N, the population mean is:

Problem
Find the mean eruption duration in the data set faithful.
Solution
We apply the mean function to compute the mean value of eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration) # apply the mean function
[1] 3.4878

Answer

The mean eruption duration is 3.4878 minutes.

Exercise
Find the mean eruption waiting periods in faithful.

Median
The median of an observation variable is the value at the middle when the data is sorted in
ascending order. It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration) # apply the median function
[1] 4
Answer

The median of the eruption duration is 4 minutes.

Exercise
Find the median of the eruption waiting periods in faithful.

Quartile
There are several quartiles of an observation variable. The first quartile, or lower quartile, is
the value that cuts off the first 25% of the data when it is sorted in ascending order. The second
quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper
quartile, is the value that cuts off the first 75%.
Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000

Answer

The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543
minutes respectively.

Exercise
Find the quartiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of quartiles. Details can be found in the R
documentation via help(quantile).

Percentile
The nth percentile of an observation variable is the value that cuts off the first n percent of the data
values when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage
ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330

Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330

minutes respectively. Range


The range of an observation variable is the difference of its largest and smallest data values. It is a
measure of how far apart the entire data spreads in value.
Problem
Find the range of the eruption duration in the data set faithful.
Solution
We apply the max and min function to compute the largest and smallest values of eruptions, then take
the difference.
> duration = faithful$eruptions # the eruption durations
> max(duration) − min(duration) # apply the max and min functions
[1] 3.5

Answer

The range of the eruption duration is 3.5 minutes.

Exercise
Find the range of the eruption waiting periods in faithful.

Exercise
Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of percentiles. Details can be found in the R
documentation via help(quantile).

Interquartile Range
The interquartile range of an observation variable is the difference of its upper and lower
quartiles. It is a measure of how far apart the middle portion of data spreads in value.

Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> duration = faithful$eruptions # the eruption durations
> IQR(duration) # apply the IQR function
[1] 2.2915

Answer

The interquartile range of eruption duration is 2.2915 minutes.

Exercise
Find the interquartile range of eruption waiting periods in faithful.

Box Plot
The box plot of an observation variable is a graphical representation based on its quartiles, as well
as its smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot

Answer

The box plot of the eruption duration is:


Exercise
Find the box plot of the eruption waiting periods in faithful.

Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data
set. It is also useful in comparing the distribution of data across data sets
by drawing boxplots for each of them.

Boxplots are created in R by using the boxplot() function.

Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −

 x is a vector or a formula.

 data is the data frame.

 notch is a logical value. Set as TRUE to draw a notch.

 varwidth is a logical value. Set as true to draw width of the box proportionate
to the sample size.
 names are the group labels which will be printed under each boxplot.

 main is used to give a title to the graph.

Example
We use the data set "mtcars" available in the R environment to create a
basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.

input <- mtcars[,c('mpg','cyl')]

print(head(input))

When we execute above code, it produces following result −


mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6

Creating the Boxplot


The below script will create a boxplot graph for the relation between mpg
(miles per gallon) and cyl (number of cylinders).

# Give the chart file a name.

png(file = "boxplot.png")

# Plot the chart.

boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",

ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −


Boxplot with Notch
We can draw boxplot with notch to find out how the medians of different
data groups match with each other.

The below script will create a boxplot graph with notch for each of the data
group.

# Give the chart file a name.

png(file = "boxplot_with_notch.png")

# Plot the chart.

boxplot(mpg ~ cyl, data = mtcars,

xlab = "Number of Cylinders",

ylab = "Miles Per Gallon",


main = "Mileage Data",

notch = TRUE,

varwidth = TRUE,

col = c("green","yellow","purple"),

names = c("High","Medium","Low")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

Mean
It is calculated by taking the sum of the values and dividing with the
number of values in a data series.
The function mean() is used to calculate this in R.

Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

 x is the input vector.

 trim is used to drop some observations from both end of the sorted vector.

 na.rm is used to remove the missing values from the input vector.

Example
# Create a vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.

result.mean <- mean(x)

print(result.mean)

When we execute the above code, it produces the following result −


[1] 8.22

Applying Trim Option


When trim parameter is supplied, the values in the vector get sorted and
then the required numbers of observations are dropped from calculating the
mean.

When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.

In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and
the values removed from the vector for calculating mean are (−21,−5,2)
from left and (12,18,54) from right.

# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.

result.mean <- mean(x,trim = 0.3)

print(result.mean)

When we execute the above code, it produces the following result −


[1] 5.55

Applying NA Option
If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm = TRUE. which
means remove the NA values.

# Create a vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)

# Find mean.

result.mean <- mean(x)

print(result.mean)

# Find mean dropping NA values.

result.mean <- mean(x,na.rm = TRUE)

print(result.mean)

When we execute the above code, it produces the following result −


[1] NA
[1] 8.22

Median
The middle most value in a data series is called the median.
The median()function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)

Following is the description of the parameters used −

 x is the input vector.

 na.rm is used to remove the missing values from the input vector.

Example
# Create the vector.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.

median.result <- median(x)

print(median.result)

When we execute the above code, it produces the following result −


[1] 5.6

Mode
The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and character
data.

R does not have a standard in-built function to calculate mode. So we


create a user function to calculate mode of a data set in R. This function
takes the vector as input and gives the mode value as output.

Example
# Create the function.

getmode <- function(v) {

uniqv <- unique(v)

uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.

v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.

result <- getmode(v)

print(result)

# Create the vector with characters.

charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.

result <- getmode(charv)

print(result)

When we execute the above code, it produces the following result −


[1] 2
[1] "it"

Variance
The variance is a numerical measure of how the data values is dispersed around the mean. In
particular, the sample variance is defined as:

Similarly, the population variance is defined in terms of the population mean μ and population
size N:

Problem
Find the variance of the eruption duration in the data set faithful.
Solution
We apply the var function to compute the variance of eruptions.
> duration = faithful$eruptions # the eruption durations
> var(duration) # apply the var function
[1] 1.3027

Answer

The variance of the eruption duration is 1.3027.

Exercise
Find the variance of the eruption waiting periods in faithful.

Standard Deviation
The standard deviation of an observation variable is the square root of its variance.
Problem
Find the standard deviation of the eruption duration in the data set faithful.
Solution
We apply the sd function to compute the standard deviation of eruptions.
> duration = faithful$eruptions # the eruption durations
> sd(duration) # apply the sd function
[1] 1.1414

Answer

The standard deviation of the eruption duration is 1.1414.

Exercise
Find the standard deviation of the eruption waiting periods in faithful.

Covariance
The covariance of two variables x and y in a data set measures how the two are linearly related. A
positive covariance would indicate a positive linear relationship between the variables, and a
negative covariance would indicate the opposite.
The sample covariance is defined in terms of the sample means as:

Similarly, the population covariance is defined in terms of the population mean μ , μ as:
x y

Problem
Find the covariance of eruption duration and waiting time in the data set faithful. Observe if there is
any linear relationship between the two variables.
Solution
We apply the cov function to compute the covariance of eruptions and waiting.
> duration = faithful$eruptions # eruption durations
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
Answer

The covariance of eruption duration and waiting time is about 14. It indicates a positive linear
relationship between the two variables.

Correlation Coefficient
The correlation coefficient of two variables in a data set equals to their covariance divided by the
product of their individual standard deviations. It is a normalized measurement of how the two are
linearly related.
Formally, the sample correlation coefficient is defined by the following formula,
where s and s are the sample standard deviations, and s is the sample covariance.
x y xy

Similarly, the population correlation coefficient is defined as follows, where σ and σ are the
x y

population standard deviations, and σ is the population covariance.


xy

If the correlation coefficient is close to 1, it would indicate that the variables are positively linearly
related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates
that the variables are negatively linearly related and the scatter plot almost falls along a straight
line with negative slope. And for zero, it would indicate a weak linear relationship between the
variables.
Problem
Find the correlation coefficient of eruption duration and waiting time in the data set faithful. Observe
if there is any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation coefficient of eruptions and waiting.
> duration = faithful$eruptions # eruption durations
> waiting = faithful$waiting # the waiting period
> cor(duration, waiting) # apply the cor function
[1] 0.90081

Answer

The correlation coefficient of eruption duration and waiting time is 0.90081. Since it is rather close
to 1, we can conclude that the variables are positively linearly related.

R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables. One of these variable is called
predictor variable whose value is gathered through experiments. The other
variable is called response variable whose value is derived from the
predictor variable.

In Linear Regression these two variables are related through an equation,


where exponent (power) of both these variables is 1. Mathematically a
linear relationship represents a straight line when plotted as a graph. A
non-linear relationship where the exponent of any variable is not equal to 1
creates a curve.

The general mathematical equation for a linear regression is −


y = ax + b

Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.

 a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his
height is known. To do this we need to have the relationship between
height and weight of a person.

The steps to create the relationship is −

 Carry out the experiment of gathering a sample of observed values of height


and corresponding weight.

 Create a relationship model using the lm() functions in R.

 Find the coefficients from the model created and create the mathematical
equation using these

 Get a summary of the relationship model to know the average error in


prediction. Also called residuals.

 To predict the weight of new persons, use the predict() function in R.

Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the
response variable.

Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)

Following is the description of the parameters used −

 formula is a symbol presenting the relation between x and y.

 data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)


print(summary(relation))

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)

Following is the description of the parameters used −

 object is the formula which is already created using the lm() function.

 newdata is the vector containing the new value for predictor variable.

Predict the weight of new persons


# The predictor vector.

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.

a <- data.frame(x = 170)

result <- predict(relation,a)

print(result)

When we execute the above code, it produces the following result −


1
76.22869

Visualize the Regression Graphically


# Create the predictor and response variable.

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

relation <- lm(y~x)

# Give the chart file a name.

png(file = "linearregression.png")

# Plot the chart.

plot(y,x,col = "blue",main = "Height & Weight Regression",

abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −


R - Time Series Analysis
Time series is a series of data points in which each data point is associated
with a timestamp. A simple example is the price of a stock in the stock
market at different points of time on a given day. Another example is the
amount of rainfall in a region at different months of the year. R language
uses many functions to create, manipulate and plot the time series data.
The data for the time series is stored in an R object called time-series
object. It is also a R data object like a vector or data frame.

The time series object is created by using the ts() function.

Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)

Following is the description of the parameters used −


 data is a vector or matrix containing the values used in the time series.

 start specifies the start time for the first observation in time series.

 end specifies the end time for the last observation in time series.

 frequency specifies the number of observations per unit time.

Except the parameter "data" all other parameters are optional.

Example
Consider the annual rainfall details at a place starting from January 2012.
We create an R time series object for a period of 12 months and plot it.

# Get the data points in form of a R vector.

rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

# Convert it to a time series object.

rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.

print(rainfall.timeseries)

# Give the chart file a name.

png(file = "rainfall.png")

# Plot a graph of the time series.

plot(rainfall.timeseries)

# Save the file.

dev.off()

When we execute the above code, it produces the following result and chart

Jan Feb Mar Apr May Jun Jul Aug Sep
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0

The Time series chart −

Different Time Intervals


The value of the frequency parameter in the ts() function decides the time
intervals at which the data points are measured. A value of 12 indicates
that the time series is for 12 months. Other values and its meaning is as
below −

 frequency = 12 pegs the data points for every month of a year.

 frequency = 4 pegs the data points for every quarter of a year.

 frequency = 6 pegs the data points for every 10 minutes of an hour.

 frequency = 24*6 pegs the data points for every 10 minutes of a day.
Multiple Time Series
We can plot multiple time series in one chart by combining both the series
into a matrix.

# Get the data points in form of a R vector.

rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

rainfall2 <-

c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)

# Convert them to a matrix.

combined.rainfall <- matrix(c(rainfall1,rainfall2),nrow = 12)

# Convert it to a time series object.

rainfall.timeseries <- ts(combined.rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries data.

print(rainfall.timeseries)

# Give the chart file a name.

png(file = "rainfall_combined.png")

# Plot a graph of the time series.

plot(rainfall.timeseries, main = "Multiple Time Series")

# Save the file.

dev.off()

When we execute the above code, it produces the following result and chart

Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6
Nov 2012 882.8 1106.7
Dec 2012 1071.0 1337.8

The Multiple Time series chart −

R - Nonlinear Least Square


When modeling real world data for regression analysis, we observe that it is
rarely the case that the equation of the model is a linear equation giving a
linear graph. Most of the time, the equation of the model of real world data
involves mathematical functions of higher degree like an exponent of 3 or a
sin function. In such a scenario, the plot of the model gives a curve rather
than a line. The goal of both linear and non-linear regression is to adjust
the values of the model's parameters to find the line or curve that comes
closest to your data. On finding these values we will be able to estimate the
response variable with good accuracy.

In Least Square regression, we establish a regression model in which the


sum of the squares of the vertical distances of different points from the
regression curve is minimized. We generally start with a defined model and
assume some values for the coefficients. We then apply the nls() function
of R to get the more accurate values along with the confidence intervals.

Syntax
The basic syntax for creating a nonlinear least square test in R is −
nls(formula, data, start)

Following is the description of the parameters used −

 formula is a nonlinear model formula including variables and parameters.

 data is a data frame used to evaluate the variables in the formula.

 start is a named list or named numeric vector of starting estimates.

Example
We will consider a nonlinear model with assumption of initial values of its
coefficients. Next we will see what is the confidence intervals of these
assumed values so that we can judge how well these values fir into the
model.

So let's consider the below equation for this purpose −


a = b1*x^2+b2

Let's assume the initial coefficients to be 1 and 3 and fit these values into
nls() function.

xvalues <- c(1.6,2.1,2,2.23,3.71,3.25,3.4,3.86,1.19,2.21)

yvalues <- c(5.19,7.43,6.94,8.11,18.75,14.88,16.06,19.12,3.21,7.58)


# Give the chart file a name.

png(file = "nls.png")

# Plot these values.

plot(xvalues,yvalues)

# Take the assumed values and fit into the model.

model <- nls(yvalues ~ b1*xvalues^2+b2,start = list(b1 = 1,b2 = 3))

# Plot the chart with new data by fitting it to a prediction from 100 data points.

new.data <- data.frame(xvalues = seq(min(xvalues),max(xvalues),len = 100))

lines(new.data$xvalues,predict(model,newdata = new.data))

# Save the file.

dev.off()

# Get the sum of the squared residuals.

print(sum(resid(model)^2))

# Get the confidence intervals on the chosen values of the coefficients.

print(confint(model))

When we execute the above code, it produces the following result −


[1] 1.081935
Waiting for profiling to be done...
2.5% 97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
We can conclude that the value of b1 is more close to 1 while the value of
b2 is more close to 2 and not 3.

R - Multiple Regression
Multiple regression is an extension of linear regression into relationship
between more than two variables. In simple linear relation we have one
predictor and one response variable, but in multiple regression we have
more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is −


y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

 y is the response variable.

 a, b1, b2...bn are the coefficients.


 x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The model
determines the value of the coefficients using the input data. Next we can
predict the value of the response variable for a given set of predictor
variables using these coefficients.

lm() Function
This function creates the relationship model between the predictor and the
response variable.

Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)

Following is the description of the parameters used −

 formula is a symbol presenting the relation between the response variable and
predictor variables.

 data is the vector on which the formula will be applied.

Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a
comparison between different car models in terms of mileage per gallon
(mpg), cylinder displacement("disp"), horse power("hp"), weight of the
car("wt") and some more parameters.

The goal of the model is to establish the relationship between "mpg" as a


response variable with "disp","hp" and "wt" as predictor variables. We
create a subset of these variables from the mtcars data set for this purpose.

input <- mtcars[,c("mpg","disp","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result −


mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460

Create Relationship Model & get the Coefficients


input <- mtcars[,c("mpg","disp","hp","wt")]

# Create the relationship model.

model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.

print(model)

# Get the Intercept and coefficients as vector elements.

cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]

print(a)

Xdisp <- coef(model)[2]

Xhp <- coef(model)[3]

Xwt <- coef(model)[4]

print(Xdisp)

print(Xhp)

print(Xwt)

When we execute the above code, it produces the following result −


Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

# # # # The Coefficient Values # # #


(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891

Create Equation for Regression Model


Based on the above intercept and coefficient values, we create the
mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Apply Equation for predicting New Values


We can use the regression equation created above to predict the mileage
when a new set of values for displacement, horse power and weight is
provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

R - Logistic Regression
The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It
actually measures the probability of a binary response as the value of
response variable based on the mathematical equation relating it with the
predictor variables.

The general mathematical equation for logistic regression is −


y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

Following is the description of the parameters used −

 y is the response variable.


 x is the predictor variable.

 a and b are the coefficients which are numeric constants.

The function used to create the regression model is the glm() function.

Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)

Following is the description of the parameters used −

 formula is the symbol presenting the relationship between the variables.

 data is the data set giving the values of these variables.

 family is R object to specify the details of the model. It's value is binomial for
logistic regression.

Example
The in-built data set "mtcars" describes different models of a car with their
various engine specifications. In "mtcars" data set, the transmission mode
(automatic or manual) is described by the column am which is a binary
value (0 or 1). We can create a logistic regression model between the
columns "am" and 3 other columns - hp, wt and cyl.

# Select some columns form mtcars.

input <- mtcars[,c("am","cyl","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result −


am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
Create Regression Model
We use the glm() function to create the regression model and get its
summary for analysis.

input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

print(summary(am.data))

When we execute the above code, it produces the following result −


Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.2297 on 31 degrees of freedom


Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Conclusion
In the summary as the p-value in the last column is more than 0.05 for the
variables "cyl" and "hp", we consider them to be insignificant in contributing
to the value of the variable "am". Only weight (wt) impacts the "am" value
in this regression model.

You might also like