Intro To Statistic Using R - Session 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Introduction to statistics using R - Session 1

Descriptive statistics - R Intermissions 1 and 2


Author: Chloé Warret Rodrigues
2023-03-08
In this file, you will find some code to summarize and describe your data with number (we will get into graphs in a few sessions). Some of the code
used is more advance than you need at this stage, so I will signal the code you can ignore with an “Advanced”/“End of Advanced” tag. However, if
you are interested in understanding these extra code bits, they will be annotated like the rest.

With that said, let’s begin. First, install some packages

#install.package(dplyr)
#install.package(tidyr)
#install.package(ggpubr)

N.B.: I’ve used a # in front of the install.packages code, because this package is already installed on my computer. If I run it, it will just ask me to
restart R because the package is loaded.

RStudio tip: Note that, alternatively, you can install a package in Rstudio easily by going on the Packages tab of the bottom-right pane, click Install,
and write the packages you need.

And load the newly installed packages in R: the command library indicate to R that we are loading the packages, so that their diverse function
become readily available for us to use. Base R has many basic commands to describe, plot and analyze your data, but some very smart people
create packages with functions that can make your life easier, apply calculation that are less common, or apply new statistical methods that are
being developed.

library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(tidyr)
library(ggpubr)

## Loading required package: ggplot2

Measuring the central tendency of a data set


Now, we’ll gonna play in R with the different parameters we talked about, that describe the location or central tendency of your data. We will see
three different examples of data with different shapes and behaviors. So, our first step will be to generate data.

A. Continuous normally distributed variables


We’ll start easy: continuous normal data
Let’s pretend we’ve measured the wing span of 15,000 birds of prey

💪 Advanced
Here, we imagine a population (i.e., a specific collection of objects of interest) which true average wingspan is 95.2cm with a standard deviation of
20.7cm. We’ll take a (huge) sample (i.e., a subse of the population) of 15,000 birds.

NB.: When the sample consists of the whole population, it is termed a census.

set.seed(12) #allows replication


df<- as.data.frame(rnorm(n = 15000, mean = 95.2, sd = 20.7)) #create a dataframe
colnames(df)[1] <- "wing_span" #rename the column "wing span"

🍀Useful tip: set.seed is a function that ensures reproducibility. Whenever your functions involve randomness (like randomly generating data), if
you use the same number, you will generate the same random values at each run.

then, we told R that we want to create a dataframe “df” with the as.data.frame function (because with just 1 column of data R would by default
create a Vector), with 15,000 data points drawn from a normal distribution (rnorm), centered on 95.2 (population mean) with a standard deviation of
20.7.

The last line indicates to R that we want the unique column of df to be named “wing_span”

End of Advanced 💪
1. The mean and median
Let’s check what our data look like. We will add the mean, and the median to the plot

ggdensity(df, x = "wing_span",
add = "mean", rug = F)

## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.

## Warning: `geom_vline()`: Ignoring `data` because `xintercept` was provided.

ggdensity(df, x = "wing_span",
add = "median", rug = F)

## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.


## `geom_vline()`: Ignoring `data` because `xintercept` was provided.

As I said above, we will get into proper plotting later. For now, we asked a density plot of the data (which shows how they are distributed). In the
first command, we added the mean, and in the second, the median. Note how mean and median are the same.

Now, let’s get the numerical value of the mean and of the median. Base R conveniently has very easy function to get them

mean(df$wing_span)

## [1] 95.19335

median(df$wing_span)

## [1] 95.10304

See how mean and median are the same when we are close to a perfect normal distribution?

2. The geometric mean


Here, there is no base function. So, we’ll simply apply the formula I showed you in the slides

gm<-exp(mean(log(df$wing_span)))
print(gm)

## [1] 92.77353

gm

## [1] 92.77353

We created a “vector” (gm) to store the quantity of interest, here our geometric mean. We also used 2 bits of code to display the value of our
vector: the print function, and simply writing the name of our vector. There are almost always more than 1 way to get R to do something, but we’ll
come back to that in another session.

Notice how the geometric mean is a bit smaller than the arithmetic mean. It’s a useful feature when your distribution is right-skewed, because it is
closer to the more likely values your variable can take.

You could also calculate you geometric mean by hand, but only if you have small data sets. See below:

vect<-c(10000,12000,11000,11500,9000,12500,10500,11000,13000,12500)
N=length(vect)
prod<-prod(vect)
GM=prod^(1/N)
GM1 = prod(vect)^(1/(length(vect)))
gm<-exp(mean(log(vect)))

Step 1: we created a short vector. As I said above, this method would not work with our 15,000 long sample data set. Step 2: 🍀 length is a useful
function that gets the number of sample in your data set. We store that information in a vector called N. Notice how you can indifferently use = or <-
to define an R object. Step 3: we calculate the product of all values and store it in a vector called “prod”. Step 4: we calculate the geometric mean.
GM1 is an alternative way of writing the formula skipping the creation of vectors N and prod. gm uses the same method as previously to calculate
the geometric mean: you can verify that they give the exact same result.

3. The harmonic mean


Reminder: the harmonic mean is the reciprocal of the mean of the reciprocals: n/(1/x1 + 1/x2+…+1/xn), and that would be particularly painful to
calculate that by hand with n = 15,000!

So, here is the magic of R: you get access to functions either in base R or in packages that will do the painful work for you. We’ll use package
psych.

#install.packages("psych")
library(psych)

## Warning: package 'psych' was built under R version 4.2.3

##
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':


##
## %+%, alpha

N.B.: I’ve used a # in front of the install.packages code, because this package is already installed on my computer. If I run it, it will just ask me to
restart R because the package is loaded.

🍀 Useful tip: If you want R to ignore a line of code just use # at the beginning of the line. If you’re code bit spans multiple lines, you have to add #
at the beginning of each line.

harmonic.mean(df$wing_span)

## [1] 90.10761

print(hm<- harmonic.mean(df$wing_span))

## [1] 90.10761

Here, we used the function harmonic.mean from package psych, stored it in a vector called hm, and in the same line of code asked R to display it
in the R console (bottom-left pane)

Before we switch to a different type of distribution (we’ll produce data with a different shape), let’s check how the size of the data
set affects the estimates of central tendency.

set.seed(213) # for reproducible example


x<- rnorm(n = 15, mean = 95.2, sd = 20.7)
mean(x)

## [1] 88.83817

median(x)

## [1] 97.21167

mean(x)-median(x)

## [1] -8.373495

set.seed(12)
x1<- rnorm(n = 300, mean = 95.2, sd = 20.7)
mean(x1)

## [1] 94.84418

median(x1)

## [1] 94.10482

mean(x1)-median(x1)

## [1] 0.7393615

set.seed(12)
x2<- rnorm(n = 150000, mean = 95.2, sd = 20.7)
mean(x2)

## [1] 95.22789

median(x2)

## [1] 95.23198

mean(x2)-median(x2)

## [1] -0.004088828

We simulated 3 data set (the vectors x, x1 nd x2) representing the exact same population (represented by a normal distribution of mean = 95.2 and
sd = 20.7). In the first case, we collected only 15 samples, while we got 300 sample in the second case, and 150,000 samples in the third case.

Notice also how the sample size affects median and mean: the smaller the sample size, the further we are from the true population mean, and the
further apart the mean and the median are. So, the smaller the sample size is, the harder it will be to determine from what distribution (shape,
mean, SD) the data came from.

B. Continuous right-skewed variables


Now, we’ll complicate things a bit
Let’s pretend we’ve measured marten home range area. Home range area data sets are often right skewed.

💪 Advanced
set.seed(123)
df2<-as.data.frame(rlnorm(1500, log(10), log(2.5)))
colnames(df2)[1] <- "log_area"

Like before, we simulate a data set, this time with 1,500 entries, but our data follows a lognormal distribution, i.e., the data is right skewed.

End of Advanced 💪
1. The mean and median
Like before, we will make a density plot of our data, and get the numerical value of our mean and median.

ggdensity(df2, x = "log_area",
add = "mean", rug = F)

## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.

## Warning: `geom_vline()`: Ignoring `data` because `xintercept` was provided.

ggdensity(df2, x = "log_area",
add = "median", rug = F)

## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.


## `geom_vline()`: Ignoring `data` because `xintercept` was provided.

mean(df2$log_area)

## [1] 15.48841

median(df2$log_area)

## [1] 10.2114

See now, how mean and median differ when the distribution is skewed.

C. Discrete variables
Count data are an example of discrete data. Let’s pretend here, that we are counting the number of some parasites on slides.

💪 Advanced
We will generate a vector to store our data, and display a summary table with how many occurrence of each count we have.

y <- rpois(n = 500, lambda = 4)


print(tb<-table(y))

## y
## 0 1 2 3 4 5 6 7 8 9 10 12
## 11 33 72 108 97 65 59 32 15 5 2 1

End of Advanced 💪
Let’s plot our data. (we need to summarize the data to make a barplot, hence we use the function table which produces the frequency of count
data).

1. The mean and median

barplot(table(y))

mean(y)

## [1] 3.978

median(y)

## [1] 4

1. The mode
There is no function to get the mode in base R. So, we will need to find other ways to get it: we can create our own function (advanced) or simply
use a function from a package.

💪 Advanced
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Mode(y)

## [1] 3

Here, we created our own function to get the mode. We call that function, wait for it… “Mode”.

The function unique will isolate each value our variable can take. The function which.max will return the location of the first maximum value in a
vector. The tabulate function counts the number of occurrences of an integer value in a vector, counting also the “missing” values (for ex., if you
have 1,2,3,5, tabulate will also indicate that 4 appears 0 time). The match function returns the index position of the first matching elements of the
first vector in the second vector.

So what did we do here? We first isolated each count value that appear in x (which is the generic name of the vector on which the function will be
applied, and which we will replace with the name of our vector), in the order they appear in a and called it ux. We, then, counted in x the number of
occurrences of each unique count value (stored in ux; that’s why we need to match ux and x). In the same line we ask which is the maximum
frequency for each unique value of our vector.

Finally, now that we have defined our function, we apply it to y. And it tells us that 4 is the most common value!

Now, let’s complicate thing even more! We’ll play with bimodal distributions.
First, let’s create a bimodal dataset based on lognormal (so continuous) distributions. We will first define every element we will need to generate
the data.

mu1 <- log(1) #first mean


mu2 <- log(1000) #second mean, needs to be way different
sig1 <- log(3) #sd of the first peak
sig2 <- log(3) #sd of the second peak
cpct <- 0.5 #that's just a probability value used to allocate values with a Bernoulli trial

Now we’ll create the data using a function with all the parameters defined above and n (sample size). We will call our function bimodalDistFunc

bimodalDistFunc <- function (n,cpct, mu1, mu2, sig1, sig2) {


y0 <- rlnorm(n,mean=mu1, sd = sig1) #apply a lognormal distribution
y1 <- rlnorm(n,mean=mu2, sd = sig2)

flag <- rbinom(n,size=1,prob=cpct) #here is the Bernoulli trial, split each of n sample with 1 coin flip and pr
obability 0.5
y <- y0*(1 - flag) + y1*flag #so that 0.5 of n belongs to y0 and 0.5 to y1
}

And now we can apply the function to create a bimodal data set we will call bimodalnorm.

bimodalnorm <- bimodalDistFunc(n=10000,cpct,mu1,mu2, sig1,sig2)


hist(log(bimodalnorm))

We created 10,000 data points that will be split around the 2 modes, and plot a histogram of these data.

Just for fun (and so that you see it works with discrete data), let’s create a bimodal data set based on 2 Poisson distributions.

lambda1 <- 5
lambda2 <- 20
cpct <- 0.5 #probability of 0.5

The Poisson distribution is a simple one (we’ll see that later), the only parameter we need to define it is lambda, because the variance is equal to
the mean in this distribution. So we define our 2 lambdas, and like before the probability value to decide how to assign a randomly generated value
to one of the Poisson distribution (here, each value has an equal chance to be assigned to each distribution, 0.5).

bimodalDistFunc2 <- function (n,cpct, lambda1, lambda2) {


y0 <- rpois(n,lambda = lambda1)
y1 <- rpois(n,lambda = lambda2)

flag <- rbinom(n,size=1,prob=cpct)


y <- y0*(1 - flag) + y1*flag
}

bimodalpois <- bimodalDistFunc2(n=10000,cpct, lambda1, lambda2)


table(bimodalpois)

## bimodalpois
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 25 174 425 676 908 867 730 499 336 192 124 118 116 138 206 262 335 376 414 472
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 40
## 426 392 394 332 242 232 177 118 98 77 51 29 18 9 6 3 1 1 1

barplot(table(bimodalpois))

And just like for the double-lognormal distribution, we create a double Poisson. We generate 10,000 data points that should be equally split
between each distributiom, and we’ll summarize the frequency of our values using table. Finally we visualize it with a barplot.

End of Advanced 💪
Ok, now that we have our fake data sets, unimodal (y), and the 2 bimodals (bimodalnorm and bimodalpoiss), we’ll get the mode(s) in each of them.

I showed you how to get a mode with a home-made function in the “advanced” section, but we can simply use packages too 😉
First, the unimodal.

library(modeest) #install it if not done

## Registered S3 method overwritten by 'rmutil':


## method from
## plot.residuals psych

mlv(y)

## [1] 3

Now, the bimodal data sets. Here, we will need specific packages capable of dealing with bimodality. The mlv function won’t work, because it’s
design for unimodal data sets:

mlv(bimodalnorm)

## Warning: argument 'method' is missing. Data are supposed to be continuous.


## Default method 'shorth' is used

## [1] 1.63779

See, it returns 1 mode only. So, We’ll now get packages that can properly deal with multimodal data.

#install.packages("biosurvey", "diptest", "LaplacesDemon", "mousetrap")


library(biosurvey)
library(diptest)
library(LaplacesDemon)

##
## Attaching package: 'LaplacesDemon'

## The following object is masked _by_ '.GlobalEnv':


##
## Mode

## The following objects are masked from 'package:psych':


##
## logit, tr

library(mousetrap)

## Welcome to mousetrap 3.2.1!

## Summary of recent changes: http://pascalkieslich.github.io/mousetrap/news/

## Forum for questions: https://forum.cogsci.nl/index.php?p=/categories/mousetrap

First, let’s use find_mode from biosurvey, and locmodes from multimode

bp.d<-density(bimodalpois)
find_modes(bp.d)

## mode density
## 1 4.56094 0.0796240867
## 2 19.21653 0.0430809388
## 3 39.77121 0.0000354922

library(multimode)
locmodes(bimodalnorm, mod0 = 2) #here, we indicated that we expect 2 modes

## Warning in locmodes(bimodalnorm, mod0 = 2): If the density function has an


## unbounded support, artificial modes may have been created in the tails

##
## Estimated location
## Modes: 683.5343 67805.39
## Antimode: 51562.42
##
## Estimated value of the density
## Modes: 6.363672e-05 7.958462e-09
## Antimode: 2.496266e-09
##
## Critical bandwidth: 6077.793

You can also determine the multimodality of your data. We’ll use the count data.

dip.test(bimodalpois)

##
## Hartigans' dip test for unimodality / multimodality
##
## data: bimodalpois
## D = 0.056787, p-value < 2.2e-16
## alternative hypothesis: non-unimodal, i.e., at least bimodal

is.unimodal(bimodalpois)

## [1] FALSE

is.bimodal(bimodalpois)

## [1] TRUE

bimodality_coefficient(bimodalpois)

## [1] 0.6259467

The package mousetrap thinks the data is bimodal because it returns a coefficient >0.55. All packages agree, we determined that our data truly is
bimodal.

Now, we’ll determine the modes, with package LaplacesDemon.

Modes(bimodalpois)

## $modes
## [1] 4.56094 19.21653
##
## $mode.dens
## [1] 0.07962409 0.04308094
##
## $size
## [1] 0.5060007 0.4939993

Measuring the variability of your data


1. The range
If you remember, the range simply indicates the boundaries of your data, i.e., the most extreme values min and max. R has a range function that
provides min and max at once

min(df$wing_span)

## [1] 15.70996

max(df$wing_span)

## [1] 164.0964

range(df$wing_span)

## [1] 15.70996 164.09644

2. The variance
The variance is the mean of the squared deviations of the observations from their arithmetic mean, and it is rarely used, probably because it does
not have the same unit as the original observations. However, many statistical tests use the variance in computation. And yes! R has a very
convenient var function.

var(df$wing_span)

## [1] 430.0878

3. The standard deviation


The standard deviation (abbreviated SD) is the square root of the variance. You can calculate it as such, otherwise R has a sd function.

sqrt(var(df$wing_span))

## [1] 20.73856

#Or the direct function


sd(df$wing_span)

## [1] 20.73856

4. The standard error


The standard error of the mean (abbreviated SE or SEM), is obtained by dividing the standard deviation by the square root of the sample size. SD
is used when indicating how scattered the data is, so it really measures the dispersion in a data set relative to it’s mean. The SE, on the other
hand, indicates the uncertainty around the measure of the mean, and so indicates the likely uncertainty with which your data represent the true
population.

Unfortunately, there is No base R function to calculate it (but some packages like plotrix have one). However, it’s really easy to simply apply the
formula.

n<-length(df$wing_span) #define sample size


sd(df$wing_span)/sqrt(n)

## [1] 0.1693296

You can even create your own function easily, if you have many data sets and don’t want to re-write the whole formula each time.

std.error <- function(x) sd(x)/sqrt(length(x))


std.error(df$wing_span)

## [1] 0.1693296

4. The confidence interval of the mean


The confidence interval (abbreviated CI) of the mean is derived from calculating the standard error. You can decide how accurate to want to be, the
classic threshold value being the 95% CI, which means that you are 95% confident that the true mean of the population will be included in the
computed interval. But you can choose a 99% CI or even a 75% CI. The higher the threshold you choose, the wider will be your CI for a given set
of data. The sample size also affects the CI: as your data set becomes smaller, then the multiplier because larger, and thus, your CI becomes
wider.

The CI is always symmetrical around the mean, and you can use it for direct comparison of data sets.

a<-0.05
ddf<-length(df$wing_span)-1
t.s<-abs(qt(p=a/2, df=ddf))
margin_error <- t.s * std.error(df$wing_span) #now compute the margin of error,

We first defined an alpha threshold: the classic 95% CI, requiring an alpha threshold of 0.05 because 1-0.05 = 0.95 (so the CI you want.

Then, we defined the degrees of freedom (ddf), which is your sample size n (hence we’re using the length function we met earlier) minus 1.

we, then, defined the t-score, which is a standardized score of your variable, where the mean always takes the score value of 50, and a difference
of 10 from the mean indicates a difference of 1 SD. Thus, a score of 60 is one SD above the mean, while a score of 30 is two SD below the mean.
We used the absolute value (function abs) to avoid a negative t-score, which would invert upper and lower bound of your CI.

Finally, we defined the margin of error, which represent the amount of sampling error.

And now we can calculate the upper and lower bound of our CI

lower_bound <- mean(df$wing_span) - margin_error


upper_bound <- mean(df$wing_span) + margin_error
ci<-c(lower_bound, upper_bound)
print(ci)

## [1] 94.86145 95.52526

Remember that we defined a mean = 95.2 as the true mean of our population.

Now, let’s compute the 99% CI:

a1<-0.01
ddf<-length(df$wing_span)-1
t.s1<-abs(qt(p=a1/2, df=ddf))
margin_error <- t.s1 * std.error(df$wing_span)
lower_bound <- mean(df$wing_span) - margin_error # Calculate the lower bound
upper_bound <- mean(df$wing_span) + margin_error # Calculate the upper bound
ci<-c(lower_bound, upper_bound)
print(ci)

## [1] 94.75713 95.62957

The new alpha threshold must be 0.01 because 1-0.01 = 0.99, the df doesn’t change because we’re using the same data set, and we adjust the t-
score with our new alpha, and the margin of error with our new t-score.

Note how the new 99% CI is a bit larger than the 95% CI.

So, now, let’s compute the 75% CI.

a25<-0.25 #because 1-0.25 = 0.75


ddf<-length(df$wing_span)-1
t.s25<-abs(qt(p=a25/2, df=ddf))
margin_error <- t.s25 * std.error(df$wing_span)
lower_bound <- mean(df$wing_span) - margin_error
upper_bound <- mean(df$wing_span) + margin_error
ci<-c(lower_bound, upper_bound)
print(ci)

## [1] 94.99856 95.38815

As you probably guessed, the CI is smaller than with an alpha threshold of 0.01.

So, Now that we’ve dissected the CI with a bunch of code lines let’s do it the easy way! 😉
You can simply use the t.test function as a one-sample t-test, and it will provide the CI you have specified.

t.test(df$wing_span)

##
## One Sample t-test
##
## data: df$wing_span
## t = 562.18, df = 14999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 94.86145 95.52526
## sample estimates:
## mean of x
## 95.19335

t.test(df$wing_span)$conf.int

## [1] 94.86145 95.52526


## attr(,"conf.level")
## [1] 0.95

The first line will provide all information of the output, and the second will specifically provide the CI.

You can check that this method is strictly equivalent to the calculations we did above, as we will find the exact same results:

t.test(df$wing_span, conf.level = 0.99)$conf.int

## [1] 94.75713 95.62957


## attr(,"conf.level")
## [1] 0.99

t.test(df$wing_span, conf.level = 0.75)$conf.int

## [1] 94.99856 95.38815


## attr(,"conf.level")
## [1] 0.75

Note that in our example, the changes are relatively small, because we have a lot of data. But see what happens if we generate a small data set.
To illustrate how sample size affects the CI, and the difference between different alpha values, we’ll sub-sample our wing span data set (we will
take the 1st 15 values).

small.df<-as.data.frame(df[1:15,]) #we need to specify we want a dataframe because there is only one column so R
will treat it as a vector.
colnames(small.df)[1] <- "wing_span" #we have to provide the column name again

We apply the t.test function to this subset of the data set.

t.test(small.df$wing_span)$conf.int

## [1] 74.84452 94.40347


## attr(,"conf.level")
## [1] 0.95

t.test(small.df$wing_span, conf.level = 0.99)$conf.int

## [1] 71.05064 98.19735


## attr(,"conf.level")
## [1] 0.99

t.test(small.df$wing_span, conf.level = 0.75)$conf.int

## [1] 79.15178 90.09621


## attr(,"conf.level")
## [1] 0.75

First you can see tha we obtained a much bigger interval with 15 samples than with the 15,000 data points. The 99% CI has largely increased
compared to 95% CI, when with n = 15,000, the increase was not that large.

Finally, you can also see for yourself, that based on a sample size of 15, with a lower confidence level, we have missed the true mean of the
population, which we had set at 95.2. So here, you have it: You are only 75% confident that the interval [79.15 - 90.10] contains the true population
mean.

Take home message is that whenever you have a small sample size, it is good practice to increase the accuracy of the CI (at least the 95% CI).

Skewness and kurtosis


Just to be complete let’s see how to get these two measure in R.

The kurtosis of a normal distribution is = 3. So, if kurtosis<3, your data is platykurtic, and thus, tends to produce fewer and less extreme outliers
than the normal distribution. If kurtosis > 3 your data is leptokurtic, and thus, tends to produce more outliers than the normal distribution.

We’ll use the package moments, because there is no base R function to calculate these parameters.

library(moments)

##
## Attaching package: 'moments'

## The following object is masked from 'package:modeest':


##
## skewness

We’ll use our 2 data sets wing span and area, and we’ll create a left-skewed data set

ls<-as.data.frame(rbeta(15000,6,2)) #the left-skewed data set

skewness(df$wing_span)

## [1] -0.008557821

skewness(df2$log_area)

## [1] 4.264622

skewness(ls)

## rbeta(15000, 6, 2)
## -0.7042779

For wing span, the skewness is almost 0, so the distribution is normal. Skewness of area =4.3 (>>0), meaning the data is frankly right-skewed, and
that of ls <0 meaning it’s left-skewed.

Now , the kurtosis

kurtosis(df$wing_span)

## [1] 2.930967

kurtosis(df2$log_area)

## [1] 32.80239

kurtosis(ls)

## rbeta(15000, 6, 2)
## 3.119814

Here, the kurtosis of wing span is almost 3, meaning the data is normally distributed. but that of area is 32.80, meaning it has a heavy tail (far more
extreme values than normal). The kurtosis of ls is 3.15, not that far from 3, meaning it has only a little more extreme value in its tail than a normal
distribution.

Remember, skewness mostly makes sense to calculate if you have a large data set (n>30).

Describing categorical data


As usual let’s first generate a data set. Here, let’s pretend we have observations of color morphs, size category and sex of leopard cats
(Prionailurus bengalensis) associated to habitat.

color, sex and habitat are nominal data, and size category is ordinal data.

💪 Advanced
It is pretty easy to simulate a data set with correlated continuous variable, but not quite as much with nominal data. So, we’ll McGyver it, aka, it’s a
pretty ugly bit of code and I’m sure there are batter ways to do it, but it will do the job, and that’s all we need.

The idea is to obtain n = 500 rows of data of habitat, color morph, size catgory and sex, with color morph depending on habitat and both size
category and sex independent. We can do it all by hand, but let’s give the data set some random aspect. We’ll generate 5,000 rows of data with the
correlation we need, so that, at the end, we’ll randomly sample 500 rows out of these 5,000, with equal probability of being selected. So, although
the rows will be selected randomly, the fact that they have equal probability of being selected, means if we have 50% forest with 80% red morph
and 20% grey morph in the initial data set, these proportions will be similar in the resulting final data set.

library(ggcorrplot)
library(dplyr)
library(tidyr)

f<- rep("forest", 2500)


e<-rep("edge", 1000)
a<-rep("agricultural", 1500)

f.col<-as.data.frame(cbind(f, c(rep("red", 2000), rep("grey", 500))))


colnames(f.col)[1] <- "hab"
e.col<-as.data.frame(cbind(e, c(rep("red", 300), rep("grey", 650), rep("tawny", 50))))
colnames(e.col)[1] <- "hab"
a.col<- as.data.frame(cbind(a, c(rep("grey", 450), rep("tawny", 1050))))
colnames(a.col)[1] <- "hab"

h.col<-rbind(f.col, e.col, a.col)

Here we have the first part of our data set with columns habitat and color. The rep function replicates elements of vector and list. We first created
different vectors for each habitat, because we need to match them with a specific proportion of color morphs. Like for example we want forest
being 50% of all habitats, and we want a strong association of the red morph with forest: I chose 80% red and 20 grey.

We need to have the same column names in each data frame, to rbind (i.e., bind the different data set row-wise) them at the end.

Now, we’ll randomly sample 500 rows from this data set and add the sex. The last line converts all character columns to factor.

set.seed(123)
df<-h.col[sample(nrow(h.col), 500), ]
colnames(df)[2] <- "color"

set.seed(1)
sex<-sample(rep(c("F", "M"), 250))
size<-as.factor(sample(rep(c("small", "medium", "large"), 167)))
size<- head(size, -1)

df<-cbind(df, sex, size)


df <- as.data.frame(unclass(df),
stringsAsFactors = TRUE)

End of Advanced 💪
1. Frequency
Frequency is the most common way to summarize categorical data. We will basically summarize data count in contingency tables, wich will
compute counts for each category of variables. It’s useful notably to check if your data is balanced. There are many ways to do so, but I will show
you 2: notably the simple xtabs function, and using dplyr (awesome package for data wrangling).

xtabs(~ color, data = df)

## color
## grey red tawny
## 152 237 111

xtabs(~ hab + color, data = df)

## color
## hab grey red tawny
## agricultural 33 0 104
## edge 67 33 7
## forest 52 204 0

xtabs(~ hab + sex, data = df)

## sex
## hab F M
## agricultural 69 68
## edge 53 54
## forest 128 128

xtabs(~ color + size, data = df)

## size
## color large medium small
## grey 37 58 57
## red 88 81 68
## tawny 42 28 41

tab<-xtabs(~ hab + color +sex, data = df)


ftable(tab)

## sex F M
## hab color
## agricultural grey 18 15
## red 0 0
## tawny 51 53
## edge grey 31 36
## red 21 12
## tawny 1 6
## forest grey 21 31
## red 107 97
## tawny 0 0

tab2<-xtabs(~ sex + hab + color, data = df)


ftable(tab2)

## color grey red tawny


## sex hab
## F agricultural 18 0 51
## edge 31 21 1
## forest 21 107 0
## M agricultural 15 0 53
## edge 36 12 6
## forest 31 97 0

In the last couple lines, we produced a 3-D table, which can be pretty obnoxious to look at. ftable() reorganize the results in a better-looking way.
You can also change the order of the variables.

And same thing with dplyr.

df %>%
count(hab, color)

## hab color n
## 1 agricultural grey 33
## 2 agricultural tawny 104
## 3 edge grey 67
## 4 edge red 33
## 5 edge tawny 7
## 6 forest grey 52
## 7 forest red 204

df %>%
count(hab, color, sex)

## hab color sex n


## 1 agricultural grey F 18
## 2 agricultural grey M 15
## 3 agricultural tawny F 51
## 4 agricultural tawny M 53
## 5 edge grey F 31
## 6 edge grey M 36
## 7 edge red F 21
## 8 edge red M 12
## 9 edge tawny F 1
## 10 edge tawny M 6
## 11 forest grey F 21
## 12 forest grey M 31
## 13 forest red F 107
## 14 forest red M 97

df %>%
count(hab, color, sex, size)

## hab color sex size n


## 1 agricultural grey F large 3
## 2 agricultural grey F medium 4
## 3 agricultural grey F small 11
## 4 agricultural grey M large 5
## 5 agricultural grey M medium 4
## 6 agricultural grey M small 6
## 7 agricultural tawny F large 16
## 8 agricultural tawny F medium 18
## 9 agricultural tawny F small 17
## 10 agricultural tawny M large 24
## 11 agricultural tawny M medium 9
## 12 agricultural tawny M small 20
## 13 edge grey F large 7
## 14 edge grey F medium 16
## 15 edge grey F small 8
## 16 edge grey M large 9
## 17 edge grey M medium 16
## 18 edge grey M small 11
## 19 edge red F large 11
## 20 edge red F medium 2
## 21 edge red F small 8
## 22 edge red M large 8
## 23 edge red M medium 1
## 24 edge red M small 3
## 25 edge tawny F small 1
## 26 edge tawny M large 2
## 27 edge tawny M medium 1
## 28 edge tawny M small 3
## 29 forest grey F large 7
## 30 forest grey F medium 7
## 31 forest grey F small 7
## 32 forest grey M large 6
## 33 forest grey M medium 11
## 34 forest grey M small 14
## 35 forest red F large 33
## 36 forest red F medium 42
## 37 forest red F small 32
## 38 forest red M large 36
## 39 forest red M medium 36
## 40 forest red M small 25

dplyr uses this pipe syntax, and has a plethora of very convenient functions, that are relatively intuitive. We’ll see more of this package soon. As
you see here, the output is straight forward, each variable is in a column, and the rows represent the count of unique combinations.

2. proportions
Another relatively common way to summarize and present categorical data, is to produce contingency tables with the proportions of each level or
level combinations. To get the tables of proportions, we can simply feed our xtabs tables from above to the prop.tab function.

prop.table(xtabs(~ color, data = df))

## color
## grey red tawny
## 0.304 0.474 0.222

prop.table(xtabs(~ hab + color, data = df))

## color
## hab grey red tawny
## agricultural 0.066 0.000 0.208
## edge 0.134 0.066 0.014
## forest 0.104 0.408 0.000

prop.table(xtabs(~ color + sex, data = df))

## sex
## color F M
## grey 0.140 0.164
## red 0.256 0.218
## tawny 0.104 0.118

prop.table(xtabs(~ size + sex, data = df))

## sex
## size F M
## large 0.154 0.180
## medium 0.178 0.156
## small 0.168 0.164

ftable(round(prop.table(tab), 2))

## sex F M
## hab color
## agricultural grey 0.04 0.03
## red 0.00 0.00
## tawny 0.10 0.11
## edge grey 0.06 0.07
## red 0.04 0.02
## tawny 0.00 0.01
## forest grey 0.04 0.06
## red 0.21 0.19
## tawny 0.00 0.00

ftable(round(prop.table(tab2), 2))

## color grey red tawny


## sex hab
## F agricultural 0.04 0.00 0.10
## edge 0.06 0.04 0.00
## forest 0.04 0.21 0.00
## M agricultural 0.03 0.00 0.11
## edge 0.07 0.02 0.01
## forest 0.06 0.19 0.00

🍀 If you don’t want R to print a bazillion decimal, you can use the function round, and set the number of decimals you’d like.
And dplyr.

df %>%
count(hab, color) %>%
mutate(prop = n / sum(n))

## hab color n prop


## 1 agricultural grey 33 0.066
## 2 agricultural tawny 104 0.208
## 3 edge grey 67 0.134
## 4 edge red 33 0.066
## 5 edge tawny 7 0.014
## 6 forest grey 52 0.104
## 7 forest red 204 0.408

df %>%
count(hab, color, sex) %>%
mutate(prop = n / sum(n))

## hab color sex n prop


## 1 agricultural grey F 18 0.036
## 2 agricultural grey M 15 0.030
## 3 agricultural tawny F 51 0.102
## 4 agricultural tawny M 53 0.106
## 5 edge grey F 31 0.062
## 6 edge grey M 36 0.072
## 7 edge red F 21 0.042
## 8 edge red M 12 0.024
## 9 edge tawny F 1 0.002
## 10 edge tawny M 6 0.012
## 11 forest grey F 21 0.042
## 12 forest grey M 31 0.062
## 13 forest red F 107 0.214
## 14 forest red M 97 0.194

df %>%
count(hab, color, sex, size) %>%
mutate(prop = n / sum(n))

## hab color sex size n prop


## 1 agricultural grey F large 3 0.006
## 2 agricultural grey F medium 4 0.008
## 3 agricultural grey F small 11 0.022
## 4 agricultural grey M large 5 0.010
## 5 agricultural grey M medium 4 0.008
## 6 agricultural grey M small 6 0.012
## 7 agricultural tawny F large 16 0.032
## 8 agricultural tawny F medium 18 0.036
## 9 agricultural tawny F small 17 0.034
## 10 agricultural tawny M large 24 0.048
## 11 agricultural tawny M medium 9 0.018
## 12 agricultural tawny M small 20 0.040
## 13 edge grey F large 7 0.014
## 14 edge grey F medium 16 0.032
## 15 edge grey F small 8 0.016
## 16 edge grey M large 9 0.018
## 17 edge grey M medium 16 0.032
## 18 edge grey M small 11 0.022
## 19 edge red F large 11 0.022
## 20 edge red F medium 2 0.004
## 21 edge red F small 8 0.016
## 22 edge red M large 8 0.016
## 23 edge red M medium 1 0.002
## 24 edge red M small 3 0.006
## 25 edge tawny F small 1 0.002
## 26 edge tawny M large 2 0.004
## 27 edge tawny M medium 1 0.002
## 28 edge tawny M small 3 0.006
## 29 forest grey F large 7 0.014
## 30 forest grey F medium 7 0.014
## 31 forest grey F small 7 0.014
## 32 forest grey M large 6 0.012
## 33 forest grey M medium 11 0.022
## 34 forest grey M small 14 0.028
## 35 forest red F large 33 0.066
## 36 forest red F medium 42 0.084
## 37 forest red F small 32 0.064
## 38 forest red M large 36 0.072
## 39 forest red M medium 36 0.072
## 40 forest red M small 25 0.050

So, very similar table to the count one, but just with proportions instead.

3. Marginals
Marginals is the total counts or percentages across columns or rows in a contingency table. I don’t see it often provided, but you may need it, so
let’s take our earlier example of count of fur colors and create a table with it, then we’ll ask R to give us the marginals.

tab1<- xtabs(~ hab + color, data = df)


tab1

## color
## hab grey red tawny
## agricultural 33 0 104
## edge 67 33 7
## forest 52 204 0

margin.table(tab1, 1)

## hab
## agricultural edge forest
## 137 107 256

margin.table(tab1, 2)

## color
## grey red tawny
## 152 237 111

First line, we’ve asked R to give us marginals for the rows (code 1), and on line two we asked columns (code 2) marginals.

It goes the same way with the proportions.

prop1<- prop.table(xtabs(~ color + sex, data = df))


prop1

## sex
## color F M
## grey 0.140 0.164
## red 0.256 0.218
## tawny 0.104 0.118

margin.table(prop1, 1)

## color
## grey red tawny
## 0.304 0.474 0.222

margin.table(prop1, 2)

## sex
## F M
## 0.5 0.5

For proportions, you can also define the margin argument in the prop.table() function directly.

xtabs(~ color + sex, data = df)

## sex
## color F M
## grey 70 82
## red 128 109
## tawny 52 59

prop.table(xtabs(~ color + sex, data = df), margin = 1)

## sex
## color F M
## grey 0.4605263 0.5394737
## red 0.5400844 0.4599156
## tawny 0.4684685 0.5315315

The sum of values in the first row of the original table is: 70+82 = 152, in the second row 128 + 109 = 237, and in the third row 52 + 59 = 111.

The output shows each individual value as a proportion of the row sum. For example: cell [1, 1] = 70/152 = 0.4605, cell [1, 2] = 82/152 = 0.5395,
cell [3, 1] = 52/111 = 0.4684, etc.

And it works the same way for margin = 2.

prop.table(xtabs(~ color + sex, data = df), margin = 2)

## sex
## color F M
## grey 0.280 0.328
## red 0.512 0.436
## tawny 0.208 0.236

The output shows each individual value as a proportion of the column sum. For example, total for Females: 70+128+52 = 250, so cell [1,1] =
70/250 = 0.28, etc…

4. Median and mode


The median can only be obtained for ordinal data, and works the same like for quantitative data. You will just need to tell R that your factor is
ordered.

library(forcats)
df$size <- factor(df$size, levels=c('small', 'medium', 'large'))
df<- df %>%
mutate(size,
factor(size, ordered = TRUE))
library(missMethods)
median(df$`factor(size, ordered = TRUE)`)

## [1] medium
## Levels: small < medium < large

Getting the mode of categorical data works exactly the same like for quantitative data.

Mode(df$hab)

## [1] forest
## Levels: agricultural edge forest

Mode(df$color)

## [1] red
## Levels: grey red tawny

Easy! However, always check the frequency tables, because even when there is no mode, the function may still find one.

Mode(df$size)

## [1] large
## Levels: small medium large

The result indicate medium is the mode, but it is not more frequent than large of small.

5. Variation ratio
The variation ratio is the proportion of observation that deviate from the mode.

#install.packages("foreign")
library(foreign)

prop.table(xtabs(~ size, data = df))

## size
## small medium large
## 0.332 0.334 0.334

prop.table(xtabs(~ hab, data = df))

## hab
## agricultural edge forest
## 0.274 0.214 0.512

tab<-table(df$size)
tabh<-table(df$hab)

#Determine Variation Ratio


VR<-1-max(tab)/sum(tab)
VR

## [1] 0.666

VRh<-1-max(tabh)/sum(tabh)
VRh

## [1] 0.488

Here, we checked if size and color were unimodal, with xtabs. Because size is non-modal (i.e., there is not a mode, because we generated the 3
levels with equal probability), and habitat is unimodal, the variation ratio (VR) is pretty straight forward to calculate.

Here, for size we have 66.7% of the data (2/3) that is different from the third level of size (logic, since each level represents about 33.3%). But it is
a particular case, so we could instead probably report that the VR = 100%. There isn’t much published about how to deal with this issue, but based
on the definition, it should be 100%, since there is no mode, 100% of the data do not belong to the mode.

For habitat, we find that 48.8% of the data differs from the most common value (forest which represents 50.2% of all habitats).

Now let’s create a small data set with bimodal categorical data.

Let’s imagine, we have observations of butterfly colors per site.

butterfly<-data.frame(site = (c(rep("site1", 5),


rep("site2", 5),
rep("site3", 5))),
color = sample(c(rep("yellow", 5),
rep("blue", 5),
rep("orange", 3),
rep("purple", 2))))

For the second column, sample allows to randomize the order of the colors.

So both Yellow and Blue are 5 times in the dataframe, meaning that both are modes. Let’s get the VR.

tab.b<-table(butterfly$color)
tab.b

##
## blue orange purple yellow
## 5 3 2 5

#Determine maximum frequency and how often it occurs


maxFreq <- max(tab.b)
maxCount <- sum(tab.b==maxFreq)

#Determine the Variation Ratio


VR.b<-1-maxCount*maxFreq/sum(tab.b)
VR.b

## [1] 0.3333333

As expected, the VR indicates that 1/3 of the data is different to the modes. As you can see, the code to get the VR was similar to the previous
one, except that we had to account for the fact that the mode grouped 2 levels of colors, hence, we multiplied the max frequency by the number of
modes (in this case 2, but the same code works if you have more than 2 modes).

6. Coefficient of unalikeability
Unalikability is a measure of variation for categorical variables, which measures the probability of drawing two non-equal values at random. The
smaller the coefficient is, the less variation you have.

#install.packages("remotes")
#remotes::install_github("raredd/ragree")
library(ragree)
unalike(df$hab)

## [1] 0.616984

unalike(df$sex)

## [1] 0.5

unalike(df$color)

## [1] 0.633624

unalike(butterfly$color)

## [1] 0.72

You might also like