Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Exploratory Data Analysis

Part 3: Using R to Explore a Single Variable

Topic 02 - EDA ST1131 1 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 2 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 3 / 33


A Five-Number Summary of the Sample

Definition 1 (Five-Number Summary)


The five-number summary of a dataset consists of the minimum, lower
quartile, median, upper quartile and the maximum.

It gives a good indication of the center and variability of a dataset.

It reduces huge datasets to just five numbers.

Topic 02 - EDA ST1131 4 / 33


Boxplots

Boxplots are a visual representation of the five numbers in the Five Number
Summary.

They identify the min, max, median, lower and upper quartiles, and point
out the outliers (if any).

An outlier is an observation that is very different from the majority of the


data.

Topic 02 - EDA ST1131 5 / 33


Outlier

Definition 2 (Outliers)
An observation is an outlier if it smaller than Q1 − 1.5 × IQR or larger than
Q3 + 1.5 × IQR.

Topic 02 - EDA ST1131 6 / 33


Ingredients of a Boxplot

On the left are the elements of a


boxplot.
The distance from the “max
whisker reach” to the third quartile
is exactly 1.5 × IQR.

Topic 02 - EDA ST1131 7 / 33


Boxplots Versus Histograms

A boxplot does not portray certain features of a distribution, such as


distinct mounds and possible gaps in the data.

If a distribution is indeed unimodal, then a boxplot might give an indication


if the distribution is skewed.

However, boxplots are useful for identifying potential outliers.

Topic 02 - EDA ST1131 8 / 33


What to Say

When presented with a boxplot, be sure to mention:


Report the median as accurately as you can.

If there are outliers, mention how many there are, and on which side of the
median they are.

If you are presented with more than one boxplot in a figure, try to compare
their medians and inter-quartile ranges.

Topic 02 - EDA ST1131 9 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 10 / 33


What we will learn

Plot a frequency table for a single categorical variable

Lable the categories if it is needed

Plot bar plot (bar chart)

Plot pie chart

Topic 02 - EDA ST1131 11 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 12 / 33


Variable Gender

Consider variable “Gender” from the lung cancer dataset.


> lung = read.csv("C:\\Data\\lung_cancer.csv", sep = ",", header = TRUE)
> #Smoke: 1 = yes; 0 = Nno
> #Gender: 1 = male; 0 = female
> #Cancer: 1 = yes; 0 = no.
> attach(lung)
> Gender
[1] 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1
> length(Gender)
[1] 34

Topic 02 - EDA ST1131 13 / 33


Frequency Table of Gender
Frequency table
> table(Gender)
Gender
0 1
16 18
Table of proportions
> prop.table(table(Gender))
Gender
0 1
0.4705882 0.5294118
Table of percentages
> prop.table(table(Gender))*100
Gender
0 1
47.05882 52.94118
Topic 02 - EDA ST1131 14 / 33
Label the Categories
We don’t want to see the “0” and “1” in the output, they sometimes confuse
us, hence we want to see the actual meaning of them as “female” and ‘male”.
Create a new variable “gender” to replace the original variable “Gender”:
> gender <- ifelse(Gender=="0","Female","Male") # for every observati
> # if 0 then label as Female, else label as Male
> table(gender)
gender
Female Male
16 18
Still use the original variable “Gender”, but replace its labels:
> Gender <- ifelse(Gender=="0","Female","Male")
> table(Gender)
Gender
Female Male
16 18
Topic 02 - EDA ST1131 15 / 33
1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 16 / 33


Bar Plot for Gender
The simplest bar plot:
> #barplot(table(gender))

Topic 02 - EDA ST1131 17 / 33


With tittle, color and description for axis.
> # barplot(table(gender), ylab = "Frequency", xlab = "Gender",
> # col = c(2,5),main = "Bar plot of Gender")

Topic 02 - EDA ST1131 18 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 19 / 33


Pie Chart for Gender
The command is
> # pie(table(gender), col = c(2,5), main = "Pie chart of Gender")

Topic 02 - EDA ST1131 20 / 33


For Your Try

Let’s try to get frequency table for variable Cancer, then proportion table or
percentage table for this variable.

Try to label the categories, such as “Yes” and “No” instead of “0” and “1”.

Plot a bar plot and a pie chart for this variable with different color for each
category and add a tittle for the plot.

Let’s assume if variable “Income” has 3 categories labeled by numbers: 1 =


low, 2 = medium and 3 = high. How will you change the label for the
categories from numbers (1, 2, 3) to strings (like “low”, “medium”, “high”,
respectively)?

Topic 02 - EDA ST1131 21 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 22 / 33


What we will learn

Get the numerical summaries of a quantitative variable.

Plot a histogram

Plot a boxplot

Topic 02 - EDA ST1131 23 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 24 / 33


Some Basic Numerical Summaries
Min, max, mean, median, 3 quartiles, IRQ, range, variance, standard
deviation are the basic numerical summaries of a quantitative variable.
We consider variable “mark” from dataset of midterm marks of 98 students
(midterm_marks).
Read dataset into R:
> data<- read.csv("C:/Data/midterm_marks")
> mark<- data[,2]
> # data will have 2 columns: first column is the index and
> # second column is the marks.

Summary command
> summary(mark)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.50 12.12 18.25 17.50 24.00 28.00
Topic 02 - EDA ST1131 25 / 33
Some Basic Numerical Summaries

A list of basic numerical summaries:


> summaries = c(min(mark),max(mark),mean(mark), median(mark),
+ quantile(mark, 0.3), IQR(mark), range(mark), var(mark), sd(mark))
> summaries
30%
0.500000 28.000000 17.500000 18.250000 13.500000 11.875000 0.500000

53.804124 7.335129

Topic 02 - EDA ST1131 26 / 33


1 Boxplots

2 R for a Categorical Variable


Frequency Table
Bar Plot
Pie Chart

3 R for a Quantititive Variable


Numerical Summaries
Graphical Summaries

Topic 02 - EDA ST1131 27 / 33


Forming Histograms

We once again consider variable of midterm marks.


The simplest histogram:
> # hist(mark)
> # this produces a histogram with grey color and bars of frequency

Histogram with bars of probability and more information added


> # hist(mark, prob = TRUE, col = 2, xlab = "Midterm Marks", ylab = "
> # main = "Histogram of the Midterm Marks)

Topic 02 - EDA ST1131 28 / 33


The Histograms

Topic 02 - EDA ST1131 29 / 33


Forming Boxplots

Basic information to sketch a boxplot:


> # boxplot.stats(mark)

The simplest boxplot


> # boxplot(mark)

Boxplot with more information:


> # boxplot(mark, ylab = "Midterm marks", main = "Boxplot
> # of midterm marks", col = 5)
> # abline(h = median(mark)) #to add a line at the median value

Topic 02 - EDA ST1131 30 / 33


The Boxplots

Topic 02 - EDA ST1131 31 / 33


For Your Try

Lets use variable BMI as given in the code file.


Get the numerical summaries for this variable.

Form a histogram (by frequency and then by probability) with color, tittle.

Form a boxplot and identify if this variable has any outlier.

Comment if the distribution of BMI is symmetric.

Topic 02 - EDA ST1131 32 / 33


THANK YOU!

Topic 02 - EDA ST1131 33 / 33

You might also like