Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Iris Dataset

Allan Lao
2023-09-26
##ctrl-alt-i for code blocks

Iris Dataset in R
The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different
species.

To explore the dataset, we can describe it statistically or visualize using charts.

Load the Iris Dataset


Since the iris dataset is a built-in dataset, we simply need to load and use it

data(iris)

Explore the Structure of the dataset


First is to examine the data structure to determine the size, number of columns and other attributes. The order on what you want to look is all up to
the analyst.

Structure
The structure of the dataset

str(iris)

## 'data.frame': 150 obs. of 5 variables:


## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

str() shows the structure indicating the number of observations (records) and variables as well as its data type. There are 150 rows of records in
the iris dataset with 5 columns. Note the Species variable has a data type of Factor

The dimension

dim(iris)

## [1] 150 5

The names of the columns

names(iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

If you want to take a glimpse at the first 4 lines of rows.

head(iris,4)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

4 rows

Optionally you may check also the last 6 records

tail(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


<dbl> <dbl> <dbl> <dbl> <fct>

145 6.7 3.3 5.7 2.5 virginica

146 6.7 3.0 5.2 2.3 virginica

147 6.3 2.5 5.0 1.9 virginica

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica

6 rows

Describe the Iris Dataset using Statistical tools


Now, lets usse some statistics to describe the dataset.

The descriptive statistics summary

summary(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##

For each of the numeric variables we can see the following information:

Min: The minimum value.


1st Qu: The value of the first quartile (25th percentile).
Median: The median value.
Mean: The mean value.
3rd Qu: The value of the third quartile (75th percentile).
Max: The maximum value.

For the only categorical variable in the dataset (Species) we see a frequency count of each value:

setosa: This species occurs 50 times.


versicolor: This species occurs 50 times.
virginica: This species occurs 50 times.

Visualize the Iris Dataset


The plot () function is the generic function for plotting R objects.

plot(iris)

the entire dataset provides a glimpse of the relation between its variables. The chart below Sepal.Length represents the Sepal.Width in the y-axis
and Sepal.Length in the x-axis

Plot quantitative variables

plot(iris$Sepal.Length) #Quantitative

<> #### Plot 2 quantitative variables

plot(iris$Sepal.Width, iris$Sepal.Length,
col=factor(iris$Species),
main='Sepal Length vs Width',
xlab='Sepal Width',
ylab='Sepal Length',

pch=19)

legend(x = "topleft", lty = c(4,6), text.font = 4,


text.col = "blue",
pch=13,
col = (factor(iris$Species)),
legend=levels(factor(iris$Species)))

<>

Plotting a Factor variable


The plot() function automatically detects the type of variable and determines the appropriate chart to use by default

plot(iris$Species)

Next, will use histogram to determine how data is spread across a range of values. Just being curious on the distribution of Sepal Length.

hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')

Box Plot shows 5 statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum. It is thus
useful for visualizing the spread of the data is and deriving inferences accordingly

Using a boxplot() we can determine the distribution of sepal length across species.

boxplot(Sepal.Length~Species,
data=iris,
main='Sepal Length by Species',
xlab='Species',
ylab='Sepal Length',
col='steelblue',
border='black')

You might also like