Apuntes de Clase - DataCamp - Visualization in Higher Dimensions

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 50

Apuntes de clase -DataCamp 1

Visualization in higher dimensions


Ex.

Choose measures for center and spread


Consider the density plots shown here. What are the most appropriate measures to
describe their centers and spreads? In this exercise, you'll select the measures and
then calculate them.

Instructions.

Using the shapes of the density plots, calculate the most appropriate measures of
center and spread for the following:

● The distribution of life expectancy in the countries of the Americas. Note you'll
need to apply a filter here.
● The distribution of country populations across the entire gap2007 dataset.

Rpta.
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))

# Compute stats for population


gap2007 %>%
summarize(median(pop),
IQR(pop))

Apuntes de clase - Data Camp R - 1


Important!
Like mean and standard deviation, median and IQR measure the central tendency and
spread, respectively, but are robust to outliers and non-normal data.

Ex.

Calculate spread measures


Let's extend the powerful group_by() and summarize() syntax to measures of spread. If
you're unsure whether you're working with symmetric or skewed distributions, it's a
good idea to consider a robust measure like IQR in addition to the usual measures of
variance or standard deviation.

Instructions.

The gap2007 dataset that you created in an earlier exercise is available in your
workspace.

● For each continent in gap2007, summarize life expectancies using the sd(), the
IQR(), and the count of countries, n(). No need to name the new columns
produced here. The n() function within your summarize() call does not take any
arguments.
● Graphically compare the spread of these distributions by constructing overlaid
density plots of life expectancy broken down by continent.

Rpta.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp
),
IQR(lifeExp
),

Apuntes de clase - Data Camp R - 2


n())

# Generate overlaid density plots


gap2007 %>%
ggplot(aes(x = lifeExp
, fill = continent)) +
geom_density(alpha = 0.3)

Ex.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp


gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp
),
median(lifeExp
))

# Generate box plots of lifeExp for each continent


gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()

Apuntes de clase - Data Camp R - 3


Ex.

3 variable plot
Faceting is a valuable technique for looking at several conditional distributions at the
same time. If the faceted distributions are laid out in a grid, you can consider the
association between a variable and two others, one on the rows of the grid and the
other on the columns.

Intructions.

common_cyl, which you created to contain only cars with 4, 6, or 8 cylinders, is


available in your workspace.

● Using common_cyl, create a histogram of hwy_mpg.


● Grid-facet the plot rowwise by ncyl and columnwise by suv.
● Add a title to your plot to indicate what variables are being faceted on.

Rpta.
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle(common_cyl)

Exploring numerical data


Ex.

Apuntes de clase - Data Camp R - 4


Plot selection
Consider two other columns in the cars dataset: city_mpg and width. Which is the
most appropriate plot for displaying the important features of their distributions?
Remember, both density plots and box plots display the central tendency and spread
of the data, but the box plot is more robust to outliers.

Instructions.

Use density plots or box plots to construct the following visualizations. For each
variable, try both plots and submit the one that is better at capturing the important
structure.

● Display the distribution of city_mpg.


● Display the distribution of width.

Rpta.
# Create plot of city_mpg
cars %>%
ggplot(aes(x=1, y=city_mpg)) +
geom_boxplot()

# Create plot of width


cars %>%
ggplot(aes(x=width)) +
geom_density()

Apuntes de clase - Data Camp R - 5


Ex.
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()

# Exclude outliers from data


cars_no_out <- cars %>%
filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset


cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()

Apuntes de clase - Data Camp R - 6


Ex.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(x=horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle(cars)

# Create hist of horsepwr with binwidth of 30


cars %>%
ggplot(aes(x=horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle(cars)

# Create hist of horsepwr with binwidth of 60


cars %>%
ggplot(aes(x=horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle(cars)

Ex.
# Create hist of horsepwr
cars %>%
ggplot(aes(x=horsepwr)) +
geom_histogram() +
ggtitle(cars)

# Create hist of horsepwr for affordable cars


cars %>%
filter(msrp < 25000) %>%
ggplot(aes(x = horsepwr)) +
geom_histogram () +
xlim(c(90, 550)) +
ggtitle(cars)

Apuntes de clase - Data Camp R - 7


Ex.
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8) )

# Create box plots of city mpg by ncyl


ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()

# Create overlaid density plots for same data


ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)

Ex.

Faceted histogram
In this chapter, you'll be working with the cars dataset, which records characteristics
on all of the new models of cars for sale in the US in a certain year. You will
investigate the distribution of mileage across a categorial variable, but before you get
there, you'll want to familiarize yourself with the dataset.

Instructions.

The cars dataset has been loaded in your workspace.

Apuntes de clase - Data Camp R - 8


● Load the ggplot2 package.
● View the size of the data and the variable types using str().
● Plot a histogram of city_mpg faceted by suv, a logical variable indicating
whether the car is an SUV or not.

# Load package
library(ggplot2)

# Learn data structure


str(cars)

# Create faceted histogram


ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)

Important!
In this exercise, you faceted by the suv variable, but it's important to note that you
can facet a plot by any categorical variable using facet_wrap(). Nice job!

Distribution of one variable


Ex.
# Put levels of flavor in descending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

Apuntes de clase - Data Camp R - 9


# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))

# Alternative solution to finding levels


# lev <- unlist(select(arrange(cnt, desc(n)), flavor))

Ex.

Marginal barchart
If you are interested in the distribution of alignment of all superheroes, it makes
sense to construct a barchart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some
sensible ordering. Superheroes that are "Neutral" show an alignment between "Good"
and "Bad", so it makes sense to put that bar in the middle.

Instructions.
● Reorder the levels of align using the factor() function so that printing them
reads "Bad", "Neutral", then "Good".
● Create a barchart of counts of the align variable.

Rpta.

Apuntes de clase - Data Camp R - 10


# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))

# Create plot of align


ggplot(comics, aes(x = align)) +
geom_bar()

Ex.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)

Counts vs. proportions

Ex.

Counts vs. proportions (2)


Apuntes de clase - Data Camp R - 11
Bar charts can tell dramatically different stories depending on whether they represent
counts or proportions and, if proportions, what the proportions are conditioned on.
To demonstrate this difference, you'll construct two barcharts in this exercise: one of
counts and one of proportions.

Instructions.
● Create a stacked barchart of gender counts with align on the x-axis.
● Create a stacked barchart of gender proportions with align on the x-axis by
setting the position argument to geom_bar() equal to "fill".

Rpta.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()

# Plot proportion of gender, conditional on align


ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill") +
ylab("proportion")

Important!

Apuntes de clase - Data Camp R - 12


Excellent work! By adding position = "fill" to geom_bar(), you are saying you want the
bars to fill the entire height of the plotting window, thus displaying proportions and
not raw counts.

Ex.

Side-by-side barcharts
While a contingency table represents the counts numerically, it's often more useful to
represent them graphically.

Here you'll construct two side-by-side barcharts of the comics data. This shows that
there can often be two or more options for presenting the same data. Passing the
argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not
stacked) barchart.

Instructions.
● Load the ggplot2 package.
● Create a side-by-side barchart with align on the x-axis and gender as the fill
aesthetic.
● Create another side-by-side barchart with gender on the x-axis and align as the
fill aesthetic. Rotate the axis labels 90 degrees to help readability.

Rpta.
# Load ggplot2
library(ggplot2)

# Create side-by-side barchart of gender by alignment


ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "dodge")

Apuntes de clase - Data Camp R - 13


# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))

Ex.
Instructions.
● Load the dplyr package.
● Print tab to find out which level of align has the fewest total entries.
● Use filter() to filter out all rows of comics with that level, then drop the unused
level with droplevels(). Save the simplified dataset as comics_filtered.

Rpta.
# Load dplyr
library(dplyr)

# Print tab
tab

# Remove align level


comics_filtered <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()

# See the result


comics_filtered

Recode a variable

Apuntes de clase - Data Camp R - 14


Exploring categorical data

Ex.
# Print the first rows of the data
comics

Apuntes de clase - Data Camp R - 15


# Check levels of align
levels(comics$align)

# Check the levels of gender


levels(comics$gender)

# Create a 2-way contingency table


table(comics$gender,comics$align)

Console.

Bad Good Neutral Reformed Criminals

Female 1573 2490 836 1

Male 7561 4809 1799 2S

Other 32 17 17 0

29/07/2020
Variables in the data

Principles of experimental design

Explanatory variable are conditions you can impose on the experimental units (light, noise).

Apuntes de clase - Data Camp R - 16


Blocking variable are characteristics that the experimental units come with that you would
like to control (gender, experience in R…)
Response variable

In random sampling, we use stratifying to control for a variable. In random assignment,


we use blocking to achieve the same goal.

Stratified sample in R
Ex.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)

# Count states by region


states_str %>%
count(region)

Ex.
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)

# Count states by region


states_srs %>%
count(region)

Apuntes de clase - Data Camp R - 17


23/07/2020
Data in R
Ex.
ucb_admission_counts %>%
# Group by gender
group_by(Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for admitted
filter(Admit == "Admitted")

Ex.
# Load packages
library(dplyr)

# Count number of male and female applicants admitted


ucb_admit %>%
count(Gender, Admit)

Apuntes de clase - Data Camp R - 18


Important! Passing several arguments to count() gives you the number of rows for each
combination of those arguments.

Ex.

Fast parsing with fasttime


The fasttime package provides a single function fastPOSIXct(), designed to read in
datetimes formatted according to ISO 8601. Because it only reads in one format, and
doesn't have to guess a format, it is really fast!

You'll see how fast in this exercise by comparing how fast it reads in the dates from
the Auckland hourly weather data (over 17,000 dates) to lubridates ymd_hms().

To compare run times you'll use the microbenchmark() function from the package of
the same name. You pass in as many arguments as you want each being an
expression to time.

Instructions.

We've loaded the datetimes from the Auckland hourly data as strings into the vector
dates.

● Examine the structure of dates to verify it is a string and in the ISO 8601
format.
● Parse dates with fasttime and pipe to str() to verify fastPOSIXct parses them
correctly.
● Now to compare timing, call microbenchmark where the first argument uses
ymd_hms() to parse dates and the second uses fastPOSIXct().

Rpta.
library(microbenchmark)
library(fasttime)

# Examine structure of dates


str(dates)

# Use fastPOSIXct() to parse dates


fastPOSIXct(dates) %>% str()

# Compare speed of fastPOSIXct() to ymd_hms()


microbenchmark(
ymd_hms = ymd_hms(dates),
fasttime = fastPOSIXct(dates),
times = 20)

Apuntes de clase - Data Camp R - 19


Ex.

Times without dates


For this entire course, if you've ever had a time, it's always had an accompanying
date, i.e. a datetime. But sometimes you just have a time without a date.

If you find yourself in this situation, the hms package provides an hms class of object
for holding times without dates, and the best place to start would be with as.hms().

In fact, you've already seen an object of the hms class, but I didn't point it out to you.
Take a look in this exercise.

● Use read_csv() to read in "akl_weather_hourly_2016.csv". readr knows about the


hms class, so if it comes across something that looks like a time it will use it.
● In this case the time column has been parsed as a time without a date. Take a
look at the structure of the time column to verify it has the class hms.
● hms objects print like times should. Take a look by examining the head of the
time column.
● You can use hms objects in plots too. Create a plot with time on the x-axis,
temperature on the y-axis, with lines grouped by date.

Rpta.
# Import auckland hourly data
akl_hourly <- read_csv("akl_weather_hourly_2016.csv")

# Examine structure of time column


str(akl_hourly$time)

# Examine head of time column


head(akl_hourly$time)

# A plot using just time


ggplot(akl_hourly, aes(x = time, y = temperature)) +
geom_line(aes(group = make_date(year, month, mday)), alpha = 0.2)

Apuntes de clase - Data Camp R - 20


Ex.

Setting the timezone


If you import a datetime and it has the wrong timezone, you can set it with force_tz().
Pass in the datetime as the first argument and the appropriate timezone to the tzone
argument. Remember the timezone needs to be one from OlsonNames().

I wanted to watch New Zealand in the Women's World Cup Soccer games in 2015,
but the times listed on the FIFA website were all in times local to the venues. In this
exercise you'll help me set the timezones, then in the next exercise you'll help me
figure out what time I needed to tune in to watch them.

Intructions.

I've put the times as listed on the FIFA website for games 2 and 3 in the group stage for New
Zealand in your code.

● Game 2 was played in Edmonton. Use force_tz() to set the timezone of game 2 to
"America/Edmonton".
● Game 3 was played in Winnipeg. Use force_tz() to set the timezone of game 3 to
"America/Winnipeg".
● Find out how long the team had to rest between the two games, by using as.period()
on the interval between game2_local and game3_local.

Rpta.

# Game2: CAN vs NZL in Edmonton

game2 <- mdy_hm("June 11 2015 19:00")

# Game3: CHN vs NZL in Winnipeg

game3 <- mdy_hm("June 15 2015 18:30")

# Set the timezone to "America/Edmonton"

game2_local <- force_tz(game2, tzone = "America/Edmonton")

game2_local

Apuntes de clase - Data Camp R - 21


21/07/2020

Intervals
Ex.

Converting to durations and periods


Intervals are the most specific way to represent a span of time since they retain
information about the exact start and end moments. They can be converted to
periods and durations exactly: it's possible to calculate both the exact number of
seconds elapsed between the start and end date, as well as the perceived change in
clock time.

To do so you use the as.period(), and as.duration() functions, parsing in an interval as


the only argument.

Try them out to get better representations of the length of the monarchs reigns.

Intructions.
● Create new columns for duration and period that convert reign into the
appropriate object.
● Examine the name, duration and period columns.

Rpta.
# New columns for duration and period
monarchs <- monarchs %>%
mutate(
duration = as.duration(reign),
period = as.period(reign))

# Examine results
monarchs %>%
select(name, duration, period)

Ex.

Comparing intervals and datetimes

Apuntes de clase - Data Camp R - 22


A common task with intervals is to ask if a certain time is inside the interval or
whether it overlaps with another interval.

The operator %within% tests if the datetime (or interval) on the left hand side is
within the interval of the right hand side. For example, if y2001 is the interval covering
the year 2001,

y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")

Then ymd("2001-03-30") %within% y2001 will return TRUE and ymd("2002-03-30")


%within% y2001 will return FALSE.

int_overlaps() performs a similar test, but will return true if two intervals overlap at
all.

Practice to find out which monarchs saw Halley's comet around 1066.

Instrcutions.

We've put halleys a data set describing appearances of Halley's comet in your
workspace.

● Print halleys to examine the date. perihelion_date is the date the Comet is
closest to the Sun. start_date and end_date are the range of dates the comet is
visible from Earth.
● Create a new column, visible, that is an interval from start_date to end_date.
● You'll work with one appearance, extract the 14th row of halleys.
● Filter monarchs to those where halleys_1066$perihelion_date is within reign.
● Filter monarchs to those where halleys_1066$visible overlaps reign.

Rpta.
# Print halleys
halleys

# New column for interval from start to end date


halleys <- halleys %>%
mutate(visible = start_date %--% end_date)

# The visitation of 1066


halleys_1066 <- halleys[14, ]

# Monarchs in power on perihelion date


monarchs %>%

Apuntes de clase - Data Camp R - 23


filter(halleys_1066$perihelion_date %within% reign) %>%
select(name, from, to, dominion)

# Monarchs whose reign overlaps visible time


monarchs %>%
filter(int_overlaps(halleys_1066$visible, reign)) %>%
select(name, from, to, dominion)

Ex.

Examining intervals. Reigns of kings and


queens
You can create an interval by using the operator %--% with two datetimes. For
example ymd("2001-01-01") %--% ymd("2001-12-31") creates an interval for the year of
2001.

Once you have an interval you can find out certain properties like its start, end and
length with int_start(), int_end() and int_length() respectively.

Practice by exploring the reigns of kings and queens of Britain (and its historical
dominions).

Intructions.

We've put the data monarchs in your workspace.

● Print monarchs to take a look at the data


● Create a new column called reign that is an interval between from and to.
● Create another new column, length, that is the interval length of reign. The rest
of the pipeline we've filled in for you, it arranges by decreasing length and
selects the name, length and dominion columns.

Rpta.
# Print monarchs
monarchs

# Create an interval for reign


monarchs <- monarchs %>%
mutate(reign = from %--% to)

Apuntes de clase - Data Camp R - 24


# Find the length of reign, and arrange
monarchs %>%
mutate(length = int_length(reign)) %>%
arrange(desc(length)) %>%
select(name, length, dominion)

Time spans.

Ex.

Generating sequences of datetimes


By combining addition and multiplication with sequences you can generate
sequences of datetimes. For example, you can generate a sequence of periods from 1
day up to 10 days with,

1:10 * days(1)

Then by adding this sequence to a specific datetime, you can construct a sequence of
datetimes from 1 day up to 10 days into the future

today() + 1:10 * days(1)

You had a meeting this morning at 8am and you'd like to have that meeting at the
same time and day every two weeks for a year. Generate the meeting times in this
exercise.

Instructions.
● Create today_8am() by adding a period of 8 hours to today()

Apuntes de clase - Data Camp R - 25


● Create a sequence of periods from one period of two weeks, up to 26 periods
of two weeks.
● Add every_two_weeks to today_8am.

Rpta.
# Add a period of 8 hours to today
today_8am <- today() + hours(8)

# Sequence of two weeks from 1 to 26


every_two_weeks <- 1:26*weeks(2)

# Create datetime for every two weeks for a year


every_two_weeks + today_8am

Ex.

Arithmetic with timespans


You can add and subtract timespans to create different length timespans, and even
multiply them by numbers. For example, to create a duration of three days and three
hours you could do: ddays(3) + dhours(3), or 3*ddays(1) + 3*dhours(1) or even
3*(ddays(1) + dhours(1)).

There was an eclipse over North America on 2017-08-21 at 18:26:40. It's possible to
predict the next eclipse with similar geometry by calculating the time and date one
Saros in the future. A Saros is a length of time that corresponds to 223 Synodic
months, a Synodic month being the period of the Moon's phases, a duration of 29
days, 12 hours, 44 minutes and 3 seconds.

Do just that in this exercise!

Instructions.
● Create a duration corresponding to one Synodic Month: 29 days, 12 hours, 44
minutes and 3 seconds.
● Create a duration corresponding to one Saros by multiplying synodic by 223.
● Add saros to eclipse_2017 to predict the next eclipse.

# Time of North American Eclipse 2017


eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")

Apuntes de clase - Data Camp R - 26


# Duration of 29 days, 12 hours, 44 mins and 3 secs
synodic <- ddays(29)+dhours(12)+ dminutes(44) + dseconds(3)

# 223 synodic months


saros <- 223*synodic

# Add saros to eclipse_2017


eclipse_2017 + saros

Ex.

> # Add a period of one week to mon_2pm


> mon_2pm <- dmy_hm("27 Aug 2018 14:00")
> mon_2pm + weeks()
[1] "2018-09-03 14:00:00 UTC"

> # Add a duration of 81 hours to tue_9am


> tue_9am <- dmy_hm("28 Aug 2018 9:00")
> tue_9am + dhours(81)
[1] "2018-08-31 18:00:00 UTC

> # Subtract a period of five years from today()


> today() - years(5)
[1] "2015-07-21"

> # Subtract a duration of five years from today()


> today() - dyears(5)
[1] "2015-07-23"

11/07/2020

Time spans.
Ex.

Taking differences of datetimes


Ex.

Apuntes de clase - Data Camp R - 27


We've put code to define three times in your script - noon on March 11th, March
12th, and March 13th in 2017 in the US Pacific timezone.

● Find the difference in time between mar_13 and mar_12 in seconds. This should
match your intuition.
● Now, find the difference in time between mar_12 and mar_11 in seconds.
Surprised?

Rpta.
# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00",
tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00",
tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00",
tz = "America/Los_Angeles")

# Difference between mar_13 and mar_12 in seconds


difftime(mar_13, mar_12, units = "secs")

# Difference between mar_12 and mar_11 in seconds


difftime(mar_12, mar_11, units = "secs")

CONSOLE
# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00",
tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00",
tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00",
tz = "America/Los_Angeles")

Important!
Good work. Why would a day only have 82800 seconds? At 2am on Mar 12th 2017,
Daylight Savings started in the Pacific timezone. That means a whole hour of seconds
gets skipped between noon on the 11th and noon on the 12th.

Ex.

Apuntes de clase - Data Camp R - 28


How long has it been?
To get finer control over a difference between datetimes use the base function
difftime(). For example instead of time1 - time2, you use difftime(time1, time2).

difftime() takes an argument units which specifies the units for the difference. Your
options are "secs", "mins", "hours", "days", or "weeks".

To practice you'll find the time since the first man stepped on the moon. You'll also
see the lubridate functions today() and now() which when called with no arguments
return the current date and time in your system's timezone.

Instructions.
● Apollo 11 landed on July 20, 1969. Use difftime() to find the number of days
between today() and date_landing.
● Neil Armstrong stepped onto the surface at 02:56:15 UTC. Use difftime() to
find the number of seconds between now() and moment_step.

Rpta.
# The date of landing and moment of step
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")

# How many days since the first man on the moon?


difftime(today(), date_landing, units = "days")

# How many seconds since the first man on the moon?


difftime(now(), moment_step, units = "secs")

Rounding datetimes

Ex.

Rounding with the weather data


When is rounding useful? In a lot of the same situations extracting date components
is useful. The advantage of rounding over extracting is that it maintains the context of
the unit. For example, extracting the hour gives you the hour the datetime occurred,

Apuntes de clase - Data Camp R - 29


but you lose the day that hour occurred on (unless you extract that too), on the other
hand, rounding to the nearest hour maintains the day, month and year.

As an example you'll explore how many observations per hour there really are in the
hourly Auckland weather data.

Instructions.
● Create a new column called day_hour that is datetime rounded down to the
nearest hour.
● Use count() on day_hour to count how many observations there are in each
hour. What looks like the most common value?
● Extend the pipeline, so that after counting, you filter for observations where n
is not equal to 2.

Rpta.
# Create day_hour, datetime rounded down to hour
akl_hourly <- akl_hourly %>%
mutate(
day_hour = floor_date(datetime, unit = "hour")
)

# Count observations per hour


akl_hourly %>%
count(day_hour)

# Find day_hours with n != 2


akl_hourly %>%
count(day_hour) %>%
filter(n != 2) %>%
arrange(desc(n))

Ex.

Practice rounding
As you saw in the video, round_date() rounds a date to the nearest value, floor_date()
rounds down, and ceiling_date() rounds up.

All three take a unit argument which specifies the resolution of rounding. You can
specify "second", "minute", "hour", "day", "week", "month", "bimonth", "quarter",
"halfyear", or "year". Or, you can specify any multiple of those units, e.g. "5 years", "3
minutes" etc.

Apuntes de clase - Data Camp R - 30


Try them out with the release datetime of R 3.4.1.

Intructions.

● Choose the right function and units to round r_3_4_1 down to the nearest day.
● Choose the right function and units to round r_3_4_1 to the nearest 5 minutes.
● Choose the right function and units to round r_3_4_1 up to the nearest week.
● Find the time elapsed on the day of release at the time of release by
subtracting r_3_4_1 rounded down to the day from r_3_4_1.

Rpta.
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Round down to day


floor_date(r_3_4_1, unit = "day")

# Round to nearest 5 minutes


round_date(r_3_4_1, unit = "5 minutes")

# Round up to week
ceiling_date(r_3_4_1, unit = "week")

# Subtract r_3_4_1 rounded down to day


r_3_4_1 - floor_date(r_3_4_1, unit = "day")

Extracting parts of a datetime


Ex.

Extracting for filtering and summarizing


Another reason to extract components is to help with filtering observations or
creating summaries. For example, if you are only interested in observations made on
weekdays (i.e. not on weekends) you could extract the weekdays then filter out
weekends, e.g. wday(date) %in% 2:6.

In the last exercise you saw that January, February and March were great times to
visit Auckland for warm temperatures, but will you need a raincoat?

Apuntes de clase - Data Camp R - 31


In this exercise you'll find out! You'll use the hourly data to calculate how many days
in each month there was any rain during the day.

Instructions.
● Create new columns for the hour and month of the observation from datetime.
Make sure you label the month.
● Filter to just daytime observations, where the hour is greater than or equal to 8
and less than or equal to 22.
● Group the observations first by month, then by date, and summarise by using
any() on the rainy column. This results in one value per day
● Summarise again by summing any_rain. This results in one value per month

Rpta.
# Create new columns hour, month and rainy
akl_hourly <- akl_hourly %>%
mutate(
hour = hour(datetime),
month = month(datetime, label = TRUE),
rainy = weather == "Precipitation"
)

# Filter for hours between 8am and 10pm (inclusive)


akl_day <- akl_hourly %>%
filter(hour >= 8, hour <= 22)

# Summarise for each date if there is any rain


rainy_days <- akl_day %>%
group_by(month, date) %>%
summarise(
any_rain = any(rainy)
)

# Summarise for each month, the number of days with rain


rainy_days %>%
summarise(
days_rainy = sum(any_rain)
)

CONSOLE
# A tibble: 12 x 2
month days_rainy
<ord> <int>
1 Jan 15
2 Feb 13
3 Mar 12

Apuntes de clase - Data Camp R - 32


4 Apr 15
5 May 21
6 Jun 19
7 Jul 22
8 Aug 16
9 Sep 25
10 Oct 20
11 Nov 19
12 Dec 11

Ex.

Extracting for plotting


Extracting components from a datetime is particularly useful when exploring data.
Earlier in the chapter you imported daily data for weather in Auckland, and created a
time series plot of ten years of daily maximum temperature. While that plot gives you
a good overview of the whole ten years, it's hard to see the annual pattern.

In this exercise you'll use components of the dates to help explore the pattern of
maximum temperature over the year. The first step is to create some new columns to
hold the extracted pieces, then you'll use them in a couple of plots.

Instructions.

● Use mutate() to create three new columns: year, yday and month that
respectively hold the same components of the date column. Don't forget to
label the months with their names.
● Create a plot of yday on the x-axis, max_temp of the y-axis where lines are
grouped by year. Each year is a line on this plot, with the x-axis running from Jan 1
to Dec 31.
● To take an alternate look, create a ridgeline plot(formerly known as a joyplot)
with max_temp on the x-axis, month on the y-axis, using geom_density_ridges()
from the ggridges package.

Rpta.
library(ggplot2)
library(dplyr)
library(ggridges)

Apuntes de clase - Data Camp R - 33


# Add columns for year, yday and month
akl_daily <- akl_daily %>%
mutate(
year = year(date),
yday = yday(date),
month = month(date, label = TRUE))

# Plot max_temp by yday for all years


ggplot(akl_daily, aes(x = yday, y = max_temp)) +
geom_line(aes(group = year), alpha = 0.5)

# Examine distribution of max_temp by month


ggplot(akl_daily, aes(x = max_temp, y = month, height = ..density..)) +
geom_density_ridges(stat = "density")

Ex.

Adding useful labels


In the previous exercise you found the month of releases:

Apuntes de clase - Data Camp R - 34


head(month(release_time))

and received numeric months in return. Sometimes it's nicer (especially for plotting or
tables) to have named months. Both the month() and wday() (day of the week)
functions have additional arguments label and abbr to achieve just that. Set label =
TRUE to have the output labelled with month (or weekday) names, and abbr = FALSE
for those names to be written in full rather than abbreviated.

For example, try running:

head(month(release_time, label = TRUE, abbr = FALSE))

Practice by examining the popular days of the week for R releases.

library(ggplot2)

# Use wday() to tabulate release by day of the week


wday(releases$datetime) %>% table()

# Add label = TRUE to make table more readable


wday(releases$datetime, label = TRUE) %>% table()

# Create column wday to hold week days


releases$wday <- wday(releases$datetime, label = TRUE)

# Plot barchart of weekday by type of release


ggplot(releases, aes(wday)) +
geom_bar() +
facet_wrap(~ type, ncol = 1, scale = "free_y")

Ex.

As you saw in the video, components of a datetime can be extracted by lubridate


functions with the same name like year(), month(), day(), hour(), minute() and
second(). They all work the same way just pass in a datetime or vector of datetimes.

There are also a few useful functions that return other aspects of a datetime like if it
occurs in the morning am(), during daylight savings dst(), in a leap_year(), or which
quarter() or semester() it occurs in.

Apuntes de clase - Data Camp R - 35


Try them out by exploring the release times of R versions using the data from
Chapter 1.

Intructions.

We've put release_time, the datetime column of the releases dataset from Chapter 1, in
your workspace.

● Examine the head() of release_time to verify this is a vector of datetimes.


● Extract the month from release_time and examine the first few with head().
● To see which months have most releases, extract the month then pipe to
table().
● Repeat, to see which years have the most releases.
● Do releases happen in the morning (UTC)? Find out if the hour of a release is
less than 12 and summarise with mean().
● Alternatively use am() to find out how often releases happen in the morning.

Rpta.
# Examine the head() of release_time
head(release_time)

# Examine the head() of the months of release_time


head(month(release_time))

# Extract the month of releases


month(release_time) %>% table()

# Extract the year of releases


year(release_time) %>% table()

# How often is the hour before 12 (noon)?


mean(hour(release_time) < 12)

# How often is the release in am?


mean(am(release_time))

Import hourly weather data


The hourly data is a little different. The date information is spread over three columns
year, month and mday, so you'll need to use make_date() to combine them.

Apuntes de clase - Data Camp R - 36


Then the time information is in a separate column again, time. It's quite common to
find date and time split across different variables. One way to construct the datetimes
is to paste the date and time together and then parse them. You'll do that in this
exercise.

Instructions.
● Import the hourly data, "akl_weather_hourly_2016.csv" with read_csv(), then print
akl_hourly_raw to confirm the date is spread over year, month and mday.
● Using mutate() create the column date with using make_date().
● We've pasted together the date and time columns. Create datetime by parsing the
datetime_string column.
● Take a look at the date, time and datetime columns to verify they match up.
● Take a look at the data by plotting datetime on the x-axis and temperature of the y-axis.

Rpta.

library(lubridate)

library(readr)

library(dplyr)

library(ggplot2)

# Import "akl_weather_hourly_2016.csv"

akl_hourly_raw <- read_csv("akl_weather_hourly_2016.csv")

# Print akl_hourly_raw

akl_hourly_raw

# Use make_date() to combine year, month and mday

akl_hourly <- akl_hourly_raw %>%

mutate(date = make_date(year = year, month = month, day = mday))

# Parse datetime_string

akl_hourly <- akl_hourly %>%

mutate(

Apuntes de clase - Data Camp R - 37


datetime_string = paste(date, time, sep = "T"),

datetime = ymd_hms(datetime_string)

# Print date, time and datetime columns of akl_hourly

akl_hourly %>% select(date, time, datetime)

# Plot to check work

ggplot(akl_hourly, aes(x = datetime, y = temperature)) +

geom_line()

10/07/2020
Weather in Auckland

08/07/2020
Dates
Ex.

Datetimes behave nicely too


Just like Date objects, you can plot and do math with POSIXct objects.

Apuntes de clase - Data Camp R - 38


As an example, in this exercise you'll see how quickly people download new versions
of R, by examining the download logs from the RStudio CRAN mirror.

R 3.2.0 was released at "2015-04-16 07:13:33" so cran-logs_2015-04-17.csv contains a


random sample of downloads on the 16th, 17th and 18th.

Instructions.
● Use read_csv() to import cran-logs_2015-04-17.csv.
● Print logs to see the information we have on each download.
● Store the R 3.2.0 release time as a POSIXct object.
● Find out when the first request for 3.2.0 was made by filtering for values in the
datetime column that are greater than release_time.
● Finally see how downloads increase by creating histograms of download time
for 3.2.0 and the previous version 3.1.3. We've provided most of the code, you
just need to specify the x aesthetic to be the datetime column.

Rpta.
# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("cran-logs_2015-04-17.csv")

# Print logs
logs

# Store the release time as a POSIXct object


release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")

# When is the first download of 3.2.0?


logs %>%
filter(datetime > release_time,
r_version == "3.2.0")

# Examine histograms of downloads by version


ggplot(logs, aes(x = datetime)) +
geom_histogram() +
geom_vline(aes(xintercept = as.numeric(release_time)))+
facet_wrap(~ r_version, ncol = 1)

Apuntes de clase - Data Camp R - 39


Ex.

Getting datetimes into R


Just like dates without times, if you want R to recognize a string as a datetime you
need to convert it, although now you use as.POSIXct(). as.POSIXct() expects strings
to be in the format YYYY-MM-DD HH:MM:SS.

The only tricky thing is that times will be interpreted in local time based on your
machine's set up. You can check your timezone with Sys.timezone(). If you want the
time to be interpreted in a different timezone, you just set the tz argument of
as.POSIXct(). You'll learn more about time zones in Chapter 4.

In this exercise you'll input a couple of datetimes by hand and then see that
read_csv() also handles datetimes automatically in a lot of cases.

Rpta.
# Use as.POSIXct to enter the datetime
as.POSIXct("2010-10-01 12:12:00")

# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`


as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")

# Use readr to import rversions.csv


releases <- read_csv("rversions.csv")

# Examine structure of datetime column


str(releases$datetime)

Console.
> # Use as.POSIXct to enter the datetime
> as.POSIXct("2010-10-01 12:12:00")
[1] "2010-10-01 12:12:00 UTC"

Apuntes de clase - Data Camp R - 40


> # Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
> as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")
[1] "2010-10-01 12:12:00 PDT"

> # Use readr to import rversions.csv


> releases <- read_csv("rversions.csv")
Parsed with column specification:
cols(
major = col_integer(),
minor = col_integer(),
patch = col_integer(),
date = col_date(format = ""),
datetime = col_datetime(format = ""),
time = col_time(format = ""),
type = col_character()
)

> # Examine structure of datetime column


> str(releases$datetime)
POSIXct[1:105], format: "1997-12-04 08:47:58" "1997-12-21 13:09:22" "1998-01-10
00:31:55" ...

Ex. Console.
> # Find the largest date
> last_release_date <- max(releases$date)

> # Filter row for last release


> last_release <- filter(releases, date == last_release_date)

> # Print last_release


> last_release
# A tibble: 1 x 7
major minor patch date datetime time type
<int> <int> <int> <date> <dttm> <time> <chr>
1 3 4 1 2017-06-30 2017-06-30 07:04:11 07:04:11 patch

> # How long since last release?


> Sys.Date() - last_release_date
Time difference of 1104 days

Ex.
library(ggplot2)

Apuntes de clase - Data Camp R - 41


# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major)))

# Limit the axis to between 2010-01-01 and 2014-01-01


ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major))) +
xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))

# Specify breaks every ten years and labels with "%Y"


ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major))) +
scale_x_date(date_breaks = "10 years", date_labels = "%Y")

Apuntes de clase - Data Camp R - 42


03/07/2020

Ex.

Generalizations
Now that you've done all the steps necessary to make our mosaic plot, you can wrap all the
steps into a single function that we can use to examine any two variables of interest in our
data frame (or in any other data frame for that matter). For example, we can use it to
examine the Vocab data frame we saw earlier in this course.

You've seen all the code in our function, so there shouldn't be anything surprising there.
Notice that the function takes multiple arguments, such as the data frame of interest and the
variables that you want to create the mosaic plot for. None of the arguments have default
values, so you'll have to specify all three if you want the mosaicGG() function to work.

Start by going through the code and see if you understand the function's implementation.
# Load all packages
library(ggplot2)
library(reshape2)
library(dplyr)
library(ggthemes)

Instructions:
● Print mosaicGG and read its contents.
● Calling mosaicGG(adult, "SRAGE_P","RBMI") will result in the plot you've been
working on so far. Try this out. This gives you a mosaic plot where BMI is
described by age.
● Test out another combination of variables in the adult data frame: Poverty
(POVLL) described by Age (SRAGE_P).
● Try the function on other datasets we've worked with throughout this course:
● mtcars dataset: am described by cyl
● Vocab dataset: vocabulary described by education.

Rpta:
# Script generalized into a function
mosaicGG

# BMI described by age (as previously seen)


mosaicGG(adult, X = "SRAGE_P", FILL = "RBMI")

Apuntes de clase - Data Camp R - 43


# Poverty described by age
mosaicGG(adult, X = "SRAGE_P", FILL = "POVLL")

# mtcars: am described by cyl


mosaicGG(mtcars, "cyl", "am")

# Vocab: vocabulary described by education


library(carData)
mosaicGG(Vocab, "education", "vocabulary")

Apuntes de clase - Data Camp R - 44


Ex.

Adding text
Since we're not coloring according to BMI, we have to add group (and x axis) labels
manually. Our goal is the plot in the viewer.

For this we'll use the label aesthetic inside geom_text(). The actual labels are found in
the FILL (BMI category) and X (age) columns in the DF_all data frame. (Additional
attributes have been set inside geom_text() in the exercise for you).

The labels will be added to the right (BMI category) and top (age) inner edges of the
plot. (We could have also added margin text, but that is a more advanced topic that
we'll encounter in the third course. This will be a suitable solution for the moment.)

The first two commands show how we got the the four positions for the y axis labels.
First, we got the position of the maximum xmax values, i.e. at the very right end,
stored as index. We want to calculate the half difference between each pair of ymax
and ymin (e.g. (ymax - ymin)/2) at these index positions, then add this value to the ymin
value. These positions are stored in the variable yposn.

We'll begin with the plot thus far, stored as object p. In the sample code, %+% DF_all
refreshes the plot's dataset with the extra columns.

# Plot so far
p

# Position for labels on y axis (don't change)


index <- DF_all$xmax == max(DF_all$xmax)
DF_all$yposn <- DF_all$ymin[index] + (DF_all$ymax[index] - DF_all$ymin[index])/2

# Plot 1: geom_text for BMI (i.e. the fill axis)


p1 <- p %+% DF_all +
geom_text(aes(x = max(xmax),

Apuntes de clase - Data Camp R - 45


y = yposn,
label = FILL),
size = 3, hjust = 1,
show.legend = FALSE)

p1

# Plot 2: Position for labels on x axis


DF_all$xposn <- DF_all$xmin + (DF_all$xmax - DF_all$xmin)/2

# geom_text for ages (i.e. the x axis)


p1 %+% DF_all +
geom_text(aes(x = xposn, label = X),
y = 1, angle = 90,
size = 3, hjust = 1,
show.legend = FALSE)

Apuntes de clase - Data Camp R - 46


Ex.

Adding statistics
In the previous exercise we generated a plot where each individual bar was plotted
separately using rectangles (shown in the viewer). This means we have access to each
piece and we can apply different fill parameters.

So let's make some new parameters. To get the Pearson residuals, we'll use the
chisq.test() function.

The data frames adult and DF_melted, as well as the object BMI_fill that you created
throughout this chapter, are all still available. The reshape2 package is already loaded.

● Use the adult$RBMI (corresponding to FILL) and adult$SRAGE_P


(corresponding to X) columns inside the table() function that's inside the
chisq.test() function. Store the result as results.
● The residuals can be accessed through results$residuals. Apply the melt()
function on them with no further arguments. Store the resulting data frame as
resid.
● Change the names of resid to c("FILL", "X", "residual"). This is so that we have
a consistent naming convention similar to how we called our variables in the
previous exercises.
● The data frame from the previous exercise, DF_melted is already available. Use
the merge() function to bring the two data frames together. Store the result as
DF_all.
● Adapt the code in the ggplot command to use DF_all instead of DF_melted.
Also, map residual onto fill instead of FILL.

# Perform chi.sq test (RBMI and SRAGE_P)


results <- chisq.test(table(adult$RBMI, adult$SRAGE_P))

Apuntes de clase - Data Camp R - 47


# Melt results$residuals and store as resid
resid <- melt(results$residuals)

# Change names of resid


names(resid) <- c("FILL", "X", "residual")

# merge the two datasets:


DF_all <- merge(DF_melted, resid)

# Update plot command


library(ggthemes)
ggplot(DF_all, aes(ymin = ymin,
ymax = ymax,
xmin = xmin,
xmax = xmax,
fill = residual)) +
geom_rect() +
scale_fill_gradient2() +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
theme_tufte()

Ex.

Marimekko/Mosaic Plot
In the previous exercise we looked at different ways of showing the frequency
distribution within each BMI category. This is all well and good, but the absolute
number of each age group also has an influence on if we will consider something as
over-represented or not. Here, we will proceed to change the widths of the bars to
show us something about the n in each group.

Apuntes de clase - Data Camp R - 48


This will get a bit more involved, because the aim is not to draw bars, but rather
rectangles, for which we can control the widths. You may have already realized that
bars are simply rectangles, but we don't have easy access to the xmin and xmax
aesthetics, but in geom_rect() we do! Likewise, we also have access to ymin and ymax.
So we're going to draw a box for every one of our 268 distinct groups of BMI
category and age.

The clean adult dataset, as well as BMI_fill, are already available. Instead of running
apply() like in the previous exercise, the contingency table has already been
transformed to a data frame using as.data.frame.matrix().

# The initial contingency table


DF <- as.data.frame.matrix(table(adult$SRAGE_P, adult$RBMI))

# Create groupSum, xmax and xmin columns


DF$groupSum <- rowSums(DF)
DF$xmax <- cumsum(DF$groupSum)
DF$xmin <- DF$xmax - DF$groupSum
# The groupSum column needs to be removed; don't remove this line
DF$groupSum <- NULL

# Copy row names to variable X


DF$X <- row.names(DF)

# Melt the dataset


library(reshape2)
DF_melted <- melt(DF, id.vars = c("X", "xmin", "xmax"), variable.name = "FILL")

# dplyr call to calculate ymin and ymax - don't change


library(dplyr)
DF_melted <- DF_melted %>%
group_by(X) %>%
mutate(ymax = cumsum(value/sum(value)),
ymin = ymax - value/sum(value))

# Plot rectangles - don't change


library(ggthemes)
ggplot(DF_melted, aes(ymin = ymin,
ymax = ymax,
xmin = xmin,
xmax = xmax,
fill = FILL)) +
geom_rect(colour = "white") +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
BMI_fill +
theme_tufte()

Apuntes de clase - Data Camp R - 49


Apuntes de clase - Data Camp R - 50

You might also like