Professional Documents
Culture Documents
Apuntes de Clase - DataCamp - Visualization in Higher Dimensions
Apuntes de Clase - DataCamp - Visualization in Higher Dimensions
Apuntes de Clase - DataCamp - Visualization in Higher Dimensions
Instructions.
Using the shapes of the density plots, calculate the most appropriate measures of
center and spread for the following:
● The distribution of life expectancy in the countries of the Americas. Note you'll
need to apply a filter here.
● The distribution of country populations across the entire gap2007 dataset.
Rpta.
# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
Ex.
Instructions.
The gap2007 dataset that you created in an earlier exercise is available in your
workspace.
● For each continent in gap2007, summarize life expectancies using the sd(), the
IQR(), and the count of countries, n(). No need to name the new columns
produced here. The n() function within your summarize() call does not take any
arguments.
● Graphically compare the spread of these distributions by constructing overlaid
density plots of life expectancy broken down by continent.
Rpta.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp
),
IQR(lifeExp
),
Ex.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
3 variable plot
Faceting is a valuable technique for looking at several conditional distributions at the
same time. If the faceted distributions are laid out in a grid, you can consider the
association between a variable and two others, one on the rows of the grid and the
other on the columns.
Intructions.
Rpta.
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle(common_cyl)
Instructions.
Use density plots or box plots to construct the following visualizations. For each
variable, try both plots and submit the one that is better at capturing the important
structure.
Rpta.
# Create plot of city_mpg
cars %>%
ggplot(aes(x=1, y=city_mpg)) +
geom_boxplot()
Ex.
# Create hist of horsepwr
cars %>%
ggplot(aes(x=horsepwr)) +
geom_histogram() +
ggtitle(cars)
Ex.
Faceted histogram
In this chapter, you'll be working with the cars dataset, which records characteristics
on all of the new models of cars for sale in the US in a certain year. You will
investigate the distribution of mileage across a categorial variable, but before you get
there, you'll want to familiarize yourself with the dataset.
Instructions.
# Load package
library(ggplot2)
Important!
In this exercise, you faceted by the suv variable, but it's important to note that you
can facet a plot by any categorical variable using facet_wrap(). Nice job!
Ex.
Marginal barchart
If you are interested in the distribution of alignment of all superheroes, it makes
sense to construct a barchart for just that single variable.
You can improve the interpretability of the plot, though, by implementing some
sensible ordering. Superheroes that are "Neutral" show an alignment between "Good"
and "Bad", so it makes sense to put that bar in the middle.
Instructions.
● Reorder the levels of align using the factor() function so that printing them
reads "Bad", "Neutral", then "Good".
● Create a barchart of counts of the align variable.
Rpta.
Ex.
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
Ex.
Instructions.
● Create a stacked barchart of gender counts with align on the x-axis.
● Create a stacked barchart of gender proportions with align on the x-axis by
setting the position argument to geom_bar() equal to "fill".
Rpta.
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
Important!
Ex.
Side-by-side barcharts
While a contingency table represents the counts numerically, it's often more useful to
represent them graphically.
Here you'll construct two side-by-side barcharts of the comics data. This shows that
there can often be two or more options for presenting the same data. Passing the
argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not
stacked) barchart.
Instructions.
● Load the ggplot2 package.
● Create a side-by-side barchart with align on the x-axis and gender as the fill
aesthetic.
● Create another side-by-side barchart with gender on the x-axis and align as the
fill aesthetic. Rotate the axis labels 90 degrees to help readability.
Rpta.
# Load ggplot2
library(ggplot2)
Ex.
Instructions.
● Load the dplyr package.
● Print tab to find out which level of align has the fewest total entries.
● Use filter() to filter out all rows of comics with that level, then drop the unused
level with droplevels(). Save the simplified dataset as comics_filtered.
Rpta.
# Load dplyr
library(dplyr)
# Print tab
tab
Recode a variable
Ex.
# Print the first rows of the data
comics
Console.
Other 32 17 17 0
29/07/2020
Variables in the data
Explanatory variable are conditions you can impose on the experimental units (light, noise).
Stratified sample in R
Ex.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(size = 2)
Ex.
# Simple random sample: states_srs
states_srs <- us_regions %>%
sample_n(size = 8)
Ex.
# Load packages
library(dplyr)
Ex.
You'll see how fast in this exercise by comparing how fast it reads in the dates from
the Auckland hourly weather data (over 17,000 dates) to lubridates ymd_hms().
To compare run times you'll use the microbenchmark() function from the package of
the same name. You pass in as many arguments as you want each being an
expression to time.
Instructions.
We've loaded the datetimes from the Auckland hourly data as strings into the vector
dates.
● Examine the structure of dates to verify it is a string and in the ISO 8601
format.
● Parse dates with fasttime and pipe to str() to verify fastPOSIXct parses them
correctly.
● Now to compare timing, call microbenchmark where the first argument uses
ymd_hms() to parse dates and the second uses fastPOSIXct().
Rpta.
library(microbenchmark)
library(fasttime)
If you find yourself in this situation, the hms package provides an hms class of object
for holding times without dates, and the best place to start would be with as.hms().
In fact, you've already seen an object of the hms class, but I didn't point it out to you.
Take a look in this exercise.
Rpta.
# Import auckland hourly data
akl_hourly <- read_csv("akl_weather_hourly_2016.csv")
I wanted to watch New Zealand in the Women's World Cup Soccer games in 2015,
but the times listed on the FIFA website were all in times local to the venues. In this
exercise you'll help me set the timezones, then in the next exercise you'll help me
figure out what time I needed to tune in to watch them.
Intructions.
I've put the times as listed on the FIFA website for games 2 and 3 in the group stage for New
Zealand in your code.
● Game 2 was played in Edmonton. Use force_tz() to set the timezone of game 2 to
"America/Edmonton".
● Game 3 was played in Winnipeg. Use force_tz() to set the timezone of game 3 to
"America/Winnipeg".
● Find out how long the team had to rest between the two games, by using as.period()
on the interval between game2_local and game3_local.
Rpta.
game2_local
Intervals
Ex.
Try them out to get better representations of the length of the monarchs reigns.
Intructions.
● Create new columns for duration and period that convert reign into the
appropriate object.
● Examine the name, duration and period columns.
Rpta.
# New columns for duration and period
monarchs <- monarchs %>%
mutate(
duration = as.duration(reign),
period = as.period(reign))
# Examine results
monarchs %>%
select(name, duration, period)
Ex.
The operator %within% tests if the datetime (or interval) on the left hand side is
within the interval of the right hand side. For example, if y2001 is the interval covering
the year 2001,
int_overlaps() performs a similar test, but will return true if two intervals overlap at
all.
Practice to find out which monarchs saw Halley's comet around 1066.
Instrcutions.
We've put halleys a data set describing appearances of Halley's comet in your
workspace.
● Print halleys to examine the date. perihelion_date is the date the Comet is
closest to the Sun. start_date and end_date are the range of dates the comet is
visible from Earth.
● Create a new column, visible, that is an interval from start_date to end_date.
● You'll work with one appearance, extract the 14th row of halleys.
● Filter monarchs to those where halleys_1066$perihelion_date is within reign.
● Filter monarchs to those where halleys_1066$visible overlaps reign.
Rpta.
# Print halleys
halleys
Ex.
Once you have an interval you can find out certain properties like its start, end and
length with int_start(), int_end() and int_length() respectively.
Practice by exploring the reigns of kings and queens of Britain (and its historical
dominions).
Intructions.
Rpta.
# Print monarchs
monarchs
Time spans.
Ex.
1:10 * days(1)
Then by adding this sequence to a specific datetime, you can construct a sequence of
datetimes from 1 day up to 10 days into the future
You had a meeting this morning at 8am and you'd like to have that meeting at the
same time and day every two weeks for a year. Generate the meeting times in this
exercise.
Instructions.
● Create today_8am() by adding a period of 8 hours to today()
Rpta.
# Add a period of 8 hours to today
today_8am <- today() + hours(8)
Ex.
There was an eclipse over North America on 2017-08-21 at 18:26:40. It's possible to
predict the next eclipse with similar geometry by calculating the time and date one
Saros in the future. A Saros is a length of time that corresponds to 223 Synodic
months, a Synodic month being the period of the Moon's phases, a duration of 29
days, 12 hours, 44 minutes and 3 seconds.
Instructions.
● Create a duration corresponding to one Synodic Month: 29 days, 12 hours, 44
minutes and 3 seconds.
● Create a duration corresponding to one Saros by multiplying synodic by 223.
● Add saros to eclipse_2017 to predict the next eclipse.
Ex.
11/07/2020
Time spans.
Ex.
● Find the difference in time between mar_13 and mar_12 in seconds. This should
match your intuition.
● Now, find the difference in time between mar_12 and mar_11 in seconds.
Surprised?
Rpta.
# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00",
tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00",
tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00",
tz = "America/Los_Angeles")
CONSOLE
# Three dates
mar_11 <- ymd_hms("2017-03-11 12:00:00",
tz = "America/Los_Angeles")
mar_12 <- ymd_hms("2017-03-12 12:00:00",
tz = "America/Los_Angeles")
mar_13 <- ymd_hms("2017-03-13 12:00:00",
tz = "America/Los_Angeles")
Important!
Good work. Why would a day only have 82800 seconds? At 2am on Mar 12th 2017,
Daylight Savings started in the Pacific timezone. That means a whole hour of seconds
gets skipped between noon on the 11th and noon on the 12th.
Ex.
difftime() takes an argument units which specifies the units for the difference. Your
options are "secs", "mins", "hours", "days", or "weeks".
To practice you'll find the time since the first man stepped on the moon. You'll also
see the lubridate functions today() and now() which when called with no arguments
return the current date and time in your system's timezone.
Instructions.
● Apollo 11 landed on July 20, 1969. Use difftime() to find the number of days
between today() and date_landing.
● Neil Armstrong stepped onto the surface at 02:56:15 UTC. Use difftime() to
find the number of seconds between now() and moment_step.
Rpta.
# The date of landing and moment of step
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")
Rounding datetimes
Ex.
As an example you'll explore how many observations per hour there really are in the
hourly Auckland weather data.
Instructions.
● Create a new column called day_hour that is datetime rounded down to the
nearest hour.
● Use count() on day_hour to count how many observations there are in each
hour. What looks like the most common value?
● Extend the pipeline, so that after counting, you filter for observations where n
is not equal to 2.
Rpta.
# Create day_hour, datetime rounded down to hour
akl_hourly <- akl_hourly %>%
mutate(
day_hour = floor_date(datetime, unit = "hour")
)
Ex.
Practice rounding
As you saw in the video, round_date() rounds a date to the nearest value, floor_date()
rounds down, and ceiling_date() rounds up.
All three take a unit argument which specifies the resolution of rounding. You can
specify "second", "minute", "hour", "day", "week", "month", "bimonth", "quarter",
"halfyear", or "year". Or, you can specify any multiple of those units, e.g. "5 years", "3
minutes" etc.
Intructions.
● Choose the right function and units to round r_3_4_1 down to the nearest day.
● Choose the right function and units to round r_3_4_1 to the nearest 5 minutes.
● Choose the right function and units to round r_3_4_1 up to the nearest week.
● Find the time elapsed on the day of release at the time of release by
subtracting r_3_4_1 rounded down to the day from r_3_4_1.
Rpta.
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")
# Round up to week
ceiling_date(r_3_4_1, unit = "week")
In the last exercise you saw that January, February and March were great times to
visit Auckland for warm temperatures, but will you need a raincoat?
Instructions.
● Create new columns for the hour and month of the observation from datetime.
Make sure you label the month.
● Filter to just daytime observations, where the hour is greater than or equal to 8
and less than or equal to 22.
● Group the observations first by month, then by date, and summarise by using
any() on the rainy column. This results in one value per day
● Summarise again by summing any_rain. This results in one value per month
Rpta.
# Create new columns hour, month and rainy
akl_hourly <- akl_hourly %>%
mutate(
hour = hour(datetime),
month = month(datetime, label = TRUE),
rainy = weather == "Precipitation"
)
CONSOLE
# A tibble: 12 x 2
month days_rainy
<ord> <int>
1 Jan 15
2 Feb 13
3 Mar 12
Ex.
In this exercise you'll use components of the dates to help explore the pattern of
maximum temperature over the year. The first step is to create some new columns to
hold the extracted pieces, then you'll use them in a couple of plots.
Instructions.
● Use mutate() to create three new columns: year, yday and month that
respectively hold the same components of the date column. Don't forget to
label the months with their names.
● Create a plot of yday on the x-axis, max_temp of the y-axis where lines are
grouped by year. Each year is a line on this plot, with the x-axis running from Jan 1
to Dec 31.
● To take an alternate look, create a ridgeline plot(formerly known as a joyplot)
with max_temp on the x-axis, month on the y-axis, using geom_density_ridges()
from the ggridges package.
Rpta.
library(ggplot2)
library(dplyr)
library(ggridges)
Ex.
and received numeric months in return. Sometimes it's nicer (especially for plotting or
tables) to have named months. Both the month() and wday() (day of the week)
functions have additional arguments label and abbr to achieve just that. Set label =
TRUE to have the output labelled with month (or weekday) names, and abbr = FALSE
for those names to be written in full rather than abbreviated.
library(ggplot2)
Ex.
There are also a few useful functions that return other aspects of a datetime like if it
occurs in the morning am(), during daylight savings dst(), in a leap_year(), or which
quarter() or semester() it occurs in.
Intructions.
We've put release_time, the datetime column of the releases dataset from Chapter 1, in
your workspace.
Rpta.
# Examine the head() of release_time
head(release_time)
Instructions.
● Import the hourly data, "akl_weather_hourly_2016.csv" with read_csv(), then print
akl_hourly_raw to confirm the date is spread over year, month and mday.
● Using mutate() create the column date with using make_date().
● We've pasted together the date and time columns. Create datetime by parsing the
datetime_string column.
● Take a look at the date, time and datetime columns to verify they match up.
● Take a look at the data by plotting datetime on the x-axis and temperature of the y-axis.
Rpta.
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
# Import "akl_weather_hourly_2016.csv"
# Print akl_hourly_raw
akl_hourly_raw
# Parse datetime_string
mutate(
datetime = ymd_hms(datetime_string)
geom_line()
10/07/2020
Weather in Auckland
08/07/2020
Dates
Ex.
Instructions.
● Use read_csv() to import cran-logs_2015-04-17.csv.
● Print logs to see the information we have on each download.
● Store the R 3.2.0 release time as a POSIXct object.
● Find out when the first request for 3.2.0 was made by filtering for values in the
datetime column that are greater than release_time.
● Finally see how downloads increase by creating histograms of download time
for 3.2.0 and the previous version 3.1.3. We've provided most of the code, you
just need to specify the x aesthetic to be the datetime column.
Rpta.
# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("cran-logs_2015-04-17.csv")
# Print logs
logs
The only tricky thing is that times will be interpreted in local time based on your
machine's set up. You can check your timezone with Sys.timezone(). If you want the
time to be interpreted in a different timezone, you just set the tz argument of
as.POSIXct(). You'll learn more about time zones in Chapter 4.
In this exercise you'll input a couple of datetimes by hand and then see that
read_csv() also handles datetimes automatically in a lot of cases.
Rpta.
# Use as.POSIXct to enter the datetime
as.POSIXct("2010-10-01 12:12:00")
Console.
> # Use as.POSIXct to enter the datetime
> as.POSIXct("2010-10-01 12:12:00")
[1] "2010-10-01 12:12:00 UTC"
Ex. Console.
> # Find the largest date
> last_release_date <- max(releases$date)
Ex.
library(ggplot2)
Ex.
Generalizations
Now that you've done all the steps necessary to make our mosaic plot, you can wrap all the
steps into a single function that we can use to examine any two variables of interest in our
data frame (or in any other data frame for that matter). For example, we can use it to
examine the Vocab data frame we saw earlier in this course.
You've seen all the code in our function, so there shouldn't be anything surprising there.
Notice that the function takes multiple arguments, such as the data frame of interest and the
variables that you want to create the mosaic plot for. None of the arguments have default
values, so you'll have to specify all three if you want the mosaicGG() function to work.
Start by going through the code and see if you understand the function's implementation.
# Load all packages
library(ggplot2)
library(reshape2)
library(dplyr)
library(ggthemes)
Instructions:
● Print mosaicGG and read its contents.
● Calling mosaicGG(adult, "SRAGE_P","RBMI") will result in the plot you've been
working on so far. Try this out. This gives you a mosaic plot where BMI is
described by age.
● Test out another combination of variables in the adult data frame: Poverty
(POVLL) described by Age (SRAGE_P).
● Try the function on other datasets we've worked with throughout this course:
● mtcars dataset: am described by cyl
● Vocab dataset: vocabulary described by education.
Rpta:
# Script generalized into a function
mosaicGG
Adding text
Since we're not coloring according to BMI, we have to add group (and x axis) labels
manually. Our goal is the plot in the viewer.
For this we'll use the label aesthetic inside geom_text(). The actual labels are found in
the FILL (BMI category) and X (age) columns in the DF_all data frame. (Additional
attributes have been set inside geom_text() in the exercise for you).
The labels will be added to the right (BMI category) and top (age) inner edges of the
plot. (We could have also added margin text, but that is a more advanced topic that
we'll encounter in the third course. This will be a suitable solution for the moment.)
The first two commands show how we got the the four positions for the y axis labels.
First, we got the position of the maximum xmax values, i.e. at the very right end,
stored as index. We want to calculate the half difference between each pair of ymax
and ymin (e.g. (ymax - ymin)/2) at these index positions, then add this value to the ymin
value. These positions are stored in the variable yposn.
We'll begin with the plot thus far, stored as object p. In the sample code, %+% DF_all
refreshes the plot's dataset with the extra columns.
# Plot so far
p
p1
Adding statistics
In the previous exercise we generated a plot where each individual bar was plotted
separately using rectangles (shown in the viewer). This means we have access to each
piece and we can apply different fill parameters.
So let's make some new parameters. To get the Pearson residuals, we'll use the
chisq.test() function.
The data frames adult and DF_melted, as well as the object BMI_fill that you created
throughout this chapter, are all still available. The reshape2 package is already loaded.
Ex.
Marimekko/Mosaic Plot
In the previous exercise we looked at different ways of showing the frequency
distribution within each BMI category. This is all well and good, but the absolute
number of each age group also has an influence on if we will consider something as
over-represented or not. Here, we will proceed to change the widths of the bars to
show us something about the n in each group.
The clean adult dataset, as well as BMI_fill, are already available. Instead of running
apply() like in the previous exercise, the contingency table has already been
transformed to a data frame using as.data.frame.matrix().