Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Exercises - 1

1.
Use the flights dataset in package nycflights13 for the following exercises.
1. Find all flights that:
a. Had an arrival delay of two or more hours
b. Flew to Houston (IAH or HOU)
c. Were operated by United, American, or Delta
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn’t leave late
f. Were delayed by at least an hour, but made up over 30minutes in flight
g. Departed between midnight and 6 a.m. (inclusive)
2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to
simplify the code needed to answer the previous challenges?
3. How many flights have a missing dep_time? What other variables are missing? What
might these rows represent?
4. How could you use arrange() to sort all missing values to the start? (Hint: use is.na().)
5. Sort flights to find the most delayed flights. Find the flights that left earliest.
6. Sort flights to find the fastest flights.
7. Which flights traveled the longest? Which traveled the shortest?
8. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute
with because they’re not really continuous numbers. Convert them to a more
convenient representation of number of minutes since midnight.
9. Find the 10 most delayed flights using a ranking function. How do you want to handle
ties? Carefully read the documentation for min_rank().
10. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of
cancelled flights related to the average delay?
11. For each plane, count the number of flights before the first delay of greater than 1 hour.

2
Use the flights, planes, airports, airlines and weather datasets in package nycflights13 for the
following exercises.
1. Add a surrogate key to flights.
2. Compute the average delay by destination, then join on the airports data frame so you
can show the spatial distribution of delays. Here’s an easy way to draw a map of the
United States:
airports%>%
semi_join(flights,c("faa"="dest"))%>%
ggplot(aes(lon,lat))+borders("state")+
geom_point()+coord_quickmap()
(Don’t worry if you don’t understand what semi_join() does—you’ll learn about it
next.)You might want to use the size or color of the points to display the average delay
for each airport.
3. Add the location of the origin and destination (i.e., the lat and lon) to flights.
4. Is there a relationship between the age of a plane and its delays?
5. What weather conditions make it more likely to see a delay?
6. What does it mean for a flight to have a missing tailnum? What do the tail numbers that
don’t have a matching record in planeshave in common? (Hint: one variable explains
approximately 90% of the problems.)
7. Filter flights to only show flights with planes that have flown atleast 100 flights.
8. Combine fueleconomy::vehicles and fueleconomy::common to find only the records for
the most common models.
9. You might expect that there’s an implicit relationship between plane and airline,
because each plane is flown by a single airline. Confirm or reject this hypothesis using
the tools you’ve learned in the preceding section.

You might also like