Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Intro to Geospatial Data and Maps in R

Julie Wood and Libby Benson


September, 2017

Overview
To try this tutorial out yoruself, you will need a copy of R installed on your computer, with the latest
version available at http://cran.r-project.org/. We also recommend using RStudio, available
at http://www.rstudio.com
Here are some other helpful resources we came across while putting together this tutorial:

 https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf
 https://www.zevross.com/blog/2016/01/13/tips-for-reading-spatial-files-into-r-with-rgdal/
 http://zevross.com/blog/2015/10/14/manipulating-and-mapping-us-census-data-in-r-using-
the-acs-tigris-and-leaflet-packages-3/
 https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf

The goals of this tutorial are to provide introductory skills on how to read-in, manipulate/edit, and plot
geospatial data in R. Geospatial data are used in diverse scientific fields such as history, biology,
busienss, tech, public health, sociology, psychology, and many more.
Important considerations for using geospatial data include defining:

 topic (e.g., frequency of flu cases in pennsylvania by county)


 unit of analysis (e.g., individuals, families, households, institutions, products)
 geographic unit (e.g., cities, counties, states, countries, continents)
 time unit (e.g., minutes, seconds, hours, days, months, years, decades)

We will present examples of geospatial data at multiple levels of analysis including county-level flu
data within Pennsylvania, as well as individual-level data for a single person on a single day.

Outline
 Section 0: Load Packages & Set Working Directory
 Section 1: Reading in Geospatial Data
 Section 2: Basic Geospatial Data Formatting
 Section 3: Fetching a Map
 Section 4: Plotting on a Static Map
 Section 5: Plotting Dynamic Maps
Section 0: Load Packages & Set Working Directory
Check whether all packages installed (install them if not), and then load all packages:

list.of.packages <- c("maps","mapdata","maptools","rgdal","ggmap","leaflet","


tigris",
"sp","ggplot2","plyr","animation","gridExtra","psych","
rstudioapi",
"data.table")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()
[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

Note: We don’t use all of these packages in the tutorial, but these were the most common packages
we came across while writing the tutorial, and therefore may be helpful to dig into in your own work :)

library(maps)
library(mapdata)
library(maptools)
library(rgdal)
library(ggmap)
library(leaflet)
library(tigris)
library(sp)
library(ggplot2)
library(plyr)
library(animation)
library(gridExtra)
library(psych)
library(rstudioapi)
library(data.table)

Set working directory to source file location first, or choose different directory:

current_path <- getSourceEditorContext()$path


setwd(dirname(current_path))
Section 1: Reading in Geospatial Data
Below are data from a single individual, collected during the Women’s March in the Washington Mall
using a passive data collection app called “Moves” (https://moves-app.com/).
The varialbes in the dataframe are:

 Date
 Name
 Start
 End
 Duration
 Latitude
 Longitude

These geospatial data are formatted as a standard .csv. Other spatial data formats include: GPX,
ESRI, Shapefile, GeoJSON, and many more. Using the rgdal package, we can take a look at all the
different drivers (i.e., spatial data formats) that can be utilized:

ogrDrivers()$name
## [1] "AeronavFAA" "ARCGEN" "AVCBin" "AVCE00"
## [5] "BNA" "CSV" "DGN" "DXF"
## [9] "EDIGEO" "ESRI Shapefile" "Geoconcept" "GeoJSON"
## [13] "Geomedia" "GeoRSS" "GML" "GPKG"
## [17] "GPSBabel" "GPSTrackMaker" "GPX" "HTF"
## [21] "Idrisi" "JML" "KML" "MapInfo File"
## [25] "Memory" "MSSQLSpatial" "ODBC" "ODS"
## [29] "OGR_GMT" "OGR_PDS" "OGR_SDTS" "OGR_VRT"
## [33] "OpenAir" "OpenFileGDB" "OSM" "PCIDSK"
## [37] "PDF" "PGDUMP" "PGeo" "REC"
## [41] "S57" "SEGUKOOA" "SEGY" "Selafin"
## [45] "SQLite" "SUA" "SVG" "SXF"
## [49] "TIGER" "UK .NTF" "VFK" "Walk"
## [53] "WAsP" "XLSX" "XPlane"

Further, there are many places on the internet as well as other R packages where geospatial data
can be obtained. For example, the cdcfluview package allows you to retrieve US flu season data
from the CDC. The US Census Bureau also has a wealth of data availabel to be downloaded
(https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml).
For now, we will read-in the moves data:

moves <- read.csv("MovesData_WM.csv",header=TRUE)


We will also look at another dataset pertaining to number of flu cases by county in Pennsylvania,
retrieved from: http://www.health.pa.gov/My%20Health/Diseases%20and%20Conditions/I-
L/Pages/20162017-Influenza-Season.aspx#.WbiFPMiGPIU

fludat <- read.csv("PA_fludata_2017.csv", header=TRUE)


fludat$County<-tolower(fludat$County)
head(fludat)
## County FluCases
## 1 adams 1623
## 2 allegheny 5001
## 3 armstrong 312
## 4 beaver 1234
## 5 bedford 234
## 6 berks 3031
str(fludat)
## 'data.frame': 67 obs. of 2 variables:
## $ County : chr "adams" "allegheny" "armstrong" "beaver" ...
## $ FluCases: int 1623 5001 312 1234 234 3031 1036 815 2227 921 ...

Section 2: Basic Geospatial Data Formatting


For some research questions, the geospatial data may have an indicator of duration. In
the moves data, there is a duration variable indicating the number of seconds that were spent in that
location.

head(moves)
## Date Name
## 1 1/21/2017 Place in Penn Quarter, Washington
## 2 1/21/2017 Place in Federal Triangle, Washington
## 3 1/21/2017 Place in The Mall, Washington
## 4 1/21/2017 Place in The Mall, Washington
## 5 1/21/2017 Place in The Mall, Washington
## 6 1/21/2017 Place in The Mall, Washington
## Start End Duration Latitude
## 1 2017-01-21T08:49:27-05:00 2017-01-21T08:55:14-05:00 347 38.89821
## 2 2017-01-21T09:09:41-05:00 2017-01-21T09:11:41-05:00 120 38.89372
## 3 2017-01-21T09:22:10-05:00 2017-01-21T09:29:11-05:00 421 38.89126
## 4 2017-01-21T09:34:49-05:00 2017-01-21T09:50:34-05:00 945 38.88987
## 5 2017-01-21T09:50:35-05:00 2017-01-21T10:03:02-05:00 747 38.89070
## 6 2017-01-21T10:03:03-05:00 2017-01-21T11:28:17-05:00 5114 38.88939
## Longitude
## 1 -77.02807
## 2 -77.02371
## 3 -77.01739
## 4 -77.01638
## 5 -77.01578
## 6 -77.01737

In the event that there is not a duration variable, first we would need to format the start and stop
dates into date objects, and then we could calcualte a duration variable ourselves:

# First remove the T between the date and time


moves$Start <- gsub("T", " ", moves$Start)
moves$End <- gsub("T", " ", moves$End)

# Also remove the -5:00 from the date and time


moves$Start <- gsub("-05:00", "", moves$Start)
moves$End <- gsub("-05:00", "", moves$End)

# Format as date object


moves$Start <- as.POSIXct(x = moves$Start, format = "%Y-%m-%d %H:%M:%S", tz
= "EST")
moves$End <- as.POSIXct(x = moves$End, format = "%Y-%m-%d %H:%M:%S", tz = "
EST")

# Create duration variable


moves$duration_sec <- difftime(time1 = moves$End, time2 = moves$Start, unit
s = "secs")
moves$duration_min <- round(difftime(time1 = moves$End, time2 = moves$Start
, units = "mins"),0)
moves$duration_hours <- round(difftime(time1 = moves$End, time2 = moves$Sta
rt, units = "hours"),2)
# Look at first and last few rows
head(moves)
## Date Name Start
## 1 1/21/2017 Place in Penn Quarter, Washington 2017-01-21 08:49:27
## 2 1/21/2017 Place in Federal Triangle, Washington 2017-01-21 09:09:41
## 3 1/21/2017 Place in The Mall, Washington 2017-01-21 09:22:10
## 4 1/21/2017 Place in The Mall, Washington 2017-01-21 09:34:49
## 5 1/21/2017 Place in The Mall, Washington 2017-01-21 09:50:35
## 6 1/21/2017 Place in The Mall, Washington 2017-01-21 10:03:03
## End Duration Latitude Longitude duration_sec
## 1 2017-01-21 08:55:14 347 38.89821 -77.02807 347 secs
## 2 2017-01-21 09:11:41 120 38.89372 -77.02371 120 secs
## 3 2017-01-21 09:29:11 421 38.89126 -77.01739 421 secs
## 4 2017-01-21 09:50:34 945 38.88987 -77.01638 945 secs
## 5 2017-01-21 10:03:02 747 38.89070 -77.01578 747 secs
## 6 2017-01-21 11:28:17 5114 38.88939 -77.01737 5114 secs
## duration_min duration_hours
## 1 6 mins 0.10 hours
## 2 2 mins 0.03 hours
## 3 7 mins 0.12 hours
## 4 16 mins 0.26 hours
## 5 12 mins 0.21 hours
## 6 85 mins 1.42 hours
tail(moves)
## Date Name Start
## 17 1/21/2017 Place in The Mall, Washington 2017-01-21 15:50:11
## 18 1/21/2017 Place in The Mall, Washington 2017-01-21 16:11:43
## 19 1/21/2017 Place in The Mall, Washington 2017-01-21 16:37:16
## 20 1/21/2017 Place in The Mall, Washington 2017-01-21 17:06:04
## 21 1/21/2017 Place in Penn Quarter, Washington 2017-01-21 17:37:22
## 22 1/21/2017 Place in The Mall, Washington 2017-01-21 18:38:26
## End Duration Latitude Longitude duration_sec
## 17 2017-01-21 16:11:42 1291 38.88991 -77.03206 1291 secs
## 18 2017-01-21 16:26:47 904 38.89191 -77.03274 904 secs
## 19 2017-01-21 16:57:56 1240 38.89378 -77.03630 1240 secs
## 20 2017-01-21 17:31:01 1497 38.89530 -77.03533 1497 secs
## 21 2017-01-21 18:21:55 2673 38.89640 -77.03189 2673 secs
## 22 2017-01-21 18:50:17 711 38.88868 -77.03406 711 secs
## duration_min duration_hours
## 17 22 mins 0.36 hours
## 18 15 mins 0.25 hours
## 19 21 mins 0.34 hours
## 20 25 mins 0.42 hours
## 21 45 mins 0.74 hours
## 22 12 mins 0.20 hours

Another possibility, is to format the dataset so that the rows are on a more uniform time metric such
as seconds, minutes, or hours. Below we demonstrate how to transform the moves data given above
into a dataset that can be indexed by minutes or hours:

# Create sequential observations variable, and recode duration into minutes a


nd hours
moves$obs <- seq(1:length(moves$Date))

# Create expanded data frame:


moves2 <- data.frame(rep(moves$obs, moves$duration_min),
rep(moves$Latitude, moves$duration_min),
rep(moves$Longitude, moves$duration_min))
names(moves2) <- c("obs","latitude","longitude")

# Re-insert and expand the minutes variable into a sequential indicator for e
ach "obs"
moves3 <- ddply(moves2,"obs",mutate, minutes = seq(1:length(obs)))

# Divide the sequential minutes variable by 20 (for plot scaling purposes)


moves3$minutes_scaled <- moves3$minutes/10

# Insert a new row 1 - this is used for plotting purposes to help with size s
caling
describe(moves3$minutes_scaled) # range = 0.1:8.5
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 487 2.21 2.14 1.3 1.86 1.33 0.1 8.5 8.4 1.24 0.52 0.1
# For later plotting purposes, add in a ghost row for scaling
moves3 <- rbind(c(0,0,0,85,8.5),moves3)

# Create a numbered sequence variable for each move


moves3$seq_num <- seq(1:length(moves3$obs))

# Look at first and last few rows


head(moves3)
## obs latitude longitude minutes minutes_scaled seq_num
## 1 0 0.00000 0.00000 85 8.5 1
## 2 1 38.89821 -77.02807 1 0.1 2
## 3 1 38.89821 -77.02807 2 0.2 3
## 4 1 38.89821 -77.02807 3 0.3 4
## 5 1 38.89821 -77.02807 4 0.4 5
## 6 1 38.89821 -77.02807 5 0.5 6
tail(moves3)
## obs latitude longitude minutes minutes_scaled seq_num
## 483 22 38.88868 -77.03406 7 0.7 483
## 484 22 38.88868 -77.03406 8 0.8 484
## 485 22 38.88868 -77.03406 9 0.9 485
## 486 22 38.88868 -77.03406 10 1.0 486
## 487 22 38.88868 -77.03406 11 1.1 487
## 488 22 38.88868 -77.03406 12 1.2 488

Section 3: Fetching a Map


ggplot2/maps Package
We can import data from the maps package into a data frame easily usable with ggplot
Here, we use map_data to pull county information for Pennsylvania:

PAcounties <- map_data("county", region = "pennsylvania")


head(PAcounties)
## long lat group order region subregion
## 1 -77.44670 39.96954 1 1 pennsylvania adams
## 2 -77.42952 39.98672 1 2 pennsylvania adams
## 3 -77.37222 40.00391 1 3 pennsylvania adams
## 4 -77.32065 40.01537 1 4 pennsylvania adams
## 5 -77.23471 40.02683 1 5 pennsylvania adams
## 6 -77.18887 40.03256 1 6 pennsylvania adams

PAcounties contains the following information: * long: longitude * lat: latitude * group: a grouping
variable for different regions, that is extra useful for plotting adjacent points (within a group). * order:
each element within a group is assigned an order number, which will tell ggplot the order to connect
the dots within groups. * region: describes what each group is (e.g., state name) * subregion:
describes what each sub-group is (e.g., counties in PA)
Next, we need to use the information we pulled from map_data to tie county information
in fludat geographic signifiers (latitude and longitude)

#Set up data: flu cases, county, state names


flu_map <- data.frame(state_names=unique(PAcounties$region),
county_names=unique(PAcounties$subregion),
flucases= fludat$FluCases)
#use a key to match flu map to county latitude and longitude
library(data.table)
PAcounties <- data.table(PAcounties)
setkey(PAcounties,region,subregion)
flu_map <- data.table(flu_map)
setkey(flu_map,state_names,county_names)
map.df <- PAcounties[flu_map]
head(map.df)
## long lat group order region subregion flucases
## 1: -77.44670 39.96954 1 1 pennsylvania adams 1623
## 2: -77.42952 39.98672 1 2 pennsylvania adams 1623
## 3: -77.37222 40.00391 1 3 pennsylvania adams 1623
## 4: -77.32065 40.01537 1 4 pennsylvania adams 1623
## 5: -77.23471 40.02683 1 5 pennsylvania adams 1623
## 6: -77.18887 40.03256 1 6 pennsylvania adams 1623
Now, we have latitude, longitude, county names, region (state), and flu cases. We’re ready to map!
Use ggplot, grouping at the county level, to fill in our flu case data

ggplot(map.df, aes(x = long, y = lat, group = subregion, fill = flucases)) +


geom_polygon(color="white") +
coord_map()

ggmap Package
The get_map() in the ggmap package is a wrapper that can query Google Maps, OpenStreetMap,
Stamen Maps or Naver Map servers.

 For the location argument, you can either input an address, longitude and latitude, or
left/bottom/right/top boudning box.
 For the zoom argument, you input an integer where 3 is at the continent level and 21 is at the
building level.
 For the maptype argument, you can specify a character string corresponding to, terrain,
terrain-background, satellite, roadmap, hybrid, toner, watercolor, terrain-labels, and so forth
(see ?get_map).

?ggmap
map1 <- get_map(location = "State College",
maptype = "terrain",
source = "google",
crop = FALSE,
zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=State+
College&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=fa
lse
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?ad
dress=State%20College&sensor=false
plot_map1 <- ggmap(map1) +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
plot_map1

map2 <- get_map(location = "Pennsylvania",


maptype = "watercolor",
source = "google",
crop = FALSE,
zoom = 5)

plot_map2 <- ggmap(map2) +


xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
plot_map2

Section 4: Plotting on a Static Map


First, lets plot the Women’s March data using a map from the previous section displaying the
Washington Mall.

summary(moves3[,c("longitude","latitude")])
## longitude latitude
## Min. :-77.04 Min. : 0.00
## 1st Qu.:-77.03 1st Qu.:38.89
## Median :-77.02 Median :38.89
## Mean :-76.87 Mean :38.81
## 3rd Qu.:-77.02 3rd Qu.:38.89
## Max. : 0.00 Max. :38.90
map3 <- get_map(location = c(-77.025,38.895),
maptype = "terrain",
source = "google",
crop = FALSE,
zoom = 15)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=38.895
,-77.025&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=f
alse
plot_map3 <- ggmap(map3) +
geom_point(data = moves, aes(x = Longitude, y = Latitude)) +
geom_path(data = moves,aes(x = Longitude, y = Latitude), size = 1, lineend
= "round") +
xlab("Longitude") + ylab("Latitude") +
ggtitle("Plot of Women's March Time Series Data") +
theme(legend.position="none",
plot.title = element_text(hjust = 0.5))

plot_map3

Section 5: Plotting Dynamic Maps


To create the dynamic map, we use the saveGIF() within the animation package. The GIF
displayed bleow plots each stop in the march with a point. To visualize the amount of time spent at
each location, the dots at each location grow proportional to the duration spent at that location
compared to other locations. The movement is depicted by black lines connecting each location. The
“growing dots” depicting duration are achieved by subseting the data iteratively - the first plot in the
GIF uses only the first row of data, the second plot uses the first two rows…all the way up to the
488th plot in the GIF which utilizes all 488 rows. Do to knit time (this particular GIF takes ~10
minutes to render), the code chunk below is not evaluated, but you are welcome to download the
data and run this code on your own machine to see the resulting GIF!

# Create color palette


set.seed(1989)
palette = sample(rainbow(length(unique(moves3$obs))))

# Create gif
saveGIF({
seq_list <- unique(moves3$seq_num) # this sets how many plots will be tie
d together (equal to number of rows)
#looping through plots
for(i in 1:length(seq_list)){
print(paste(c("Now at", seq_list[i])))
dat <- subset(moves3, seq_num <= seq_list[i])
print(
ggmap(map) +
scale_size(range=c(0,8.5)) +
scale_colour_manual(values=palette) +
geom_point(data = dat, aes(x = longitude, y = latitude, size=minu
tes_scaled,colour=factor(obs))) +
geom_path(data = dat, aes(x = longitude, y = latitude), size = .5
, lineend = "round") +
xlab("Longitude") + ylab("Latitude") +
theme(legend.position="none")
)
}}, movie.name="moves2.gif",
interval = .001,
ani.width = 600,
ani.height = 600)

Questions, Suggestions, Sharing


If you have questions or suggestions on material to add to this tutorial, please contact Libby
at leb237@psu.edu. We are happy for this tutorial to be shared, but rather than sharing directly,
please make sure to link to it’s page on the QuantDev
website: https://quantdev.ssri.psu.edu/tutorials/intro-geospatial-data-and-maps-r
There are also many other cool tutorials given by the QuantDev group
at: https://quantdev.ssri.psu.edu/tutorials/

You might also like