Week 6

Introduction to Data Science
STAT240
Dr. David C. Stenning

02/14/2024 - Week 6
Announcements
Lab assignments are being marked as fast as we can manage.
(You should have received your marks for Labs 1 and 2, with 3 to
come soon and 4 prior to the midterm.) In the meantime,
solutions are posted on Canvas that can be used for
comparison.
A sample midterm (taken from the STAT 240 Spring 2023

midterm exam) is posted on Canvas. The instructions page will
be very similar to the one you will receive during your midterm. -
The final instructions page will be send via a Canvas
announcement before Mon Feb 27, 2024.
Midterm exam
In-tutorial midterm exam 02/26, 02/27, & 02/28, covering
material through Week 6. It is to be done individually during your
scheduled tutorial section.
There will be no lecture on 02/28, to give you a bit of a break

after taking the exam.
You will use a lab computer to complete the midterm. IDs will be
checked, so please your SFU ID and a matching government ID
for comparison.
The exam is closed book. You cannot use the internet during the
exam, except to upload the exam to Crowdmark.
You can use RStudio's "help" tab. You cannot use previous labs
or slides or notes.
If you're required to use functions or code such as SQL queries

(besides basic commands such as mean or max), a description
of the function or code will be provided (see Sample Midterm).
Linear regression: Theory
Why do we care about the sum of squared errors when we fit a
line?
One reason is statistical. Imagine there was a "true" value of α

and β and that the Yi values were generated by the following
process:
Set Zi = βXi + α
Set Yi = Zi + εi , where ε1 , … , εn are independent

random variables following a normal distribution with mean
0 and standard deviation σ
Under this generation process, the OLS solution α̂, β̂ is the

maximum likelihood estimate for α and β
In nature, many variables can be modeled well as independent

and normally distributed random variables
Assume we have input X = (X1 , … , Xn ) and output
Y = (Y1 , … , Yn ) (example: Xi is month index, Yi is demand
for product A)
We assume that the "true" demand is f (Xi ) = βXi + α
The "observed" demand is a noisy version of the "true" demand:

βXi + α + ϵi
Here ϵi ∼ N (0, σ) ⇒ Yi − f (Xi ) ∼ N (0, σ)
In regression, the quantities ε = Yi − f (Xi ) are called

residuals
The log likelihood of parameters α, β given X and Y is:
n
log ∏ Pr(Y |X, α, β)
i=1
n 2
1 1 f (Xi ) − Yi
= ∑ log exp(− ( ) )
σ√2π 2 σ
i=1
n
2
∝ − ∑ (f (Xi ) − Yi ) + K
i=1
The log likelihood is maximized when the sum of squared errors

is minimized
Therefore, finding regression parameters that minimize the sum

of squared errors is the same as assuming the residuals are
normally distributed (with equal standard deviation) and finding
the maximum likelihood solution for the parameters
Linear regression: Example
Consider the following simulated dataset:
set.seed(20230212); x = 1:5; alpha = 2; beta = 1

epsilon = rnorm(5)
y = beta * x + alpha + epsilon
par(mar = c(5,5,1,1))
plot(x, y, xlab = 'x', ylab = 'y',ylim=c(0,10),pch=19)
#par(mar()) used to set margins
Linear regression: Example
We have the following OLS solution:
model = lm(y ~ x, data = data.frame(x = x, y = y))

alpha0 = summary(model)$coefficients[1,1]
beta0 = summary(model)$coefficients[2,1]
par(mar = c(5,5,1,1))
plot(x, y, xlab = 'x', ylab = 'y', ylim=c(0,10),pch=19)
abline(alpha0, beta0)
Data aquisition
To develop products, portfolios and projects using data science,
we need data!
Download data from public repositories
Use private data (i.e., from your company, or from a scientific

collaboration)
Use APIs (interact with an app or service)

Data aquisition: STAT240
UCI Machine Learning Repository
UNdata: A world of information
NASDAQ API
NASDAQ API
API: Application Programming Interface
Interact with an app, service or data source
Agnostic to programming language used (the language

"interfaces" with the service by calling commands)
Usually done over the internet
An application programming interface (API) is a way for

two or more computer programs to communicate with
each other
— Wikipedia
Example: NASDAQ
For example, in the NASDAQ API, a command is run by going to
a URL (universal resource locator). The content of the "website"
at the URL is the requested data
Question: What does rdiff mean here?

API documentation
Often, APIs exist so that the service is used (either for free, or
through subscription). So, they are often well documented
Intermission
Retrieving URLs in R
Let's use the NASDAQ API. This call gets the quarterly
percentage change in AAPL stock between 1985 and 1997,
closing prices only
library(httr)
url='https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?
start_date=1985-05-01&end_date=1997-07-
01&order=asc&column_index=4&collapse=quarterly&transformation=rdif
data = GET(url)
print(data)
Response [https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?
start_date=1985-05-01&end_date=1997-07-
01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff]
Date: 2024-02-14 08:56
Status: 200
Content-Type: application/json; charset=utf-8
Size: 2.55 kB
We convert to JSON, and explore to find out how to index the
data in the list
print(substr(as.character(data),1,50))
[1] "{\"dataset\":{\"id\":9775409,\"dataset_code\":\"AAPL\",\"da"
library(rjson)
parsed = fromJSON(as.character(data))
print(parsed$dataset$data[[1]])
[[1]]
[1] "1985-06-30"
[[2]]
[1] 18
Once we understand how to index into the list, we can convert to
a dataframe:
n = length(parsed$dataset$data)
dates = rep("", n); values = rep(NA, n)
for (i in 1:n) {
dates[i] = parsed$dataset$data[[i]][[1]]
values[i] = parsed$dataset$data[[i]][[2]]
}
df = data.frame(date = dates, value = values)
kable(df[1:3,]) # Show 1st 3 rows as a table in rendered output
date value
1985-06-30 18.00
1985-09-30 15.75
1985-12-31 22.00
R packages for APIs
While APIs are programming language agnostic, many APIs
have libraries in a variety of languages (these may construct
URLs and parse results). NASDAQ API R package: Quandl
library(Quandl)
# This call gets the quarterly percentage change in AAPL stock
between 1985 and 1997, closing prices only
result = Quandl("WIKI/AAPL", transformation ="rdiff", start_date
= "1985-05-01", end_date = "1997-07-01", column_index = 4, order
= "asc", collapse = "quarterly")
kable(result[1:3,])
Date Close
1985-06-30 18.00
1985-09-30 15.75
1985-12-31 22.00
Example: Google maps
Example: Apple pay
Example: Discord
Example: eBird
eBird is a database of bird
sightings maintained by the
Cornell Lab of Ornithology
at Cornell University, Ithaca
~70m complete "checklists"

by citizen scientists
(recreational and
professional), >1b bird
observations
eBird has an API, and the

API has an R package:
rebird
eBird API
Working from the eBird manual, we develop code to explore bird
counts in the provinces and territories of Canada. Regions in the
eBird database are sorted by "subnational1" and "subnational2"
regions. For provinces/territories in Canada, we find (in the
manual) the function ebirdsubregionlist:
library(rebird)
#API key must be obtained. Set API_KEY = "your key".
subregions = ebirdsubregionlist(regionType = "subnational1",
parentRegionCode = "CA", key = API_KEY)
New names:
• value -> value...5
eBird API
print(subregions)
# A tibble: 13 × 2
code name
<chr> <chr>
1 CA-AB Alberta
2 CA-BC British Columbia
3 CA-MB Manitoba
4 CA-NB New Brunswick
5 CA-NL Newfoundland and Labrador
6 CA-NT Northwest Territories
7 CA-NS Nova Scotia
8 CA-NU Nunavut
9 CA-ON Ontario
10 CA-PE Prince Edward Island
11 CA-QC Quebec
12 CA-SK Saskatchewan
13 CA-YT Yukon Territory
"Tibble" to vector
Converting "A tibble: 13 × 2" to a vector:
as.vector(subregions[,1])
# A tibble: 13 × 1
code
<chr>
1 CA-AB
2 CA-BC
3 CA-MB
4 CA-NB
5 CA-NL
6 CA-NT
7 CA-NS
8 CA-NU
9 CA-ON
10 CA-PE
11 CA-QC
12 CA-SK
13 CA-YT
eBird API: Species list
Again, from eBird API:
species = ebirdregionspecies("CA", key = API_KEY)

sp = as.vector(species)$speciesCode
print(sp[1:10])
[1] "emu1" "bbwduc" "fuwduc" "bahgoo" "empgoo" "snogoo"

"rosgoo"
[8] "sxrgoo1" "gragoo" "swagoo1"
eBird API: By region
We now loop over species and regions. "obs/%s/recent/%s"
returns observations of a species in a region in last 30 days. I
didn't see the function in the R library, so I manually call the API
n = length(sr) ; m = length(sp)
recent = matrix(NA, n, m)
for (j in 1:m) {
for (i in 1:n) {
if (is.na(recent[i, j])) {
url = sprintf('https://api.ebird.org/v2/data/obs/%s/recent/%s',
sr[i], sp[j])
result = content(GET(url, add_headers(
'X-eBirdApiToken' = API_KEY
)))
...
for (k in 1:length(result)) {
...
recent[i, j] = recent[i, j] + howMany
...
(Too long to show in full. We'll use a pre-processed version.)

eBird API: Pre-processed dataset
ebird30 = read.csv("ebird30.csv")
print(dim(ebird30))
[1] 883 14
print(colnames(ebird30))
[1] "Code" "CA.AB" "CA.BC" "CA.MB" "CA.NB" "CA.NL" "CA.NT"

"CA.NS" "CA.NU"
[10] "CA.ON" "CA.PE" "CA.QC" "CA.SK" "CA.YT"
names = read.csv("names.csv")
print(dim(names))
[1] 883 3
print(colnames(names))
[1] "Code" "Common.Name" "Scientific.Name"

Drawing a map of Canada
In this week's archive is a script canada.R with a function canada
that makes a heatmap for Canada:
source('canada.R')
heat = list("CA.MB" = 0, "CA.BC" = 1, "CA.AB" = 0, "CA.SK" = 0,
"CA.ON" = 0.5, "CA.QC" = 0, "CA.NL" = 0, "CA.NS" = 0, "CA.NB" =
0, "CA.PE" = 0, "CA.NT" = 0, "CA.NU" = 0, "CA.YT" = 0)
title = 'Test'
Drawing a map of Canada
canada(title, heat)
Heatmaps for birds
This can be combined with the eBirds data to provide data
visualizations:
heat = list()
rownames(ebird30) = ebird30$Code
for (i in 2:14) {
heat[[colnames(ebird30)[i]]] = ebird30["houspa",
colnames(ebird30)[i]]
}
Question: Why is the index set above specified as i in 2:14?

Heatmaps for birds
canada("House sparrow count", heat)
Reading
Munzert Section 14.1

Week 6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 6

Uploaded by

Copyright:

Available Formats

Introduction to Data Science

Dr. David C. Stenning

A sample midterm (taken from the STAT 240 Spring 2023

There will be no lecture on 02/28, to give you a bit of a break

If you're required to use functions or code such as SQL queries

One reason is statistical. Imagine there was a "true" value of α

Set Yi = Zi + εi , where ε1 , … , εn are independent

Under this generation process, the OLS solution α̂, β̂ is the

In nature, many variables can be modeled well as independent

We assume that the "true" demand is f (Xi ) = βXi + α

The "observed" demand is a noisy version of the "true" demand:

Here ϵi ∼ N (0, σ) ⇒ Yi − f (Xi ) ∼ N (0, σ)

In regression, the quantities ε = Yi − f (Xi ) are called

log ∏ Pr(Y |X, α, β)

The log likelihood is maximized when the sum of squared errors

Therefore, finding regression parameters that minimize the sum

set.seed(20230212); x = 1:5; alpha = 2; beta = 1

model = lm(y ~ x, data = data.frame(x = x, y = y))

Download data from public repositories

Use private data (i.e., from your company, or from a scientific

Use APIs (interact with an app or service)

Agnostic to programming language used (the language

Usually done over the internet

An application programming interface (API) is a way for

Question: What does rdiff mean here?

~70m complete "checklists"

eBird has an API, and the

species = ebirdregionspecies("CA", key = API_KEY)

[1] "emu1" "bbwduc" "fuwduc" "bahgoo" "empgoo" "snogoo"

(Too long to show in full. We'll use a pre-processed version.)

[1] "Code" "CA.AB" "CA.BC" "CA.MB" "CA.NB" "CA.NL" "CA.NT"

[1] "Code" "Common.Name" "Scientific.Name"

Question: Why is the index set above specified as i in 2:14?

You might also like