Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Introduction to Data Science

STAT240

Dr. David C. Stenning


02/14/2024 - Week 6
Announcements
Lab assignments are being marked as fast as we can manage.
(You should have received your marks for Labs 1 and 2, with 3 to
come soon and 4 prior to the midterm.) In the meantime,
solutions are posted on Canvas that can be used for
comparison.

A sample midterm (taken from the STAT 240 Spring 2023


midterm exam) is posted on Canvas. The instructions page will
be very similar to the one you will receive during your midterm. -
The final instructions page will be send via a Canvas
announcement before Mon Feb 27, 2024.
Midterm exam
In-tutorial midterm exam 02/26, 02/27, & 02/28, covering
material through Week 6. It is to be done individually during your
scheduled tutorial section.

There will be no lecture on 02/28, to give you a bit of a break


after taking the exam.

You will use a lab computer to complete the midterm. IDs will be
checked, so please your SFU ID and a matching government ID
for comparison.

The exam is closed book. You cannot use the internet during the
exam, except to upload the exam to Crowdmark.

You can use RStudio's "help" tab. You cannot use previous labs
or slides or notes.

If you're required to use functions or code such as SQL queries


(besides basic commands such as mean or max), a description
of the function or code will be provided (see Sample Midterm).
Linear regression: Theory
Why do we care about the sum of squared errors when we fit a
line?

One reason is statistical. Imagine there was a "true" value of α


and β and that the Yi values were generated by the following
process:

Set Zi = βXi + α

Set Yi = Zi + εi , where ε1 , … , εn are independent


random variables following a normal distribution with mean
0 and standard deviation σ

Under this generation process, the OLS solution α̂, β̂ is the


maximum likelihood estimate for α and β

In nature, many variables can be modeled well as independent


and normally distributed random variables
Linear regression: Theory
Assume we have input X = (X1 , … , Xn ) and output
Y = (Y1 , … , Yn ) (example: Xi is month index, Yi is demand

for product A)

We assume that the "true" demand is f (Xi ) = βXi + α

The "observed" demand is a noisy version of the "true" demand:


βXi + α + ϵi

Here ϵi ∼ N (0, σ) ⇒ Yi − f (Xi ) ∼ N (0, σ)

In regression, the quantities ε = Yi − f (Xi ) are called


residuals
Linear regression: Theory
The log likelihood of parameters α, β given X and Y is:
n

log ∏ Pr(Y |X, α, β)

i=1

n 2
1 1 f (Xi ) − Yi
= ∑ log exp(− ( ) )
σ√2π 2 σ
i=1

n
2
∝ − ∑ (f (Xi ) − Yi ) + K

i=1

The log likelihood is maximized when the sum of squared errors


is minimized

Therefore, finding regression parameters that minimize the sum


of squared errors is the same as assuming the residuals are
normally distributed (with equal standard deviation) and finding
the maximum likelihood solution for the parameters
Linear regression: Example
Consider the following simulated dataset:

set.seed(20230212); x = 1:5; alpha = 2; beta = 1


epsilon = rnorm(5)
y = beta * x + alpha + epsilon
par(mar = c(5,5,1,1))
plot(x, y, xlab = 'x', ylab = 'y',ylim=c(0,10),pch=19)
#par(mar()) used to set margins
Linear regression: Example
We have the following OLS solution:

model = lm(y ~ x, data = data.frame(x = x, y = y))


alpha0 = summary(model)$coefficients[1,1]
beta0 = summary(model)$coefficients[2,1]
par(mar = c(5,5,1,1))
plot(x, y, xlab = 'x', ylab = 'y', ylim=c(0,10),pch=19)
abline(alpha0, beta0)
Data aquisition
To develop products, portfolios and projects using data science,
we need data!

Download data from public repositories

Use private data (i.e., from your company, or from a scientific


collaboration)

Use APIs (interact with an app or service)


Data aquisition: STAT240
UCI Machine Learning Repository
Data aquisition: STAT240
UNdata: A world of information
Data aquisition: STAT240
NASDAQ API
Data aquisition: STAT240
NASDAQ API
API: Application Programming Interface
Interact with an app, service or data source

Agnostic to programming language used (the language


"interfaces" with the service by calling commands)

Usually done over the internet

An application programming interface (API) is a way for


two or more computer programs to communicate with
each other

— Wikipedia
Example: NASDAQ
For example, in the NASDAQ API, a command is run by going to
a URL (universal resource locator). The content of the "website"
at the URL is the requested data

Question: What does rdiff mean here?


API documentation
Often, APIs exist so that the service is used (either for free, or
through subscription). So, they are often well documented
Intermission
Retrieving URLs in R
Let's use the NASDAQ API. This call gets the quarterly
percentage change in AAPL stock between 1985 and 1997,
closing prices only

library(httr)
url='https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?
start_date=1985-05-01&end_date=1997-07-
01&order=asc&column_index=4&collapse=quarterly&transformation=rdif
data = GET(url)
print(data)

Response [https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?
start_date=1985-05-01&end_date=1997-07-
01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff]
Date: 2024-02-14 08:56
Status: 200
Content-Type: application/json; charset=utf-8
Size: 2.55 kB
Retrieving URLs in R
We convert to JSON, and explore to find out how to index the
data in the list

print(substr(as.character(data),1,50))

[1] "{\"dataset\":{\"id\":9775409,\"dataset_code\":\"AAPL\",\"da"

library(rjson)
parsed = fromJSON(as.character(data))
print(parsed$dataset$data[[1]])

[[1]]
[1] "1985-06-30"

[[2]]
[1] 18
Retrieving URLs in R
Once we understand how to index into the list, we can convert to
a dataframe:

n = length(parsed$dataset$data)
dates = rep("", n); values = rep(NA, n)
for (i in 1:n) {
dates[i] = parsed$dataset$data[[i]][[1]]
values[i] = parsed$dataset$data[[i]][[2]]
}
df = data.frame(date = dates, value = values)
kable(df[1:3,]) # Show 1st 3 rows as a table in rendered output

date value
1985-06-30 18.00
1985-09-30 15.75
1985-12-31 22.00
R packages for APIs
While APIs are programming language agnostic, many APIs
have libraries in a variety of languages (these may construct
URLs and parse results). NASDAQ API R package: Quandl

library(Quandl)
# This call gets the quarterly percentage change in AAPL stock
between 1985 and 1997, closing prices only
result = Quandl("WIKI/AAPL", transformation ="rdiff", start_date
= "1985-05-01", end_date = "1997-07-01", column_index = 4, order
= "asc", collapse = "quarterly")
kable(result[1:3,])

Date Close
1985-06-30 18.00
1985-09-30 15.75
1985-12-31 22.00
Example: Google maps
Example: Apple pay
Example: Discord
Example: eBird
eBird is a database of bird
sightings maintained by the
Cornell Lab of Ornithology
at Cornell University, Ithaca

~70m complete "checklists"


by citizen scientists
(recreational and
professional), >1b bird
observations

eBird has an API, and the


API has an R package:
rebird
eBird API
Working from the eBird manual, we develop code to explore bird
counts in the provinces and territories of Canada. Regions in the
eBird database are sorted by "subnational1" and "subnational2"
regions. For provinces/territories in Canada, we find (in the
manual) the function ebirdsubregionlist:

library(rebird)
#API key must be obtained. Set API_KEY = "your key".
subregions = ebirdsubregionlist(regionType = "subnational1",
parentRegionCode = "CA", key = API_KEY)

New names:
• value -> value...5
• value -> value...6
• value -> value...7
• value -> value...8
• value -> value...9
eBird API
print(subregions)

# A tibble: 13 × 2
code name
<chr> <chr>
1 CA-AB Alberta
2 CA-BC British Columbia
3 CA-MB Manitoba
4 CA-NB New Brunswick
5 CA-NL Newfoundland and Labrador
6 CA-NT Northwest Territories
7 CA-NS Nova Scotia
8 CA-NU Nunavut
9 CA-ON Ontario
10 CA-PE Prince Edward Island
11 CA-QC Quebec
12 CA-SK Saskatchewan
13 CA-YT Yukon Territory
"Tibble" to vector
Converting "A tibble: 13 × 2" to a vector:

as.vector(subregions[,1])

# A tibble: 13 × 1
code
<chr>
1 CA-AB
2 CA-BC
3 CA-MB
4 CA-NB
5 CA-NL
6 CA-NT
7 CA-NS
8 CA-NU
9 CA-ON
10 CA-PE
11 CA-QC
12 CA-SK
13 CA-YT
eBird API: Species list
Again, from eBird API:

species = ebirdregionspecies("CA", key = API_KEY)


sp = as.vector(species)$speciesCode
print(sp[1:10])

[1] "emu1" "bbwduc" "fuwduc" "bahgoo" "empgoo" "snogoo"


"rosgoo"
[8] "sxrgoo1" "gragoo" "swagoo1"
eBird API: By region
We now loop over species and regions. "obs/%s/recent/%s"
returns observations of a species in a region in last 30 days. I
didn't see the function in the R library, so I manually call the API

n = length(sr) ; m = length(sp)
recent = matrix(NA, n, m)
for (j in 1:m) {
for (i in 1:n) {
if (is.na(recent[i, j])) {
url = sprintf('https://api.ebird.org/v2/data/obs/%s/recent/%s',
sr[i], sp[j])
result = content(GET(url, add_headers(
'X-eBirdApiToken' = API_KEY
)))
...
for (k in 1:length(result)) {
...
recent[i, j] = recent[i, j] + howMany
...

(Too long to show in full. We'll use a pre-processed version.)


eBird API: Pre-processed dataset
ebird30 = read.csv("ebird30.csv")
print(dim(ebird30))

[1] 883 14

print(colnames(ebird30))

[1] "Code" "CA.AB" "CA.BC" "CA.MB" "CA.NB" "CA.NL" "CA.NT"


"CA.NS" "CA.NU"
[10] "CA.ON" "CA.PE" "CA.QC" "CA.SK" "CA.YT"

names = read.csv("names.csv")
print(dim(names))

[1] 883 3

print(colnames(names))

[1] "Code" "Common.Name" "Scientific.Name"


Drawing a map of Canada
In this week's archive is a script canada.R with a function canada
that makes a heatmap for Canada:

source('canada.R')
heat = list("CA.MB" = 0, "CA.BC" = 1, "CA.AB" = 0, "CA.SK" = 0,
"CA.ON" = 0.5, "CA.QC" = 0, "CA.NL" = 0, "CA.NS" = 0, "CA.NB" =
0, "CA.PE" = 0, "CA.NT" = 0, "CA.NU" = 0, "CA.YT" = 0)
title = 'Test'
Drawing a map of Canada
canada(title, heat)
Heatmaps for birds
This can be combined with the eBirds data to provide data
visualizations:

heat = list()
rownames(ebird30) = ebird30$Code
for (i in 2:14) {
heat[[colnames(ebird30)[i]]] = ebird30["houspa",
colnames(ebird30)[i]]
}

Question: Why is the index set above specified as i in 2:14?


Heatmaps for birds
canada("House sparrow count", heat)
Reading
Munzert Section 14.1

You might also like