Getting Started With R: Sebastiano Manzan

Getting started with R
Sebastiano Manzan
ECO 4051 | Spring 2018
1 / 71
Why R?
I R is becoming increasingly popular in economics and finance
I It is open source, simple to use, and with numerous packages (12,126 as of

today) contributed by a large community of users on any aspect of
statistical modeling and data analysis
I If you can have an excellent free product why should you pay for an
excellent expensive product (e.g., Matlab, SAS)?
I Learning a programming language can be a useful skill in the current labor

market where companies are increasingly interested to extract information
(intelligence) from large datasets they collect about their business
I Trend in the industry (just two example):

I Microsoft bought Revolution R and renamed Microsoft R (a version
of R that is optimized to work on multiple cores)
I IBM bought SPSS and many data/analytics providers (e.g., the
Weather Channel), and sponsors Cognitive Class.ai that offers free
online courses about data science and machine learning
2 / 71
Outline of the course
1. Getting started with R

2. Linear Regression Model
3. Time series models
4. Volatility modeling
5. High-frequency data
6. Measuring Financial Risk
3 / 71
Lets get started with R
Figure 1: R for Windows
4 / 71
Lets get started with Rstudio
Figure 2: Rstudio
5 / 71
How R works
I R creates and works with objects that contain data
I The object/data can be of a different structure such as:

I data frame: a table where each column represents a variable and
each row a different observation (different time period or unit); a
variable can be numerical or a string (similar to an Excel spreadsheet)
I matrix: same as data frame, but all variables/columns have to be of
the same type (typically all numbers)
I lists: an object of objects; each element of the list can be, e.g., a data
frame, a matrix, and a vector (similar to a set of Excel spreadsheets)
I Function: in R we can create functions that takes a set of arguments and
perform a set of operations on a data object; e.g., mean(x, na.rm=T)
I Package: a group of functions with a specific purpose (e.g., ggplot2)

I install a package: install.packages("ggplot2") (only done once)
I use the package: library(ggplot2) or require(ggplot2)
6 / 71
Loading data in R
I It is convenient to start a R session by setting the working directory where

the data/files are stored; for example:
I setwd('/Users/username/Baruch/ECO4051/') in Mac/Unix
I setwd('c:/Baruch/ECO4051/') in Windows
I Two ways to load a dataset in R:
1. import the data from a local file
2. import the data from an online resource (e.g., Yahoo Finance, FRED,
Google Finance, Quandl)
7 / 71
Base function read.csv()
I You can load a file from Rstudio via Tools -> Import Dataset and then
you are given the option From Text File or From Web URL
I Otherwise, you can type a few lines of code (table from Wikipedia):
splist <- read.csv("List_SP500.csv")

head(splist,10)
Ticker.symbol Security Address.of.Headquarters Date.first.added

1 MMM 3M Company St. Paul, Minnesota
2 ABT Abbott Laboratories North Chicago, Illinois 1964-03-31
3 ABBV AbbVie Inc. North Chicago, Illinois 2012-12-31
4 ACN Accenture plc Dublin, Ireland 2011-07-06
5 ATVI Activision Blizzard Santa Monica, California 2015-08-31
6 AYI Acuity Brands Inc Atlanta, Georgia 2016-05-03
7 ADBE Adobe Systems Inc San Jose, California 1997-05-05
8 AMD Advanced Micro Devices Inc Sunnyvale, California 2017-03-20
9 AAP Advance Auto Parts Roanoke, Virginia 2015-07-09
10 AES AES Corp Arlington, Virginia
I The commands head( ,n) and tail( ,n) show the first and last n
observations
8 / 71
Data types
I The str() command can be used to evaluate the object structure and the
data types:
str(splist)
'data.frame': 505 obs. of 4 variables:

$ Ticker.symbol : Factor w/ 505 levels "A","AAL","AAP",..: 314 7 5 8 52 58 9 33 3 17 ...
$ Security : Factor w/ 505 levels "3M Company","A.O. Smith Corp",..: 1 3 4 5 6 7 8 10 9 11
$ Address.of.Headquarters: Factor w/ 256 levels "Akron, Ohio",..: 222 159 159 66 210 8 204 226 195 5 ...
$ Date.first.added : Factor w/ 303 levels "","1964-03-31",..: 1 2 213 195 252 273 79 292 249 1 ...
I Each variable in the data frame splist has a type that can be:
I numeric: (or double) is used for decimal values
I integer: for integer values
I character: for strings of characters
I Date: for dates
I factor: represents a type of variable (either numeric, integer, or
character) that categorizes the values in a small (relative to the sample
size) set of categories (or levels)
9 / 71
I The read.csv() function has the annoying feature that any string is
interpreted as a factor
I This can be switched off by adding the argument stringsAsFactors = FALSE
splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)

str(splist)

$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...
$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date.first.added : chr "" "1964-03-31" "2012-12-31" "2011-07-06" ...
I The ticker symbol, security name, and address are all correctly interpreted
as chr
I The date.first.added is also imported as a string, but we would like to
define it of type Date
10 / 71
I The code below is used to define the column/variable Date.first.added as
a date with command as.Date()
splist$Date.first.added <- as.Date(splist$Date.first.added, format="%Y-%m-%d")

str(splist)

$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date.first.added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...
I $ sign is used to extract a variable/column in a data frame;

splist$Date.first.added extract the Date.first.added from the data
frame object splist
I as.Date() function converts the splist$Date.first.added from character
to Date; the role of the argument format="%Y-%m-%d" is to specify the
format of the date being defined
11 / 71
read_csv() from readr package
I In addition to the base read.csv() function, there are other packages that
provide functions to read data
I There are two problems with the read.csv() function:
I type guessing (in particular dates)
I reading speed
I The function read_csv() from package readr tries to solve both problems
(will talk about speed later):
library(readr)
splist <- read_csv("List_SP500.csv")
str(splist, max.level=1)
Classes 'tbl_df', 'tbl' and 'data.frame': 505 obs. of 4 variables:

$ Ticker symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Address of Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date first added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...
- attr(*, "spec")=List of 2
..- attr(*, "class")= chr "col_spec"
I Notice:
I class tbl (tibble) and tbl_df is a type of data frame specific to this
package
I Date defined as a Date which saves us a line of code
12 / 71
I The file GSPC.csv represents daily data for the S&P 500 Index from January
1985 downloaded from Yahoo Finance
I Below is a comparison of read.csv() and read_csv()
index <- read.csv("GSPC.csv", stringsAsFactors = FALSE)

$ Date : chr "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...
$ GSPC.Open : num 167 165 165 164 164 ...
$ GSPC.High : num 167 166 165 165 165 ...
$ GSPC.Low : num 165 164 163 164 164 ...
$ GSPC.Close : num 165 165 164 164 164 ...
$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...
$ GSPC.Adjusted: num 165 165 164 164 164 ...
index <- read_csv("GSPC.csv")
Classes 'tbl_df', 'tbl' and 'data.frame': 8237 obs. of 7 variables:

$ Date : Date, format: "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...
$ GSPC.Open : num 167 165 165 164 164 ...
$ GSPC.High : num 167 166 165 165 165 ...
$ GSPC.Low : num 165 164 163 164 164 ...
$ GSPC.Close : num 165 165 164 164 164 ...
$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...
$ GSPC.Adjusted: num 165 165 164 164 164 ...
- attr(*, "spec")=List of 2
..- attr(*, "class")= chr "col_spec"
13 / 71
Saving data files
I In addition to read/import files, we might also need to save data files
I This can be done with the base function write.csv() (see help(write.csv)
for the arguments)
index <- read_csv("GSPC.csv")

write.csv(index, file = "myfile.csv", row.names = FALSE)
I The index object is saved to a file called myfile.csv in the working directory
14 / 71
Plotting the data . . .
I Visualization is an essential part of data analysis

I It is useful to guide the analysis and to communicate our results
I It is typically easier to process and understand a graph rather than a table
with numbers
I The base function in R to plot data is plot() that takes as arguments:
I x, y: the variables to plot in the x-axis and y-axis
I type: p for points, l for lines, b for both (and more)
I xlim, ylim: the range of the axes
I xlab, ylab: the labels of the axes
I main: string to use for title
I col: color of the point and/or line
I pch: type of point to use
15 / 71
I The code below produces a time series plot of the S&P 500 Index:
I The column index$Date is defined of class Date and used as the x-axis
I The column index$GSPC.Adjusted is used as the y-axis
I The left plot uses the default settings, the plot on the right has been
customized
# LEFT PLOT
plot(index$Date, index$GSPC.Adjusted)
# RIGHT PLOT
plot(index$Date, index$GSPC.Adjusted, type="l", xlab="", ylab="S&P 500 Index", xaxt="n", yaxt="n")
ticks <- seq(index$Date[1], index$Date[nrow(index)], by="year")
axis(1, at=ticks, labels=ticks, cex.axis=0.9, col="orange", col.axis="blue")
axis(2, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3,cex.axis=0.75,col.axis="purple")
axis(4, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3, cex.axis=0.75,col.axis="purple")
2500
index$GSPC.Adjusted
2000
2000
S&P 500 Index
1500
1500
1500
1000
1000
500
500
500
1985 1990 1995 2000 2005 2010 2015 1985−01−02 1995−01−02 2005−01−02 2015−01−02
index$Date
16 / 71
Time series objects
I A variable that is observed over time is called a time series (e.g., stock
prices, real GDP, inflation)
I There are several packages that provide infrastructures to define an object
as a time series object
I I will mostly use the xts package (that is part of the quantmod package for
quantitative finance; other packages are ts and zoo)
I To define an object as a time series we use the command xts() that takes
two arguments:
I a data frame
I a vector of dates (of class Date)
library(xts)
index.xts <- xts(subset(index, select=-Date), order.by=index$Date)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

1985-01-02 167.20 167.20 165.19 165.37 67820000 165.37
1985-01-03 165.37 166.11 164.38 164.57 88880000 164.57
1985-01-04 164.55 164.55 163.36 163.68 77480000 163.68
1985-01-07 163.68 164.71 163.68 164.24 86190000 164.24
17 / 71
I The xts package provides functions to extract information from a time
series object:
start(index.xts) # start date

end(index.xts) # end date
periodicity(index.xts) # periodicity/frequency (daily, weekly, monthly)
[1] "1985-01-02"
[1] "2017-09-01"
Daily periodicity from 1985-01-02 to 2017-09-01
I There are also functions to aggregate the observations from high frequency
(e.g., daily) to lower frequency (e.g., weekly/monthly/quarterly)
I By default the sub-sampling is performed by taking the first observation of
the interval (e.g., monday of each week, 1st of the month)
index.weekly <- to.weekly(index.xts)
index.xts.Open index.xts.High index.xts.Low index.xts.Close index.xts.Volume index.xts.Adjusted

1985-01-04 167.20 167.20 163.36 163.68 234180000 163.68
1985-01-11 163.68 168.72 163.68 167.91 509830000 167.91
1985-01-18 167.91 171.94 167.58 171.32 634000000 171.32
1985-01-25 171.32 178.16 171.31 177.35 749100000 177.35
18 / 71
I Functions apply.weekly() and apply.monthly() are used when the goal is
to apply a function to each week/month in the sample
I In the examples below I apply these functions to subsample the first() and
last() day of the week (notice that the first() is equivalent to the
to.weekly() function) and to calculate the mean() of the week
index.weekly <- apply.weekly(index.xts, "first")

1985-01-04 167.20 167.20 165.19 165.37 67820000 165.37
1985-01-11 163.68 164.71 163.68 164.24 86190000 164.24
index.weekly <- apply.weekly(index.xts, "last")

1985-01-04 164.55 164.55 163.36 163.68 77480000 163.68
1985-01-11 168.31 168.72 167.58 167.91 107600000 167.91
index.weekly <- apply.weekly(index.xts, "mean")

1985-01-04 165.71 165.95 164.31 164.54 78060000 164.54
1985-01-11 165.08 166.38 164.83 165.93 101966000 165.93
19 / 71
Plotting time series data
I Once the object is defined as xts, plotting is executed by plot.xts() which

has the advantage that:
I the x-axis is by default defined to be time
I easier rendering of dates through the grids
I Using plot() on an xts object automatically calls plot.xts()
GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)

plot(Ad(GSPC))
Ad(GSPC)
2500
2000
1500
1000
500
Jan 02 Jan 04 Jan 02 Jan 02 Jan 03 Jan 02 Jan 02 Jan 03 Jan 04

1985 1988 1992 1996 2000 2004 2008 2012 2016
20 / 71
Subsetting
I The xts package provides its own syntax to subset the time series object
I Below are some examples:
par(mfrow=c(2,3)) # organize the plots in 2 rows and 3 columns

plot(index.xts[’2007’])
plot(index.xts[’2007/2009’])
plot(index.xts[’/2007’])
plot(index.xts[’2007/’])
plot(index.xts[’2007-03-21/2008-02-12’])
plot(index.xts[.indexwday(index.xts)==3])
index.xts["2007"] index.xts["2007/2009"] index.xts["/2007"]
1600
1000 1400
1500
1200
600
1400
800
200
Jan 03 Apr 02 Jul 02 Oct 01 Dec 31 Jan 03 Jan 02 Jan 02 Dec 31 Jan 02 Jan 02 Jan 03 Jan 03 Jan 03
2007 2007 2007 2007 2007 2007 2008 2009 2009 1985 1990 1995 2000 2005
index.xts["2007/"] index.xts["2007−03−21/2008−02−12"] index.xts[.indexwday(index.xts) == 3]
2500
1550
2000
1500
1450
1000
1350
500
Jan 03 Jan 02 Jan 03 Jan 02 Jan 02 Jan 03 Mar 21 Jun 01 Aug 01 Oct 01 Dec 03 Feb 12 Jan 03 Jan 02 Jan 08 Jan 08 Jan 07 Jan 07
2007 2009 2011 2013 2015 2017 2007 2007 2007 2007 2007 2008 1985 1992 1998 2004 2010 2016
21 / 71
getSymbols() from the quantmod package
I There are several packages that provide functions to download economic and
financial data by only specifying the ticker and time period (and frequency
in some functions)
I I will discuss only the getSymbols() function from package quantmod()
which will be used in this class
I Features of getSymbols():
I Sources: Yahoo Finance, Google Finance, OANDA (fx rates), FRED
(argument src)
I Download multiple tickers in one call
I Select the time period (with arguments from and to)
I By default the output is a xts object for each ticker specified
I Yahoo Finance: downloads open, high, low, close, volume, and
adjusted close at the daily frequency
I You can convert to weekly or monthly use to.weekly()/to.monthly()
or the apply.weekly()/apply.monthly()
22 / 71
getSymbols() with one ticker
I By default, the function creates a xts object with the name of the ticker
(except the ˆ part if you are downloading an index)
I When downloading only one ticker, setting auto.assign=FALSE allows you to
assign the output to an object that you name (in the example below data)
library(quantmod)
getSymbols("^GSPC", src = "yahoo", from = "1990-01-01")
[1] "GSPC"
tail(GSPC, 2)

2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1
2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5
data <- getSymbols("^GSPC", src = "yahoo", from = "1990-01-01", auto.assign = FALSE)

tail(data, 2)

2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1
2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5
23 / 71
Multiple tickers
I A vector of symbols can be passed to getSymbols() to download data for

multiple assets
I The function will create a xts object for each ticker (using as name the
ticker; the auto.assign options does not work for more than one ticker)
I When you download more than 5 symbols you will see a message pausing 1
second between requests for more than 5 symbols
library(quantmod)
getSymbols(c("^GSPC","^DJI"), src="yahoo", from="1990-01-01")
periodicity(GSPC)
periodicity(DJI)
[1] "GSPC" "DJI"

24 / 71
I getSymbols() allows you also to specify an environment (argument env)
I Think of the environment as a folder in the R global environment where the
objects will be stored
I Steps:
I create a new environment with new.env() command (called myenv
below)
I call the getSymbols() function and set the env= argument to the new
environment you created
I the ls() command below lists the objects in the new environment
myenv
splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)

myenv <- new.env()
getSymbols(splist$Ticker.symbol[1:10], env=myenv, from="2010-01-01", src="yahoo")
[1] "MMM" "ABT" "ABBV" "ACN" "ATVI" "AYI" "ADBE" "AMD" "AAP" "AES"
ls(myenv)
[1] "AAP" "ABBV" "ABT" "ACN" "ADBE" "AES" "AMD" "ATVI" "AYI" "MMM"
25 / 71
quantmod functionalities
I The object created contains all the information, but for our analysis we
might need only some of the columns
I The package provides functions to extract the open price Op(), the closing
price Cl(), the highest intra-day price Hi(), the lowest Lo(), the volume
Vo(), and the adjusted closing price Ad()
I OpCl() calculates the open-to-close daily return, ClCl() for the close-to-close
return, and LoHi() for the low-to-high difference (also called the intra-day
range)
data.new <- merge(Ad(GSPC),Ad(DJI))
GSPC.Adjusted DJI.Adjusted
1990-01-02 359.69 2810.1
1990-01-03 358.76 2809.7
1990-01-04 355.67 2796.1
26 / 71
Oanda
I Daily exchange rates for a wide range of currency pairs
I Limit of 2000 days for request
I command oanda.currencies gives you the symbols for 191 currencies
getSymbols(c("USD/EUR", "USD/JPY"), src="oanda")
[1] "USDEUR" "USDJPY"
par(mfrow=c(1,2))
plot(USDEUR)
plot(USDJPY)
USDEUR USDJPY
114
0.85
112
0.83
110
108
0.81
Jul 30 Sep 04 Oct 09 Nov 13 Dec 18 Jan 22 Jul 30 Sep 04 Oct 09 Nov 13 Dec 18 Jan 22
2017 2017 2017 2017 2017 2018 2017 2017 2017 2017 2017 2018
27 / 71
FRED
I Federal Reserve Economic Data (FRED) can be used to download

macroeconomic time series for the US economy and also international
I Visit FRED to find out the symbol of the variable(s) you are interested to
download
I Below are some examples:
library(quantmod)
macrodata <- getSymbols(c(’UNRATE’,’CPIAUCSL’,’GDPC1’), src="FRED")
macrodata <- merge(UNRATE, CPIAUCSL, GDPC1)
par(mfrow=c(1,3))
plot(UNRATE); plot(CPIAUCSL); plot(GDPC1)
UNRATE CPIAUCSL GDPC1

250
10
15000
200
8
150
10000
6
100
5000
4
50
Jan Jan Jan Jan Jan Dec Jan Jan Jan Jan Jan Dec Jan Jan Jan Jan Jan Jul
1948 1960 1975 1990 2005 2017 1947 1960 1975 1990 2005 2017 1947 1960 1975 1990 2005 2017
28 / 71
I Exchange rates are also available in FRED (no restriction on the time
period)
# DEXUSEU: U.S. Dollars to One Euro

# DEXJPUS: Japanese Yen to One U.S. Dollar
macrodata <- getSymbols(c(’DEXUSEU’,’DEXJPUS’), src="FRED", from="1975-01-01")
par(mfrow=c(1,2))
plot(DEXUSEU)
plot(DEXJPUS)
DEXUSEU DEXJPUS
1.6
350
300
1.4
250
1.2
200
150
1.0
100
0.8
Jan 04 Jan 01 Jan 01 Jan 03 Jan 01 Jan 04 Jan 01 Jan 01 Jan 03 Jan 01
1999 2003 2007 2011 2015 1971 1980 1990 2000 2010
29 / 71
Quandl
I Quandl works as an aggregator of open and subscription databases

I The macroeconomic variables retrieved from FRED can also be obtained
from Quandl:
library(Quandl)
macrodata <- Quandl(c("FRED/UNRATE", ’FRED/CPIAUCSL’, "FRED/GDPC1"),
start_date="1950-01-02", type="xts")
head(macrodata)
FRED.UNRATE - Value FRED.CPIAUCSL - Value FRED.GDPC1 - Value

1950-02-01 6.4 23.61 NA
1950-03-01 6.3 23.64 NA
1950-04-01 5.8 23.65 2147.6
1950-05-01 5.5 23.77 NA
1950-06-01 5.4 23.88 NA
1950-07-01 5.0 24.07 2230.4
30 / 71
I Quandl has many more datasets, e.g. commodity spot and futures prices
oil.spot <- Quandl("COM/WLD_CRUDE_WTI", type="xts")

coffee.spot <- Quandl("COM/COFFEE_BRZL", type="xts")
sp.futures <- Quandl("CHRIS/CME_SP1", type="xts")
gold.futures <- Quandl("CHRIS/CME_GC3", type="xts")
par(mfrow=c(1,4))
plot(oil.spot); plot(coffee.spot); plot(sp.futures$Settle); plot(gold.futures$Settle)
oil.spot coffee.spot sp.futures$Settle gold.futures$Settle

3.0
2500
120
1500
2.5
100
2000
80
1500
1000
2.0
60
1000
1.5
500
40
500
20
1.0
Jan Feb Feb Feb Feb May 01 Nov 01 May 01 Oct 31 Apr 21 Jan 02 Jan 02 Jan 03 Dec 31 Jan 04 Jan 04 Jan 03
1982 1990 1998 2006 2014 2007 2010 2014 2017 1982 1992 2002 2012 1974 1988 2000 2012
31 / 71
Reading large files
I Files imported in R are stored in the memory of the program
I The physical limit to the file size that can imported is determined by the
RAM of your machine (2, 4, 6GB)
I Reading large files can be time consuming when using the base functions
I Two packages are available to help with this task:

I readr (function read_csv())
I data.table (function fread())
I I will perform a speed comparison for the three functions using a common
benchmark
32 / 71
I The benchmark for the comparison is obtained from the Center for Research
in Security Prices (CRSP) at the University of Chicago. The variables in the
dataset are:
I PERMNO: identification number for each company
I date: date in format 2015/12/31
I EXCHCD: exchange code
I TICKER: company ticker
I COMNAM: company name
I CUSIP: another identification number for the security
I DLRET: delisting return
I PRC: price
I RET: return
I SHROUT: share oustanding
I ALTPRC: alternative price
I The observations are all companies listed in the NYSE, NASDAQ, and
AMEX from January 1985 until December 2016 at the monthly frequency
for a total of 3,627,236 observations and 16 variable. The size of the file is
328Mb
33 / 71
read_csv() from readr package
I The command Sys.time() reads the current time that is assigned to
start.time
I The time to perform the operation is calculated as the difference between
the Sys.time() and the start.time
start.time <- Sys.time()

crsp <- read.csv("crsp_eco4051_jan2017.csv", stringsAsFactors = FALSE)
end.csv <- Sys.time() - start.time
Time difference of 39.204 secs
I and for the read_csv() function?
library(readr)
crsp <- read_csv("crsp_eco4051_jan2017.csv")
end_csv <- Sys.time() - start.time
I read_csv() is 7.1 times faster than read.csv()

34 / 71
fread() from data.table package
I Another function to read data fast is fread

I The arguments:
I data.table: whether output will be a regular data frame or a
data.table frame (TRUE/FALSE)
I showProgress: whether partial info about the percentage loaded
should be printed (TRUE/FALSE)

crsp <- data.table::fread("crsp_eco4051_jan2017.csv",
data.table=FALSE,
showProgress = FALSE)
end.fread <- Sys.time() - start.time
I fread is 12 times faster than read.csv() and 1.8 times faster than
read_csv()
35 / 71
Create returns
I Very often we need to create new variables that are transformations of

existing variables
I One example is to calculate the return of an asset, that is:
1. Simple return: Rt = (Pt − Pt−1 )/Pt−1
2. Logarithmic return: rt = log(Pt ) − log(Pt−1 )
I In macro this transformation is typically called the growth rate
I The transformation is easily done in R using the lag() and diff() functions
GSPC <- getSymbols("^GSPC", from="1990-01-01", auto.assign = FALSE)

GSPC$ret.simple <- 100 * (Ad(GSPC) - lag(Ad(GSPC), 1)) / lag(Ad(GSPC),1)
GSPC$ret.log <- 100 * (log(Ad(GSPC)) - lag(log(Ad(GSPC)), 1))
GSPC$ret.simple <- 100 * diff(Ad(GSPC)) / lag(Ad(GSPC), 1)
GSPC$ret.log <- 100 * diff(log(Ad(GSPC)))
head(GSPC)
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted ret.simple ret.log

1990-01-02 353.40 359.69 351.98 359.69 162070000 359.69 NA NA
1990-01-03 359.69 360.59 357.89 358.76 192330000 358.76 -0.25855 -0.25889
1990-01-04 358.76 358.76 352.89 355.67 177000000 355.67 -0.86130 -0.86503
1990-01-05 355.67 355.67 351.35 352.20 158530000 352.20 -0.97562 -0.98041
1990-01-08 352.20 354.24 350.54 353.79 140110000 353.79 0.45145 0.45043
1990-01-09 353.83 354.17 349.61 349.62 155210000 349.62 -1.17867 -1.18567
36 / 71
I If we want to transform all the columns of a data frame or xts object we can
simply do the operation on the object rather than the variable
# "^GSPC" = S&P 500 Index, "^N225" = Nikkei 225, "^STOXX50E" = EURO STOXX 50
data <- getSymbols(c("^GSPC", "^N225", "^STOXX50E"), from="2000-01-01")
price <- merge(Ad(GSPC), Ad(N225), Ad(STOXX50E))
ret <- 100 * diff(log(price))
tail(ret, 5)
GSPC.Adjusted N225.Adjusted STOXX50E.Adjusted

2018-01-19 0.437565 0.187892 NA
2018-01-22 0.803436 0.034728 0.44324
2018-01-23 0.217200 1.284195 0.19107
2018-01-24 -0.056013 -0.763018 -0.79476
2018-01-25 NA -1.139636 NA
37 / 71
Elegant graphics: ggplot2 package
I The base plotting functions are easy to use and convenient for quick plotting
I However, they lack elegance and it is difficult to produce high-quality
graphics
I The package ggplot2 offers an alternative set of function to make graphs
I There are two ways of producing plots with ggplot2:

1. qplot() is a wrapper function similar to plot() that uses underlying
ggplot2 plotting functions
2. Using the grammar of graphics that is composed of:
I ggplot(): creating the graph and speciying the data frame that
contains the variables to plot
I geom_xxx(): the type of plot that is needed; point, line,
histogram, boxplot, etc (see list here)
I aes(): the x and y variables to plot
I theme(): the overall look of the plot; theme_bw(),
theme_classic(), theme_dark() etc.
38 / 71
I ggplot2 does not recognize the time series properties of xts objects and we
have to specify the x axis
I The time(GSPC) command is used to extract the date associated with each
observation
GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)

names(GSPC) <- sub(’GSPC.’,"", names(GSPC))
library(ggplot2)
qplot(time(GSPC), GSPC$Close, geom="line")
2000
GSPC$Close
1000
1990 2000 2010

time(GSPC)
39 / 71
I ggplot2 interacts best with data frames
I We can extract a data frame from the xts object by:
I creating a new variable that represents the Date
I use coredata() to extract the data from the xts object
GSPC.df <- data.frame(Date = time(GSPC), coredata(GSPC))

qplot(Date, Close, data = GSPC.df, geom = "line")
2000
Close
1000
1990 2000 2010

Date
40 / 71
I We can produce the same graph using the ggplot2 grammar as in the
example below
I Notice that we can assign a ggplot to an object (called myplot) that can be
used later and altered by only changing specific aspects
myplot <- ggplot(GSPC.df) + geom_line(aes(x = Date, y = Close))

myplot
2000
Close
1000
1990 2000 2010

Date
41 / 71
I The plots can be customized with themes, line colors and types, label names
etc
I The par(mfrow=c(2,2)) does not work with ggplot and we should use the
grid.arrange() function from package gridExtra
plot1 <-
ggplot(GSPC.df, aes(Date, Adjusted)) + geom_line(color="darkgreen")
plot2 <-
plot1 + theme_bw()
plot3 <-
plot2 + theme_classic() + labs(x="", y="Index", title="S&P 500")
plot4 <-
plot3 + geom_line(color="darkorange") + geom_smooth(method="lm") +
theme_dark() + labs(subtitle="Period: 1985/2016", caption="Source: Yahoo")
library(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
2000 2000
Adjusted
Adjusted
1000 1000
1990 2000 2010 1990 2000 2010

Date Date
S&P 500 S&P 500
Period: 1985/2016
2000 2000
Index
Index
1000
1000
0
1990 2000 2010
1990 2000 2010
Source: Yahoo
42 / 71
I A scatter plot of two variables can be easily produced in ggplot2
data <- getSymbols(c("^GSPC", "^N225"), from="1990-01-01")

price <- merge(Ad(to.monthly(GSPC)), Ad(to.monthly(N225)))
ret <- 100 * diff(log(price))
GN.df <- data.frame(Date=time(price), Year = year(time(price)), coredata(merge(price,ret)))
names(GN.df) <- c("Date","Year", "SP", "NIK", "SPret","NIKret")
plot1 <- ggplot(GN.df, aes(NIKret, SPret)) + geom_point() +

geom_vline(xintercept = 0) + geom_hline(yintercept = 0)
plot2 <- plot1 + geom_smooth(method="lm", se=FALSE) + theme_bw() +
labs(x="NIKKEI", y="SP500")
grid.arrange(plot1, plot2, ncol=2)
10 10
0 0
SP500
SPret
−10 −10
−20 −20
−20 −10 0 10 20 −20 −10 0 10 20
NIKret NIKKEI
43 / 71
I The strength of ggplot2 is to make easier to produce sophisticated graphics
I For example: if we want to have the dots in the scatter plot depend on a
variable (e.g., Year) this can be done easily by adding the argument
color=Year in the aesthetics
GN.df$Year <- year(time(price))

plot1 <- ggplot(GN.df, aes(NIKret, SPret, color=Year)) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")
plot2 <- ggplot(GN.df, aes(NIKret, SPret, color=factor(Year))) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")
grid.arrange(plot1, plot2, ncol=2)
1990 2005
10 10 1991 2006
1992 2007
1993 2008
Year 1994 2009
0 0 1995 2010
2010 1996 2011
1997 2012
2000
1998 2013
−10 −10 1999 2014
1990
2000 2015
2001 2016
2002 2017
−20 −20 2003 2018
−20 −10 0 10 20 −20 −10 0 10 20
2004
44 / 71
Boxplot
I A boxplot provides a graphical representation of the distribution of the data

I Box-whisker plot: the box represents the 25%, 50%, and 75% quantile and
the whiskers extend to the highest value within 1.5 times the interquartile
range; the dots are the min/max
I In the graph below the boxplot is plotted for each year separately
ggplot(GN.df, aes(factor(Year), SPret)) + geom_boxplot() +

theme(axis.text.x = element_text(angle = 90))
10
0
SPret
−10
−20
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
factor(Year)
45 / 71
Summary statistics
I The function summary() provides a few summary statistics of the
distribution of a data object
I Of course, the variable needs to be of type numerical
summary(GSPC$ret.simple)
Index ret.simple
Min. :1990-01-02 Min. :-9.0350
1st Qu.:1996-12-26 1st Qu.:-0.4399
Median :2004-01-07 Median : 0.0530
Mean :2004-01-07 Mean : 0.0353
3rd Qu.:2011-01-13 3rd Qu.: 0.5541
Max. :2018-01-24 Max. :11.5800
NA's :1
summary(GSPC$ret.log)
Index ret.log
Min. :1990-01-02 Min. :-9.4695
1st Qu.:1996-12-26 1st Qu.:-0.4409
Median :2004-01-07 Median : 0.0530
Mean :2004-01-07 Mean : 0.0292
3rd Qu.:2011-01-13 3rd Qu.: 0.5525
Max. :2018-01-24 Max. :10.9572
NA's :1
46 / 71
I Package fBasics provides the basicStats() function with a more
comprehensive set of descritive statistics compared to summary()
fBasics::basicStats(GSPC$ret.log)
ret.log
nobs 7072.000000
NAs 1.000000
Minimum -9.469512
Maximum 10.957197
1. Quartile -0.440919
3. Quartile 0.552538
Mean 0.029210
Median 0.052970
Sum 206.545022
SE Mean 0.013172
LCL Mean 0.003390
UCL Mean 0.055031
Variance 1.226761
Stdev 1.107592
Skewness -0.252791
Kurtosis 9.004323
47 / 71
Covariance and correlation between two or more assets
I When we have several variables or assets the first question that arises in the
analysis is whether they co-move
I Dependence is measured using the covariance and the correlation
I Do the S&P 500 and NIKKEI move togheter?
Ret <- subset(GN.df, select=c("SPret","NIKret"))

cov(Ret, use=’complete.obs’)
SPret NIKret
SPret 16.960 13.446
NIKret 13.446 38.731
cor(Ret, use=’complete.obs’)
SPret NIKret
SPret 1.00000 0.52463
NIKret 0.52463 1.00000
48 / 71
Plotting the data distribution
I A very useful tool to explore the distribution of the data is the histogram
that represents an estimator of the underlying (population) of the data. It is
useful to assess (visually) the characteristics of the data, such as normality,
fat tails, asymmetry
hist(GSPC$ret.log, breaks=50, xlab="", main="") # base function

qplot(ret.log, data=GSPC, geom="histogram", bins=50) # ggplot function
1500
1500
1000
count
Frequency
1000
500
500
0
0
−10 −5 0 5 10
−10 −5 0 5 10 ret.log
49 / 71
I We can overlap a non-parametric estimate to the histogram which represents
a smooth line that goes through the histogram bars
# base function
hist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)
lines(density(GSPC$ret.log,na.rm=TRUE),col=2,lwd=2)
box()
# ggplot function
ggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +
geom_density(color="red", size=1.2) +
theme_bw()
0.6
0.5
0.4
0.4
0.3
density
0.2
0.2
0.1
0.0
0.0
−10 −5 0 5 10
−10 −5 0 5 10
Return ret.log
50 / 71
I Or compare the histogram to a distribution (e.g., normal)
# base function
hist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)
curve(dnorm(x, mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)),
from=-10, to=10, add=TRUE, col="red",lwd=2)
box()
# ggplot function
ggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +
stat_function(fun = dnorm, colour = "red",
args = list(mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)), size=1.2) +
theme_bw()
0.6
0.5
0.4
0.4
0.3
density
0.2
0.2
0.1
0.0
0.0
−10 −5 0 5 10
−10 −5 0 5 10
Return ret.log
51 / 71
Dates and times in R
I We already used the command as.Date() to define a string to be of type

Date
I The default format of a date in R is 2011-07-17
I If in your dataset a date is specified in a different way, you need to help R
read it by specifying the format=
I The syntax of the format is: %d for numerical day, %a and %A
abbreviated/unabbreviated weekday, %m numerical month, %b and %B for
abbreviated/unabbreviated month, %y and %Y for 2/4 digit year
as.Date("2011-07-17") # default, no need to specify format

as.Date("July 17, 2011", format="%B %d,%Y")
as.Date("Monday July 17, 2011", format="%A %B %d,%Y")
as.Date("17072011", format="%d%m%Y")
as.Date("11@17#07", format="%y@%d#%m")
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"
52 / 71
I One operation we might want to do with dates is to calculate the difference
between two dates
I This can be done by subtracting two dates or using the difftime() function
that allows also to specify the unit of time
date1 <- as.Date("July 17, 2011", format="%B %d,%Y")

date2 <- Sys.Date()
date2 - date1
difftime(date2, date1, units="secs")
difftime(date2, date1, units="days")
difftime(date2, date1, units="weeks")
Time difference of 2384 days

Time difference of 205977600 secs
Time difference of 2384 days
Time difference of 340.57 weeks
53 / 71
Time
I In addition to the date, we might need to specify the time of the day
I This is useful when dealing with intra-day data, such as the FX data that
we discussed earlier and shown below
data.hf <- data.table::fread(’USDJPY-2016-12.csv’,

col.names=c("Pair","Date","Bid","Ask"),
colClasses=c("character","character","numeric","numeric"),
data.table=FALSE, showProgress = FALSE)
Pair Date Bid Ask

1 USD/JPY 20161201 00:00:00.041 114.68 114.69
2 USD/JPY 20161201 00:00:00.042 114.68 114.69
3 USD/JPY 20161201 00:00:00.186 114.68 114.69
4 USD/JPY 20161201 00:00:00.188 114.68 114.69
5 USD/JPY 20161201 00:00:00.189 114.69 114.70
6 USD/JPY 20161201 00:00:00.223 114.69 114.70
7 USD/JPY 20161201 00:00:00.343 114.69 114.70
8 USD/JPY 20161201 00:00:00.347 114.69 114.70
9 USD/JPY 20161201 00:00:00.403 114.69 114.69
10 USD/JPY 20161201 00:00:00.415 114.69 114.69
54 / 71
I To work with time we can use two functions:
I strptime()
I as.POSIXlt()
I Both require to specify the format of the date and time part
I The format of the time is: %H hour (out of 24), %M minute, %S seconds, and
%OS fractional seconds
strptime("20161201 01:00", format="%Y%m%d %H:%M")

strptime("20161201 00:00:01", format="%Y%m%d %H:%M:%S")
strptime("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
##
date1 <- as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
date2 <- strptime("20161201 01:15:00.041", format="%Y%m%d %H:%M:%OS")
date2 - date1
difftime(date2, date1, unit="secs")
[1] "2016-12-01 01:00:00 EST"

[1] "2016-12-01 00:00:01 EST"
[1] "2016-12-01 00:00:00 EST"
[1] "2016-12-01 00:00:00 EST"
Time difference of 1.25 hours
Time difference of 4500 secs
55 / 71
lubridate package
I This package makes it easier to define dates by having dedicated functions

that do the as.Date and format together
I These functions are:
I ymd: for dates in the format year, month, day
I dmy: dates with day, month, year format
I mdy: when the format is month, day, year
I ymd_hm: in addition to the date the time is provided in hour and
minute (the date part can be changed to other formats)
I ymd_hms: the time format is hour, minute, and seconds
library(lubridate)
ymd("20110717")
ymd("2011/07/17")
ymd_hm("20110717 01:00")
ydm_hms("20111707 00:00:00.041")
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17 01:00:00 UTC"
[1] "2011-07-17 00:00:00 UTC"
56 / 71
I The package provides functions that makes it easy to extract the year,
month, day, day of the week/month/year, minute, second etc
mydate <- ydm_hms("20111707 00:00:00.041")

year(mydate)
month(mydate)
day(mydate)
wday(mydate, label=T, abbr=FALSE)
minute(mydate)
second(mydate)
[1] 2011
[1] 7
[1] 17
[1] Sunday
Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
[1] 0
[1] 0.041
57 / 71
dplyr package
I The package dplyr provides functions to manipulate data frames

I Data analysis requires a lot of data manipulation and this package has made
the task significantly easier
I dplyr has 5 main verbs:
I mutate: to create new variables
I select: to select columns of the data frame
I filter: to select rows based on a criterion
I group_by()/summarize: uses a function to summarize
columns/variables in one value
I arrange: to order a data frame based on one or more variables
I dplyr is typically used with the %>% piping operator (read then)
I %>% is useful to write compact code when we are not interested in
using or storing the intermediate results
58 / 71
mutate and select
I mutate() to create new variables

I select() to select existing variables
library(dplyr)
library(lubridate)
GSPC.df <- mutate(GSPC.df, range = 100 * log(High/Low),

ret.c2c = 100 * log(Adjusted / lag(Adjusted)),
year = year(Date),
month = month(Date),
wday = wday(Date, label=T, abbr=F))
tail(GSPC.df, 2)
Date Open High Low Close Volume Adjusted range ret.c2c year month wday
8334 2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1 0.41073 0.217200 2018 1 Tuesday
8335 2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5 0.99195 -0.056013 2018 1 Wednesday
Date year month wday range ret.c2c

8334 2018-01-23 2018 1 Tuesday 0.41073 0.217200
8335 2018-01-24 2018 1 Wednesday 0.99195 -0.056013
59 / 71
filter()
I the filter() function is used to select specific rows of the data frame
filter(GSPC.df, wday == "Tuesday") %>% head(7)
Date Open High Low Close Volume Adjusted range ret.c2c year month wday
1 1985-01-08 164.24 164.59 163.91 163.99 92110000 163.99 0.41400 -0.152332 1985 1 Tuesday
2 1985-01-15 170.51 171.82 170.40 170.81 155300000 170.81 0.82988 0.175790 1985 1 Tuesday
3 1985-01-22 175.23 176.63 175.14 175.48 174800000 175.48 0.84715 0.142568 1985 1 Tuesday
4 1985-01-29 177.40 179.19 176.58 179.18 115700000 179.18 1.46727 0.998381 1985 1 Tuesday
5 1985-02-05 180.35 181.53 180.07 180.61 143900000 180.61 0.80753 0.144058 1985 2 Tuesday
6 1985-02-12 180.51 180.75 179.45 180.56 111100000 180.56 0.72182 0.027697 1985 2 Tuesday
7 1985-02-19 181.60 181.61 180.95 181.33 90400000 181.33 0.36408 -0.148791 1985 2 Tuesday
60 / 71
group_by()/summarize()
I A common operation in data analysis is to group observations based on a

certain characteristic and apply a certain function to each group
I Example: calculate the average/min/max return by day of the week
GSPC.df %>% group_by(wday) %>%

summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),
MAX.RET = max(ret.c2c, na.rm=T))
# A tibble: 5 x 4
wday AV.RET MIN.RET MAX.RET
<ord> <dbl> <dbl> <dbl>
1 Monday 0.010565 -22.8997 10.9572
2 Tuesday 0.066759 -5.9108 10.2457
3 Wednesday 0.054635 -9.4695 8.7089
4 Thursday 0.016346 -7.9224 6.6923
5 Friday 0.019671 -7.0082 6.1328
61 / 71
I The grouping can also be done on two variables, for example month and year
GSPC.df %>% group_by(month, year) %>%

summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),
MAX.RET = max(ret.c2c, na.rm=T))
# A tibble: 397 x 5
# Groups: month [?]
month year AV.RET MIN.RET MAX.RET
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1985 0.393875 -0.54228 2.2566
2 1 1986 0.010744 -2.76472 1.4843
3 1 1987 0.589429 -1.40073 2.3024
4 1 1988 0.198181 -7.00824 3.5231
5 1 1989 0.327143 -0.87157 1.4854
6 1 1990 -0.324090 -2.61989 1.8710
7 1 1991 0.184905 -1.74726 3.6642
8 1 1992 -0.091477 -1.11960 1.4615
9 1 1993 0.035106 -0.87605 0.8903
10 1 1994 0.152304 -0.58097 1.1363
# ... with 387 more rows
62 / 71
Does volatility of the S&P 500 vary over time?
I dplyr functions can be very useful to write relatively lengthy operations in a

compact and readable manner
I Assume we want to calculate the average volatility by year using the
intra-day range as a proxy for volatility
GSPC.df %>% mutate(range = 100 * log(High/Low),

year = year(Date)) %>%
group_by(year) %>%
summarize(av.range = mean(range, na.rm=T)) %>%
ggplot(., aes(year, av.range)) + geom_line(color="steelblue4", size=1.3) +
theme_bw() + labs(x="", y="")
1990 2000 2010
63 / 71
Creating functions in R
I The advantage of a programming language is that you have the flexibility to

write your own functions. This can be useful when:
1. No package provides pre-programmed functions to perform the
analysis you want to conduct
2. The task is very complex and you prefer to break it down in smaller
tasks that make the code easier to read, interpret, and test.
3. Once you write a function, you can use it again in future analysis
64 / 71
I A function is a set of operations applied to some data
I The syntax is as follows:
myfunction <- function(inputs)

{
#
# operations
#
return(output)
}
I The function() can include several arguments

I The return() is the output of the function that can include only one object,
although that object might include several elements
I To call the function you type in R the name of the function with the
appropriate arguments: myfunction()
65 / 71
A function to calculate the sample average
I R provides the mean() function to calculate the sample average
mean(GSPC$ret.log, na.rm=T)
[1] 0.02921
I As an illustration, let’s write a function that calculates the sample average:

PT
R̄ = R /T
t=1 t
I In this case the set of operations to perform is quite simple:

1. sum the values of the time series
2. divide by the total number of observations
66 / 71
I Below is the code that defines a new function called mymean()
# Y is the input, Ybar the output of the function

mymean <- function(Y)
{
Y = na.omit(Y)
Ybar <- sum(Y) / length(Y)
return(Ybar)
}
I Let’s compare the results:
mean(GSPC$ret.log, na.rm=T)
[1] 0.02921
mymean(GSPC$ret.log)
[1] 0.02921
67 / 71
Loops in R
I Loops are a useful tool when you want to perform the same set of operations
on several time series or datasets
I A common loop is the for loop which has the following syntax:
for (i in 1:N)
{
# write your commands here
}
I The loop iterates through the values of i from 1 to N

I Example:
for (i in 1:3)
{
print(i)
}
[1] 1
[1] 2
[1] 3
68 / 71
mysum() function using a for loop
mysum <- function(Y)

{
Y = na.omit(Y)
N = length(Y) # define N as the number of elements of Y
sumY = 0 # initialize the variable that will store the sum of Y
for (i in 1:N)
{
sumY = sumY + as.numeric(Y[i]) # current sum is equal to previous sum
} # plus the i-th value of Y
return(sumY) # as.numeric(): makes sure to transform
} # from other classes to a number
mysum(GSPC$ret.log)
[1] 206.55
sum(GSPC$ret.log, na.rm=T)
[1] 206.55
69 / 71
A simulation exercise
I Simulations are used to evaluate some quantities (e.g., the price of an option
or an estimator) based on large number of samples generated from a certain
distribution
I The recipe works as follows:
1. generate random values from a model
2. calculate the quantity of interest
3. repeat 1 and 2 many times
I Example: we want to evaluate if the distribution of the sample mean is

N (µ, σ 2 /N ), where µ is the population mean, σ 2 the population variance,
and N the sample size
I In the example in the following slide we generate data from a normal

distribution with mean 0 and standard deviation 2
70 / 71
S = 5000 # set the number of simulations
N = 1000 # set the length of the sample
mu = 0 # population mean
sigma = 2 # population standard deviation
Ybar = vector(’numeric’, S) # create an empty vector of S elements

# to store the t-stat of each simulation
for (i in 1:S)
{
Y = rnorm(N, mu, sigma) # Generate a sample of length N
Ybar[i] = mean(Y) # store the t-stat
}
ggplot(data = data.frame(Ybar=Ybar), aes(Ybar)) +
geom_histogram(aes(y = ..density..), color="red", fill="lightsalmon", bins = 40) +
stat_function(fun = dnorm, args = list(mean = 0, sd = sigma/sqrt(N)), color="seagreen",size=1.2) +
theme_bw() + xlim(c(-0.3, 0.3))
4
density
−0.2 0.0 0.2

Ybar
71 / 71

Getting Started With R: Sebastiano Manzan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Getting Started With R: Sebastiano Manzan

Uploaded by

Copyright:

Available Formats

Getting started with R

ECO 4051 | Spring 2018

I R is becoming increasingly popular in economics and finance

I It is open source, simple to use, and with numerous packages (12,126 as of

I Learning a programming language can be a useful skill in the current labor

I Trend in the industry (just two example):

1. Getting started with R

Figure 1: R for Windows

I R creates and works with objects that contain data

I The object/data can be of a different structure such as:

I Package: a group of functions with a specific purpose (e.g., ggplot2)

I It is convenient to start a R session by setting the working directory where

splist <- read.csv("List_SP500.csv")

Ticker.symbol Security Address.of.Headquarters Date.first.added

'data.frame': 505 obs. of 4 variables:

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)

'data.frame': 505 obs. of 4 variables:

splist$Date.first.added <- as.Date(splist$Date.first.added, format="%Y-%m-%d")

'data.frame': 505 obs. of 4 variables:

I $ sign is used to extract a variable/column in a data frame;

Classes 'tbl_df', 'tbl' and 'data.frame': 505 obs. of 4 variables:

index <- read.csv("GSPC.csv", stringsAsFactors = FALSE)

'data.frame': 8237 obs. of 7 variables:

index <- read_csv("GSPC.csv")

Classes 'tbl_df', 'tbl' and 'data.frame': 8237 obs. of 7 variables:

I In addition to read/import files, we might also need to save data files

index <- read_csv("GSPC.csv")

I Visualization is an essential part of data analysis

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

start(index.xts) # start date

index.weekly <- to.weekly(index.xts)

index.xts.Open index.xts.High index.xts.Low index.xts.Close index.xts.Volume index.xts.Adjusted

index.weekly <- apply.weekly(index.xts, "first")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

index.weekly <- apply.weekly(index.xts, "last")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

index.weekly <- apply.weekly(index.xts, "mean")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

I Once the object is defined as xts, plotting is executed by plot.xts() which

GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)

Jan 02 Jan 04 Jan 02 Jan 02 Jan 03 Jan 02 Jan 02 Jan 03 Jan 04

par(mfrow=c(2,3)) # organize the plots in 2 rows and 3 columns

index.xts["2007"] index.xts["2007/2009"] index.xts["/2007"]

index.xts["2007/"] index.xts["2007−03−21/2008−02−12"] index.xts[.indexwday(index.xts) == 3]

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

data <- getSymbols("^GSPC", src = "yahoo", from = "1990-01-01", auto.assign = FALSE)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted

I A vector of symbols can be passed to getSymbols() to download data for

[1] "GSPC" "DJI"

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)

data.new <- merge(Ad(GSPC),Ad(DJI))

getSymbols(c("USD/EUR", "USD/JPY"), src="oanda")

[1] "USDEUR" "USDJPY"

I Federal Reserve Economic Data (FRED) can be used to download

UNRATE CPIAUCSL GDPC1

# DEXUSEU: U.S. Dollars to One Euro

I Quandl works as an aggregator of open and subscription databases

FRED.UNRATE - Value FRED.CPIAUCSL - Value FRED.GDPC1 - Value

oil.spot <- Quandl("COM/WLD_CRUDE_WTI", type="xts")

oil.spot coffee.spot sp.futures$Settle gold.futures$Settle

I Files imported in R are stored in the memory of the program

I Two packages are available to help with this task:

start.time <- Sys.time()