Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Getting started with R

Sebastiano Manzan

ECO 4051 | Spring 2018

1 / 71
Why R?

I R is becoming increasingly popular in economics and finance

I It is open source, simple to use, and with numerous packages (12,126 as of


today) contributed by a large community of users on any aspect of
statistical modeling and data analysis

I If you can have an excellent free product why should you pay for an
excellent expensive product (e.g., Matlab, SAS)?

I Learning a programming language can be a useful skill in the current labor


market where companies are increasingly interested to extract information
(intelligence) from large datasets they collect about their business

I Trend in the industry (just two example):


I Microsoft bought Revolution R and renamed Microsoft R (a version
of R that is optimized to work on multiple cores)
I IBM bought SPSS and many data/analytics providers (e.g., the
Weather Channel), and sponsors Cognitive Class.ai that offers free
online courses about data science and machine learning

2 / 71
Outline of the course

1. Getting started with R


2. Linear Regression Model
3. Time series models
4. Volatility modeling
5. High-frequency data
6. Measuring Financial Risk

3 / 71
Lets get started with R

Figure 1: R for Windows

4 / 71
Lets get started with Rstudio

Figure 2: Rstudio
5 / 71
How R works

I R creates and works with objects that contain data

I The object/data can be of a different structure such as:


I data frame: a table where each column represents a variable and
each row a different observation (different time period or unit); a
variable can be numerical or a string (similar to an Excel spreadsheet)
I matrix: same as data frame, but all variables/columns have to be of
the same type (typically all numbers)
I lists: an object of objects; each element of the list can be, e.g., a data
frame, a matrix, and a vector (similar to a set of Excel spreadsheets)
I Function: in R we can create functions that takes a set of arguments and
perform a set of operations on a data object; e.g., mean(x, na.rm=T)

I Package: a group of functions with a specific purpose (e.g., ggplot2)


I install a package: install.packages("ggplot2") (only done once)
I use the package: library(ggplot2) or require(ggplot2)

6 / 71
Loading data in R

I It is convenient to start a R session by setting the working directory where


the data/files are stored; for example:
I setwd('/Users/username/Baruch/ECO4051/') in Mac/Unix
I setwd('c:/Baruch/ECO4051/') in Windows
I Two ways to load a dataset in R:
1. import the data from a local file
2. import the data from an online resource (e.g., Yahoo Finance, FRED,
Google Finance, Quandl)

7 / 71
Base function read.csv()

I You can load a file from Rstudio via Tools -> Import Dataset and then
you are given the option From Text File or From Web URL
I Otherwise, you can type a few lines of code (table from Wikipedia):

splist <- read.csv("List_SP500.csv")


head(splist,10)

Ticker.symbol Security Address.of.Headquarters Date.first.added


1 MMM 3M Company St. Paul, Minnesota
2 ABT Abbott Laboratories North Chicago, Illinois 1964-03-31
3 ABBV AbbVie Inc. North Chicago, Illinois 2012-12-31
4 ACN Accenture plc Dublin, Ireland 2011-07-06
5 ATVI Activision Blizzard Santa Monica, California 2015-08-31
6 AYI Acuity Brands Inc Atlanta, Georgia 2016-05-03
7 ADBE Adobe Systems Inc San Jose, California 1997-05-05
8 AMD Advanced Micro Devices Inc Sunnyvale, California 2017-03-20
9 AAP Advance Auto Parts Roanoke, Virginia 2015-07-09
10 AES AES Corp Arlington, Virginia

I The commands head( ,n) and tail( ,n) show the first and last n
observations

8 / 71
Data types

I The str() command can be used to evaluate the object structure and the
data types:

str(splist)

'data.frame': 505 obs. of 4 variables:


$ Ticker.symbol : Factor w/ 505 levels "A","AAL","AAP",..: 314 7 5 8 52 58 9 33 3 17 ...
$ Security : Factor w/ 505 levels "3M Company","A.O. Smith Corp",..: 1 3 4 5 6 7 8 10 9 11
$ Address.of.Headquarters: Factor w/ 256 levels "Akron, Ohio",..: 222 159 159 66 210 8 204 226 195 5 ...
$ Date.first.added : Factor w/ 303 levels "","1964-03-31",..: 1 2 213 195 252 273 79 292 249 1 ...

I Each variable in the data frame splist has a type that can be:
I numeric: (or double) is used for decimal values
I integer: for integer values
I character: for strings of characters
I Date: for dates
I factor: represents a type of variable (either numeric, integer, or
character) that categorizes the values in a small (relative to the sample
size) set of categories (or levels)

9 / 71
I The read.csv() function has the annoying feature that any string is
interpreted as a factor
I This can be switched off by adding the argument stringsAsFactors = FALSE

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)


str(splist)

'data.frame': 505 obs. of 4 variables:


$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...
$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date.first.added : chr "" "1964-03-31" "2012-12-31" "2011-07-06" ...

I The ticker symbol, security name, and address are all correctly interpreted
as chr
I The date.first.added is also imported as a string, but we would like to
define it of type Date

10 / 71
I The code below is used to define the column/variable Date.first.added as
a date with command as.Date()

splist$Date.first.added <- as.Date(splist$Date.first.added, format="%Y-%m-%d")


str(splist)

'data.frame': 505 obs. of 4 variables:


$ Ticker.symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...
$ Address.of.Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date.first.added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...

I $ sign is used to extract a variable/column in a data frame;


splist$Date.first.added extract the Date.first.added from the data
frame object splist
I as.Date() function converts the splist$Date.first.added from character
to Date; the role of the argument format="%Y-%m-%d" is to specify the
format of the date being defined

11 / 71
read_csv() from readr package
I In addition to the base read.csv() function, there are other packages that
provide functions to read data
I There are two problems with the read.csv() function:
I type guessing (in particular dates)
I reading speed
I The function read_csv() from package readr tries to solve both problems
(will talk about speed later):
library(readr)
splist <- read_csv("List_SP500.csv")
str(splist, max.level=1)

Classes 'tbl_df', 'tbl' and 'data.frame': 505 obs. of 4 variables:


$ Ticker symbol : chr "MMM" "ABT" "ABBV" "ACN" ...
$ Security : chr "3M Company" "Abbott Laboratories" "AbbVie Inc." "Accenture plc" ...
$ Address of Headquarters: chr "St. Paul, Minnesota" "North Chicago, Illinois" "North Chicago, Illinois"
$ Date first added : Date, format: NA "1964-03-31" "2012-12-31" "2011-07-06" ...
- attr(*, "spec")=List of 2
..- attr(*, "class")= chr "col_spec"

I Notice:
I class tbl (tibble) and tbl_df is a type of data frame specific to this
package
I Date defined as a Date which saves us a line of code
12 / 71
I The file GSPC.csv represents daily data for the S&P 500 Index from January
1985 downloaded from Yahoo Finance
I Below is a comparison of read.csv() and read_csv()

index <- read.csv("GSPC.csv", stringsAsFactors = FALSE)

'data.frame': 8237 obs. of 7 variables:


$ Date : chr "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...
$ GSPC.Open : num 167 165 165 164 164 ...
$ GSPC.High : num 167 166 165 165 165 ...
$ GSPC.Low : num 165 164 163 164 164 ...
$ GSPC.Close : num 165 165 164 164 164 ...
$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...
$ GSPC.Adjusted: num 165 165 164 164 164 ...

index <- read_csv("GSPC.csv")

Classes 'tbl_df', 'tbl' and 'data.frame': 8237 obs. of 7 variables:


$ Date : Date, format: "1985-01-02" "1985-01-03" "1985-01-04" "1985-01-07" ...
$ GSPC.Open : num 167 165 165 164 164 ...
$ GSPC.High : num 167 166 165 165 165 ...
$ GSPC.Low : num 165 164 163 164 164 ...
$ GSPC.Close : num 165 165 164 164 164 ...
$ GSPC.Volume : num 67820000 88880000 77480000 86190000 92110000 ...
$ GSPC.Adjusted: num 165 165 164 164 164 ...
- attr(*, "spec")=List of 2
..- attr(*, "class")= chr "col_spec"

13 / 71
Saving data files

I In addition to read/import files, we might also need to save data files

I This can be done with the base function write.csv() (see help(write.csv)
for the arguments)

index <- read_csv("GSPC.csv")


write.csv(index, file = "myfile.csv", row.names = FALSE)

I The index object is saved to a file called myfile.csv in the working directory

14 / 71
Plotting the data . . .

I Visualization is an essential part of data analysis


I It is useful to guide the analysis and to communicate our results
I It is typically easier to process and understand a graph rather than a table
with numbers
I The base function in R to plot data is plot() that takes as arguments:
I x, y: the variables to plot in the x-axis and y-axis
I type: p for points, l for lines, b for both (and more)
I xlim, ylim: the range of the axes
I xlab, ylab: the labels of the axes
I main: string to use for title
I col: color of the point and/or line
I pch: type of point to use

15 / 71
I The code below produces a time series plot of the S&P 500 Index:
I The column index$Date is defined of class Date and used as the x-axis
I The column index$GSPC.Adjusted is used as the y-axis
I The left plot uses the default settings, the plot on the right has been
customized

# LEFT PLOT
plot(index$Date, index$GSPC.Adjusted)
# RIGHT PLOT
plot(index$Date, index$GSPC.Adjusted, type="l", xlab="", ylab="S&P 500 Index", xaxt="n", yaxt="n")
ticks <- seq(index$Date[1], index$Date[nrow(index)], by="year")
axis(1, at=ticks, labels=ticks, cex.axis=0.9, col="orange", col.axis="blue")
axis(2, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3,cex.axis=0.75,col.axis="purple")
axis(4, at=seq(0, 2000, 500), labels=seq(0,2000,500), col.ticks=3, cex.axis=0.75,col.axis="purple")
2500
index$GSPC.Adjusted

2000

2000
S&P 500 Index
1500

1500

1500
1000

1000
500

500

500
1985 1990 1995 2000 2005 2010 2015 1985−01−02 1995−01−02 2005−01−02 2015−01−02

index$Date

16 / 71
Time series objects

I A variable that is observed over time is called a time series (e.g., stock
prices, real GDP, inflation)
I There are several packages that provide infrastructures to define an object
as a time series object
I I will mostly use the xts package (that is part of the quantmod package for
quantitative finance; other packages are ts and zoo)
I To define an object as a time series we use the command xts() that takes
two arguments:
I a data frame
I a vector of dates (of class Date)

library(xts)
index.xts <- xts(subset(index, select=-Date), order.by=index$Date)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


1985-01-02 167.20 167.20 165.19 165.37 67820000 165.37
1985-01-03 165.37 166.11 164.38 164.57 88880000 164.57
1985-01-04 164.55 164.55 163.36 163.68 77480000 163.68
1985-01-07 163.68 164.71 163.68 164.24 86190000 164.24

17 / 71
I The xts package provides functions to extract information from a time
series object:

start(index.xts) # start date


end(index.xts) # end date
periodicity(index.xts) # periodicity/frequency (daily, weekly, monthly)

[1] "1985-01-02"
[1] "2017-09-01"
Daily periodicity from 1985-01-02 to 2017-09-01

I There are also functions to aggregate the observations from high frequency
(e.g., daily) to lower frequency (e.g., weekly/monthly/quarterly)
I By default the sub-sampling is performed by taking the first observation of
the interval (e.g., monday of each week, 1st of the month)

index.weekly <- to.weekly(index.xts)

index.xts.Open index.xts.High index.xts.Low index.xts.Close index.xts.Volume index.xts.Adjusted


1985-01-04 167.20 167.20 163.36 163.68 234180000 163.68
1985-01-11 163.68 168.72 163.68 167.91 509830000 167.91
1985-01-18 167.91 171.94 167.58 171.32 634000000 171.32
1985-01-25 171.32 178.16 171.31 177.35 749100000 177.35

18 / 71
I Functions apply.weekly() and apply.monthly() are used when the goal is
to apply a function to each week/month in the sample
I In the examples below I apply these functions to subsample the first() and
last() day of the week (notice that the first() is equivalent to the
to.weekly() function) and to calculate the mean() of the week

index.weekly <- apply.weekly(index.xts, "first")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


1985-01-04 167.20 167.20 165.19 165.37 67820000 165.37
1985-01-11 163.68 164.71 163.68 164.24 86190000 164.24

index.weekly <- apply.weekly(index.xts, "last")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


1985-01-04 164.55 164.55 163.36 163.68 77480000 163.68
1985-01-11 168.31 168.72 167.58 167.91 107600000 167.91

index.weekly <- apply.weekly(index.xts, "mean")

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


1985-01-04 165.71 165.95 164.31 164.54 78060000 164.54
1985-01-11 165.08 166.38 164.83 165.93 101966000 165.93

19 / 71
Plotting time series data

I Once the object is defined as xts, plotting is executed by plot.xts() which


has the advantage that:
I the x-axis is by default defined to be time
I easier rendering of dates through the grids
I Using plot() on an xts object automatically calls plot.xts()

GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)


plot(Ad(GSPC))

Ad(GSPC)
2500
2000
1500
1000
500

Jan 02 Jan 04 Jan 02 Jan 02 Jan 03 Jan 02 Jan 02 Jan 03 Jan 04


1985 1988 1992 1996 2000 2004 2008 2012 2016

20 / 71
Subsetting
I The xts package provides its own syntax to subset the time series object
I Below are some examples:

par(mfrow=c(2,3)) # organize the plots in 2 rows and 3 columns


plot(index.xts[’2007’])
plot(index.xts[’2007/2009’])
plot(index.xts[’/2007’])
plot(index.xts[’2007/’])
plot(index.xts[’2007-03-21/2008-02-12’])
plot(index.xts[.indexwday(index.xts)==3])

index.xts["2007"] index.xts["2007/2009"] index.xts["/2007"]

1600

1000 1400
1500

1200

600
1400

800

200
Jan 03 Apr 02 Jul 02 Oct 01 Dec 31 Jan 03 Jan 02 Jan 02 Dec 31 Jan 02 Jan 02 Jan 03 Jan 03 Jan 03
2007 2007 2007 2007 2007 2007 2008 2009 2009 1985 1990 1995 2000 2005

index.xts["2007/"] index.xts["2007−03−21/2008−02−12"] index.xts[.indexwday(index.xts) == 3]

2500
1550
2000

1500
1450
1000

1350

500

Jan 03 Jan 02 Jan 03 Jan 02 Jan 02 Jan 03 Mar 21 Jun 01 Aug 01 Oct 01 Dec 03 Feb 12 Jan 03 Jan 02 Jan 08 Jan 08 Jan 07 Jan 07
2007 2009 2011 2013 2015 2017 2007 2007 2007 2007 2007 2008 1985 1992 1998 2004 2010 2016

21 / 71
getSymbols() from the quantmod package

I There are several packages that provide functions to download economic and
financial data by only specifying the ticker and time period (and frequency
in some functions)
I I will discuss only the getSymbols() function from package quantmod()
which will be used in this class
I Features of getSymbols():
I Sources: Yahoo Finance, Google Finance, OANDA (fx rates), FRED
(argument src)
I Download multiple tickers in one call
I Select the time period (with arguments from and to)
I By default the output is a xts object for each ticker specified
I Yahoo Finance: downloads open, high, low, close, volume, and
adjusted close at the daily frequency
I You can convert to weekly or monthly use to.weekly()/to.monthly()
or the apply.weekly()/apply.monthly()

22 / 71
getSymbols() with one ticker
I By default, the function creates a xts object with the name of the ticker
(except the ˆ part if you are downloading an index)
I When downloading only one ticker, setting auto.assign=FALSE allows you to
assign the output to an object that you name (in the example below data)

library(quantmod)
getSymbols("^GSPC", src = "yahoo", from = "1990-01-01")

[1] "GSPC"

tail(GSPC, 2)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1
2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5

data <- getSymbols("^GSPC", src = "yahoo", from = "1990-01-01", auto.assign = FALSE)


tail(data, 2)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted


2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1
2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5

23 / 71
Multiple tickers

I A vector of symbols can be passed to getSymbols() to download data for


multiple assets
I The function will create a xts object for each ticker (using as name the
ticker; the auto.assign options does not work for more than one ticker)
I When you download more than 5 symbols you will see a message pausing 1
second between requests for more than 5 symbols

library(quantmod)
getSymbols(c("^GSPC","^DJI"), src="yahoo", from="1990-01-01")
periodicity(GSPC)
periodicity(DJI)

[1] "GSPC" "DJI"


Daily periodicity from 1990-01-02 to 2018-01-24
Daily periodicity from 1990-01-02 to 2018-01-24

24 / 71
I getSymbols() allows you also to specify an environment (argument env)
I Think of the environment as a folder in the R global environment where the
objects will be stored
I Steps:
I create a new environment with new.env() command (called myenv
below)
I call the getSymbols() function and set the env= argument to the new
environment you created
I the ls() command below lists the objects in the new environment
myenv

splist <- read.csv("List_SP500.csv", stringsAsFactors = FALSE)


myenv <- new.env()
getSymbols(splist$Ticker.symbol[1:10], env=myenv, from="2010-01-01", src="yahoo")

[1] "MMM" "ABT" "ABBV" "ACN" "ATVI" "AYI" "ADBE" "AMD" "AAP" "AES"

ls(myenv)

[1] "AAP" "ABBV" "ABT" "ACN" "ADBE" "AES" "AMD" "ATVI" "AYI" "MMM"

25 / 71
quantmod functionalities

I The object created contains all the information, but for our analysis we
might need only some of the columns
I The package provides functions to extract the open price Op(), the closing
price Cl(), the highest intra-day price Hi(), the lowest Lo(), the volume
Vo(), and the adjusted closing price Ad()
I OpCl() calculates the open-to-close daily return, ClCl() for the close-to-close
return, and LoHi() for the low-to-high difference (also called the intra-day
range)

data.new <- merge(Ad(GSPC),Ad(DJI))

GSPC.Adjusted DJI.Adjusted
1990-01-02 359.69 2810.1
1990-01-03 358.76 2809.7
1990-01-04 355.67 2796.1

26 / 71
Oanda
I Daily exchange rates for a wide range of currency pairs
I Limit of 2000 days for request
I command oanda.currencies gives you the symbols for 191 currencies

getSymbols(c("USD/EUR", "USD/JPY"), src="oanda")

[1] "USDEUR" "USDJPY"

par(mfrow=c(1,2))
plot(USDEUR)
plot(USDJPY)

USDEUR USDJPY

114
0.85

112
0.83

110
108
0.81

Jul 30 Sep 04 Oct 09 Nov 13 Dec 18 Jan 22 Jul 30 Sep 04 Oct 09 Nov 13 Dec 18 Jan 22
2017 2017 2017 2017 2017 2018 2017 2017 2017 2017 2017 2018

27 / 71
FRED

I Federal Reserve Economic Data (FRED) can be used to download


macroeconomic time series for the US economy and also international
I Visit FRED to find out the symbol of the variable(s) you are interested to
download
I Below are some examples:

library(quantmod)
macrodata <- getSymbols(c(’UNRATE’,’CPIAUCSL’,’GDPC1’), src="FRED")
macrodata <- merge(UNRATE, CPIAUCSL, GDPC1)
par(mfrow=c(1,3))
plot(UNRATE); plot(CPIAUCSL); plot(GDPC1)

UNRATE CPIAUCSL GDPC1


250
10

15000
200
8

150

10000
6

100

5000
4

50

Jan Jan Jan Jan Jan Dec Jan Jan Jan Jan Jan Dec Jan Jan Jan Jan Jan Jul
1948 1960 1975 1990 2005 2017 1947 1960 1975 1990 2005 2017 1947 1960 1975 1990 2005 2017

28 / 71
I Exchange rates are also available in FRED (no restriction on the time
period)

# DEXUSEU: U.S. Dollars to One Euro


# DEXJPUS: Japanese Yen to One U.S. Dollar
macrodata <- getSymbols(c(’DEXUSEU’,’DEXJPUS’), src="FRED", from="1975-01-01")
par(mfrow=c(1,2))
plot(DEXUSEU)
plot(DEXJPUS)

DEXUSEU DEXJPUS
1.6

350
300
1.4

250
1.2

200
150
1.0

100
0.8

Jan 04 Jan 01 Jan 01 Jan 03 Jan 01 Jan 04 Jan 01 Jan 01 Jan 03 Jan 01
1999 2003 2007 2011 2015 1971 1980 1990 2000 2010

29 / 71
Quandl

I Quandl works as an aggregator of open and subscription databases


I The macroeconomic variables retrieved from FRED can also be obtained
from Quandl:

library(Quandl)
macrodata <- Quandl(c("FRED/UNRATE", ’FRED/CPIAUCSL’, "FRED/GDPC1"),
start_date="1950-01-02", type="xts")
head(macrodata)

FRED.UNRATE - Value FRED.CPIAUCSL - Value FRED.GDPC1 - Value


1950-02-01 6.4 23.61 NA
1950-03-01 6.3 23.64 NA
1950-04-01 5.8 23.65 2147.6
1950-05-01 5.5 23.77 NA
1950-06-01 5.4 23.88 NA
1950-07-01 5.0 24.07 2230.4

30 / 71
I Quandl has many more datasets, e.g. commodity spot and futures prices

oil.spot <- Quandl("COM/WLD_CRUDE_WTI", type="xts")


coffee.spot <- Quandl("COM/COFFEE_BRZL", type="xts")
sp.futures <- Quandl("CHRIS/CME_SP1", type="xts")
gold.futures <- Quandl("CHRIS/CME_GC3", type="xts")
par(mfrow=c(1,4))
plot(oil.spot); plot(coffee.spot); plot(sp.futures$Settle); plot(gold.futures$Settle)

oil.spot coffee.spot sp.futures$Settle gold.futures$Settle


3.0

2500
120

1500
2.5
100

2000
80

1500

1000
2.0
60

1000
1.5

500
40

500
20

1.0

Jan Feb Feb Feb Feb May 01 Nov 01 May 01 Oct 31 Apr 21 Jan 02 Jan 02 Jan 03 Dec 31 Jan 04 Jan 04 Jan 03
1982 1990 1998 2006 2014 2007 2010 2014 2017 1982 1992 2002 2012 1974 1988 2000 2012

31 / 71
Reading large files

I Files imported in R are stored in the memory of the program

I The physical limit to the file size that can imported is determined by the
RAM of your machine (2, 4, 6GB)

I Reading large files can be time consuming when using the base functions

I Two packages are available to help with this task:


I readr (function read_csv())
I data.table (function fread())
I I will perform a speed comparison for the three functions using a common
benchmark

32 / 71
I The benchmark for the comparison is obtained from the Center for Research
in Security Prices (CRSP) at the University of Chicago. The variables in the
dataset are:
I PERMNO: identification number for each company
I date: date in format 2015/12/31
I EXCHCD: exchange code
I TICKER: company ticker
I COMNAM: company name
I CUSIP: another identification number for the security
I DLRET: delisting return
I PRC: price
I RET: return
I SHROUT: share oustanding
I ALTPRC: alternative price
I The observations are all companies listed in the NYSE, NASDAQ, and
AMEX from January 1985 until December 2016 at the monthly frequency
for a total of 3,627,236 observations and 16 variable. The size of the file is
328Mb

33 / 71
read_csv() from readr package
I The command Sys.time() reads the current time that is assigned to
start.time
I The time to perform the operation is calculated as the difference between
the Sys.time() and the start.time

start.time <- Sys.time()


crsp <- read.csv("crsp_eco4051_jan2017.csv", stringsAsFactors = FALSE)
end.csv <- Sys.time() - start.time

Time difference of 39.204 secs

I and for the read_csv() function?

library(readr)
start.time <- Sys.time()
crsp <- read_csv("crsp_eco4051_jan2017.csv")
end_csv <- Sys.time() - start.time

Time difference of 5.5403 secs

I read_csv() is 7.1 times faster than read.csv()


34 / 71
fread() from data.table package

I Another function to read data fast is fread


I The arguments:
I data.table: whether output will be a regular data frame or a
data.table frame (TRUE/FALSE)
I showProgress: whether partial info about the percentage loaded
should be printed (TRUE/FALSE)

start.time <- Sys.time()


crsp <- data.table::fread("crsp_eco4051_jan2017.csv",
data.table=FALSE,
showProgress = FALSE)
end.fread <- Sys.time() - start.time

Time difference of 3.1622 secs

I fread is 12 times faster than read.csv() and 1.8 times faster than
read_csv()

35 / 71
Create returns

I Very often we need to create new variables that are transformations of


existing variables
I One example is to calculate the return of an asset, that is:
1. Simple return: Rt = (Pt − Pt−1 )/Pt−1
2. Logarithmic return: rt = log(Pt ) − log(Pt−1 )
I In macro this transformation is typically called the growth rate
I The transformation is easily done in R using the lag() and diff() functions

GSPC <- getSymbols("^GSPC", from="1990-01-01", auto.assign = FALSE)


GSPC$ret.simple <- 100 * (Ad(GSPC) - lag(Ad(GSPC), 1)) / lag(Ad(GSPC),1)
GSPC$ret.log <- 100 * (log(Ad(GSPC)) - lag(log(Ad(GSPC)), 1))
GSPC$ret.simple <- 100 * diff(Ad(GSPC)) / lag(Ad(GSPC), 1)
GSPC$ret.log <- 100 * diff(log(Ad(GSPC)))
head(GSPC)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted ret.simple ret.log


1990-01-02 353.40 359.69 351.98 359.69 162070000 359.69 NA NA
1990-01-03 359.69 360.59 357.89 358.76 192330000 358.76 -0.25855 -0.25889
1990-01-04 358.76 358.76 352.89 355.67 177000000 355.67 -0.86130 -0.86503
1990-01-05 355.67 355.67 351.35 352.20 158530000 352.20 -0.97562 -0.98041
1990-01-08 352.20 354.24 350.54 353.79 140110000 353.79 0.45145 0.45043
1990-01-09 353.83 354.17 349.61 349.62 155210000 349.62 -1.17867 -1.18567

36 / 71
I If we want to transform all the columns of a data frame or xts object we can
simply do the operation on the object rather than the variable

# "^GSPC" = S&P 500 Index, "^N225" = Nikkei 225, "^STOXX50E" = EURO STOXX 50
data <- getSymbols(c("^GSPC", "^N225", "^STOXX50E"), from="2000-01-01")
price <- merge(Ad(GSPC), Ad(N225), Ad(STOXX50E))
ret <- 100 * diff(log(price))
tail(ret, 5)

GSPC.Adjusted N225.Adjusted STOXX50E.Adjusted


2018-01-19 0.437565 0.187892 NA
2018-01-22 0.803436 0.034728 0.44324
2018-01-23 0.217200 1.284195 0.19107
2018-01-24 -0.056013 -0.763018 -0.79476
2018-01-25 NA -1.139636 NA

37 / 71
Elegant graphics: ggplot2 package

I The base plotting functions are easy to use and convenient for quick plotting
I However, they lack elegance and it is difficult to produce high-quality
graphics
I The package ggplot2 offers an alternative set of function to make graphs

I There are two ways of producing plots with ggplot2:


1. qplot() is a wrapper function similar to plot() that uses underlying
ggplot2 plotting functions
2. Using the grammar of graphics that is composed of:
I ggplot(): creating the graph and speciying the data frame that
contains the variables to plot
I geom_xxx(): the type of plot that is needed; point, line,
histogram, boxplot, etc (see list here)
I aes(): the x and y variables to plot
I theme(): the overall look of the plot; theme_bw(),
theme_classic(), theme_dark() etc.

38 / 71
I ggplot2 does not recognize the time series properties of xts objects and we
have to specify the x axis
I The time(GSPC) command is used to extract the date associated with each
observation

GSPC <- getSymbols("^GSPC", from="1985-01-01", auto.assign=FALSE)


names(GSPC) <- sub(’GSPC.’,"", names(GSPC))
library(ggplot2)
qplot(time(GSPC), GSPC$Close, geom="line")

2000
GSPC$Close

1000

1990 2000 2010


time(GSPC)

39 / 71
I ggplot2 interacts best with data frames
I We can extract a data frame from the xts object by:
I creating a new variable that represents the Date
I use coredata() to extract the data from the xts object

GSPC.df <- data.frame(Date = time(GSPC), coredata(GSPC))


qplot(Date, Close, data = GSPC.df, geom = "line")

2000
Close

1000

1990 2000 2010


Date

40 / 71
I We can produce the same graph using the ggplot2 grammar as in the
example below
I Notice that we can assign a ggplot to an object (called myplot) that can be
used later and altered by only changing specific aspects

myplot <- ggplot(GSPC.df) + geom_line(aes(x = Date, y = Close))


myplot

2000
Close

1000

1990 2000 2010


Date

41 / 71
I The plots can be customized with themes, line colors and types, label names
etc
I The par(mfrow=c(2,2)) does not work with ggplot and we should use the
grid.arrange() function from package gridExtra
plot1 <-
ggplot(GSPC.df, aes(Date, Adjusted)) + geom_line(color="darkgreen")
plot2 <-
plot1 + theme_bw()
plot3 <-
plot2 + theme_classic() + labs(x="", y="Index", title="S&P 500")
plot4 <-
plot3 + geom_line(color="darkorange") + geom_smooth(method="lm") +
theme_dark() + labs(subtitle="Period: 1985/2016", caption="Source: Yahoo")
library(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

2000 2000
Adjusted

Adjusted
1000 1000

1990 2000 2010 1990 2000 2010


Date Date
S&P 500 S&P 500
Period: 1985/2016

2000 2000
Index

Index

1000
1000

0
1990 2000 2010
1990 2000 2010
Source: Yahoo

42 / 71
I A scatter plot of two variables can be easily produced in ggplot2

data <- getSymbols(c("^GSPC", "^N225"), from="1990-01-01")


price <- merge(Ad(to.monthly(GSPC)), Ad(to.monthly(N225)))
ret <- 100 * diff(log(price))
GN.df <- data.frame(Date=time(price), Year = year(time(price)), coredata(merge(price,ret)))
names(GN.df) <- c("Date","Year", "SP", "NIK", "SPret","NIKret")

plot1 <- ggplot(GN.df, aes(NIKret, SPret)) + geom_point() +


geom_vline(xintercept = 0) + geom_hline(yintercept = 0)
plot2 <- plot1 + geom_smooth(method="lm", se=FALSE) + theme_bw() +
labs(x="NIKKEI", y="SP500")
grid.arrange(plot1, plot2, ncol=2)

10 10

0 0

SP500
SPret

−10 −10

−20 −20
−20 −10 0 10 20 −20 −10 0 10 20
NIKret NIKKEI

43 / 71
I The strength of ggplot2 is to make easier to produce sophisticated graphics
I For example: if we want to have the dots in the scatter plot depend on a
variable (e.g., Year) this can be done easily by adding the argument
color=Year in the aesthetics

GN.df$Year <- year(time(price))


plot1 <- ggplot(GN.df, aes(NIKret, SPret, color=Year)) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")
plot2 <- ggplot(GN.df, aes(NIKret, SPret, color=factor(Year))) + geom_point() +
geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + theme_bw() + labs(x="", y="")
grid.arrange(plot1, plot2, ncol=2)

1990 2005

10 10 1991 2006
1992 2007
1993 2008

Year 1994 2009

0 0 1995 2010
2010 1996 2011
1997 2012
2000
1998 2013
−10 −10 1999 2014
1990
2000 2015
2001 2016
2002 2017
−20 −20 2003 2018
−20 −10 0 10 20 −20 −10 0 10 20
2004

44 / 71
Boxplot

I A boxplot provides a graphical representation of the distribution of the data


I Box-whisker plot: the box represents the 25%, 50%, and 75% quantile and
the whiskers extend to the highest value within 1.5 times the interquartile
range; the dots are the min/max
I In the graph below the boxplot is plotted for each year separately

ggplot(GN.df, aes(factor(Year), SPret)) + geom_boxplot() +


theme(axis.text.x = element_text(angle = 90))

10

0
SPret

−10

−20
1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018
factor(Year)

45 / 71
Summary statistics
I The function summary() provides a few summary statistics of the
distribution of a data object
I Of course, the variable needs to be of type numerical

summary(GSPC$ret.simple)

Index ret.simple
Min. :1990-01-02 Min. :-9.0350
1st Qu.:1996-12-26 1st Qu.:-0.4399
Median :2004-01-07 Median : 0.0530
Mean :2004-01-07 Mean : 0.0353
3rd Qu.:2011-01-13 3rd Qu.: 0.5541
Max. :2018-01-24 Max. :11.5800
NA's :1

summary(GSPC$ret.log)

Index ret.log
Min. :1990-01-02 Min. :-9.4695
1st Qu.:1996-12-26 1st Qu.:-0.4409
Median :2004-01-07 Median : 0.0530
Mean :2004-01-07 Mean : 0.0292
3rd Qu.:2011-01-13 3rd Qu.: 0.5525
Max. :2018-01-24 Max. :10.9572
NA's :1

46 / 71
I Package fBasics provides the basicStats() function with a more
comprehensive set of descritive statistics compared to summary()

fBasics::basicStats(GSPC$ret.log)

ret.log
nobs 7072.000000
NAs 1.000000
Minimum -9.469512
Maximum 10.957197
1. Quartile -0.440919
3. Quartile 0.552538
Mean 0.029210
Median 0.052970
Sum 206.545022
SE Mean 0.013172
LCL Mean 0.003390
UCL Mean 0.055031
Variance 1.226761
Stdev 1.107592
Skewness -0.252791
Kurtosis 9.004323

47 / 71
Covariance and correlation between two or more assets

I When we have several variables or assets the first question that arises in the
analysis is whether they co-move
I Dependence is measured using the covariance and the correlation

I Do the S&P 500 and NIKKEI move togheter?

Ret <- subset(GN.df, select=c("SPret","NIKret"))


cov(Ret, use=’complete.obs’)

SPret NIKret
SPret 16.960 13.446
NIKret 13.446 38.731

cor(Ret, use=’complete.obs’)

SPret NIKret
SPret 1.00000 0.52463
NIKret 0.52463 1.00000

48 / 71
Plotting the data distribution

I A very useful tool to explore the distribution of the data is the histogram
that represents an estimator of the underlying (population) of the data. It is
useful to assess (visually) the characteristics of the data, such as normality,
fat tails, asymmetry

hist(GSPC$ret.log, breaks=50, xlab="", main="") # base function


qplot(ret.log, data=GSPC, geom="histogram", bins=50) # ggplot function

1500
1500

1000

count
Frequency

1000

500
500

0
0

−10 −5 0 5 10
−10 −5 0 5 10 ret.log

49 / 71
I We can overlap a non-parametric estimate to the histogram which represents
a smooth line that goes through the histogram bars

# base function
hist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)
lines(density(GSPC$ret.log,na.rm=TRUE),col=2,lwd=2)
box()
# ggplot function
ggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +
geom_density(color="red", size=1.2) +
theme_bw()

0.6
0.5
0.4

0.4
0.3

density
0.2

0.2
0.1
0.0

0.0
−10 −5 0 5 10
−10 −5 0 5 10
Return ret.log

50 / 71
I Or compare the histogram to a distribution (e.g., normal)

# base function
hist(GSPC$ret.log, breaks=50, main="", xlab="Return", ylab="",prob=TRUE)
curve(dnorm(x, mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)),
from=-10, to=10, add=TRUE, col="red",lwd=2)
box()
# ggplot function
ggplot(GSPC, aes(ret.log)) +
geom_histogram(aes(y = ..density..), bins=50, color="black", fill="white") +
stat_function(fun = dnorm, colour = "red",
args = list(mean(GSPC$ret.log, na.rm=T), sd(GSPC$ret.log, na.rm=T)), size=1.2) +
theme_bw()

0.6
0.5
0.4

0.4
0.3

density
0.2

0.2
0.1
0.0

0.0
−10 −5 0 5 10
−10 −5 0 5 10
Return ret.log

51 / 71
Dates and times in R

I We already used the command as.Date() to define a string to be of type


Date
I The default format of a date in R is 2011-07-17
I If in your dataset a date is specified in a different way, you need to help R
read it by specifying the format=
I The syntax of the format is: %d for numerical day, %a and %A
abbreviated/unabbreviated weekday, %m numerical month, %b and %B for
abbreviated/unabbreviated month, %y and %Y for 2/4 digit year

as.Date("2011-07-17") # default, no need to specify format


as.Date("July 17, 2011", format="%B %d,%Y")
as.Date("Monday July 17, 2011", format="%A %B %d,%Y")
as.Date("17072011", format="%d%m%Y")
as.Date("11@17#07", format="%y@%d#%m")

[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17"

52 / 71
I One operation we might want to do with dates is to calculate the difference
between two dates
I This can be done by subtracting two dates or using the difftime() function
that allows also to specify the unit of time

date1 <- as.Date("July 17, 2011", format="%B %d,%Y")


date2 <- Sys.Date()
date2 - date1
difftime(date2, date1, units="secs")
difftime(date2, date1, units="days")
difftime(date2, date1, units="weeks")

Time difference of 2384 days


Time difference of 205977600 secs
Time difference of 2384 days
Time difference of 340.57 weeks

53 / 71
Time

I In addition to the date, we might need to specify the time of the day
I This is useful when dealing with intra-day data, such as the FX data that
we discussed earlier and shown below

data.hf <- data.table::fread(’USDJPY-2016-12.csv’,


col.names=c("Pair","Date","Bid","Ask"),
colClasses=c("character","character","numeric","numeric"),
data.table=FALSE, showProgress = FALSE)

Pair Date Bid Ask


1 USD/JPY 20161201 00:00:00.041 114.68 114.69
2 USD/JPY 20161201 00:00:00.042 114.68 114.69
3 USD/JPY 20161201 00:00:00.186 114.68 114.69
4 USD/JPY 20161201 00:00:00.188 114.68 114.69
5 USD/JPY 20161201 00:00:00.189 114.69 114.70
6 USD/JPY 20161201 00:00:00.223 114.69 114.70
7 USD/JPY 20161201 00:00:00.343 114.69 114.70
8 USD/JPY 20161201 00:00:00.347 114.69 114.70
9 USD/JPY 20161201 00:00:00.403 114.69 114.69
10 USD/JPY 20161201 00:00:00.415 114.69 114.69

54 / 71
I To work with time we can use two functions:
I strptime()
I as.POSIXlt()
I Both require to specify the format of the date and time part
I The format of the time is: %H hour (out of 24), %M minute, %S seconds, and
%OS fractional seconds

strptime("20161201 01:00", format="%Y%m%d %H:%M")


strptime("20161201 00:00:01", format="%Y%m%d %H:%M:%S")
strptime("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
##
date1 <- as.POSIXlt("20161201 00:00:00.041", format="%Y%m%d %H:%M:%OS")
date2 <- strptime("20161201 01:15:00.041", format="%Y%m%d %H:%M:%OS")
date2 - date1
difftime(date2, date1, unit="secs")

[1] "2016-12-01 01:00:00 EST"


[1] "2016-12-01 00:00:01 EST"
[1] "2016-12-01 00:00:00 EST"
[1] "2016-12-01 00:00:00 EST"
Time difference of 1.25 hours
Time difference of 4500 secs

55 / 71
lubridate package

I This package makes it easier to define dates by having dedicated functions


that do the as.Date and format together
I These functions are:
I ymd: for dates in the format year, month, day
I dmy: dates with day, month, year format
I mdy: when the format is month, day, year
I ymd_hm: in addition to the date the time is provided in hour and
minute (the date part can be changed to other formats)
I ymd_hms: the time format is hour, minute, and seconds

library(lubridate)
ymd("20110717")
ymd("2011/07/17")
ymd_hm("20110717 01:00")
ydm_hms("20111707 00:00:00.041")

[1] "2011-07-17"
[1] "2011-07-17"
[1] "2011-07-17 01:00:00 UTC"
[1] "2011-07-17 00:00:00 UTC"

56 / 71
I The package provides functions that makes it easy to extract the year,
month, day, day of the week/month/year, minute, second etc

mydate <- ydm_hms("20111707 00:00:00.041")


year(mydate)
month(mydate)
day(mydate)
wday(mydate, label=T, abbr=FALSE)
minute(mydate)
second(mydate)

[1] 2011
[1] 7
[1] 17
[1] Sunday
Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
[1] 0
[1] 0.041

57 / 71
dplyr package

I The package dplyr provides functions to manipulate data frames


I Data analysis requires a lot of data manipulation and this package has made
the task significantly easier
I dplyr has 5 main verbs:
I mutate: to create new variables
I select: to select columns of the data frame
I filter: to select rows based on a criterion
I group_by()/summarize: uses a function to summarize
columns/variables in one value
I arrange: to order a data frame based on one or more variables
I dplyr is typically used with the %>% piping operator (read then)
I %>% is useful to write compact code when we are not interested in
using or storing the intermediate results

58 / 71
mutate and select

I mutate() to create new variables


I select() to select existing variables

library(dplyr)
library(lubridate)

GSPC.df <- mutate(GSPC.df, range = 100 * log(High/Low),


ret.c2c = 100 * log(Adjusted / lag(Adjusted)),
year = year(Date),
month = month(Date),
wday = wday(Date, label=T, abbr=F))
tail(GSPC.df, 2)

Date Open High Low Close Volume Adjusted range ret.c2c year month wday
8334 2018-01-23 2835.1 2842.2 2830.6 2839.1 3519650000 2839.1 0.41073 0.217200 2018 1 Tuesday
8335 2018-01-24 2845.4 2853.0 2824.8 2837.5 4014070000 2837.5 0.99195 -0.056013 2018 1 Wednesday

Date year month wday range ret.c2c


8334 2018-01-23 2018 1 Tuesday 0.41073 0.217200
8335 2018-01-24 2018 1 Wednesday 0.99195 -0.056013

59 / 71
filter()

I the filter() function is used to select specific rows of the data frame

filter(GSPC.df, wday == "Tuesday") %>% head(7)

Date Open High Low Close Volume Adjusted range ret.c2c year month wday
1 1985-01-08 164.24 164.59 163.91 163.99 92110000 163.99 0.41400 -0.152332 1985 1 Tuesday
2 1985-01-15 170.51 171.82 170.40 170.81 155300000 170.81 0.82988 0.175790 1985 1 Tuesday
3 1985-01-22 175.23 176.63 175.14 175.48 174800000 175.48 0.84715 0.142568 1985 1 Tuesday
4 1985-01-29 177.40 179.19 176.58 179.18 115700000 179.18 1.46727 0.998381 1985 1 Tuesday
5 1985-02-05 180.35 181.53 180.07 180.61 143900000 180.61 0.80753 0.144058 1985 2 Tuesday
6 1985-02-12 180.51 180.75 179.45 180.56 111100000 180.56 0.72182 0.027697 1985 2 Tuesday
7 1985-02-19 181.60 181.61 180.95 181.33 90400000 181.33 0.36408 -0.148791 1985 2 Tuesday

60 / 71
group_by()/summarize()

I A common operation in data analysis is to group observations based on a


certain characteristic and apply a certain function to each group
I Example: calculate the average/min/max return by day of the week

GSPC.df %>% group_by(wday) %>%


summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),
MAX.RET = max(ret.c2c, na.rm=T))

# A tibble: 5 x 4
wday AV.RET MIN.RET MAX.RET
<ord> <dbl> <dbl> <dbl>
1 Monday 0.010565 -22.8997 10.9572
2 Tuesday 0.066759 -5.9108 10.2457
3 Wednesday 0.054635 -9.4695 8.7089
4 Thursday 0.016346 -7.9224 6.6923
5 Friday 0.019671 -7.0082 6.1328

61 / 71
I The grouping can also be done on two variables, for example month and year

GSPC.df %>% group_by(month, year) %>%


summarize(AV.RET = mean(ret.c2c, na.rm=T),
MIN.RET = min(ret.c2c, na.rm=T),
MAX.RET = max(ret.c2c, na.rm=T))

# A tibble: 397 x 5
# Groups: month [?]
month year AV.RET MIN.RET MAX.RET
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1985 0.393875 -0.54228 2.2566
2 1 1986 0.010744 -2.76472 1.4843
3 1 1987 0.589429 -1.40073 2.3024
4 1 1988 0.198181 -7.00824 3.5231
5 1 1989 0.327143 -0.87157 1.4854
6 1 1990 -0.324090 -2.61989 1.8710
7 1 1991 0.184905 -1.74726 3.6642
8 1 1992 -0.091477 -1.11960 1.4615
9 1 1993 0.035106 -0.87605 0.8903
10 1 1994 0.152304 -0.58097 1.1363
# ... with 387 more rows

62 / 71
Does volatility of the S&P 500 vary over time?

I dplyr functions can be very useful to write relatively lengthy operations in a


compact and readable manner
I Assume we want to calculate the average volatility by year using the
intra-day range as a proxy for volatility

GSPC.df %>% mutate(range = 100 * log(High/Low),


year = year(Date)) %>%
group_by(year) %>%
summarize(av.range = mean(range, na.rm=T)) %>%
ggplot(., aes(year, av.range)) + geom_line(color="steelblue4", size=1.3) +
theme_bw() + labs(x="", y="")

1990 2000 2010

63 / 71
Creating functions in R

I The advantage of a programming language is that you have the flexibility to


write your own functions. This can be useful when:
1. No package provides pre-programmed functions to perform the
analysis you want to conduct

2. The task is very complex and you prefer to break it down in smaller
tasks that make the code easier to read, interpret, and test.

3. Once you write a function, you can use it again in future analysis

64 / 71
I A function is a set of operations applied to some data

I The syntax is as follows:

myfunction <- function(inputs)


{
#
# operations
#

return(output)
}

I The function() can include several arguments


I The return() is the output of the function that can include only one object,
although that object might include several elements
I To call the function you type in R the name of the function with the
appropriate arguments: myfunction()

65 / 71
A function to calculate the sample average

I R provides the mean() function to calculate the sample average

mean(GSPC$ret.log, na.rm=T)

[1] 0.02921

I As an illustration, let’s write a function that calculates the sample average:


PT
R̄ = R /T
t=1 t

I In this case the set of operations to perform is quite simple:


1. sum the values of the time series
2. divide by the total number of observations

66 / 71
I Below is the code that defines a new function called mymean()

# Y is the input, Ybar the output of the function


mymean <- function(Y)
{
Y = na.omit(Y)
Ybar <- sum(Y) / length(Y)
return(Ybar)
}

I Let’s compare the results:

mean(GSPC$ret.log, na.rm=T)

[1] 0.02921

mymean(GSPC$ret.log)

[1] 0.02921

67 / 71
Loops in R

I Loops are a useful tool when you want to perform the same set of operations
on several time series or datasets

I A common loop is the for loop which has the following syntax:

for (i in 1:N)
{
# write your commands here
}

I The loop iterates through the values of i from 1 to N


I Example:

for (i in 1:3)
{
print(i)
}

[1] 1
[1] 2
[1] 3

68 / 71
mysum() function using a for loop

mysum <- function(Y)


{
Y = na.omit(Y)
N = length(Y) # define N as the number of elements of Y
sumY = 0 # initialize the variable that will store the sum of Y

for (i in 1:N)
{
sumY = sumY + as.numeric(Y[i]) # current sum is equal to previous sum
} # plus the i-th value of Y
return(sumY) # as.numeric(): makes sure to transform
} # from other classes to a number

mysum(GSPC$ret.log)

[1] 206.55

sum(GSPC$ret.log, na.rm=T)

[1] 206.55

69 / 71
A simulation exercise

I Simulations are used to evaluate some quantities (e.g., the price of an option
or an estimator) based on large number of samples generated from a certain
distribution
I The recipe works as follows:
1. generate random values from a model
2. calculate the quantity of interest
3. repeat 1 and 2 many times

I Example: we want to evaluate if the distribution of the sample mean is


N (µ, σ 2 /N ), where µ is the population mean, σ 2 the population variance,
and N the sample size

I In the example in the following slide we generate data from a normal


distribution with mean 0 and standard deviation 2

70 / 71
S = 5000 # set the number of simulations
N = 1000 # set the length of the sample
mu = 0 # population mean
sigma = 2 # population standard deviation

Ybar = vector(’numeric’, S) # create an empty vector of S elements


# to store the t-stat of each simulation
for (i in 1:S)
{
Y = rnorm(N, mu, sigma) # Generate a sample of length N
Ybar[i] = mean(Y) # store the t-stat
}
ggplot(data = data.frame(Ybar=Ybar), aes(Ybar)) +
geom_histogram(aes(y = ..density..), color="red", fill="lightsalmon", bins = 40) +
stat_function(fun = dnorm, args = list(mean = 0, sd = sigma/sqrt(N)), color="seagreen",size=1.2) +
theme_bw() + xlim(c(-0.3, 0.3))

4
density

−0.2 0.0 0.2


Ybar

71 / 71

You might also like