Data Analytics Btma 636 Day 2

Data Analytics I
Day 2
Introductory Remarks
· We’re very excited to see you sharing a bit about yourselves and getting to
know each other on the #social channel.
· Across everyone in the classes, there is a wide diversity of prior knowledge,

work experiences, and future goals.
· It makes the class both exciting and challenging to teach (in the sense that
we’re balancing different prior knowledge, interests, and goals).
2/144
· Make sure your Discord screen name on our server contains your first
name and last name (or last initial).
· We want to make sure everyone’s names are displayed so that the TAs can
go through the class rosters easily.
· In Week 3, we will be removing the names of students who are not on any
class roster (we want to make sure students who have withdrawn by Week
3 do not have access to the lecture recordings).
· If your name is accidentally removed (this may happen if you are using a
nickname not on the class roster), then rejoin the class Discord and let us
know your official name so that we know that you prefer to use your
nickname.
3/144
· To get credit for assignments, you need to both submit the D2L quiz
associated with the assignment and submit your work (.R script file or
.Rmd file) to the Dropbox folder.
· Follow the HW naming conventions specified in the assignment.
· “HW[#]_[firstName]_[lastName]” is the naming convention. You can save
your file either as an R script file (which will automatically have a file
extension of .R) or as an R Markdown (which will have file extension .Rmd).
· Make sure to submit your work in a single unzipped file (in other words,
not in multiple files and not in a zipped file).
· In general, follow the instructions very, very carefully.
4/144
· Lecture Quizzes will be posted in the D2L Quizzes section every few weeks
(roughly every two weeks), and you will have at least a week to complete
them. They will typically be released on Thursday and due on Friday the
following week.
· They will primarily cover the last two weeks’ worth of lectures, but material
from prior weeks can appear in them as well.
· Each time you attempt the quiz, you may get a different set of questions
(since there is a pool of questions for each quiz).
· As discussed before, these quizzes will be asynchronous, open-book

quizzes in which you can learn from your mistakes. You can work on pairs
or triples, but you cannot post anything on Discord.
· Your score will be your best attempt before the deadline.
5/144
· Once live Zoom sessions begin, one of the biggest challenges for us will be
helping you in class.
· As discussed in the Welcome Video, in prior years, I could just walk up to

students and help them while redirecting everyone else’s attention to the
projector screen. In an online setting, it is much more difficult to do this, so
we need you to be patient with us (especially when multiple students are
requesting help at the same time).
· This will be our first time offering this class online, so we will be learning
what works/doesn’t work in terms of helping students in an online setting.
6/144
· I tested what happens when you record a Zoom meeting with Participants
in the Zoom meeting.
· Even if participants have their webcams on, when the host is on screen-
sharing mode and recording the screen, particpants’ webcams will not be
recorded (at least in my default setting).
· If I record the meeting when screen-sharing mode is turned off, then

participants will be recorded. For class, we will be on screen-sharing mode
the entire time for lectures (so that you can see the lecture slide that I am
on/your instructor is on), so you can feel free to have webcam on if you
want to.
7/144
· If you decide to have your webcam on (optional), then make sure that you
are in an environment that will not distract your peers. Have your
microphone off unless you have a question or are answering a question
from your instructor.
· You should not be streaming lectures on Twitch or anywhere else for the
public to view (lecture recordings will be posted as password-protected
Yuja videos to protect your privacy).
· It should go without saying that non-academic misconduct consequences

apply if you harass or abuse others on Zoom, Discord, etc.
8/144
Outline of Today’s Class
· For-Loops
· Finance Analytics
· Logical Operators/Vectors
· Coding Tips and Debugging Tips
· Useful Functions
9/144
Programming Basics: The For-Loop
· How would you write a for-loop to replicate the output of this set of
commands?
x <- numeric(0)
x[1] <- 1
x[2] <- 1
x[3] <- (x[1] + x[2])^2
x[4] <- (x[2] + x[3])^2
x[5] <- (x[3] + x[4])^2
x[6] <- (x[4] + x[5])^2
x[7] <- (x[5] + x[6])^2
x[8] <- (x[6] + x[7])^2
x[9] <- (x[7] + x[8])^2
x[10] <- (x[8] + x[9])^2
10/144
· When you see a pattern in the copy/pasted line of code, the thing that gets
modified will somehow relate to the indexing variable (i in this case).
x <- numeric(0)
x[1] <- 1
x[2] <- 1
for(i in 3:10){
x[i] <- (x[i-2] + x[i-1])^2
}
· In each copy/pasted line of code, three numbers change: the index you’ll
assign to (left of the assignment operator) and the other two indices on
the right of the assignment operator. Is there a relationship among those
numbers? That pattern is expressed in the for-loop.
11/144
· The for-loop is not necessarily the fastest, most efficient way of doing
something.
· For example, suppose that you wanted to create a vector whose elements
are the difference of consecutive squares of whole numbers. So the first
entry is 12 − 02 , the second entry is 22 − 12 , the third entry is 32 − 22 ,
up to the millionth entry being 10000002 − 9999992 .
· How would you do this with a for-loop, and how would you do this without
a for-loop?
12/144
· With a for-loop:
x <- numeric(0)
for(i in 1:1000000){
x[i] <- (i)^2 - (i-1)^2
}
13/144
14/144
15/144
· Without a for-loop:
x <- (1:1000000)^2 - (0:999999)^2
16/144
· We can compute the elapsed time (how long it takes to run the code) using
the system.time() function.
system.time(x <- (1:1000000)^2 - (0:999999)^2)
## user system elapsed

## 0.00 0.02 0.06
rm(x) # Removes x from the environment
17/144
· We can compute the elapsed time (how long it takes to run the code) using
the system.time() function.
system.time( {x <- numeric(0)
for(i in 1:1000000){
x[i] <- (i)^2 - (i-1)^2
}
} )
## user system elapsed

## 0.54 0.03 2.89
18/144
· Vectorizing your code (removing for-loops by applying functions to
vectors/data frames/matrices directly) typically will speed up the run-time
of the code.
· R and other languages meant for working with numbers are vectorized for
computational efficiency.
· What that means is that it is (sometimes slightly, sometimes dramatically)
faster for the computer to run the code if it’s been vectorized (by removing
for-loops and replacing them with functions applied to vectors).
· Aside: Vectorizing is a form of parallel computing, sending the CPU
instructions to do the computations at the same time. In a for-loop, the
instructions are sent/carried out one at a time (unless you add in further
instructions to carry the for-loop out in parallel). Vectorizing sends
instructions to different cores on your machine at the same time.
19/144
· Think: What do you think this will do?
example_data <- data.frame(x = numeric(), y = numeric());
for(i in 1:100){
example_data[i, 1] <- i;
example_data[i, 2] <- i^2;
}
plot(example_data, xlab = 'x', ylab = 'y')
20/144
21/144
· Think: Can you recreate the plot without using a for-loop?
example_data <- data.frame(x = numeric(), y = numeric());
for(i in 1:100){
example_data[i, 1] <- i;
example_data[i, 2] <- i^2;
}
plot(example_data)
22/144
· Again, whenever you can get rid of a for-loop, then do so.
plot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y')
23/144
ggplot
· Base R’s plotting library is not very good. Instead, we will want to use
better graphics packages (ggplot or plotly) or export the data to Power BI
or Tableau.
· Let’s install the ggplot package.
install.packages('ggplot2')
24/144
ggplot
· After you install a package, if you to use a package within the current
working session, then you have to load it using the library() function.
· Every time you close RStudio or restart R, the working session will be
restarted so that it clears the memory of your global environment data
and packages you’ve loaded.
library(ggplot2)
25/144
ggplot
· Now that we’ve loaded the library (also known as package), we can use
functions within that package.
· Let’s use the qplot() function to ‘quickly plot’ something.
qplot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y')
26/144
ggplot
27/144
ggplot
· You can also fit a curve through the points by using the geom argument of
the qplot() function.
qplot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y',

geom = "smooth")
28/144
ggplot
29/144
ggplot
· When the scatterplot points do not all fall neatly onto the smooth curve
fitting the data points, you may want to see both the points as well as the
fitted curve.
· Then you would do: geom = c(“point”, “smooth”). See below.
qplot(x = 1:100, y = (1:100)^2 + rnorm(n = 100, mean = 0, sd = 500),

xlab = 'x', ylab = 'y', geom = c("point", "smooth"))
· Aside/Foreshadowing: Note that the blue curve you see upon running the
code is not the same curve as the previous slide. It will be slightly different.
In this case, R builds a (local) polynomial regression model to best fit the
data points. We will see polynomial regression later in the semester.
30/144
ggplot
31/144
· For-Loops
· Useful Functions
32/144
Historical Stock Prices
· For your first homework, you will examine stock price data using a
package called ‘quantmod’, which was developed by a quant enthusiasts
Josh Ulrich and Jeffrey Ryan (along with other contributors).
· Since R is an open-source language, anyone around the world can write

packages and make them freely available for anyone else to use and
contribute to.
· Later in the semester, you will learn how to scrape data without relying on
packages like quantmod.
33/144
Packages
· Related to the open-source nature of R, the upside is that there are a ton
of packages for almost any setting you would care to apply analytics to.
· The downside is that packages depend on one another, and updating a

package (or your version of R) may interact with the functionality of other
packages. You can typically figure this out by Googling the error message.
34/144
· Before we install the package, let’s make sure we all have the same version
of R.
R.Version() # Version 3.6.2 for Windows
· To get another version of R for Windows, go here: https://cran.r-

project.org/bin/windows/base/old/
· For Macs, you can download older versions of R here: https://cran.r-

project.org/bin/macosx/old/index-old.html
· For Macs, you can download the newest version of R here: https://cloud.r-
project.org/bin/macosx/
35/144
· Caveat: I do not have a Mac and have not tested whether the packages
used in this slide deck are compatible with the most up-to-date version of
R for Macs.
· If there is some incompatibility issue (for example, an issue installing one

of the packages we use in this slide deck), then try downloading the
previous release of R for Macs. Students last year who had Macs were able
to download the packages.
36/144
· Type this into the Script Editor and run it:
install.packages('quantmod')
· The quantmod package gets it data from a variety of different data

sources, including Yahoo Finance, Oanda, the U.S. Federal Reserve, and
financial research platform Tiingo.
· Now you’ve downloaded this package in R. Once you download a package,

you won’t have to download it again (unless you delete it or uninstall R).
37/144
· However, in order to be able to use the functions in the package, you will
need to load the package into the Global Environment. To do this, type this
into the Script Editor and run it.
library(quantmod)
· You have to load the package every time you start a new session in R
(every time you close RStudio). Make sure you include
‘library(packageName)’ in your Script Editor file so that you remember
what packages you’re using.
38/144
· All packages come with help files that are accessible both online and
directly in the Help Window.
· In the Script Editor, run the following:
??quantmod
· You can also access the help files online in pdf format if you Google ‘R
quantmod package’
39/144
· Let’s take a look at Tesla’s stock price over time since the August 1 of 2017
(retrieving all stock prices on or after this date).
· We can do that using the getSymbols() function from the quantmod
package.
getSymbols(Symbols = 'TSLA', src = "yahoo", from = "2017-08-01")
# The default source for getSymbols() is Yahoo Finance.

# If you left out src = "yahoo", you'd get the same data
· This automatically returns an object called TSLA, which is an extensible

time-series object in R (which is a data structure specifically for time series
data).
40/144
· A time series is a sequence of values over time, typically in which future
values of the time series depend on past values.
· We won’t cover time series analysis in this class (it is outside the scope of
the class), but today and in the homework assignment, you will just plot
some stock prices over time (i.e., a time series).
· For those of you who are interested in time series analysis or R for finance,
you can check out: https://www.datacamp.com/tracks/time-series-with-r
and https://www.datacamp.com/tracks/quantitative-analyst-with-r
41/144
· Note that the way the getSymbols() function was built by the authors, once
you run the getSymbols() function for a company, then it will automatically
store the data for that company in your Environment.
· Most functions won’t be like that, but the authors preferred to have it this
way since it’s more convenient.
· However, they knew that not everyone has the same preference, so they
gave users the flexibility to store the output of getSymbols() manually.
42/144
· If we wanted to store the output of getSymbols() manually, we could set
the env parameter to NULL (so that the output is not automatically saved
in the Environment).
· This allows us to name the object ourselves.
· We could also convert the output to a data frame, instead of as an xts

object, using the as.data.frame() function.
TSLA.stock.charts <- as.data.frame(getSymbols(Symbols = 'TSLA',

src = "yahoo", from = "2017-08-01", env = NULL))
· Note that the env = NULL tells getSymbols() not to automatically store the
dataset as TSLA (which it would do by default).
43/144
· There are different data structures and classes of objects in R.
· The xts (extensible time-series) class was designed to work better with
time series data. There are functions that work only with extensible time-
series objects but not with regular data frames. Since we will not be
working with time series, we will not need to know much more about the
xts class.
· Just note that even if two data sets look identical, you may not be able to
apply the same function on them if they are of different classes.
TSLA.stock.charts.xts <- getSymbols(Symbols = 'TSLA',

src = "yahoo", from = "2017-08-01", env = NULL)
str(TSLA.stock.charts.xts) # Look at the structure of this object

str(TSLA.stock.charts) # Look at the structure of this object
44/144
· If you viewed the objects (by clicking them in the Environment or by using
the View() function), they look identical.
View(TSLA.stock.charts.xts)
View(TSLA.stock.charts)
· The View() function (make sure you have the V capitalized!) does the same
thing as clicking a dataset in the Environment window.
· Using the View() function in your code lets your teammates know that they
should take a look at the data.
45/144
· However, functions designed to work with xts will not work (or work the
same way) for data frames.
· For example, try the index() function.
index(TSLA.stock.charts.xts)
index(TSLA.stock.charts)
· For data frames, the index() function will tell you the row number of the
observation (which line of the data frame that observation was recorded
in). For xts objects, the index() function will tell you the date of that
observation.
· Note that when we ran the getSymbols() function initially (without setting
the env parameter to NULL), it created an xts automatically for the data
(TSLA). So TSLA.stock.charts.xts is an identical copy of TSLA.
46/144
· In the next few slides, we will plot the closing price of the TSLA over time.
To get the date of the observation for each price record, we will do:
index(TSLA)
· Suppose we wanted to plot the closing price over time. First, we’ll define a
data frame.
· We want to have the ticker.date column be a date data type, so we use the
as.Date() function.
· Also, Closing.price should be a number, so we use the as.numeric()
function.
TSLA.data <- data.frame(ticker.date = as.Date(index(TSLA)),

Closing.price = as.numeric(TSLA$TSLA.Close))
· Note: This is how you can define a new data frame based using data from
previously defined objects.
47/144
· Now that we have our data frame, we can plot it.
· We will use the qplot() function (quick plot) from the ggplot2 package.
qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price)
48/144
49/144
· Maybe it would be better as a line plot.
· We can do that by changing a parameter of qplot().
qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

geom = 'line')
50/144
51/144
· You can change the x-axis and y-axis labels.

geom = 'line') +
labs(x = 'Date', y = 'Closing Price')
52/144
53/144
· You can add and center the title.

geom = 'line') +
labs(x = 'Date', y = 'Closing Price') +
ggtitle("TSLA Stock Price") +
theme(plot.title = element_text(hjust = 0.5))
# Note that hjust is for the horizontal adjustment factor
54/144
55/144
· For ggplot, left-adjusted is the default for titles (hjust = 0).
· Center-adjusted is hjust = 0.5.
· To right-center the title, you would use hjust = 1.

geom = 'line') +
labs(x = 'Date', y = 'Closing Price') +
theme(plot.title = element_text(hjust = 1))
56/144
57/144
· In ggplot, we can change the axes labels and title programmatically by
adding layers to the plot.
p <- qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

geom = 'line') +
labs(x = 'Year', y = 'Closing price')
print(p) # print(p), or just p, will display p in the Plot Viewer
· The main feature of ggplot is that we can add more layers on to

previous plot layers. Instead of trying to build everything in one shot, you
can build the plot piece by piece.
58/144
p <- p +
p # print(p), or just p, will display p in the Plot Viewer
· Here, we add a title layer and a theme layer.

· By default, ggplot titles are left-centered.
· The theme() function allows us to center the title.
59/144
· We could also use the main ggplot() function, which allows us to create
more refined plots than the qplot() function.
ggplot(data = TSLA.data,
aes(x = ticker.date, y = Closing.price)) +
geom_line() +
labs(x = 'Year', y = 'Closing price') +
· This is saying we will take the TSLA.closing.price data and use the
ticker.date data as the x-axis and the Closing.price data as the y-axis (the
aesthetics include colors, labels, etc.).
· We want a line plot, so we add geom_line() to the plot.
· We then label the axes and title as we want.
60/144
61/144
· There’s no need to learn everything about ggplot now; you can learn the
syntax by studying the examples I provide (as well as ggplot examples
tutorials online).
· For those of you who prefer Power BI, we can run the R code in Power BI
to get the TSLA stock data, and then plot it.
· I will show you this in a separate video (which will be posted to the folder:
Content > Supplementary Materials > Power BI).
· The advantage of coding it programmatically is that everyone on your

team can know how you created the plot and what customizations you
made to it.
· Going back to R, create the same plot but with Amazon (AMZN instead of
TSLA).
62/144
63/144
· Going back to Tesla, suppose that we were day traders and wanted to
examine the daily return of Tesla.
· We could use the quantmod package to easily get the historical stock price
data.
· We could then get metrics on the performance of that stock, and we could
even test different trading strategies on the historical data to compare the
performance of different trading strategies.
· We won’t compare trading strategies in this class, since this class is not a
finance-focused class. However, we will do some basic exploratory data
analysis with the stock price data in HW 1 (ignoring considerations such as
tax implications, dividends, etc. in the analysis).
64/144
## ticker.date Closing.price
## 1 2017-08-01 63.914
## 2 2017-08-02 65.178
## 3 2017-08-03 69.418
## 4 2017-08-04 71.382
## 5 2017-08-07 71.034
## 6 2017-08-08 73.044
· The definition of daily return is
today's value - yesterday's value
yesterday's value
· Looking at the data, what is the daily return between the first and second
day?
65/144
## 1 2017-08-01 63.914
## 2 2017-08-02 65.178
## 3 2017-08-03 69.418
## 4 2017-08-04 71.382
## 5 2017-08-07 71.034
## 6 2017-08-08 73.044
· 65.178 - 63.914
≈ 0.0198 = 1.98%
63.914
· The daily return is positive, meaning that the price increased between the
first and second day. If you bought $100 worth of it on the first day and
sold it the second, then you would gain $1.98 (not accounting for tax
implications).
· What about between the second and third days?
66/144
## 1 2017-08-01 63.914
## 2 2017-08-02 65.178
## 3 2017-08-03 69.418
## 4 2017-08-04 71.382
## 5 2017-08-07 71.034
## 6 2017-08-08 73.044
· 69.418 - 65.178
= 0.06505263 = 6.50%
65.178
· The daily return is again positive, meaning that the price increased
between the second and third days.
· What is the daily return between the third and fourth days?
67/144
· 65.178 - 63.914
return[1] =
63.914
· 69.418 - 65.178
return[2] =
65.178
· 71.382 - 69.418
return[3] =
69.418
· Do you see the general structure/pattern?
68/144
· If N is the total number of rows of data frame, then what we want to
create looks something like this:
· 65.178 - 63.914
return[1] =
63.914
· 69.418 - 65.178
return[2] =
65.178
· …
· price[N] - price[N - 1]
return[N - 1] =
price[N - 1]
· How could we create an entire column of daily return values?
69/144
· One approach you might be thinking of is to use a for-loop.
· Again though, for computational efficiency, if you can avoid using for-
loops, you should do so.
· In particular, here we can create this daily return column by doing

calculations directly on the Closing Price column.
70/144
N <- nrow(TSLA.data) # Gives number of rows of the dataframe
# Starting at the second entry, this gives closing price for ...
# consecutive days until the last day.
todays_price <- TSLA.data$Closing.price[2 : N]
# Starting at the first entry, gives closing price for ...

# consecutive days until the 2nd to last day.
yesterdays_price <- TSLA.data$Closing.price[1 : N - 1]
TSLA_dailyReturn <- (todays_price - yesterdays_price)/yesterdays_price
· The math operations used when defining TSLA_dailyReturn are ‘entrywise

operations.’ This means that when R computes the ith entry of
TSLA_dailyReturn, R takes the difference between the ith entries in
todays_price and yesterdays_price and divides that by the ith entry in
yesterdays_price.
71/144
# To add it to TSLA.data, we can append NA for the first entry

TSLA.data$daily.return <- c(NA,
(todays_price - yesterdays_price)/yesterdays_price)
· Note: If you want to add a column to a pre-existing data frame, you have
to make sure that you have the same number of entries in the column as
there are rows in the data frame (otherwise, there would be an error).
72/144
· Now we can study these daily returns.
· For example, we can look at statistical measures such as the mean and
standard deviation.
mean(TSLA.data$daily.return, na.rm = TRUE)
## [1] 0.003087012
sd(TSLA.data$daily.return, na.rm = TRUE)
## [1] 0.04096016
· Note that the na.rm parameter should be set to TRUE since the first entry
was NA.
73/144
· Alternatively, you could have done:
mean(na.omit(TSLA.data$daily.return))
## [1] 0.003087012
sd(na.omit(TSLA.data$daily.return))
## [1] 0.04096016
· The na.omit() function returns all rows of a data frame in which there are
no NA values for that row (if there is a row with an NA value in some
column, then that row will be removed by the na.omit() function). Using
the na.omit() function on a vector means that we remove all NA values
from the vector.
74/144
· Alternatively, you could have done:
mean(TSLA.data$daily.return[2:N])
## [1] 0.003087012
sd(TSLA.data$daily.return[2:N])
## [1] 0.04096016
· Since you know that the first entry is NA (you defined the column that
way), you can tell R to ignore that entry in the computation.
75/144
· We can visualize these daily return values as well.
qplot(TSLA.data$daily.return) +
xlab('daily return') + ylab('count') +
ggtitle("Histogram of TSLA Daily Returns") +
76/144
77/144
· Problem: Create a data frame AMZN.data that has Amazon’s daily returns
since August 1 of 2017.
78/144
tech.symbol <- 'AMZN'

getSymbols(Symbols = tech.symbol, src = "yahoo", from = "2017-08-01")
## [1] "AMZN"
AMZN.data <- data.frame(ticker.date = as.Date(index(AMZN)),

Closing.price = as.numeric(AMZN$AMZN.Close))
N <- nrow(AMZN.data)
todays_price <- AMZN.data$Closing.price[2 : N]
yesterdays_price <- AMZN.data$Closing.price[1 : N-1]
AMZN.data$daily.return <- c(NA,
(todays_price - yesterdays_price)/yesterdays_price)
79/144
· Problem: Using qplot(), create a scatterplot of the daily return values of
TSLA (on the x-axis) and AMZN (on the y-axis) to see how correlated those
daily returns are. Would you say these daily returns are positively
correlated, negatively correlated, or uncorrelated?
80/144
qplot(x = TSLA.data$daily.return, y = AMZN.data$daily.return) +

labs(x = 'TSLA daily return', y = 'AMZN daily return')
81/144
· Note that this illustrates how you can create scatterplots of data from two
datasets.
82/144
· We can quantify the correlation using the cor() function.
cor(na.omit(TSLA.data$daily.return), na.omit(AMZN.data$daily.return))
## [1] 0.3839377
· Note: The cor() function does not have an na.rm argument. We have to
remove the NA value in the first entry of the vectors, since leaving it in
would result in NA for the output of cor(). We can use the na.omit()
function to handle this.
83/144
· We could also have selected the entries we knew were not NA values.
cor(TSLA.data$daily.return[2:N], AMZN.data$daily.return[2:N])
## [1] 0.3839377
· We will more formally define what the correlation of two random variables
is later in the semester (when we review probability/statistics
fundamentals).
84/144
· For-Loops
· Useful Functions
85/144
Logical Operators
· Recall that ! is the NOT operator, & is the AND operator, | is the OR
operator.
· To give a sense of AND, here is HW 1 pdf’s release condition prior to
September 12. If you do not meet all of the conditions, then you cannot
see the HW 1 pdf in the Homework folder.
86/144
Logical Operators
· Recall that ! is the NOT operator, & is the AND operator, | is the OR
operator.
· To give a sense of OR, here is Textbook Quiz #2’s release condition. You will
be able to see Textbook Quiz #2 if you either get 100% on Textbook Quiz
#1 or you attempted the quiz twice.
· Note that this is not an exclusive OR, meaning that if you attempted the
quiz twice and got 100% on that second attempt, you would not be
excluded from seeing Quiz #2.
87/144
Logical Operators
· The xor() is the EXCLUSIVE OR function. The xor() function is asking, “Is
exactly one of these conditions true?” If more than one condition is TRUE,
then xor() returns a FALSE value.
xor(1 + 1 == 2, 2 + 2 == 4)
## [1] FALSE
xor(1 + 1 == 200, 2 + 2 == 4)
## [1] TRUE
xor(1 + 1 == 200, 2 + 2 == 400)
## [1] FALSE
88/144
Logical Operators
(1 + 1 == 2) | (2 + 2 == 4)
## [1] TRUE
xor(1 + 1 == 2, 2 + 2 == 4)
## [1] FALSE
· Caution: Students sometimes confuse ‘or’ with ‘exclusive or,’ since typically
when people in plain language mean ‘xor’ when they say ‘or.’ Be careful
since in programming/logic, there is a difference.
89/144
Logical Operators
· The xor() function is associative (like addition). This means that if A, B, and
C are logical conditions, then (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) ).
xor(xor(1 + 1 == 2, 2 + 2 == 5), 3 + 3 == 18)
## [1] TRUE
· Here, A is ‘1 + 1 == 2’, B is ‘2 + 2 == 5’, and C is ‘3 + 3 == 18.’
· If you have a bunch of conditions, we can see later how you can automate
the writing of the xor() command using the paste() and eval() functions.
90/144
Logical Operators
· The outcome of any comparison is either TRUE or FALSE.
· When the comparison is done comparing a vector to a constant, then the
comparison is done entrywise.
(1:7) > 4
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
91/144
Logical Operators
· TRUE corresponds to 1 and FALSE corresponds to 0.
· Therefore, if you want to count how many instances in a vector satisfy a
certain condition, you can sum the outcome of a logical vector.
sum((1:100) > 90)
## [1] 10
· If you wanted to understand what fraction of elements in a vector satisfied

a certain condition, how would you do that?
92/144
Logical Operators
· If you wanted to understand what fraction of elements in a vector satisfied
a certain condition, how would you do that?
· You can do sum the logical vector, and then divide by the total number of
numbers in the vector.
sum((1:100) > 90)/length(1:100)
## [1] 0.1
93/144
Logical Vectors
· TRUE/FALSE values are encoded as 1’s and 0’s.
· A key concept of the logic of TRUE/FALSE statements is that AND is

associated with multiplying TRUE/FALSE values.
x <- 1:10 # A vector with entries/values 1 to 10

(x > 3) # Creates a logical vector
(x < 6) # Creates another logical vector
· Using arithmetic operations, how can you create a logical vector (a vector
of 0’s and 1’s) that shows ‘x is either greater than 3 AND x is less than 6’?
94/144
Logical Vectors
· A key concept of the logic of TRUE/FALSE statements is that AND is
associated with multiplying TRUE/FALSE values.
x <- 1:10 # A vector with entries/values 1 to 10

(x > 3) # Creates a logical vector of 0's and 1's
(x < 6) # Creates another logical vector of 0's and 1's
(x > 3) * (x < 6)
## [1] 0 0 0 1 1 0 0 0 0 0
95/144
Logical Vectors
· Note that this is equivalent to the following:
((x > 3) & (x < 6))
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
· This is answering the question, “Which numbers are both greater than 3
AND less than 6?”
96/144
Logical Vectors
· However, note that there is a difference between a single & and a double
&&.
· The double && answers the question, “Is it the case that the first entries in
these vectors satisfies both of the conditions?”.
· The answer to that question is either TRUE or FALSE.
((x > 3) && (x < 6))

((x > 3) & (x < 6))[1] # Same outcome as &&
· A single & creates a vector of TRUE/FALSE values while a double &&

returns only a single TRUE/FALSE value.
· Just use & unless you only care about the first entries.
97/144
Logical Vectors
· Similarly for || instead of |.
· The double || answers the question, “Is it the case that the first element of
the vector x is either less than 3 OR greater than 6?”.
· The answer is either TRUE or FALSE.
((x > 3) || (x < 6))

((x > 3) | (x < 6))[1] # Same outcome as ||
98/144
Logical Vectors
· Being able to create logical vectors makes computations easier. For
example, suppose you were a professor deciding grade thresholds at the
end of the semester. Let’s suppose that the raw final scores are given
below as grades.vec.
# Setting the seed makes sure we all get same answers when we ...
# use a random number generator
set.seed(1)
# Creates vector of random numbers that are Normally distributed

grades.vec <- rnorm(n = 30, mean = 79, sd = 6)
· Using the grade cutoffs specified in the course outline for this class, what
would the class average GPA be?
99/144
Logical Vectors
· Using the grade cutoffs specified in the course outline for this class, what
would the class average GPA be?
GPA.vec <- 4*(grades.vec >= 94.0) +

3.7*(grades.vec >= 91.0 & grades.vec < 94.0) +
3*(grades.vec >= 84.0 & grades.vec < 87.5) +
2*(grades.vec >= 72.5 & grades.vec < 76.5) +
1*(grades.vec >= 59.5 & grades.vec < 64.0)
mean(GPA.vec)
100/144
· For-Loops

· Useful Functions
101/144
Programming Tip
· In everyday language, write down your process of thinking about how to
solve the problem at a high-level. Do not go into details; start with the big
picture.
· Make sure the logic is solid before trying to implement it in R (or any
programming language).
· “If I can figure out how to track the value of my investment of a single stock
over time, then I can track the value of a portfolio (which is just a weighted
average of the individual stocks).”
· “Once I can track the value of a portfolio over time, then I can figure out
the daily return of the portfolio over time.”
102/144
Programming Tip
· Your initial thought process may not be perfect (it will not get all the
technical details and possible complications when implementing your
approach).
· That is OK, since you can always update your approach and think of new
ones. Thinking of the solution approach at a high level prevents you from
getting stuck for a long time.
103/144
Problem-Solving Tip
· Simplify the problem.
· This is especially true once you start to implement your high-level solution
approach.
· Once you figure out how to solve a simpler problem, it may help you see
how to approach a larger or more complex problem.
· For your projects (large problems), you will likely have to break each
problem into many smaller sub-problems and figure out how to solve
those pieces before figuring out how to solve the larger problem.
104/144
Problem-Solving Tip
· Example: Suppose that you wanted to store the max daily return over the
last month for the top dozen cryptocurrencies by market capitalization.
· “I want to store the symbol and max daily return over the last month for
each of the top dozen coins by market cap into a data frame using a for-
loop.”
· “Instead of doing that, can I figure out how to use a for-loop to store just
the symbols for the top dozen coins (BTC, ETH, etc.)?”
105/144
The crypto package
· Let’s practice this with an example.
· Cryptocurrency is a digital asset that uses cryptography to verify
transactions.
· Cryptocurrencies can be traded and exchanged, and current and historical
prices for different cryptocurrencies can be seen on sites such as
CoinMarketCap.
· There is such a package that scrapes CoinMarketCap data, created by
cryptocurrency enthusiast Jesse Vent.
install.packages('crypto')
library(crypto)
# The crypto_list() function creates a dataframe sorted by market cap

crypto.df <- crypto_list() # Lists cryptocurrencies by market cap
View(crypto.df)
106/144
Debugging Tip
· When you have an error in a for-loop (or while-loop or if-then statement),
then ‘step’ into the code.
· Suppose that you wanted to create a vector (called simplified_problem)

which has the symbols of the top dozen cryptocurrencies.
· What’s wrong with the code below?
simplified_problem <- data.frame(symbol = character(),

stringsAsFactors=FALSE)
crypto.df <- crypto_list()
for(i in 1:12){
coin.symbol <- crypto.df[i, 1]
simplified_problem[1, 1] <- coin.symbol
}
107/144
Debugging Tip
· When you have an error in a for-loop (or while loop or if-then statement),
then ‘step’ into the code.
· You can ‘step into the loop’ by copy/pasting everything in the for-loop
repeatedly as you increase the index i.
108/144
Debugging Tip
i <- 1

View(simplified_problem) # See what's going on
109/144
Debugging Tip
i <- 2

View(simplified_problem) # See what's going on
· Do you see the problem now?
110/144
Debugging Tip
· That is what ‘stepping into the loop’ means.
· The for-loop is essentially copy/pasting everything in the braces over and

over again as you’re increasing the index.
· We can see what it is doing by manually copy/pasting the code and

increasing the index ourselves.
111/144
Debugging Tip
· Sometimes, the issue with the loop is at some different index, like i = 100.
If that is the case, see what is happening at i = 99 and i = 100.
· You can use a for-loop for i from 1 to 98, and then manually step into the
loop for i = 99 and i = 100.
· You can also use the try() function to pinpoint the troublemaker.
112/144
Debugging Tip
· The try() function is helpful for debugging loops and if-then statements.
for(i in 1:12){
coin <- crypto.df[i, 1]
coin_charts <- crypto_timeseries(coin)
}
· The for-loop could break down if the CoinMarketCap site is down.

· If/when that happens, then you might want to figure out which line of code
is causing the issue.
113/144
Debugging Tip
· Suppose that there was some bug in the for-loop. The first time there’s a
hiccup, the for-loop would just stop and return an error.
for(i in 1:12){
coin_charts <- crypto_timseries(coin)
}
· However, with try(), you can catch all the indices that are causing
problems.
114/144
Debugging Tip
· One way to find the trouble spots would be the following:
for(i in 1:12){
# Allows us to catch errors in the crypto_timeseries() function

errorcatch <- try(coin_charts <- crypto_timseries(coin))
if("try-error" %in% class(errorcatch)){

print(i)}
}
· “Try this line of code (storing the crypto_timeseries() output into

coin_charts). If there’s an error, then print the index i.”
115/144
Debugging Tip
· One way to continue the loop and skip over troublemaker indices is the
following:
for(i in 1:12){
# Allows us to catch errors in the crypto_timeseries() function

if(!("try-error" %in% class(errorcatch))){ # Note the ! operator

# Commands to run when there is no error
# Insert commands to run here
}
}
· “Try this line of code. If there’s no error, then run the given commands.”
116/144
Debugging Tip
· Another approach is to comment out a large independent chunks of the
code to see which chunk is causing the issue.
· For example, if you had four large chunks of code that do not depend on
each other, then comment out two of them and run all of your code at
once (Ctrl + A, then Ctrl + Enter). If there is an issue, then the issue must be
in one of the chunks that you just ran.
· This ‘binary search’ approach can be helpful when you have large,
independent blocks of code in your file.
117/144
A Difference Between <- and =
· Note that in errorcatch, we had the line:
· What this says is: “If the expression ‘coin_charts <- crypto_timseries(coin)’
works, then store the output of the crypto_timseries() function to an object
coin_charts and store this output in errorcatch as well. If it doesn’t work,
then store the error message in errorcatch.”
118/144
· Note the difference between the two lines of code.
i <- 1
errorcatch <- try(coin_charts <- crypto_timeseries(coin))
rm(list = ls()) # Removes everything in the Environment

i <- 1
errorcatch <- try(coin_charts = crypto_timeseries(coin))
· What happened?
119/144
· Within a function, the = sign specifies the value of an argument of the
function.
log(x = 1, base = 10)
· Here, ‘x’ and ‘base’ are called arguments of the function, and the values of
‘x’ and ‘base’ are specified in the code above.
· I will use the terms arguments and parameters interchangeably. These

are also known as the inputs of the function.
120/144
· So when we tried the following:
i <- 1
errorcatch <- try(coin_charts = crypto_timeseries(coin))
· We’ll get an error because try() is interpreting the coin_charts as an

argument of the try() function.
121/144
· If we want to continue using =, we will have to surround the expression
with parentheses:
rm(list = ls()) # Removes everything listed in the environment.

i <- 1
errorcatch <- try((coin_charts = crypto_timeseries(coin)))
· Now this works since try() will not interpret coin_charts as an argument of
the try() function anymore.
122/144
· Another example of the difference between <- and = is given below:
log(x = 100, base = 10)
· Compare that with:
log(x <- 100, base <- 10)
· Did you notice a difference?
123/144
The Great Escape
· Sometimes when you are typing in code, you forget a closing parenthesis
or brace.
· R will think you are still writing code for the for-loop or part of the
command line in the parentheses.
· For example, run the following:
errorcatch <- try((coin_charts = crypto_timeseries(coin))
· Now press Enter a few times, and you can see the + symbol in the Console.
· This means that R is still trying to run that line of code.
· You can either close the parentheses (to run it) or press Escape to get out
of it (abandon that line of code).
124/144
· For-Loops
· Useful Functions
125/144
Useful Functions
· paste(), assign(), merge()
· If you want to store the stock price data for each of the stocks in a
portfolio as you’re going through the for-loop, you can do so with the
assign() and paste() functions.
126/144
The paste() function
· The paste() function pastes/concatenates/glues together strings of text.
?paste # See the help file
library(lubridate)
todays_wday <- wday(today(), label = TRUE, abbr = FALSE)
paste('Hi, today is ', todays_wday)
· Note that the default parameter puts a single space as a separator, so

there’s a space in between the different chunks of text. To see this, type ?
paste.
· The default for paste0 is sep = "" (so that there’s no space between the
chunks of text).
127/144
· The paste() function pastes/concatenates/glues together strings of text.
paste0('Hi, today is ', todays_wday)

paste('Hi, today is ', todays_wday, sep = "") # Same result
· There’s no fundamental difference between paste0() and paste(), except

that paste() allows you to choose different separators while paste0() uses
no separator.
128/144
· You can change the separator between chunks by changing the sep
parameter/argument of the function.
local.part <- 'firstName.lastName'

domain <- 'ucalgary.ca'
email <- paste(local.part, domain, sep = '@')
129/144
· You can also paste more than two chunks together.
paste0('Hi, today is ', todays_wday, '.')
· We have that last bit with the period so that it pastes the period at the end
of the printed text.
130/144
The assign() function
· The assign() function can be used in a for-loop or a function to create new
objects systematically.
· The assign() function combined with the paste() or paste0() function is very
useful when you want to store data in separate data frames or objects
with a specified naming structure.
· To give an example, let’s first take a look at a function in the crypto

package.
131/144
The crypto package
· One useful function is the crypto_history() function, which gives the price
data in USD for a specified cryptocurrency (similar to quantmod).
# You can specify either the symbol or the full name

BTC.charts <- crypto_history('BTC')
ETH.charts <- crypto_history('Ethereum')
132/144
The assign() function
· The assign() function can be used in a for-loop or a function to create new
objects systematically.
for(i in 1:5){
coin_charts <- crypto_history(coin)
assign(paste0(coin, '.charts'), coin_charts)

}
133/144
merge() and cbind()
· The functions merge() and cbind() can be used to add new columns to data
frames.
· There is a difference between the two.
· cbind() requires columns to have the same number of rows. If they do not
have the same number, there will be an error.
· On the other hand, merge() works fine even when columns don’t have the
same number of rows.
· By default, when you use merge() to merge by some column, it only

includes rows that are in common between both of the data frames.
134/144
merge() and cbind()
· The following example illustrates this.
example1 <- data.frame(col1 = 1:10, col2 = 101:110)

example2 <- data.frame(col1 = 1:10, col3 = LETTERS[1:10])
merged.df <- merge(example1, example2, by = 'col1')

cbind.df <- cbind(example1, col3 = example2$col3)
· The merge() function merges the second data frame (labeled y in the help
file) with the first function (labeled x in the help file).
· Here, the output will be the same either way you do it.
135/144
merge() and cbind()
· However, if one data frame has fewer rows, then cbind() will break down.

example2.v2 <- data.frame(col1 = 1:7, col3 = LETTERS[1:7])
cbind.df <- cbind(example1, col3 = example2.v2$col3)
· This is because a data frame has a very particular structure. It is a list in

which all items are vectors with the same length.
136/144
merge() and cbind()
· On the other hand, merge() will be able to handle it. By default, it tosses
those extra rows that do not have common entries of the “by” column.

merged.df <- merge(example1, example2.v2, by = 'col1')
137/144
merge() and cbind()
· If you wanted to include everything and just put NA whenever one data
frame doesn’t have that entry, then you can change the ‘all’ parameter of
the merge() function (or change the all.X and all.y parameters separately).

merged.df2 <- merge(example1, example2.v2, by = 'col1', all = TRUE)
138/144
merge() and cbind()
· In general, if you want to merge data frames by some columns, you will
want to be on the safe side and use the merge() function unless you are
sure everything will be fine.
· Otherwise, even if they have the same number of rows, you may not get
what you want (as illustrated below).

example2.v3 <- data.frame(col1 = c(1:7, 18:20),
col3 = c(LETTERS[1:7], LETTERS[18:20]))
cbinded.df <- cbind(example1, col3 = example2.v3$col3)
139/144
merge() and cbind()
· Observe what happens if we set all.x = FALSE or all.y = FALSE. Also note
that all.x = FALSE and all.y = FALSE is the default so that there will not be
NA values.
merged.df3 <- merge(example1, example2.v3)

View(merged.df3)
140/144
merge() and cbind()
NA values.
merged.df3.x <- merge(example1, example2.v3, by = 'col1',

all.x = TRUE, all.y = FALSE)
# Set all.x = TRUE if you want to ...
# have all of rows of 1st dataset with respect to the 'by' column(s)
View(merged.df3.x)
141/144
merge() and cbind()
NA values.
merged.df3.y <- merge(example1, example2.v3, by = 'col1',

all.x = FALSE, all.y = TRUE)
# Set all.y = TRUE if you want to ...
# have all of rows of 1st dataset with respect to the 'by' column(s)
View(merged.df3.y)
· Notice now that it won’t have 8, 9, or 10 in col1 because those values were
not in example2.v3$col1.
142/144
merge() and cbind()
NA values.
merged.df3.all <- merge(example1, example2.v3, by = 'col1', all = TRUE)

# Set all = TRUE if you want to ...
# have all of rows of either dataset with respect to the 'by' column(s)
View(merged.df3.all)
143/144
· For-Loops
· Useful Functions
144/144

Data Analytics Btma 636 Day 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics Btma 636 Day 2

Uploaded by

Copyright:

Available Formats

Data Analytics I

· Across everyone in the classes, there is a wide diversity of prior knowledge,

· As discussed before, these quizzes will be asynchronous, open-book

· Your score will be your best attempt before the deadline.

· As discussed in the Welcome Video, in prior years, I could just walk up to

· If I record the meeting when screen-sharing mode is turned off, then

· It should go without saying that non-academic misconduct consequences

x <- (1:1000000)^2 - (0:999999)^2

system.time(x <- (1:1000000)^2 - (0:999999)^2)

## user system elapsed

rm(x) # Removes x from the environment

system.time( {x <- numeric(0)

## user system elapsed

example_data <- data.frame(x = numeric(), y = numeric());

plot(example_data, xlab = 'x', ylab = 'y')

example_data <- data.frame(x = numeric(), y = numeric());

plot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y')

· Let’s install the ggplot package.

· Let’s use the qplot() function to ‘quickly plot’ something.

qplot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y')

qplot(x = 1:100, y = (1:100)^2, xlab = 'x', ylab = 'y',

· Then you would do: geom = c(“point”, “smooth”). See below.

qplot(x = 1:100, y = (1:100)^2 + rnorm(n = 100, mean = 0, sd = 500),

· Since R is an open-source language, anyone around the world can write

· The downside is that packages depend on one another, and updating a

R.Version() # Version 3.6.2 for Windows

· To get another version of R for Windows, go here: https://cran.r-

· For Macs, you can download older versions of R here: https://cran.r-

· If there is some incompatibility issue (for example, an issue installing one

· The quantmod package gets it data from a variety of different data

· Now you’ve downloaded this package in R. Once you download a package,

getSymbols(Symbols = 'TSLA', src = "yahoo", from = "2017-08-01")

# The default source for getSymbols() is Yahoo Finance.

· This automatically returns an object called TSLA, which is an extensible

· This allows us to name the object ourselves.

· We could also convert the output to a data frame, instead of as an xts

TSLA.stock.charts <- as.data.frame(getSymbols(Symbols = 'TSLA',

TSLA.stock.charts.xts <- getSymbols(Symbols = 'TSLA',

str(TSLA.stock.charts.xts) # Look at the structure of this object

· For example, try the index() function.

TSLA.data <- data.frame(ticker.date = as.Date(index(TSLA)),

qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price)

· We can do that by changing a parameter of qplot().

qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

# Note that hjust is for the horizontal adjustment factor

qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

p <- qplot(x = TSLA.data$ticker.date, y = TSLA.data$Closing.price,

print(p) # print(p), or just p, will display p in the Plot Viewer

· The main feature of ggplot is that we can add more layers on to

p # print(p), or just p, will display p in the Plot Viewer

· Here, we add a title layer and a theme layer.

· The advantage of coding it programmatically is that everyone on your

· The definition of daily return is

today's value - yesterday's value

· What about between the second and third days?

· Do you see the general structure/pattern?

· How could we create an entire column of daily return values?

· In particular, here we can create this daily return column by doing

N <- nrow(TSLA.data) # Gives number of rows of the dataframe

# Starting at the first entry, gives closing price for ...