Intro To Data Science Lecture 4

9/6/2022
OUTLINE
Introduction to Data Science for  1. Data Exploration
Civil Engineers  2. Data Management
 3. Data Engineering and Shaping
 4. More about data.table
 5. More about dplyr
Lecture 2b. Data Exploration and Management
Fall 2022
1 2
1. DATA EXPLORATION POTENTIAL PROBLEMS IN THE DATA

 Potential problems in the data  Missing values
 Missing values  Represented by NA in R
 Invalid values and outliers  Many algorithms in R will quietly drop rows with missing
 Wide or narrow data ranges values
 Data units  A variable (column in data frame) with many missing values
should be treated carefully
 Exclude it from your model
 Functions of descriptive statistics may reveal many data
 Convert it to some meaningful value
problems (e.g., summary() in R)
 Repair it with data imputation
 Average values of available data that are in the same
 Data visualization by graphs also help reveal data category

problems  More sophisticated methods that treat the missing data as a
 In R, graphing packages include ggplot2, WVPlots, function of other variables whose data are available
(regression models, neural network models)
ggpubr, ggstatsplot, lattice, and others.
 Investigate the reasons for data missing
 R package vtreat may automatically treat missing variables
3 4 1
9/6/2022
POTENTIAL PROBLEMS IN THE DATA POTENTIAL PROBLEMS IN THE DATA

 Invalid values and outliers  Data range
 Invalid values: wrong sign; wrong data type, etc.  Data that ranges over several orders of magnitude can be a
 e.g., age = -2 years problem for some modeling methods.
 Outliers: data points falling far out of their reasonable range  Data that are in a very narrow range may not be a good
predictor in a model (variance of estimated parameter may be
 Some invalid values or outliers have special meanings
large).
 E.g., 0 or -9999 are used to represent missing, truncated, or
 Whether the data range is too narrow or wide depends on
censored data (sentinel values)
problem domain, data unit, and others
 Check data dictionary or documentation for explanation
 The coefficient of variation (ratio of standard deviation to
mean) may be used to evaluate the data range

 Handling invalid values and outliers
 Convert them to NA
 Treat them as a new category

 Data units
 Units should be consistent
 Numerical values of a variable will change significantly when
the unit changes; data precision also changes.
5 6
2. DATA MANAGEMENT 3. DATA ENGINEERING AND SHAPING

 Value Transformation  Data Wrangling
 Normalization (rescaling)  Data Selection
 Centering and scaling  Data Transformation
 Can be done by function scale() in R  Data Reshaping
 Recommended procedure for PCA and Deep Learning
 Some data wrangling tools in R
 Log transformation  Base R (Package base in R)
 Nonnegative value whose distribution is skewed
 Package data.table
 Values that range over several orders of magnitude
 Used for fast and memory efficient data manipulation
 When the process from which data is generated is  Uses “reference semantics” in that changes are made directly in a
multiplicative instead of additive, log transformation of the shared data structure
process data can make modeling easier  For more details see help(data.table,package=“data.table”)
 https://github.com/eddelbuettel/gsir-te
 The caret and recipes R packages both include many  Package dplyr
more high-level functions for data preprocessing and  Data manipulation through sequences of SQL-like operators
normalization.  Part of tidyverse (a collection of R packages designed for data science).
 https://github.com/saghirb/Getting-Started-in-R
7 8 2
9/6/2022
3.1 DATA SELECTION 3.1 DATA SELECTION

 Select some columns and some rows from a data frame  Using package dplyr
 Using Base R library("dplyr")
iris_dplyr <- iris %>%
#iris data comes pre-installed with R and is part of the "datasets" package select(.,
summary(iris) Petal.Length, Petal.Width, Species) %>%
head(iris) filter(.,
columns_we_want <- c("Petal.Length", "Petal.Width", "Species") Petal.Length > 2)
rows_we_want <- iris$Petal.Length > 2 head(iris_dplyr)
iris_base <- iris[rows_we_want, columns_we_want, drop = FALSE]
head(iris_base) # display the beginning of the data set
tail(iris_base) # display the end of the data set  Note: Traditionally dplyr steps are chained with the pipe
 Using package data.table operator “%>%”.
 Version 4.1.0 of R released in May 2021 now has the new native pipe
library("data.table")
operator “|>”
iris_data.table <- as.data.table(iris)
columns_we_want <- c("Petal.Length", "Petal.Width", "Species")
rows_we_want <- iris_data.table$Petal.Length > 2
iris_data.table <- iris_data.table[rows_we_want , ..columns_we_want]  A cheat sheet for dplyr can be found at
head(iris_data.table) https://www.rstudio.com/wp-content/uploads/2015/02/data-
wrangling-cheatsheet.pdf .
 Note: ".." indicates that "columns_we_want" isn't itself the name of a
column but a variable referring to names of columns.
 More info can be found by the command vignette("datatable-
intro", package = "data.table")
9 10

 Remove records with incomplete data  Remove records with incomplete data
 Using Base R  Using package data.table
library('ggplot2') # Load ggplot2 package library("data.table")
data(msleep) # load msleep data from ggplot2 package
str(msleep) msleep_data.table <- as.data.table(msleep)
summary(msleep)
clean_data.table = msleep_data.table[complete.cases(msleep_data.table), ]
clean_base_1 <- msleep[complete.cases(msleep), , drop = FALSE]
summary(clean_base_1) nrow(clean_data.table)
nrow(clean_base_1)
 Using package dplyr
clean_base_2 = na.omit(msleep)
nrow(clean_base_2) library("dplyr")
clean_dplyr <- msleep %>%

 complete.cases() returns a vector with one entry for each filter(., complete.cases(.))
row of the data frame, which is TRUE if the row has no nrow(clean_dplyr)
missing entries.
 na.omit() performs the whole task in one step.
11 12 3
9/6/2022

> purchases  Example of separating ordered rows into groups and run
 Order rows day hour n_purchase
1 1 9 5 calculations on each group
 Using Base R 2 2 9 3 order_index <- with(purchases, order(day, hour))
3 2 11 5
> library("wrapr") purchases_ordered <- purchases[order_index, , drop = FALSE]
> purchases <- wrapr::build_frame( 4 1 13 1
"day", "hour", "n_purchase" | 5 2 13 3
data_list <- split(purchases_ordered, purchases_ordered$day)
1 , 9 , 5 | 6 1 14 1
2 , 9 , 3 | data_list <- lapply(
2 , 11 , 5 | > order_ind<-order(purchases$day, purchases$n_purchase) data_list,
1 , 13 , 1 | function(di) {
> order_ind
2 , 13 , 3 | di$running_total <- cumsum(di$n_purchase)
1 , 14 , 1 ) [1] 4 6 1 2 5 3
di
> purchases[order_ind,,drop=FALSE]
})
day hour n_purchase
4 1 13 1 purchases_ordered <- do.call(base::rbind, data_list)
6 1 14 1
1 1 9 5
 The base function with() executes the code in its second argument as if the
2 2 9 3
5 2 13 3 columns of the first argument were variables.
3 2 11 5  The base function split(x,f) divides the data in the vector x into the groups
defined by f.
 The base function order() returns a permutation which rearranges  The base function lapply(X,FUN) returns a list of the same length as X, each
its first argument (a sequence of vectors) into ascending or element of which is the result of applying FUN to the corresponding element
descending order. In the case of ties in the first vector, values in the of X.
second are used to break the ties.  The base function do.call() constructs and executes a function call from a
name or a function and a list of arguments to be passed to it.
13 14
3.1 DATA SELECTION 3.2 DATA TRANSFORMATION

 Order rows  (i). Basic data transformation
 Using package data.table  Adding a new columns, deleting a column using assignment
library("data.table")
operator
DT_purchases <- as.data.table(purchases) > library('datasets')
order_cols <- c("day", "hour") > head(airquality) #“airquality is a data set in package ‘datasets’”
setorderv(DT_purchases, order_cols) Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
 setorderv() re-orders data in place and takes a list of ordering 2 36 118 8.0 72 5 2
column names to specify the order. 3 12 149 12.6 74 5 3
> myaqdata<-airquality
# add a new column of data
 Using package dplyr > myaqdata$newcol<-paste(myaqdata$Month,myaqdata$Day,sep="-")
> head(myaqdata) # display the beginning of the data set
library("dplyr") Ozone Solar.R Wind Temp Month Day newcol
res <- purchases %>% 1 41 190 7.4 67 5 1 5-1
arrange(., day, hour) 2 36 118 8.0 72 5 2 5-2
3 12 149 12.6 74 5 3 5-3
> myaqdata$newcol<-NULL # delete a column
> head(myaqdata)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
> tail(myaqdata) # display the end of the data set
15 16 4
9/6/2022
3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

 Changing columns and selecting rows using transform() and  Replacing a missing value (NA) with the most recent non-NA
subset() functions prior to it.
 transform(`mydata`, ...): ... are tagged vector expressions, which  Use the na.locf() function from the package zoo. (locf: Last
are evaluated in the data frame mydata. The tags are matched against Observation Carried Forward)
names(mydata), and for those that match, the value replace the > install.packages("zoo")
corresponding variable in mydata, and the others are appended to > library("zoo")
mydata > myaqdata <- airquality
> head(myaqdata)
 subset(x, argu, select) returns a subset of x. argu selects rows, Ozone Solar.R Wind Temp Month Day
select selects columns. 1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
> myaqdata <- transform(myaqdata,newcol=paste(myaqdata$Month,myaqdata$Day,sep=".")) 3 12 149 12.6 74 5 3
> head(myaqdata) 4 18 313 11.5 62 5 4
Ozone Solar.R Wind Temp Month Day newcol 5 NA NA 14.3 56 5 5
1 41 190 7.4 67 5 1 5.1 6 28 NA 14.9 66 5 6
2 36 118 8.0 72 5 2 5.2 > myaqdata$Ozone <-na.locf(myaqdata$Ozone, na.rm=FALSE)
3 12 149 12.6 74 5 3 5.3 > head(myaqdata)
Ozone Solar.R Wind Temp Month Day
Note: always use na.rm =
> myaqdata <- subset(airquality,!is.na(Ozone), select=c("Ozone","Month")) 1 41 190 7.4 67 5 1
> head(myaqdata) 2 36 118 8.0 72 5 2
FALSE with na.locf(),
Ozone Month 3 12 149 12.6 74 5 3 otherwise it may delete the
1 41 5 4 18 313 11.5 62 5 4 initial NA values from your
2 36 5 5 18 NA 14.3 56 5 5 data
3 12 5 6 28 NA 14.9 66 5 6
17 18

 (ii). Aggregating transformation  (2) Combining multiple rows into summary rows using base
functions such as tapply(), aggregate(), tabulate(),
 Combining multiple rows or columns
table(). (not very convenient)
 (1) Creating a vector of data with no missing values from the
 tapply(X, INDEX, FUN) applies function FUN on X (a vector-like R
first non-missing values of multiple vectors (known as coalesce object) that are grouped based on INDEX.
in SQL) using wrapr::coalesce(left, right) or simply
left %?% right (Note: %?% is the infixed operator for > purchases
day hour n_purchase
coalesce) 1 1 9 5 > x1 <- tapply(purchases$hour,purchases$day,sum)
2 2 9 3 > x1
> library("wrapr")
3 2 11 5 1 2
> data<- wrapr::build_frame(
4 1 13 1 36 33
+ "time","s1","s2","s3"|
5 2 13 3
+ 1L,NA, -.7,0.8|
6 1 14 1 > x2<-tapply(purchases$n_purchase,purchases$day,sum)
+ 2L,NA,0.2,NA|
> dayno<-factor(purchases$day)
+ 3L,NA,NA, 0.9|
>
+ 4L,NA,NA,NA)
> sumpur<-
> data
> data$complete<-data$s1 %?% data$s2 %?% data$s3 %?% 0.0 cbind(day=as.numeric(levels(dayno)),sum_hour=x1,sum_n_purchase=x2
time s1 s2 s3
> data )
1 1 NA -0.7 0.8
time s1 s2 s3 complete > sumpur
2 2 NA 0.2 NA
1 1 NA -0.7 0.8 -0.7 day sum_hour sum_n_purchase
3 3 NA NA 0.9
2 2 NA 0.2 NA 0.2 1 1 36 7
4 4 NA NA NA
3 3 NA NA 0.9 0.9 2 2 33 11
4 4 NA NA NA 0.0
19 20 5
9/6/2022

 aggregate() splits the data into subsets, computes summary  (2) Combining multiple rows into summary rows using package
statistics for each, and returns the result in a convenient form. data.table
> aggregate(purchases, list(purchases$day), sum)[,c(1,3,4)]
Group.1 hour n_purchase > library("data.table")
1 1 36 7 > purchases.table <-as.data.table(purchases)
2 2 33 11 > purchases.table <-purchases.table[,
+ .(hour=sum(hour),
+ n_purchase=sum(n_purchase)),
 tabulate(bin,nbins) takes the integer-valued vector bin and + by =.(day)]
counts the frequencies of integers of 1, 2, ..nbins. > purchases.table
day hour n_purchase
> tabulate(c(-2,0,2,3,3,5), nbins = 3) 1: 1 36 7
[1] 0 1 2 2: 2 33 11
> tabulate(c(2,3,3,5), nbins = 10)
[1] 0 1 2 0 1 0 0 0 0 0
 table() creates a contingency table.

> with(airquality, table(cut(Temp, quantile(Temp)), Month))
Month
5 6 7 8 9
(56,72] 24 3 0 1 10
(72,79] 5 15 2 9 10
(79,85] 1 7 19 7 5
(85,97] 0 5 10 14 5
21 22

 (2) Combining multiple rows into summary rows using package  (iii). Multi-Table Data Transform
dplyr
 Split a data frame into multiple data frames
 By split(x,f) function: divides the data in x into the groups
> library("dplyr")
> purchases_summary <- purchases %>% group_by(., day) %>% defined by f.
+ summarize(., > temp<-split(purchases,purchases$day)
+ hour = sum(hour), > temp
+ n_purchase = sum(n_purchase)) %>% $`1`
+ ungroup(.) day hour n_purchase
`summarise()` ungrouping output (override with `.groups` argument) 1 1 9 5
4 1 13 1
> purchases_summary 6 1 14 1
# A tibble: 2 x 3
day hour n_purchase $`2`
<dbl> <dbl> <dbl> day hour n_purchase
1 1 36 7 2 2 9 3
2 2 33 11
3 2 11 5
5 2 13 3
 By using logical vector as index of rows

> purchases1<-purchases[purchases$day==1,]
> purchases2<-purchases[purchases$day==2,]
23 24 6
9/6/2022

 Combining multiple data frames by rows using the base R  Joining tables by matching rows based on key values
function rbind()  Left Join: keeps row from the left table and adds columns from
> purchases <-rbind(purchases1, purchases2) matching rows in the right table.
> purchases
day hour n_purchase  Using base R function merge() with argument all.x=TRUE.
1 1 9 5 Note that merge() merges two data frames by common
4 1 13 1
6 1 14 1 columns or row names, or do other versions of database join
2 2 9 3 operations.
3 2 11 5
5 2 13 3
 Combining multiple data frames by columns using the base R > capacityTable
RoadID capacity
function cbind() 1 r1 9.99
2 r3 19.99
> temp<-cbind(purchases1, purchases2) 3 r4 5.49
> temp 4 r5 24.49
day hour n_purchase day hour n_purchase > demandTable > merge(capacityTable,demandTable, by ="RoadID",all.x=TRUE)
1 1 9 5 2 9 3 RoadID demand RoadID capacity demand
4 1 13 1 2 11 5 1 r1 10 1 r1 9.99 10
6 1 14 1 2 13 3 2 r2 43 2 r3 19.99 55
3 r3 55 3 r4 5.49 8
4 r4 8 4 r5 24.49 NA
25 26

 Joining tables by matching rows based on key values  Joining tables by matching rows based on key values
 Right Join: keeps row from the right table and adds columns  Inner Join: keeps rows where the key exists in both tables.
from matching rows in the left table. This produces an intersection of two tables.
 Using base R function merge() with argument all.y=TRUE.  Using base R function merge()
> capacityTable
RoadID capacity
1 r1 9.99
2 r3 19.99
3 r4 5.49
4 r5 24.49
> demandTable
RoadID demand > capacityTable
1 r1 10 RoadID capacity
2 r2 43 1 r1 9.99
3 r3 55 2 r3 19.99
4 r4 8 3 r4 5.49
4 r5 24.49
> merge(capacityTable,demandTable, by ="RoadID",all.y=TRUE) > demandTable > merge(capacityTable,demandTable, by ="RoadID")
RoadID capacity demand RoadID demand RoadID capacity demand
1 r1 9.99 10 1 r1 10 1 r1 9.99 10
2 r2 NA 43 2 r2 43 2 r3 19.99 55
3 r3 19.99 55 3 r3 55 3 r4 5.49 8
4 r4 5.49 8 4 r4 8
27 28 7
9/6/2022
3.2 DATA TRANSFORMATION 3.3 DATA RESHAPING

 Joining tables by matching rows based on key values  Data Reshaping: moving data between rows and
 Full Join: keeps rows for all key vales. Notice the two tables columns (pivoting and un-pivoting)
have equal importance here.  (i). Moving data from wide to tall form
 Using base R function merge()
> capacityTable
RoadID capacity
1 r1 9.99
2 r3 19.99
3 r4 5.49
4 r5 24.49 > merge(capacityTable,demandTable, by ="RoadID", all=TRUE)
> demandTable RoadID capacity demand
RoadID demand 1 r1 9.99 10
1 r1 10 2 r2 NA 43
2 r2 43 3 r3 19.99 55
3 r3 55 4 r4 5.49 8
4 r4 8 5 r5 24.49 NA
29 30
3.3 DATA RESHAPING 3.3 DATA RESHAPING

 (i). Moving data from wide to tall form  (i). Moving data from wide to tall form
 Method (a): using gather() function in tidyr package.  Method (b): using unpivot_to_blocks() function in cdata
> install.packages("tidyr") package.

> library("tidyr")
> y <- data.frame(index = c(1,2,3), info = c("a","a","c"), meas1 = > install.packages(“cdata")
4:6,meas2=7:9, meas3=11:13) > library(“cdata")
> y
index info meas1 meas2 meas3 > y_long2 <-unpivot_to_blocks(y, nameForNewKeyColumn="meastype",
1 1 a 4 7 11 + nameForNewValueColumn="meas",
2 2 a 5 8 12 + columnsToTakeFrom=c("meas1","meas2","meas3"))
3 3 c 6 9 13 > y_long2
index info meastype meas
> y_long <-gather(y,key=meastype,value=meas,meas1, meas2,meas3) # 1 1 a meas1 4
more columns can be added behind meas3 if neede. 2 1 a meas2 7
> y_long 3 1 a meas3 11
index info meastype meas 4 2 a meas1 5
1 1 a meas1 4 5 2 a meas2 8
2 2 a meas1 5 6 2 a meas3 12
3 3 c meas1 6 7 3 c meas1 6
4 1 a meas2 7 8 3 c meas2 9
5 2 a meas2 8 9 3 c meas3 13
6 3 c meas2 9
7 1 a meas3 11  Note: more information can be found at https://win-
8 2 a meas3 12 vector.com/2018/10/21/faceted-graphs-with-cdata-and-ggplot2/
9 3 c meas3 13
31 32 8
9/6/2022

 (i). Moving data from wide to tall form  (ii). Moving data from tall to wide form
 Method (c): using melt.data.table() function in data.table
package.
> install.packages(“data.table")
> library(“data.table")
> y_long3 <- melt.data.table(as.data.table(y), id.vars = NULL,

+ measure.vars = c("meas1","meas2","meas3"),
+ variable.name ="meastype",
+ value.name="meas")
> y_long3
index info meastype meas
1: 1 a meas1 4
2: 2 a meas1 5
3: 3 c meas1 6
4: 1 a meas2 7
5: 2 a meas2 8
6: 3 c meas2 9
7: 1 a meas3 11
8: 2 a meas3 12
9: 3 c meas3 13
 Note: measure.vars specifies which columns values are to be taken from.

variable.name is the new key columns and value.name is the new value
column.
33 34

 (ii). Moving data from tall to wide form  (ii). Moving data from tall to wide form
 Method (a): using spread() function in tidyr package.  Method (b): using pivot_to_rowrecs() function in cdata
package.
> library("tidyr")
> y_long > library(“cdata")
index info meastype meas >
1 1 a meas1 4 > y_wide2 <- pivot_to_rowrecs(y_long,
2 2 a meas1 5 + columnToTakeKeysFrom = "meastype",
3 3 c meas1 6 + columnToTakeValuesFrom = "meas",
4 1 a meas2 7 + rowKeyColumns = "index")
5 2 a meas2 8 > y_wide2
6 3 c meas2 9 index info meas1 meas2 meas3
7 1 a meas3 11 1 1 a 4 7 11
8 2 a meas3 12 2 2 a 5 8 12
9 3 c meas3 13 3 3 c 6 9 13
>
> y_wide1 <- spread(y_long,key=meastype,value=meas)
>
> y_wide1
index info meas1 meas2 meas3
1 1 a 4 7 11
2 2 a 5 8 12
3 3 c 6 9 13
35 36 9
9/6/2022
3.3 DATA RESHAPING 4. MORE ABOUT DATA.TABLE

 (ii). Moving data from tall to wide form  Package data.table can be used to manipulate datasets.
 Method (c): using dcast.data.table() function in
 data.table operations can be viewed as dt[i, j, by],
data.table package.
where i can select rows, j is used to select, summarize, or
> library(“data.table") mutate columns, and by is the grouping operator.
> y_wide3 <- dcast.data.table(as.data.table(y_long),
+ index ~ meastype, value.var = "meas")  Use j to select columns: Add a new column (variable) or
modifies an existing one.
> y_wide3
index meas1 meas2 meas3 > pcs<-as.data.table(purchases[1:3,]) # convert data.frame to data.table
1: 1 4 7 11 > pcs
> purchases day hour n_purchase
2: 2 5 8 12 day hour n_purchase 1: 1 9 5
3: 3 6 9 13 1 1 9 5 2: 2 9 3
2 2 9 3 3: 2 11 5
3 2 11 5 > pcs[,minute := hour*60] # add one new column
 Note: the “left ~ right” formula specifies a result matrix 4 1 13 1
> pcs
day hour n_purchase minute
whose rows are identified by left and columns are 5 2 13 3 1: 1 9 5 540
6 1 14 1
identified by right. The value.var argument tells how to 2:
3:
2
2
9
11
3
5
540
660
populate the cells of this matrix. > pcs[,day := paste0("Day_",day)]
> pcs
# modify an existing column

1: Day_1 9 5 540
2: Day_2 9 3 540
3: Day_2 11 5 660
37 38
4. MORE ABOUT DATA.TABLE 4. MORE ABOUT DATA.TABLE

 Use j to select columns: Keep, drop, or reorder variables  Use i to select rows: Keep or drop observations (rows)
> pcs[,.(day,hour,n_purchase)] > pcs
day hour n_purchase day hour n_purchase minute
1: Day_1 9 5 1: Day_1 9 5 540
2: Day_2 9 3 2: Day_2 9 3 540
3: Day_2 11 5 3: Day_2 11 5 660
> pcs[order(hour),] > pcs[hour==9,]
day hour n_purchase minute day hour n_purchase minute
1: Day_1 9 5 540 1: Day_1 9 5 540
2: Day_2 9 3 540 2: Day_2 9 3 540
3: Day_2 11 5 660 > pcs[hour==9 & n_purchase>3,]
1: Day_1 9 5 540
 Use j to summarize
> pcs[,.(TotalHourPerDay=sum(hour),Mean=mean(n_purchase)),by=.(day)]  Note: For comparing values in vectors use: < (less than), >
day TotalHourPerDay Mean (greater than), <= (less than and equal to), >= (greater than and
1: Day_1 9 5 equal to), == (equal to) and != (not equal to).
2: Day_2 20 4
 These can be combined logically using & (and) and | (or).
 Note: in addition to sum, mean, there are other functions like

sd, median, min, max
39 40 10
9/6/2022
4. MORE ABOUT DATA.TABLE 4. MORE ABOUT DATA.TABLE

 Use setnames() to name or rename variables whilst keeping all  Use setkey() to sort a data.table and mark it as sorted with an
variables. > DT <- data.table(a = 1, b = 2, d = 3) attribute sorted. The sorted columns are the key. The key can be
> DT
a b d any number of columns. The columns are always sorted in
1: 1 2 3
> old <- c("a", "b", "c", "d")
ascending order.
> new <- c("A", "B", "C", "D")
>
> setnames(DT, old, new, skip_absent = TRUE) > pcs
> DT
A B D
Date Hour N Min
1: 1 2 3 1: Day_1 9 5 540
> pcs
2: Day_2 9 3 540
1: Day_1 9 5 540 3: Day_2 11 5 660
2: Day_2 9 3 540
3: Day_2 11 5 660
> setkey(pcs, N, Hour) #order rows based on N and Hour
> setnames(pcs,c("Day","Hour","N","Min")) # change all column names > pcs
> pcs
Day Hour N Min
Date Hour N Min
1: Day_1 9 5 540 1: Day_2 9 3 540
2: Day_2 9 3 540
3: Day_2 11 5 660
2: Day_1 9 5 540
> setnames(pcs,"Day","Date") # change “Day” to “Date” 3: Day_2 11 5 660
> pcs
Date Hour N Min
1: Day_1 9 5 540
2: Day_2 9 3 540
3: Day_2 11 5 660
41 42
4. MORE ABOUT DATA.TABLE 5. MORE ABOUT DPLYR

 Chaining: several sets of commands may be put together with  Package dplyr is part of tidyverse, a collection of
square brackets to do multiple data wrangling steps at once. packages designed for data science.
 Some functions to wrangle data include mutate(),
> purchases > pcs<-as.data.table(purchases)
day hour n_purchase > pcs select(), rename(), filter(), and arrange().
1 1 9 5
2 2 9 3
day hour n_purchase  Use mutate() to add a new variable (column) or modify an
1: 1 9 5
3 2 11 5 existing column.
4 1 13 1 2: 2 9 3
5 2 13 3 3: 2 11 5
4: 1 13 1 > pcs <-mutate(purchases,minute=hour*60)
6 1 14 1
> pcs[1:3,]
5: 2 13 3
> purchases day hour n_purchase minute
6: 1 14 1 1 1 9 5 540
day hour n_purchase
> pcs[,minute 2 2 9 3 540
1 1 9 5
:=hour*60][n_purchase>1,][,.(MeanMin=mean(minute),TotalPu 2 2 9 3 3 2 11 5 660
rchase=sum(n_purchase)),by=.(day,hour)] 3 2 11 5 > pcs2 <-mutate(purchases,day=str_c("day ",day))
day hour MeanMin TotalPurchase 4 1 13 1 > pcs2
1: 1 9 540 5 5 2 13 3 day hour n_purchase
6 1 14 1 1 day 1 9 5
2: 2 9 540 3
2 day 2 9 3
3: 2 11 660 5 3 day 2 11 5
4: 2 13 780 3 4 day 1 13 1
5 day 2 13 3
6 day 1 14 1
43 44 11
9/6/2022
5. MORE ABOUT DPLYR 5. MORE ABOUT DPLYR

 Use select() to select columns: keep, drop, or reorder variables  Use filter() to select rows: keep or drop observations
> pcs
> pcs<-pcs[1:3,] day hour n_purchase minute
> select(pcs,-minute) #drop the “minute” column. 1 1 9 5 540
day hour n_purchase 2 2 9 3 540
1 1 9 5 3 2 11 5 660
> filter(pcs, hour==9)
2 2 9 3
3 2 11 5 1 1 9 5 540
> select(pcs,day, hour, n_purchase) # Keep three columns 2 2 9 3 540
day hour n_purchase > filter(pcs, hour==9 & n_purchase > 3)
1 1 9 5 day hour n_purchase minute
2 2 9 3 1 1 9 5 540
3 2 11 5
45 46

 Use group_by() to convert a tibble (a modern reimaging of data  Use arrange() to change the order of rows (observations)
frame) into a grouped tibble where operations can be performed > pcs
"by group". (ungroup() removes grouping). day hour n_purchase minute
 Use summarize() to create a new tibble (data frame), which has 1 1 9 5 540
2 2 9 3 540
one (or more) rows for each combination of grouping variables. It 3 2 11 5 660
will contain one column for each grouping variable and one > arrange(pcs,n_purchase, day)
column for each of the summary statistics that you have specified. day hour n_purchase minute
1 2 9 3 540
> pcs4 <-group_by(purchases,day,n_purchase)
2 1 9 5 540
> summarize(pcs4, N=n(),Mean=mean(hour))
3 2 11 5 660
# A tibble: 4 x 4
> arrange(pcs,desc(n_purchase))
# Groups: day [2]
day n_purchase N Mean
1 1 9 5 540
<dbl> <dbl> <int> <dbl>
2 2 11 5 660
1 1 1 2 13.5
3 2 9 3 540
2 1 5 1 9
3 2 3 2 11
4 2 5 1 11  Note: desc() orders rows in descending order.
 Note: other functions that can be included in summarize() are sd(),
max(), min(),median(), etc. (see help(“summarize”) for more
details)
47 48 12
9/6/2022

 Use rename() to rename variables whilst keeping all variables  Chaining: multiple data wrangling steps may be done in one
step by the pipe operator, %>%, which can be read as “then”.
> pcs
> pcs <-purchases
day hour n_purchase minute > pcs3 <- pcs %>%
1 1 9 5 540 > purchases mutate(minute=hour*60) %>%
2 2 9 3 540 day hour n_purchase filter(n_purchase>1) %>%
3 2 11 5 660 1 1 9 5 select(day, n_purchase,minute) %>%
> rename(pcs, Date=day, total = n_purchase) 2 2 9 3 arrange(n_purchase,day)
Date hour total minute 3 2 11 5 > pcs3
1 1 9 5 540 4 1 13 1 day n_purchase minute
5 2 13 3 1 2 3 540
2 2 9 3 540 2 2 3 780
6 1 14 1
3 2 11 5 660 3 1 5 540
4 2 5 660
# In practice, the format is as follows:
> pcs3 <- pcs %>%
mutate(.,minute=hour*60) %>%
filter(.,n_purchase>1) %>%
Note: The pipe select(.,day, n_purchase,minute) %>%
operator, %>%, replaces arrange(.,n_purchase,day)
> pcs3
the dots (.) with
day n_purchase minute
whatever is returned 1 2 3 540
from code preceding it. 2 2 3 780
3 1 5 540
4 2 5 660
49 50
13

Intro To Data Science Lecture 4

Uploaded by

Copyright:

Available Formats

You might also like

Intro To Data Science Lecture 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Data Science Lecture 4

Uploaded by

Copyright:

Available Formats

9/6/2022

 4. More about data.table

 5. More about dplyr

Lecture 2b. Data Exploration and Management

1. DATA EXPLORATION POTENTIAL PROBLEMS IN THE DATA

 Average values of available data that are in the same

 Data visualization by graphs also help reveal data category

 R package vtreat may automatically treat missing variables

POTENTIAL PROBLEMS IN THE DATA POTENTIAL PROBLEMS IN THE DATA

mean) may be used to evaluate the data range

 Treat them as a new category

2. DATA MANAGEMENT 3. DATA ENGINEERING AND SHAPING

3.1 DATA SELECTION 3.1 DATA SELECTION

intro", package = "data.table")

3.1 DATA SELECTION 3.1 DATA SELECTION

clean_dplyr <- msleep %>%

3.1 DATA SELECTION 3.1 DATA SELECTION

3.1 DATA SELECTION 3.2 DATA TRANSFORMATION

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

 table() creates a contingency table.

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

 By using logical vector as index of rows

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

3.2 DATA TRANSFORMATION 3.2 DATA TRANSFORMATION

3.2 DATA TRANSFORMATION 3.3 DATA RESHAPING

3.3 DATA RESHAPING 3.3 DATA RESHAPING

> install.packages("tidyr") package.

3.3 DATA RESHAPING 3.3 DATA RESHAPING

> y_long3 <- melt.data.table(as.data.table(y), id.vars = NULL,

 Note: measure.vars specifies which columns values are to be taken from.

3.3 DATA RESHAPING 3.3 DATA RESHAPING

3.3 DATA RESHAPING 4. MORE ABOUT DATA.TABLE

day hour n_purchase minute

4. MORE ABOUT DATA.TABLE 4. MORE ABOUT DATA.TABLE

 Note: in addition to sum, mean, there are other functions like

4. MORE ABOUT DATA.TABLE 4. MORE ABOUT DATA.TABLE

4. MORE ABOUT DATA.TABLE 5. MORE ABOUT DPLYR

5. MORE ABOUT DPLYR 5. MORE ABOUT DPLYR

5. MORE ABOUT DPLYR 5. MORE ABOUT DPLYR

5. MORE ABOUT DPLYR 5. MORE ABOUT DPLYR

You might also like