Dplyr Package in R: Alka Vaidya

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

dplyr package in R

Alka Vaidya
What is dplyr in R?
• The dplyr is a powerful R-package to manipulate, clean and summarize
unstructured data. In short, it makes data exploration and data
manipulation easy and fast in R.
• The package "dplyr" comprises many functions that perform mostly used
data manipulation operations such as applying filter, selecting specific
columns, sorting data, adding or deleting columns and aggregating data.
Some operations are overlapping with base R, but dplyr functions are
faster than base R
• To install,
install.packages("dplyr")
library(dplyr)
Some Important Functions in dplyr

dplyr Function Description Equivalent SQL

select() Selecting columns SELECT


(variables)
filter() Filter (subset) rows. WHERE
group_by() Group the data GROUP BY

summarise() Summarise (or aggregate) -


data
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New Variables COLUMN ALIAS
Read the data…
• The data contains income (not real) generated by 51 states from year
2002 to 2015
> mydata=read.csv("sampledata.csv")
> dim(mydata)
• To select random rows use sample_n()
>sample_n(mydata,3)
• Selecting Random Fraction of Rows
>sample_frac(mydata,0.1)
>x1=distinct(mydata) # to display unique rows, in this data no duplicates
select( ) Function
• select(data , ....) # syntax, where, data : Data Frame
mydata2 = select(mydata, State, Y2008) # select state and 2008
columns
mydata2 = select(mydata, State:Y2008) # select state upto 2008
columns
mydata2 = select(mydata, index, State, Y2012:Y2015) # select index,
state, and given range of years
mydata2 = select(mydata, -Index, -Y2002,-Y2003)
Pipe Operator %>%
• It allows you to write sub-queries like we do it in sql.
• All the functions in dplyr package can be used without the pipe
operator.
• However, if you use pipe operator, it lets to wrap multiple functions
together with the use of %>%. e.g.
m2 =mydata%>%select(Index,State)
m2= mydata%>%select(Index,State)%>%sample_n(10)
filter( ) Function
• filter() syntax : filter(data , ....) # data : Data Frame: Logical Condition
mydata2 = filter(mydata, Index == "A“)
mydata2 = filter(mydata, Index %in% c("A", "C"))
mydata2 = filter(mydata , Index %in% c("A", "C") | Y2002 >= 1300000)
mydata2 = filter(mydata , Index %in% c("A", "C") & Y2002 >= 1300000 )
mydata2 = filter(mydata, !Index %in% c("A", "C"))
• You can combine select and filter with Pipe operator %>%
mydata%>%select(Index,State,Y2005)%>%filter(Y2005>1880000)
summarise( ), summarise_at() Functions
• summarise(data , ....) #data : Data Frame ..... : Summary Functions
such as mean, median etc.
summarise(mydata, mn15 = mean(Y2015), med15=median(Y2015))
summarise_at(mydata, vars(Y2011, Y2012),funs(mean, median))

• summary() function calculates summary statistics for all the columns


in a data frame
summary(mydata)
arrange() function
• arrange(data_frame, variable(s)_to_sort)
• OR
• data_frame %>% arrange(variable(s)_to_sort)
arrange(mydata, Index, Y2011) OR arrange(mydata, desc(Index), Y2011)
Arrange(cust,Balance,IdNo) OR arrange(cust,desc(Balance),IdNo
group_by() function
• group_by(data, variables)
• OR
• data %>% group_by(variables)
> t = mydata %>% group_by(Index) %>% summarise_at(vars(Y2011:Y2012),
funs(mean,median))
• Practice at home with CUST:
c=cust%>%group_by(Age_group)%>
%summarise_at(vars(Balance),funs(min,max))%>%filter(Gender=“F”)
c=cust%>%filter(Gender=="F")%>%group_by(Age_group)%>
%summarise_at(vars(Balance),funs(min,max))
Mutate()
• Create a new variable (mutated variable)
• data_frame %>% mutate(expression(s))
• md=mydata %>% mutate(chng=Y2015/Y2014)c
md$chng
mydata %>% filter(min_rank(desc(Y2015))==1) # rank 1 in 2015, Missouri, but prints all
years
mydata %>% filter(min_rank(desc(Y2015))==1) %>% select(State,Y2015)
• If you want this in each group, you will use group_by
Mydata %>%group_by(Index) %>%filter(min_rank(desc(Y2015))==1) %>%
select(Index,State,Y2015) # group_by should be earlier than others
Data Pre-processing
• Very important aspect of analytics and machine learning
• Analogy – Recipe may be good, but ingredients?
• Let’s understand the process with Bank Loan file in Excel Format
• library(xlsx)
• df= read.xlsx("Bank_Loan.xlsx",sheetIndex = 2)
• head(df), str(df),dim(df)
• With too many variables, there may be a multidimensionality
problem, algorithm may take time. Hence, we need to remove
irrelevant variables.
• With too many variables, there may be a multidimensionality
problem, algorithm may take time. Hence, we need to remove
irrelevant variables.
df=df[-1] … don’t do it, but generally done to remove the 1st column assign it
to df1… df1=df[2:14] will also do.
summary(df)
range(df$age)
0: 67 ………..0 is a problem
• You may want to see how well is your data distributed
hist(df$Age, main=“Age Histogram", xlab="Age", border="blue", col="green")
• As you can see, 25 onwards, it’s a very well-distributed data, after age of 65,
may be the bank is not eager or the people are not applying
• Before 20, there are no cases of personal loan, however, few cases can be
seen at 0 which could be the ‘noise’ in the data. So, the noise needs to be
removed.
plot(sort(df$Age)) #… shows you the discreteness of data, i.e. even if the data is continuous,
there are smaller groups within that..
• Treatment of ‘0’ age -
• In case you want to see IDs of Age ‘0’ people
library(dplyr) # ??dplyr() will communicate with R and show you the help
df%>%select(Age,ID)%>%filter(Age==0)
df%>%group_by(Age)%>%count() # will give you count of people with each age
• we will convert them as NA using library dplyr
df$Age=na_if(df$Age,0)
• Alternatively, if you don’t know dplyr, for multiple columns you can replace 0’s
with NAs, /
df[,c(1,3:7)][df[,c(1,3:7)]==0]=NA # we will not do this. we are considering all the rows, but 1st
column and 3rd to 7th columns and comparing them with 0, if they are 0, replace them with NA
df$Age # difficult to locate NAs..
df[is.na(df$Age),1] #Print IDs of Age NA
sum(is.na(df$Age))
• If you want to see, how many replacements it has done…
i=0
for (i in 1:14) {print(sum(is.na(df[,i])))}
# 0 4 0 0 0 0 0, as we have replaced in only one column
• So, what are we going to do with NAs? We may want to omit the bad
ingredients?
• newdata=na.omit(df)
• newdata
• dim(newdata) # so the new dimension is 4996x14
• However, problem was only with ‘Age’, so do we want to omit the
entire record for a bad column.
• Instead of replacing the entire row, we may want to replace the
missing data with the median or mean of that column
median(df$Age) # will not work due to certain NULL values
median(df$Age, na.rm=TRUE) # rm stands for remove, it calculates the median
by removing the NAs
• Now, we want to replace those NA values with median
df$Age[is.na(df$Age)]=median(df$Age,na.rm =TRUE)
‘Experience’ Column
hist(df$Experience,main="Experience Histogram",xlab = "Experience",
ylab="Count", col="red", border = "blue")
• There are values with negative experience
df[df$Experience<0,3] # will show those places where experience is <0
OR
df%>%select(ID,Experience)%>%filter(Experience<0)
df%>%group_by(Experience)%>%count() # will give you count of
people with each experience
• Replace ‘0’s with Nas
• df[df<0]=NA
• Table(df$Experience) # total 4948 NAs are not shown
• sum(is.na(df)) # 52
• So, either you omit those records …
df1 = na.omit(df)
dim(df1) #4848x14
• Or replace with it’s mean
• df$Experience[is.na(df$Experience)]=mean(df$Experience,na.rm=TRUE)
Let’s consider Character Data
lapply(df,class) # take the function ‘class’ and apply it to every column of df
#lapply[df[3],sum] …. Would add income column
str(df) # Note that ZIP.code is character
unique(ZIP.code)
library(stringr) # we are going to use few string functions. E.g. str_trim
df$ZIP.Code=str_trim(df$ZIP.Code) # to trim the extra white spaces
df$ZIP.Code=str_trim(df$ZIP.Code, side= “left”) # to trim left side white spaces
• To remove white spaces within the column values,
str_replace_all(df$ZIP.Code,pattern=" ", repl="")
# still comma pending
df$ZIP.Code=gsub(",","",df$ZIP.Code)
str_replace_all(df$ZIP.Code,pattern=" ", repl="")
Alternatively,
• Still, it’s hard way to do… so, there is a library called ‘Plant Genetic
Resources’ or PGRdup, There are ‘n’ number of functions available to
clean string type of data.
• install.packages(“PGRdup”)
• library(PGRdup)
• dataClean(df$ZIP.code,fix.comma=TRUE)
• Some data can be redundant,
• Finding the correlation:
• plot(df$Age,df$Experience,xlab="Age",ylab="Exper",col="green")
• High correlation between age and experience
• Similarly there may be negative correlation (e.g. as weight of the car
increases, the average of the car decreases plot(mtcars$wt,
mtcars$mpg,xlab="Weight", ylab="miles per gallon",col="red") )
• But with new data, how you would come to know which two
attributes are related? For that,
• If height of a tree increases, girth would also increase… but not
necessary in every case

You might also like