R

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 15

Data frames are primary data structures in R, capable of storing 2-dimensional

data. What follows here is the essential knowledge you need to have when dealing
with Data frames.

By now, you already learned quite some things in R. Data structures such as
vectors, matrices and lists have no secrets for you anymore. However, R is a
statistical programming language, and in statistics you'll often be working with
data sets. Such data sets are typically comprised of observations, or instances.
All these observations have some variables associated with them. You can have for
example, a data set of 5 people. Each person is an instance, and the properties
about these people, such as for example their name, their age and whether they have
children are the variables. How could you store such information in R? In a matrix?
Not really, because the name would be a character and the age would be a numeric,
these don't fit in a matrix. In a list maybe? This could work, because you can put
practically anything in a list. You could create a list of lists, where each
sublist is a person, with a name, an age and so on. However, the structure of such
a list is not really useful to work with. What if you want to know all the ages for
example? You'd have to write a lot of R code just to get what you want. But what
data structure could we use then?

Meet the data frame. It's the fundamental data structure to store typical data
sets. It's pretty similar to a matrix, because it also has rows and columns. Also
for data frames, the rows correspond to the observations, the persons in our
example, while the columns correspond to the variables, or the properties of each
of these persons. The big difference with matrices is that a data frame can contain
elements of different types. One column can contain characters, another one numeric
and yet another one logical. That's exactly what we need to store our persons'
information in the dataset, right? We could have a column for the name, which is
character, one for the age, which is numeric, and one logical column to denote
whether the person has children.

There still is a restriction on the data types, though. Elements in the same column
should be of the same type. That's not really a problem, because in one column, the
age column for example, you'll always want a numeric, because an age is always a
number, regardless of the observation.

So, for the practical part now: creating a data.frame. In most cases, you don't
create a data frame yourself. Instead, you typically import data from another
source. This could be a csv file, a relational database, but also come from other
software packages like Excel or SPSS.

Of course, R provides ways to manually create data frames as well. You use the data
dot frame function for this. To create our people data frame that has 5
observations and 3 variables, we'll have to pass the data frame function 3 vectors
that are all of length five. The vectors you pass correspond to the columns. Let's
create these three vectors first: name, age and child.

Now, calling the data frame function is simple:

The printout of the data frame already shows very clearly that we're dealing with a
data set. Notice how the data frame function inferred the names of the columns from
the variable names you passed it. To specify the names explicitly, you can use the
same techniques as for vectors and lists. You can use the names function, ... , or
use equals sings inside the data frame function to name the data frame columns
right away.

Like in matrices, it's also possible to name the rows of the data frame, but that's
generally not a good idea so I won't detail on that here.
Before you head over to some exercises, let me shortly discuss the structure of a
data frame some more.

If you look at this structure, ..., there are two things you can see here: First,
the printout looks suspiciously similar to that of a list. That's because, under
the hood, the data frame actually is a list. In this case, it's a list with three
elements, corresponding to each of the columns in the data frame. Each list element
is a vector of length 5, corresponding to the number of observations. A requirement
that is not present for lists is that the length of the vectors you put in the list
has to be equal. If you try to create a data frame with 3 vectors that are not all
of the same length, you'll get an error.

Second, the name column, which you expect to be a character vector, is actually a
factor. That's because R by default stores the strings as factors. To suppress this
behaviour, you can set the stringsAsFactors argument of the data.frame function to
FALSE

Now, the name column actually contains characters.

With this new knowledge, you're ready for some first exercises on this extremely
useful and powerful data structure.

data_handle_1 <- function()


{

#Write your code here

name <- c("A","B","C","D","E")


age <- c(10,11,12,13,14)
department <- c("AA","BB","CC","DD","EE")
empdetails = data.frame(name, age, department)
print(empdetails)
print(str(empdetails))

print(str(mtcars))
print(summary(mtcars))

print(mtcars[,c(1,2,10)])

print(mtcars[order(mtcars$mpg),])

print(subset(mtcars, mpg > 30))

print(subset(mtcars[order(mtcars$mpg, mtcars$hp, decreasing = TRUE),], mpg >


30))
#print(mtcars[order(mtcars$mpg, decreasing = TRUE),])

data_handle_1()
----------------------------------------

The R package data.table provides an enhanced version of data.frame and this allows
you to perform blazing fast data manipulations.

This package is being used in different fields such as finance and genomics, and is
especially useful while working with large data sets (e.g. 1GB to 100GB in RAM).

Q)In this module we will learn how to use this highly useful package.

library(data.table)
data_table <- function(){

#Write your code here


mt <- data.table(subset(mtcars, mpg > 15)) #mtcars for mpg > 15 into data table
#print(mt)
mt[ ,.(AvgHp = mean(hp),AvgWt = mean(wt)), by = .(cyl,carb) ] #avg wt and hp in
above data table arranged by cyl and carb numbers

cyl carb AvgHp AvgWt


1: 6 4 116.5 3.09375
2: 4 1 77.4 2.15100
3: 6 1 107.5 3.33750
4: 8 2 162.5 3.56000
5: 4 2 87.0 2.39800
6: 8 3 180.0 3.86000
7: 8 4 264.0 3.17000
8: 6 6 175.0 2.77000

Using lapply and sapply


lapply and sapply are functions which loop over a list, apply a function to every
element of the list and return the results. lapply can be used on data frames,
lists or vectors and the output returned is a list (thus the l in the function
name). Watch this video to know more!

sapply works similar to lapply, but it tries to simplify the output to the most
elementary data structure that is possible. In effect, sapply is a 'wrapper'
function for lapply.

Watch this video to understand how these functions work!

Using apply
The apply function operates on arrays and, applies R base package function or user
defined functions on the rows or columns of the array. This function returns a
vector or array or list of values. Example:

m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)


apply(m, 1, mean)
Returns: [1] 6 7 8 9 10 11 12 13 14 15

tapply applies a function to each cell of an array, that is - to each (non-empty)


group of values, which is given by a unique combination of the levels of certain
factors. Example: Applying tapply on the famous iris data.
tapply(iris$Petal.Length, Species, mean)
Returns:

setosa versicolor virginica


1.462 4.260 5.552

mapply is a multivariate version of sapply. mapply applies a function to each


element of each argument. Example:

l1 <- list(a = c(1:10), b = c(11:20))


l2 <- list(c = c(21:30), d = c(31:40))
mapply(sum, l1$a, l1$b, l2$c, l2$d)
Returns:[1] 64 68 72 76 80 84 88 92 96 100

loops <- function(){

#Write your code here


maths <- c(70,75,80,85,90)
science <- c(71,76,81,86,91)
english <- c(73,78,83,88,93)
list1 = list(maths,science,english)

print(lapply(list1, mean))

s1 <- c(70,80,90)
s2 <- c(71,81,91)
s3 <- c(72,82,92)
s4 <- c(73,83,93)
list2 = list(s1,s2,s3,s4)

print(sapply(list2, mean))

m <- matrix(c(1:50), nrow = 10, ncol = 5)

print(apply(m, 2, mean))

print(tapply(mtcars$mpg, list(mtcars$cyl, mtcars$am), mean))

x <- c(1:5)
y <- c(6:10)

print(mapply(prod, x, y))

loops()

Test case 0
Your Output (stdout)
[[1]]
[1] 80
[[2]]
[1] 81
[[3]]
[1] 83
[1] 80 81 82 83
[1] 5.5 15.5 25.5 35.5 45.5
0 1
4 22.900 28.07500
6 19.125 20.56667
8 15.050 15.40000

Importing data into R is an essential step before starting with data exploration or
data analysis. R offers a number of ways to import data from a variety of data
sources and data file types.

R has a number of functions and packages to support importing every single type of
file, relational databases and statistical analysis software.

In this module we will go over some of these functions and packages.

Importing and Exporting csv Files


The read.csv() and write.csv() functions in R are frequently used to read data from
and write data to .csv files.

In this video, you will learn how to import and export .csv data files in R. At the
same time, some of the most common problems that you can face when loading csv
files into R will also be addressed.

Importing Text Files


Many times, data is stored as tab delimited text file or a .txt file.

To import such data, you can use one of the easiest and most general options to
import your file to R: the read.table() function.

Watch this video to learn how to use read.table() function to import data from text
files.

Importing Tab Delimited Files


To read data from tab delimited files, you can use the function read.delim().

read.delim() is a wrapper function for read.table() with default argument values


that are convenient when reading in tab-separated data.

read.delim() function reads a file into a list. By default, the file is separated
by tab, comma or any other delimiter specified by parameter "sep=".

If the parameter "header=TRUE", then the first row will be treated as the row
names.

Syntax: read.delim(file, header = FALSE, sep = "\t", quote = "\"", dec = ".", fill
= TRUE, comment.char = "", ...)

colClasses to specify type of colums extracted


nRow to specify number of rows to extracted

rvest for webscrapping data

library(rvest)

motherlink <- read_ntml("url of target page")

forecasttext <- html_text(forecasthtml)

forecasttext
paste(forecasttext, collapse = " ")

Introduction to dplyr
Introduction to dplyr
dplyr is a package for data manipulation. It provides some simple and useful
functions that come in handy when performing exploratory data analysis and
manipulation.

dplyr is great for the following:

-Data Exploration

Data Transformation
Fast on Data frames
Working with data stored in databases
This module will provide a basic overview of some of the most useful functions in
the package.

Have you ever had a data set that you were sure contained simple insights that you
just couldn’t access? I’m not talking about the sorts of information that you can
discover with machine learning or modeling algorithms. Just simple things like new
variables, summary statistics, group differences, and so on.

Most data sets contain more information than they display and dplyr is a package
that can help you access that information.

dplyr introduces a grammar of data manipulation. Five simple functions that you can
use to reveal new variables, new observations, and new ways to describe your data.
You can also use these functions to subset your data and do group wise operations.

And dplyr is fast. Very fast. The key pieces of dplyr are written in C++, which
means you get the speed of C with the ease of R.

This course will show you how to use dplyr like an expert. You'll learn to use
dplyr's grammar of data manipulation to solve any data related task you can think
of. Along the way, you’ll learn how to think about and manipulate the structure of
data. You'll also learn to use dplyr's tbl structure and the piping operator -- two
features that can save you tons of time.

You'll even learn how to use dplyr to access data stored in a database, which
provides an easy way to work with data that is too big to fit in R all at once.

With dplyr, R is literally faster, bigger, and better.

I'll be your guide through the dplyr package. My name is Garrett Grolemund and I'm
a Data Scientist at RStudio. I work closely with Hadley Wickham, the author of
dplyr, and I spend much of my time teaching people how to use the wonderful tools
that Haldey makes, as well as the tools that my other colleagues at RStudio make.

I've asked Hadley to join us at the end of the course. So, if you work your way
through all of the exercises, you'll have a chance to hear Hadley's own thoughts on
the dplyr package.

But before we can do any of that, you'll need to set up your R Session to use
dplyr. Let's get started!

Getting Started with dplyr


Before you get started with dplyr, install and load the package dplyr.

To explore and learn various functions in dplyr, use the built in hflights dataset.
This dataset comes from US Bureau of Transportation Statistics and contains details
of 227,496 flights that departed from Houston in 2011.

Run following commands to install the package and dataset:

install.packages("dplyr")
install.packages("hflights")
library(dplyr)
library(hflights) tables

Introduction to the tbl


Though dplyr can work with data frames as is, while dealing with large data, it's
worthwhile to convert them to a data structure "tbl".

A tbl is a wrapper around a data frame that won't accidentally print a lot of data
to the screen.

Watch this video to know more about this data structure.

tbl_df(mtcars)

glimpse(mtcars)

as.data.frame(mtcars)

#install.packages("tidyverse")
#library(tidyverse)
library(dplyr)
library(data.table)
dplyrs <- function(){

#Write your code here


hflights <- read.csv("hflights.csv")

print(dim(hflights))

print(head(hflights))

print(tail(hflights))

print(head(hflights, n = 20))

glimpse(hflights)

hflights1 <- hflights[1:50,] # here they mentioned five instead of fifty

tbl <- as.data.table(hflights1, keep.rownames = TRUE) #

carriers <- tbl$UniqueCarrier

print(carriers)

abrCarrier <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" =


"Continental", "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" =
"US_Airways", "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
#abrCarrier_df <- as.data.frame(sapply(strsplit(abrCarrier, "="), rbind),
stringsAsFactors=TRUE)
#names(abrCarrier_df) <- abrCarrier_df[1,]
#abrCarrier_df <- abrCarrier_df[-1,]
#print(head(abrCarrier_df, n = 15))
#abrCarrier_df <- enframe(abrCarrier, name = "UniqueCarrier", value =
"Carrier")
#hflights2 <- left_join(hflights, abrCarrier_df, by = "UniqueCarrier")

#print(head(hflights2, n = 10))

hflights$Carrier <- abrCarrier[hflights$UniqueCarrier]


print(head(hflights, n = 10))

dplyrs()

There are 5 basic verbs in dplyr. Command structure for all dplyr verbs is similar
and returns a data frame.

Dplyr does not modify the actual data frame and does not maintain row numbers.

Over the next set of cards, you will learn how to use each of these verbs.

Select,summarize,arrange,mutate,filter

The filter function returns all the rows that satisfy a particular condition.

Comma or "&" is used to separate multiple conditions


"|" symbol can be used for OR condition
%in% operator is used for multiple values in a condition
Example: The following command returns all flights that flew in the month of
January, after Jan 15th.

filter(hflights, DayofMonth>15, Month==1)

Like SELECT in SQL, the function select is used to choose specific columns of a
data frame.

Example: select(hflights, FlightNum, DepTime, ArrTime, UniqueCarrier)

A colon can be used to select all columns between two specific columns.

Example: select(hflights, Year : DayofMonth)

Besides selecting existing columns, it’s often useful to add new columns that are
functions of existing columns. The mutate() function will help with this.
For Example: Let's add a new column called Distance_Km which will convert distance
in hflights dataset from miles to kilometers and store the data in new data frame
hflights1.

hflights1 <- mutate(hflights, Distance_Km= Distance*2.2)

arrange()
arrange()
The arrange() function takes a data frame, and a set of column names to order by as
input and reorders the rows in the data frame.

If more than one column name is provided, each additional column will be used to
break ties in the values of preceding columns.

Example: arrange(hflights, year, month, day)

summarise()
summarise()
The summarise function summarises multiple values into a single value. It is useful
when used in conjunction with the other functions in the dplyr package.

Example: summarise(hflights, mean(ArrDelay, na.rm = TRUE))

Here, na.rm = TRUE will remove all NA values while calculating the mean, so that it
doesn’t produce spurious results.

groupby()
groupby()
The group_by function groups data by one or more variables.

In the example mentioned here groupby will group the data based on the Month, and
then the summarise function calculates the mean temperature in each month.

Example: summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))

Pipe Operator
Pipe Operator
The pipe operator in R, represented by %>% can be used to chain code together.

It is very useful when you perform several operations on data, and you do not want
to save the output into a variable at each intermediate step.

It makes the code simpler to write and maintain.

Understanding Pipe Operator


Assume that you want to find the average speed of each flight in the hflights
dataset and list the flightNumbers in Descending Order of their speed.

Using Pipe Operator, you can achieve this by:

hflights %>%
mutate(speed=Distance/AirTime) %>%
group_by (FlightNum) %>%
summarise(avgspd=mean(speed, na.rm = TRUE)) %>%
arrange(desc(avgspd))
This is equivalent to: Multiple Variables Approach

mflights<-mutate(hflights, speed=Distance/AirTime)
sflights<-summarise(group_by(mflights, FlightNum), avgspd=mean(speed, na.rm =
TRUE))
arrange(sflights, desc(avgspd))
OR Nested Approach

arrange(summarise(group_by(mutate( hflights,
speed=Distance/AirTime),FlightNum),avgspd=mean(speed, na.rm = TRUE)),desc(avgspd))

Q)

library(dplyr)
library(data.table)
dplyr_verbs <- function (){

#Write your code here


hflights <- as.data.frame(read.csv("hflights.csv"))

print(filter(hflights, FlightNum %in% c(428,460))) #flights with flight number


428 or 460

hflights1 <- head(hflights,n = 20)


print(select(hflights1,FlightNum,contains("Time"))) # select first 20 rows and
only columns with "Time" string in them like ArrTime

print(mutate(hflights1, velocity = Distance * 60 / AirTime)) #calculate new


field velocity

print(arrange(hflights1,desc(ArrDelay))) #arrange in descending order of


ArrDelay

hflights %>% group_by(UniqueCarrier) %>% summarise(n_flights =


as.integer(table(UniqueCarrier)),n_canc = sum(Cancelled)) %>% print()
#group by unique carrier name and hen provide count of total flights and
total number of cancellations

hflights %>% group_by(Month) %>% summarise(mean(ArrDelay, na.rm = TRUE)) %>%


print()
#average ArrDelay on a monthly basis

dplyr_verbs()

dplyr contains joining functions that enable you to combine two data frames into
one. These functions mimic database joins.

In this module we will focus on understanding how joins work in dplyr.


now before we get into our last chapters of join I want to take a moment and take
you back to the first chapter of writing basic SQL statements and from this slide
where we started now you have known about selection how you can select any
particular row that lies in any particular column easily by using select clause and
where clause now there was also one concept of join and I told you that joining
means joining two columns that lies in two different tables and as long as I recall
I think I told you about that can only happen if you want to join your table one
with your table - you need primary key without primary key and foreign key you
cannot join your two tables so now in this lecture we are going to learn what is
primary key and foreign key but first primary key primary key is an attribute that
can uniquely identify all the attributes of a given row and it cannot be known now
for example regard this table of employee where we got three columns employee
number employee name and Department number now tell me which column could have
primary key which attribute can uniquely identify all zeros is it name or is it
number what is it Department number now let me give you a hint I know that all of
you must have a social security number or identity card number that is given by
your country and all of you have name address date of birth and gender now suppose
this person lives in America and if I go certain he lives in New York City America
his name is Alex his DOB date of birth is 14 August 1947 his address is New York
America now there is one more person with the same name of Alex who has same DOB
date of birth 14 August 1947 and the same address he also lives in New York America
now how can the government could differ these two persons when they have to issue
them a passport because they both want to take a vacation and go to travel so they
have same DOB same address and same name so it will be tricky for a government
department to issue them a passport which they can use to travel all around the
world so instead of put a mark on their skin they issue them Social Security number
and their social security number is different now a first person with a social
security number of five four nine six four nine nine eight two can easily get a
passport from government department and the second person with the same name of
Alex DOB and address can also get a passport with his social security number of
nine two three nine two three four six nine now there is no problem for government
department and also for these persons when in the future they have to present
themselves in the situation like this because they have a social security number
that can easily and uniquely identify each of the person regardless of their same
name regardless of their age same DOB and same address so now this social security
number that is uniquely identifying these people is called primary key and no one
has that social security number or primary key in their country so their social
security number is an attribute that can uniquely identify all the attributes means
all their information their name their DOB their addresses their father name etc
etc so now finally the problem has been resolved by their social security number so
primary key is exactly as your social security number that could uniquely identify
all of your information so let's come back to our slide so the primary key is an
attribute that can uniquely identify all the attribute value in a given row and it
cannot be know so for example we got the table of employers here where you can see
we got name Department number and employee number now you tell me which of the
column could be called as primary key right employ number why because employee
number is like a social security number it is possible that two person with the
same name with the same Department could be in the same organization but their
employee number can be different because one employee number could only be assigned
to one person only so this column of employment number we can apply a primary key
on this column and we will say that it cannot be know whether an employee have a
department number or not they should have at least employee number so by looking at
seven eight three nine you can easily tell the name of seven eight three nine and
that is King and also you can tell his address Department number his date of birth
etc etc because this employee number is certain for only one body now then we got
the foreign key so the foreign keys and attribute in one table whose value must
either match the primary key in an other table or we know now suppose you got two
tables first one is employ table and the second table is Department table now if I
ask you tell me the department name of my employ name of King now how can you do
that you have to join these two table of employee and Department to tell me the
department name that is in my department table of my employee of King so I have to
join these two tables now look at these table and tell me what is the common column
in both of these tables is it employee number no because there is no column of
employee number in my department table is it named no because there is no
department name in my employee table so is it department number yes it is
department number why because there is one column of department number in my
department table and employ table as well so this department number column that
lies in both of the tables could help me combine these two tables in either of the
table I have to make one department number a primary key and the other table
Department number of foreign key because the foreign key is an attribute in one
table whose value must either match the primary key in another table so look at
employer table where we got the foreign key on the department number column so look
at the last entry seven three six nine with the name of Smith and with the
department number of 20 where we made this department number twenty of foreign key
and this 20 Massey primary key of my table of department so now if a person ask me
what is the department name of my employee of Smith I can easily tell him that
Smith work in research department and his location is Delos why because I combined
these two tables using primary key and foreign key so foreign key is an attribute
like 20 whose value is matching with the primary key of my department number 20 in
my department table so now you have an idea about primary key and for Anki now we
are good to jump into our last section of join where we learn about how you can
join two tables to have the different information

how're we doing everybody this is that our nerd back at you with another very nice
video what we're gonna cover today is how to do sequel joins within our and we're
gonna do this using the deep liar package the one I'm going to load here is tidy
verse tidy verse and this package actually comes with deep liar and ggplot and a
lot of other really good packages so I just use this one a lot so tidy verse what
we're gonna do is we're gonna make a couple data frames and I'm gonna use a little
bit of randomness so I'm going to set a seed so that you guys can follow along and
make this data frame as well so set a seed 20:18 and we'll do a data frame 1 and
inside of this data frame we're gonna do a data frame and we're gonna do a customer
ID we're gonna make this a vector from 1 to 10 all right so if we run this chunk
it's just gonna be 1 through 10 so our customer ideas are 1 through 10 and then
we're also going to put a product and for this one we're going to use our little
bit of randomness here we're gonna do a sample and so what this is is this this
first vector here that we make it's kind of like the Earned it's like what's inside
of what we're gonna try to take a sample of an urn classic right and 10 and replace
is equal to true so this is doing this is saying we have a an urn with a toaster a
TV a dishwasher we're gonna take 10 polls out of this and we're gonna replace it
and so once we pull it out we put it back in and so if we run just this chunk here
we see it pulled 10 10 random draws out of that ok we'll make our second data frame
and for this one we'll do a data frame will do a customer ID is equal to and for
this one we're actually gonna do a sample here as well so DF $1 customer ID and
we're gonna do 5 draws from this urn so what this is saying is we're going to take
a sample of the possible options from the state of frame 1 that we just made how do
these customer IDs so even earn that goes from 1 to 10 we're gonna pull 5 out of
there and we don't want to put replace is equal to true because we just want one
and if when it's gone it's gone ok then we'll put what state they're from so we'll
do another sample here samp you know a sample ok and then in this one this little
urn we have here we'll do New York and California we're gonna do five pulls I guess
we have 5 here so we want it to be the same for our data frame and we're gonna
replace is equal to true on this one let's put this one down on the next line there
and then all this we're gonna wrap in a table which is something that deep liar
uses pretty much just a data frame but it has a lot of nice properties and it's
better so TBL DF and we'll wrap this whole thing in the table oh this whole thing
all right I'll run this whole thing the seed all the way down shoot darn it I done
fudged up I didn't close this one here let's run this one one more time now let's
run the whole thing with the seed okay so we have d f1 we have customer ID 1
through 10 and then TV TV toaster toaster TV toaster TV toaster dishwasher TV and
then dataframe 2 which is a bunch of customer IDs 4 5 6 7 8 what are the odds of
that state California New York California California California okay so let's do
some joins here so in all these joins we're going to use DF 1 as our left table and
DF 2 as our right table first join will do is an inner join and what this one does
is it returns only what is in both data frames and this is by whatever key we give
it so I'll say DF 1 we're gonna pipe it so that's the percent greater than percent
as a pipe we're gonna pipe this data frame into the inner join inner join function
and we're gonna give it the data frame 2 and then you're supposed to tell it what
your what you want to merge it by because there could be multiple columns that are
the same and you might just want to merge it by the one so say by is equal to
customer ID so when we join there we we only keep the customer IDs that are in both
of our data frames which is 4 or 5 6 7 8 and then we get the product from the left
and the state from the right ok you do a left join electron returns everything and
the left and the rows with matching keys in the right okay so this is really
honestly pretty similar here I take data frame 1 we're gonna pipe it into a left
join and we'll give it dick frame to say by is equal to customer ID again and so
now we keep everything from the left right customer ID we have 1 through 10 on the
left we keep all the products on the left and then from state we pull over what was
in the the right data frame and match it up right so we'd four or five six seven
and eight and our right data frame and then it gives it the states in there okay
right join returns everything and the right rows with matching keys in the left
again really similar here this is what's really nice about deep liar you know it's
super pretty pretty simple simple right take data frame one we're gonna pipe it
into our right join date for him to I is equal to customer customer ID so now we're
gonna keep the customer IDs that we're in our right table our states that were in
our right table and we're gonna match up with our key the products that were from
our left table okay outer join this one's gonna return all rows from both tables
and join matching keys and the right and left so as you can probably guess what
this is actually this one's just a hair different we're going to take date frame
one and instead of an outer join which is like the sequel terminology we're going
to do a full join D F 2 pi is equal to customer ID now this one looks like the left
join it's actually the exact same output but if we had a 11 like customer ID the 11
in our right table that wasn't in our left you would be down there and then we'd be
missing this one right because we don't have it in our left table and then we have
whatever state it was in the right table as if we had customer 11 and their state
in the right table that's what which other but in this case we don't have anything
extra in that right table so it doesn't show anything extra finally the last one
which I actually kind of like a lot see anti join actually find all sorts of weird
reasons to use this so anyways returns all rows in the left that do not have
matching keys in the right so it's gonna look at our first table and says hey what
of these are not in the right and then it's going to return that so say the f1
we're gonna pipe it to an anti join D F 2 pi is equal to customer D all right so we
had four or five six seven eight in the customer ID for the other one and not not
in a left table or not in the right table I should say and so it returns what was
not in the right table so pretty cool anyways I hope this video was helpful for you
if it was make sure to press that like button so other people can see it and make
sure to subscribe for the best our content that is available yeah you have a great
day thanks for your time

Filtering Joins
Filter joins filters observations from one table based on whether they match an
observation in the other table.

There are two types of filtering joins. For example, consider two datasets x and y.

semi_join()
This returns all rows from x where there are matching values in y.
semi join will never allow duplicate rows.
anti_join()

This returns all rows from x where there are no matching values in y.

Set Operations in dplyr provide efficient versions for data frames and tables. They
override the set functions provided in base R.

The default methods call the base versions.

Set Operations Usage


Set operations combine the observations in two datasets x and y, as if they were
set elements:

intersect(x, y, ...)
union(x, y, ...)
union_all(x, y, ...)
setdiff(x, y, ...)

create 2 vectors first and second using mtcars rows 1 to 6 and 6 to 15


respectively, then carry out above functions
library(dplyr)
sets <- function() {

#Write your code here


first <- mtcars[1:6,]
second <- mtcars[6:15,]

print(intersect(first, second))
print(union(first, second))
print(union_all(first, second))
print(setdiff(first, second))

sets()

Cut, Pretty and Range


Explains how to Bin / Bucket Data, using "Cut", "Pretty"and "Range" Functions in R.

It is also used to convert a continuous variable to categorical variables.

Contingency Tables with 1 Variable


Watch this video to learn how to create "contingency tables" or "frequency tables"
and tabulate frequency of occurrences for each observation and to Cross tabulate
(contingency tables).

Contingency Tables with 2 Variables


Continuing with the previous video, here you will learn how to deal with two
variables using "tables()" function.
Apply fuctions table,cut and pretty on mtcars and create contingency table based on
mpg provided by listed cars and print it
table(cut(mtcars$mpg, pretty(mtcars$mpg)))

Your Output (stdout)


(10,15] (15,20] (20,25] (25,30] (30,35]
6 12 8 2 4

You might also like