Professional Documents
Culture Documents
R
R
R
data. What follows here is the essential knowledge you need to have when dealing
with Data frames.
By now, you already learned quite some things in R. Data structures such as
vectors, matrices and lists have no secrets for you anymore. However, R is a
statistical programming language, and in statistics you'll often be working with
data sets. Such data sets are typically comprised of observations, or instances.
All these observations have some variables associated with them. You can have for
example, a data set of 5 people. Each person is an instance, and the properties
about these people, such as for example their name, their age and whether they have
children are the variables. How could you store such information in R? In a matrix?
Not really, because the name would be a character and the age would be a numeric,
these don't fit in a matrix. In a list maybe? This could work, because you can put
practically anything in a list. You could create a list of lists, where each
sublist is a person, with a name, an age and so on. However, the structure of such
a list is not really useful to work with. What if you want to know all the ages for
example? You'd have to write a lot of R code just to get what you want. But what
data structure could we use then?
Meet the data frame. It's the fundamental data structure to store typical data
sets. It's pretty similar to a matrix, because it also has rows and columns. Also
for data frames, the rows correspond to the observations, the persons in our
example, while the columns correspond to the variables, or the properties of each
of these persons. The big difference with matrices is that a data frame can contain
elements of different types. One column can contain characters, another one numeric
and yet another one logical. That's exactly what we need to store our persons'
information in the dataset, right? We could have a column for the name, which is
character, one for the age, which is numeric, and one logical column to denote
whether the person has children.
There still is a restriction on the data types, though. Elements in the same column
should be of the same type. That's not really a problem, because in one column, the
age column for example, you'll always want a numeric, because an age is always a
number, regardless of the observation.
So, for the practical part now: creating a data.frame. In most cases, you don't
create a data frame yourself. Instead, you typically import data from another
source. This could be a csv file, a relational database, but also come from other
software packages like Excel or SPSS.
Of course, R provides ways to manually create data frames as well. You use the data
dot frame function for this. To create our people data frame that has 5
observations and 3 variables, we'll have to pass the data frame function 3 vectors
that are all of length five. The vectors you pass correspond to the columns. Let's
create these three vectors first: name, age and child.
The printout of the data frame already shows very clearly that we're dealing with a
data set. Notice how the data frame function inferred the names of the columns from
the variable names you passed it. To specify the names explicitly, you can use the
same techniques as for vectors and lists. You can use the names function, ... , or
use equals sings inside the data frame function to name the data frame columns
right away.
Like in matrices, it's also possible to name the rows of the data frame, but that's
generally not a good idea so I won't detail on that here.
Before you head over to some exercises, let me shortly discuss the structure of a
data frame some more.
If you look at this structure, ..., there are two things you can see here: First,
the printout looks suspiciously similar to that of a list. That's because, under
the hood, the data frame actually is a list. In this case, it's a list with three
elements, corresponding to each of the columns in the data frame. Each list element
is a vector of length 5, corresponding to the number of observations. A requirement
that is not present for lists is that the length of the vectors you put in the list
has to be equal. If you try to create a data frame with 3 vectors that are not all
of the same length, you'll get an error.
Second, the name column, which you expect to be a character vector, is actually a
factor. That's because R by default stores the strings as factors. To suppress this
behaviour, you can set the stringsAsFactors argument of the data.frame function to
FALSE
With this new knowledge, you're ready for some first exercises on this extremely
useful and powerful data structure.
print(str(mtcars))
print(summary(mtcars))
print(mtcars[,c(1,2,10)])
print(mtcars[order(mtcars$mpg),])
data_handle_1()
----------------------------------------
The R package data.table provides an enhanced version of data.frame and this allows
you to perform blazing fast data manipulations.
This package is being used in different fields such as finance and genomics, and is
especially useful while working with large data sets (e.g. 1GB to 100GB in RAM).
Q)In this module we will learn how to use this highly useful package.
library(data.table)
data_table <- function(){
sapply works similar to lapply, but it tries to simplify the output to the most
elementary data structure that is possible. In effect, sapply is a 'wrapper'
function for lapply.
Using apply
The apply function operates on arrays and, applies R base package function or user
defined functions on the rows or columns of the array. This function returns a
vector or array or list of values. Example:
print(lapply(list1, mean))
s1 <- c(70,80,90)
s2 <- c(71,81,91)
s3 <- c(72,82,92)
s4 <- c(73,83,93)
list2 = list(s1,s2,s3,s4)
print(sapply(list2, mean))
print(apply(m, 2, mean))
x <- c(1:5)
y <- c(6:10)
print(mapply(prod, x, y))
loops()
Test case 0
Your Output (stdout)
[[1]]
[1] 80
[[2]]
[1] 81
[[3]]
[1] 83
[1] 80 81 82 83
[1] 5.5 15.5 25.5 35.5 45.5
0 1
4 22.900 28.07500
6 19.125 20.56667
8 15.050 15.40000
Importing data into R is an essential step before starting with data exploration or
data analysis. R offers a number of ways to import data from a variety of data
sources and data file types.
R has a number of functions and packages to support importing every single type of
file, relational databases and statistical analysis software.
In this video, you will learn how to import and export .csv data files in R. At the
same time, some of the most common problems that you can face when loading csv
files into R will also be addressed.
To import such data, you can use one of the easiest and most general options to
import your file to R: the read.table() function.
Watch this video to learn how to use read.table() function to import data from text
files.
read.delim() function reads a file into a list. By default, the file is separated
by tab, comma or any other delimiter specified by parameter "sep=".
If the parameter "header=TRUE", then the first row will be treated as the row
names.
Syntax: read.delim(file, header = FALSE, sep = "\t", quote = "\"", dec = ".", fill
= TRUE, comment.char = "", ...)
library(rvest)
forecasttext
paste(forecasttext, collapse = " ")
Introduction to dplyr
Introduction to dplyr
dplyr is a package for data manipulation. It provides some simple and useful
functions that come in handy when performing exploratory data analysis and
manipulation.
-Data Exploration
Data Transformation
Fast on Data frames
Working with data stored in databases
This module will provide a basic overview of some of the most useful functions in
the package.
Have you ever had a data set that you were sure contained simple insights that you
just couldn’t access? I’m not talking about the sorts of information that you can
discover with machine learning or modeling algorithms. Just simple things like new
variables, summary statistics, group differences, and so on.
Most data sets contain more information than they display and dplyr is a package
that can help you access that information.
dplyr introduces a grammar of data manipulation. Five simple functions that you can
use to reveal new variables, new observations, and new ways to describe your data.
You can also use these functions to subset your data and do group wise operations.
And dplyr is fast. Very fast. The key pieces of dplyr are written in C++, which
means you get the speed of C with the ease of R.
This course will show you how to use dplyr like an expert. You'll learn to use
dplyr's grammar of data manipulation to solve any data related task you can think
of. Along the way, you’ll learn how to think about and manipulate the structure of
data. You'll also learn to use dplyr's tbl structure and the piping operator -- two
features that can save you tons of time.
You'll even learn how to use dplyr to access data stored in a database, which
provides an easy way to work with data that is too big to fit in R all at once.
I'll be your guide through the dplyr package. My name is Garrett Grolemund and I'm
a Data Scientist at RStudio. I work closely with Hadley Wickham, the author of
dplyr, and I spend much of my time teaching people how to use the wonderful tools
that Haldey makes, as well as the tools that my other colleagues at RStudio make.
I've asked Hadley to join us at the end of the course. So, if you work your way
through all of the exercises, you'll have a chance to hear Hadley's own thoughts on
the dplyr package.
But before we can do any of that, you'll need to set up your R Session to use
dplyr. Let's get started!
To explore and learn various functions in dplyr, use the built in hflights dataset.
This dataset comes from US Bureau of Transportation Statistics and contains details
of 227,496 flights that departed from Houston in 2011.
install.packages("dplyr")
install.packages("hflights")
library(dplyr)
library(hflights) tables
A tbl is a wrapper around a data frame that won't accidentally print a lot of data
to the screen.
tbl_df(mtcars)
glimpse(mtcars)
as.data.frame(mtcars)
#install.packages("tidyverse")
#library(tidyverse)
library(dplyr)
library(data.table)
dplyrs <- function(){
print(dim(hflights))
print(head(hflights))
print(tail(hflights))
print(head(hflights, n = 20))
glimpse(hflights)
print(carriers)
#print(head(hflights2, n = 10))
dplyrs()
There are 5 basic verbs in dplyr. Command structure for all dplyr verbs is similar
and returns a data frame.
Dplyr does not modify the actual data frame and does not maintain row numbers.
Over the next set of cards, you will learn how to use each of these verbs.
Select,summarize,arrange,mutate,filter
The filter function returns all the rows that satisfy a particular condition.
Like SELECT in SQL, the function select is used to choose specific columns of a
data frame.
A colon can be used to select all columns between two specific columns.
Besides selecting existing columns, it’s often useful to add new columns that are
functions of existing columns. The mutate() function will help with this.
For Example: Let's add a new column called Distance_Km which will convert distance
in hflights dataset from miles to kilometers and store the data in new data frame
hflights1.
arrange()
arrange()
The arrange() function takes a data frame, and a set of column names to order by as
input and reorders the rows in the data frame.
If more than one column name is provided, each additional column will be used to
break ties in the values of preceding columns.
summarise()
summarise()
The summarise function summarises multiple values into a single value. It is useful
when used in conjunction with the other functions in the dplyr package.
Here, na.rm = TRUE will remove all NA values while calculating the mean, so that it
doesn’t produce spurious results.
groupby()
groupby()
The group_by function groups data by one or more variables.
In the example mentioned here groupby will group the data based on the Month, and
then the summarise function calculates the mean temperature in each month.
Pipe Operator
Pipe Operator
The pipe operator in R, represented by %>% can be used to chain code together.
It is very useful when you perform several operations on data, and you do not want
to save the output into a variable at each intermediate step.
hflights %>%
mutate(speed=Distance/AirTime) %>%
group_by (FlightNum) %>%
summarise(avgspd=mean(speed, na.rm = TRUE)) %>%
arrange(desc(avgspd))
This is equivalent to: Multiple Variables Approach
mflights<-mutate(hflights, speed=Distance/AirTime)
sflights<-summarise(group_by(mflights, FlightNum), avgspd=mean(speed, na.rm =
TRUE))
arrange(sflights, desc(avgspd))
OR Nested Approach
arrange(summarise(group_by(mutate( hflights,
speed=Distance/AirTime),FlightNum),avgspd=mean(speed, na.rm = TRUE)),desc(avgspd))
Q)
library(dplyr)
library(data.table)
dplyr_verbs <- function (){
dplyr_verbs()
dplyr contains joining functions that enable you to combine two data frames into
one. These functions mimic database joins.
how're we doing everybody this is that our nerd back at you with another very nice
video what we're gonna cover today is how to do sequel joins within our and we're
gonna do this using the deep liar package the one I'm going to load here is tidy
verse tidy verse and this package actually comes with deep liar and ggplot and a
lot of other really good packages so I just use this one a lot so tidy verse what
we're gonna do is we're gonna make a couple data frames and I'm gonna use a little
bit of randomness so I'm going to set a seed so that you guys can follow along and
make this data frame as well so set a seed 20:18 and we'll do a data frame 1 and
inside of this data frame we're gonna do a data frame and we're gonna do a customer
ID we're gonna make this a vector from 1 to 10 all right so if we run this chunk
it's just gonna be 1 through 10 so our customer ideas are 1 through 10 and then
we're also going to put a product and for this one we're going to use our little
bit of randomness here we're gonna do a sample and so what this is is this this
first vector here that we make it's kind of like the Earned it's like what's inside
of what we're gonna try to take a sample of an urn classic right and 10 and replace
is equal to true so this is doing this is saying we have a an urn with a toaster a
TV a dishwasher we're gonna take 10 polls out of this and we're gonna replace it
and so once we pull it out we put it back in and so if we run just this chunk here
we see it pulled 10 10 random draws out of that ok we'll make our second data frame
and for this one we'll do a data frame will do a customer ID is equal to and for
this one we're actually gonna do a sample here as well so DF $1 customer ID and
we're gonna do 5 draws from this urn so what this is saying is we're going to take
a sample of the possible options from the state of frame 1 that we just made how do
these customer IDs so even earn that goes from 1 to 10 we're gonna pull 5 out of
there and we don't want to put replace is equal to true because we just want one
and if when it's gone it's gone ok then we'll put what state they're from so we'll
do another sample here samp you know a sample ok and then in this one this little
urn we have here we'll do New York and California we're gonna do five pulls I guess
we have 5 here so we want it to be the same for our data frame and we're gonna
replace is equal to true on this one let's put this one down on the next line there
and then all this we're gonna wrap in a table which is something that deep liar
uses pretty much just a data frame but it has a lot of nice properties and it's
better so TBL DF and we'll wrap this whole thing in the table oh this whole thing
all right I'll run this whole thing the seed all the way down shoot darn it I done
fudged up I didn't close this one here let's run this one one more time now let's
run the whole thing with the seed okay so we have d f1 we have customer ID 1
through 10 and then TV TV toaster toaster TV toaster TV toaster dishwasher TV and
then dataframe 2 which is a bunch of customer IDs 4 5 6 7 8 what are the odds of
that state California New York California California California okay so let's do
some joins here so in all these joins we're going to use DF 1 as our left table and
DF 2 as our right table first join will do is an inner join and what this one does
is it returns only what is in both data frames and this is by whatever key we give
it so I'll say DF 1 we're gonna pipe it so that's the percent greater than percent
as a pipe we're gonna pipe this data frame into the inner join inner join function
and we're gonna give it the data frame 2 and then you're supposed to tell it what
your what you want to merge it by because there could be multiple columns that are
the same and you might just want to merge it by the one so say by is equal to
customer ID so when we join there we we only keep the customer IDs that are in both
of our data frames which is 4 or 5 6 7 8 and then we get the product from the left
and the state from the right ok you do a left join electron returns everything and
the left and the rows with matching keys in the right okay so this is really
honestly pretty similar here I take data frame 1 we're gonna pipe it into a left
join and we'll give it dick frame to say by is equal to customer ID again and so
now we keep everything from the left right customer ID we have 1 through 10 on the
left we keep all the products on the left and then from state we pull over what was
in the the right data frame and match it up right so we'd four or five six seven
and eight and our right data frame and then it gives it the states in there okay
right join returns everything and the right rows with matching keys in the left
again really similar here this is what's really nice about deep liar you know it's
super pretty pretty simple simple right take data frame one we're gonna pipe it
into our right join date for him to I is equal to customer customer ID so now we're
gonna keep the customer IDs that we're in our right table our states that were in
our right table and we're gonna match up with our key the products that were from
our left table okay outer join this one's gonna return all rows from both tables
and join matching keys and the right and left so as you can probably guess what
this is actually this one's just a hair different we're going to take date frame
one and instead of an outer join which is like the sequel terminology we're going
to do a full join D F 2 pi is equal to customer ID now this one looks like the left
join it's actually the exact same output but if we had a 11 like customer ID the 11
in our right table that wasn't in our left you would be down there and then we'd be
missing this one right because we don't have it in our left table and then we have
whatever state it was in the right table as if we had customer 11 and their state
in the right table that's what which other but in this case we don't have anything
extra in that right table so it doesn't show anything extra finally the last one
which I actually kind of like a lot see anti join actually find all sorts of weird
reasons to use this so anyways returns all rows in the left that do not have
matching keys in the right so it's gonna look at our first table and says hey what
of these are not in the right and then it's going to return that so say the f1
we're gonna pipe it to an anti join D F 2 pi is equal to customer D all right so we
had four or five six seven eight in the customer ID for the other one and not not
in a left table or not in the right table I should say and so it returns what was
not in the right table so pretty cool anyways I hope this video was helpful for you
if it was make sure to press that like button so other people can see it and make
sure to subscribe for the best our content that is available yeah you have a great
day thanks for your time
Filtering Joins
Filter joins filters observations from one table based on whether they match an
observation in the other table.
There are two types of filtering joins. For example, consider two datasets x and y.
semi_join()
This returns all rows from x where there are matching values in y.
semi join will never allow duplicate rows.
anti_join()
This returns all rows from x where there are no matching values in y.
Set Operations in dplyr provide efficient versions for data frames and tables. They
override the set functions provided in base R.
intersect(x, y, ...)
union(x, y, ...)
union_all(x, y, ...)
setdiff(x, y, ...)
print(intersect(first, second))
print(union(first, second))
print(union_all(first, second))
print(setdiff(first, second))
sets()