Professional Documents
Culture Documents
Useful R Packages
Useful R Packages
Garrett Grolemund
January 10, 2018 17:49
Recommended Packages
Many useful R function come in packages, free libraries of code written by R's active user community. To install an R package, open an R
session and type at the command line
R will download the package from CRAN, so you'll need to be connected to the internet. Once you have a package installed, you can make
its contents available to use in your current R session by running
There are thousands of helpful R packages for you to use, but navigating them all can be a challenge. To help you out, we've compiled this
guide to some of the best. We've used each of these, and found them to be outstanding – we've even written some of them. But you don't
have to take our word for it, these packages are also some of the top most downloaded R packages.
To load data
RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages are a good place to start. Choose the
package that fits your type of database.
XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can also just export your spreadsheets from
Excel as .csv's.
foreign - Want to read a SAS data set into R? Or an SPSS data set? Foreign provides functions that help you load data files from other
programs into R.
R can handle plain text files – no package required. Just use the functions read.csv, read.table, and read.fwf. If you have even more exotic
data, consult the CRAN guide to data import and export.
To manipulate data
dplyr - Essential shortcuts for subsetting, summarizing, rearranging, and joining together data sets. dplyr is our go to package for fast data
manipulation.
tidyr - Tools for changing the layout of your data sets. Use the gather and spread functions to convert your data into the tidy format, the
layout R likes best.
stringr - Easy to learn tools for regular expressions and character strings.
lubridate - Tools that make working with dates and times easier.
To visualize data
ggplot2 - R's famous package for making beautiful graphics. ggplot2 lets you use the grammar of graphics to build layered, customizable plots.
ggvis - Interactive, web based graphics built with the grammar of graphics.
htmlwidgets - A fast way to build interactive (javascript based) visualizations with R. Packages that implement htmlwidgets include:
leaflet (maps)
dygraphs (time series)
DT (tables)
diagrammeR (diagrams)
network3D (network graphs)
threeJS (3D scatterplots and globes).
googleVis - Let's you use Google Chart tools to visualize data in R. Google Chart tools used to be called Gapminder, the graphing software
Hans Rosling made famous in hie TED talk.
To model data
car - car's Anova function is popular for making type II and type III Anova tables.
To report results
shiny - Easily make interactive, web apps with R. A perfect way to explore data and share findings with non-programmers.
R Markdown - The perfect workflow for reproducible reporting. Write R code in your markdown reports. When you run render, R
Markdown will replace the code with its results and then export your report as an HTML, pdf, or MS Word document, or a HTML or pdf
slideshow. The result? Automated reporting. R Markdown is integrated straight into RStudio.
xtable - The xtable function takes an R object (like a data frame) and returns the latex or HTML code you need to paste a pretty version of
the object into your documents. Copy and paste, or pair up with R Markdown.
ggmap - Download street maps straight from Google maps and use them as a background in your ggplots.
xts - Very flexible tools for manipulating time series data sets.
quantmod - Tools for downloading financial data, plotting common charts, and doing technical analysis.
data.table - An alternative way to organize data sets for very, very fast operations. Useful for big data.
parallel - Use parallel processing in R to speed up your code or to crunch large data sets.
testthat - testthat provides an easy way to write unit tests for your code projects.
roxygen2 - A quick way to document your R packages. roxygen2 turns inline code comments into documentation pages and builds a
package namespace.
You can also read about the entire package development process online in Hadley Wickham's R Packages book
This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
Example output
Package Version
bitops 1.0-6
BradleyTerry2 1.0-6
brew 1.0-6
brglm 0.5-9
car 2.0-25
caret 6.0-47
coin 1.0-24
colorspace 1.2-6
crayon 1.2.1
devtools 1.8.0
dichromat 2.0-0
digest 0.6.8
earth 4.4.0
evaluate 0.7
[..snip..]
This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R
packages after an R upgrade.
Usage
installed.packages(lib.loc = NULL, priority = NULL,
noCache = FALSE, fields = NULL,
subarch = .Platform$r_arch, ...)
Arguments
lib.loc character vector describing the location of R library trees to search through, or NULL for all known trees (see .libPaths).
priority character vector or NULL (default). If non-null, used to select packages; "high" is equivalent to c("base", "recommended"). To
select all packages without an assigned priority use priority = "NA".
noCache Do not use cached information, nor cache it.
fields a character vector giving the fields to extract from each package's DESCRIPTION file in addition to the default ones,
or NULL (default). Unavailable fields result in NA values.
subarch character string or NULL. If non-null and non-empty, used to select packages which are installed for that sub-architecture.
... allows unused arguments to be passed down from other functions.
Details
installed.packages scans the ‘DESCRIPTION’ files of each package found along lib.loc and returns a matrix of package names,
library paths and version numbers.
The information found is cached (by library) for the R session and specified fields argument, and updated only if the top-level library
directory has been altered, for example by installing or removing a package. If the cached information becomes confused, it can be
avoided by specifying noCache = TRUE.
Value
A matrix with one row per package, row names the package names and column names
(currently) "Package", "LibPath", "Version", "Priority", "Depends", "Imports", "LinkingTo", "Suggests", "Enhances", "OS_type", "Licens
e" and "Built" (the R version the package was built under). Additional columns can be specified using the fields argument.
Note
This needs to read several files per installed package, which will be slow on Windows and on some network-mounted file systems.
It will be slow when thousands of packages are installed, so do not use it to find out if a named package is installed
(use find.package or system.file) nor to find out if a package is usable (call requireNamespace orrequire and check the return value) nor to find
details of a small number of packages (use packageDescription).
See Also
update.packages, install.packages, INSTALL, REMOVE.
Examples
## confine search to .Library for speed
str(ip <- installed.packages(.Library, priority = "high"))
ip[, c(1,3:5)]
plic <- installed.packages(.Library, priority = "high", fields = "License")
## what licenses are there:
table( plic[, "License"] )
[ Need to learn R or brush up on basics? Download our free Beginner's Guide to R or the Advanced Beginner's Guide to R ]
Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The
table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in
the table are clickable if you want more information. To find out more about a package once you've installed it, type help(package =
"packagename") in your R console (of course substituting the actual package name ).
devtools package While devtools is aimed at helping you create install_github("rstudio/leaflet") Hadley
development, your own R packages, it's also essential if you Wickham &
package want to easily install other packages from others
installation GitHub. Install it! Requires Rtools on Windows
and XCode on a Mac. On CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
remotes package If all you want is to install packages from GitHub, remotes::install_github("mangothecat/franc") Gabor Csardi &
installation devtools may be a bit of a heavyweight. remotes others
will easily install from GitHub as well as
Bitbucket and some others. On CRAN. (ghit is
another option, but is GitHub-only.)
installr misc Windows only: Update your installed version of updateR() Tal Galili &
R from within R. On CRAN. others
reinstallr misc Seeks to find packages that had previously been reinstallr() Calli Gross
installed on your system and need to be re-
installed after upgrading R. CRAN.
readxl data import Fast way to read Excel files in R, without read_excel("my-spreadsheet.xls", sheet = 1) Hadley
dependencies such as Java. CRAN. Wickham
googlesheets data import, Easily read data into R from Google Sheets. mysheet <- gs_title("Google Spreadsheet Title") Jennifer Bryan
data export CRAN. mydata <- mydata <- gs_read(mysheet, ws =
“WorksheetTitle”)
readr data import Base R handles most of these functions; but if read_csv(myfile.csv) Hadley
you have huge files, this is a speedy and Wickham
standardized way to read tabular files such as
CSVs into R data frames, as well as plain text files
into character strings with read_file. CRAN.
rio data import, rio has a good idea: Pull a lot of separate data- import("myfile") Thomas J.
data export reading packages into one, so you just need to Leeper &
remember 2 functions: import and export. others
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
Hmisc data analysis There are a number of useful functions in here. describe(mydf) Frank E Harrell
Two of my favorites: describe, a more robust Cs(so, it, goes) Jr & others
summary function, and Cs, which creates a
vector of quoted character strings from
unquoted comma-separated text. Cs(so, it,
goes) creates c("so", "it", "goes"). CRAN.
datapasta data import Data copy and paste: Meet reproducible df_paste() to create a data frame, vector_paste() Miles McBain
research. If you've copied data from the Web, a to create a vector.
spreadsheet, or other source into your
clipboard, datapasta lets you paste it into R as
an R object, with the code to reproduce it. It
includes RStudio add-ins as well as command-
line functions for transposing data, turning it
into markdown format, and more. CRAN.
sqldf data wrangling, Do you know a great SQL query you'd use if your sqldf("select * from mydf where mycol > 4") G.
data analysis R data frame were in a SQL database? Run SQL Grothendieck
queries on your data frame with sqldf. CRAN.
jsonlite data import, Parse json within R or turn R data frames into myjson <- toJSON(mydf, pretty=TRUE) Jeroen Ooms &
data wrangling json. CRAN. mydf2 <- fromJSON(myjson) others
XML data import, Many functions for elegantly dealing with XML mytables <- readHTMLTable(myurl) Duncan
data wrangling and HTML, such as readHTMLTable. CRAN. Temple Lang
httr data import, An R interface to http protocols; useful for r <- GET("http://httpbin.org/get") Hadley
data wrangling pulling data from APIs. See the httr quickstart content(r, "text") Wickham
guide. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
quantmod data import, Even if you're not interested in analyzing and getSymbols("AITINO", src="FRED") Jeffrey A. Ryan
data graphing financial investment data, quantmod
visualization, has easy-to-use functions for importing
data analysis economic as well as financial data from sources
like the Federal Reserve. CRAN.
tidyquant data import, Another financial package that's useful for aapl_key_ratios <- tq_get("AAPL", get = Matt Dancho
data importing, analyzing and visualizing data, "key.ratios")
visualization, integrating aspects of other popular finance
data analysis packages as well as tidyverse tools. With
thorough documentation. CRAN.
rvest data import, Web scraping: Extract data from HTML pages. See the package vignette Hadley
web scraping Inspired by Python's Beautiful Soup. Works well Wickham
with Selectorgadget. CRAN.
dplyr data wrangling, The essential data-munging R package when See the intro vignette Hadley
data analysis working with data frames. Especially useful for Wickham
operating on data by categories. CRAN.
purrr data wrangling purrr is a relatively new package aimed at map_df(mylist, myfunction) Hadley
replacing plyr and some base R apply functions More: Charlotte Wickham's purr tutorial video, Wickham
for doing operations for running . It's more the purrr cheat sheet PDF download.
complex to learn but also has more functionality.
CRAN.
reshape2 data wrangling Change data row and column formats from See my tutorial Hadley
"wide" to "long"; turn variables into column Wickham
names or column names into variables and
more. The tidyr package is a newer, more
focused option, but I still use reshape2. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
tidyr data wrangling While I still prefer reshape2 for general re- See examples in this blog post. Hadley
arranging, tidy won me over with specialized Wickham
functions like fill (fill in missing columns from
data above) and replace_na. CRAN.
splitstackshape data wrangling It's rare that I'd recommend a package that cSplit(mydata, "multi_val_column", sep = ",", Ananda Mahto
hasn't been updated in years, but the cSplit() direction = "long").
function solves a rather complex shaping
problem in an astonishingly easy way. If you
have a data frame column with one or
more comma-separated values (think a survey
question with "select all that apply"), this is
worth an install if you want to separate each
item into its own new data frame row.. CRAN.
magrittr data wrangling This package gave us the %>% symbol for mydf %<>% mutate(newcol = myfun(colname)) Stefan Milton
chaining R operations, but it's got other useful Bache &
operators such as %<>% for mutating a data Hadley
frame in place and and . as a placeholder for the Wickham
original object being operated upon. CRAN.
validate data wrangling Intuitive data validation based on rules you can See the introductory vignette. Mark van der
define, save and re-use. CRAN. Loo & Edwin
de Jonge
testthat programming Package that makes it easy to write unit tests for See the testing chapter of Hadley Wickham's Hadley
your R code. CRAN. book on R packages. Wickham
data.table data wrangling, Popular package for heavy-duty data wrangling. Useful tutorial Matt Dowle &
data analysis While I typically prefer dplyr, data.table has others
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
stringr data wrangling Numerous functions for text manipulation. Some str_pad(myzipcodevector, 5, "left", "0") Hadley
are similar to existing base R functions but in a Wickham
more standard format, including working with
regular expressions. Some of my favorites:
str_pad and str_trim. CRAN.
lubridate data wrangling Everything you ever wanted to do with date mdy("05/06/2015") + months(1) Garrett
arithmetic, although understanding & using More examples in the package vignette Grolemund,
available functionality can be somewhat Hadley
complex. CRAN. Wickham &
others
zoo data wrangling, Robust package with a slew of functions for rollmean(mydf, 7) Achim Zeileis &
data analysis dealing with time series data; I like the handy others
rollmean function with its align=right and fill=NA
options for calculating moving averages. CRAN.
editR data display Interactive editor for R Markdowndocuments. editR("path/to/myfile.Rmd") Simon Garnier
Note that R Markdown Notebooks are another
useful way to generate Markdown interactively.
editR is on GitHub.
knitr data display Add R to a markdown document and easily See the Minimal Examples page. Yihui Xie &
generate reports in HTML, Word and other others
formats. A must-have if you're interested in
reproducible research and automating the
journey from data analysis to report creation.
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
officeR data display Import and edit Microsoft Word and PowerPoint my_doc <- read_docx() %>% David Gohel
documents, making it easy to add R-generated body_add_img(src = myplot)
analysis and visualizations to existing as well as The package website has many more examples.
new reports and presentations. CRAN.
listviewer data display, While RStudio has since added a list-viewing jsonedit(mylist) Kent Russell
data wrangling option, this HTML widget still offers an elegant
way to view complex nested lists within R.
GitHub timelyportfolio/listviewer.
DT data display Create a sortable, searchable table in one line of datatable(mydf) RStudio
code with this R interface to the jQuery
DataTables plug-in. GitHub rstudio/DT.
ggplot2 data Powerful, flexible and well-thought-out dataviz qplot(factor(myfactor), data=mydf, geom="bar", Hadley
visualization package following 'grammar of graphics' syntax fill=factor(myfactor)) Wickham
to create static graphics, but be prepared for a See my searchable ggplot2 cheat sheet and
steep learning curve. CRAN. time-saving code snippets.
patchwork data Easily combine ggplot2 plots and keep the new, plot1 + plot2 + plot_layout(ncol=1) Thomas Lin
visualization merged plot a ggplot2 object. plot_layout() adds Pedersen
ability to set columns, rows, and relative sizes of
each component graphic. GitHub.
ggiraph data Make ggplot2 plots interactive with this g <- ggplot(mpg, aes( x = displ, y = cty, color = David Gohel
visualization extension's new geom functions such drv) )
geom_bar_interactive and arguments for my_gg <- g + geom_point_interactive(aes(tooltip
tooltips and JavaScript onclicks. CRAN. = model), size = 2)
ggiraph(code = print(my_gg), width = .7).
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
dygraphs data Create HTML/JavaScript graphs of time series - dygraph(myxtsobject) JJ Allaire &
visualization one-line command if your data is an xts object. RStudio
CRAN.
googleVis data Tap into the Google Charts API using R. CRAN. mychart <- gvisColumnChart(mydata) Markus
visualization plot(Column) Gesmann &
Numerous examples here others
metricsgraphics data R interface to the metricsgraphics JavaScript See package intro Bob Rudis
visualization library for bare-bones line, scatterplot and bar
charts. GitHub hrbrmstr/metricsgraphics.
RColorBrewer data Not a designer? RColorBrewer helps you select See Jennifer Bryan's tutorial Erich Neuwirth
visualization color pallettes for your visualizations. CRAN.
sf mapping, data This package makes it much easier to do GIS See the package vignettes, starting with the Edzer Pebesma
wrangling work in R. Simple features protocols make introduction, Simple Features for R. & others
geospatial data look a lot like regular data
frames, while various functions allow for analysis
such as determining whether points are in a
polygons. A GIS game-changer for R. CRAN.
leaflet mapping Map data using the Leaflet JavaScript library See my tutorial RStudio
within R. GitHub rstudio/leaflet.
ggmap mapping Although I don't use this package often for its geocode("492 Old Connecticut Path, David Kahle
main purpose of pulling down background map Framingham, MA") &Hadley
tiles, it's my go-to for geocoding up to 2,500 Wickham
addresses with the Google Maps API with its
geocode and mutate_geocode functions. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
tmap & mapping These package offer an easy way to read in See the package vignette or my mapping in R Martijn
tmaptools shape files and join data files with geographic tutorial Tennekes
info, as well as do some exploratory mapping.
Recent functionality adds support for simple
features, interactive maps and creating leaflet
objects. Plus, tmaptools::palette_explorer() is a
great tool for picking ColorBrewer palettes.
CRAN.
mapsapi mapping, data This interface to the Google Maps Direction and google_directions( origin = c(my_longitude, Michael
wrangling Distance Matrix APIs let you analyze and map my_latitude), Dorman
distances and driving routes. CRAN. destination = c(my_address),
alternatives = TRUE
Also see the vignette
tidycensus mapping, data Want to analyze and map U.S. Census Bureau See Basic usage of tidycensus. Kyle E. Walker
wrangling data from 5-year American Community Surveys
or 10-year censuses? This makes it easy to
download numerical and geospatial info in R-
ready format. CRAN.
glue data wrangling Main function, also glue, evaluates variables and glue("Today is {Sys.Date()}") Jim Hester
R expressions within a quoted string, as long as
they're enclosed by {} braces. This makes for an
elegant paste() replacement. CRAN.
rga Web analytics Use Google Analytics with R. GitHub See package README file and my tutorial Bror
skardhamar/rga. Skardhamar
RSiteCatalyst Web analytics Use Adobe Analytics with R. GitHub See intro video Randy Zwitch
randyzwitch/RSiteCatalyst.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
roxygen2 package Useful tools for documenting functions within R See this short, easy-to-read blog post Hadley
development packages. CRAN. on writing R packages Wickham &
others
shiny data Turn R data into interactive Web applications. See the tutorial RStudio
visualization I've seen some nice (if sometimes sluggish) apps
and it's got many enthusiasts. CRAN.
flexdashboard data If Shiny is too complex and involved for your More info in Using flexdashboard JJ Allaire,
visualization needs, this package offers a simpler (if RStudio &
somewhat less robust) solution based on R others
Markdown. CRAN.
openxlsx misc If you need to write to an Excel file as well as write.xlsx(mydf, "myfile.xlsx") Alexander
read, this package is easy to use. CRAN. Walker
gmodels data wrangling, There are several functions for modeling data CrossTable(myxvector, myyvector, Gregory R.
data analysis here, but the one I use, CrossTable, simply prop.t=FALSE, prop.chisq = FALSE) Warnes
creates cross-tabs with loads of options -- totals,
proprotions and several statistical tests. CRAN.
janitor data wrangling, Basic data cleaning made easy, such as finding tabyl(mydf, sort = TRUE) %>% Samuel Firke
data analysis duplicates by multiple columns, making R- adorn_totals("row")
friendly column names and removing empty
columns. It also has some nice tabulating tools,
like adding a total row, as well as generating
tables with percentages and easy crosstabs.
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
car data wrangling car's recode function makes it easy to bin recode(x, "1:3='Low'; 4:7='Mid'; 8:hi='High'") John Fox &
continuous numerical data into categories or others
factors. While base R's cut accomplishes the
same task, I find recode's syntax to be more
intuitive - just remember to put the entire
recoding formula within double quotation
marks. dplyr's case_when() function is another
option worth considering. CRAN.
rcdimple data R interface to the dimple JavaScript library with dimple(mtcars, mpg ~ cyl, type = "bar") Kent Russell
visualization numerous customization options. Good choice
for JavaScript bar charts, among others. GitHub
timelyportfolio/rcdimple.
foreach data wrangling Efficient - and intuitive if you come from another foreach(i=1:3) %do% sqrt(i) Revolution
programming language - for loops in R. CRAN. Also see The Wonders of foreach Analytics,
Steve Weston
scales data wrangling While this package has many more sophisticated comma(mynumvec) Hadley
ways to help you format data for graphing, it's Wickham
worth a download just for the comma(),
percent() and dollar() functions. CRAN.
plotly data R interface to the Plotly JavaScript library that d <- diamonds[sample(nrow(diamonds), 1000), ] Carson Sievert
visualization was open-sourced in late 2015. Basic graphs plot_ly(d, x = carat, y = price, text = & others
have a distinctive look which may not be for paste("Clarity: ", clarity), mode = "markers", color
everyone, but it's full-featured, relatively easy to = carat, size = carat)
learn (especially if you know ggplot2) and
includes a ggplotly() function to turn graphs
created with ggplot2 interactive. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
highcharter data R wrapper for the robust and well documented hchart(mydf, "charttype", hcaes(x = xcol, y = ycol, Joshua Kunst &
visualization Highcharts JavaScript library, one of my favorite group = groupbycol)) others
choices for presentation-quality interactive
graphics. The package uses ggplot2-like syntax,
including options for handling both long and
wide data, and comes with plenty of examples.
Note that a paid Highcharts license is needed to
use this for commercial or government work (it's
free for personal and non-profit projects). CRAN.
. CRAN.
profvis programming Is your R code sluggish? This package gives you a profvis({ your code here }) Winston Chang
visual representative of your code line by line so & others
you can find the speed bottlenecks. CRAN.
tidytext text mining Elegant implementation of text mining functions See tidytextmining.com for numerous examples. Julia Silge &
using Hadley Wickham's "tidy data" principles. David Robinson
CRAN.
diffobj data analysis Base R's identical() function tells you whether or diffObj(x,y) Brodie Gaslam
not two objects are the same; but if they're not, & Michael B.
it won't tell you why. diffobj gives you a visual Allen
representation of how two R objects differ.
CRAN.
Prophet forecasting I don't do much forecasting analysis; but if I did, See the Quick start guide. Sean Taylor &
I'd start with this package. CRAN. Ben Letham at
Facebook
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
feather data import, This binary data-file format can be read by both write_feather(mydf, "myfile") Wes McKinney
data export Python and R, making data interchange easier & Hadley
between the two languages. It's also built for I/O Wickham
speed. CRAN.
fst data import, Another alternative for binary file storage (R- write.fst(mydf, "myfile.fst", 100) Mark Klik
data export only), fst was built for fast storage and retrieval,
with access speeds above 1 GB/sec. It also offers
compression that doesn't slow data access too
much, as well as the ability to import a specific
range of rows (by row number). CRAN.
googleAuthR data import If you want to use data from a Google API in an R See examples on the package website and this Mark
project and there's not yet a specific package for gist for use with Google Calendars. CRAN. Edmondson
that API, this is the place to turn for
authenticating CRAN.
here misc This package has one function with a single, my_project_directory <- here() Kirill Müller
useful purpose: find your project's working
directory. Surprisingly helpful if you want your
code to run on more than one system. CRAN.
pacman misc This package is another that aims to solve one p_load(dplyr, here, tidycensus) Tyler Rinker
problem, and solve it well: package installation.
The main functions will loadi a package that's
already installed or installing it first if it's not
available. While this is certainly possible to do
with base R's require() and an if statement,
p_load() is so much more elegant for CRAN
packages, or p_load_gh() for GitHub. Other
useful options include p_temp(), which allows
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR
cloudyR project data import, This is a collection of packages aimed at making See the list of packages. Various
data export it easier for R to work with cloud platforms such
as Amazon Web Services, Google and Travis-CI.
Some are already on CRAN, some can be found
on GitHub.
To install a package from CRAN, use the command install.packages("packagename") -- of course substituting the actual package name
for packagename and putting it in quotation marks. Package names, like pretty much everything else in R, are case sensitive.
To install from GitHub, it's easiest to use the install-github function from the devtools package, using the
format devtools::install_github("githubaccountname/packagename"). That means you first want to install the devtools package on your
system with install.packages("devtools"). Note that devtools sometimes needs some extra non-R software on your system -- more
specifically, an Rtools download for Windows or Xcode for OS X. There's more information about devtools here.
In order to use a package's function during your R session, you need to do one of two things. One option is to load it into your R session
with the library("packagename") or require("packagename"). The other is to call the function including the package name, like
this: packagename::functioname(). Package names, like pretty much everything else in R, are case sensitive.
List of useful packages (libraries) for Data Analysis in R
Introduction
R offers multiple packages for performing data analysis. Apart from providing an awesome interface for statistical analysis, the next best thing
about R is the endless support it gets from developers and data science maestros from all over the world. Current count of downloadable
packages from CRAN stands close to 7000 packages!
Beyond some of the popular packages such as caret, ggplot, dplyr, lattice, there exist many more libraries which remain unnoticeable, but
prove to be very handy at certain stages of analysis. So, we created a comprehensive list of all packages in R.
1. Mapped use of each of these libraries to the stage they generally get used at – Pre-Modeling, Modeling and Post-Modeling.
2. Created a handy infographic with the most commonly used libraries. Analysts can just print this out and keep handy for reference. The
graphic is displayed below:
Here is a complete guide to powerful R packages, which are categorized into various stages of process of data analysis. Download Here.
AWESOME R
Here is a list for you, that I have used and found to be very very useful and powerful. Among these packages some are ofently used by
Kagglers. Few of these R packages played a key role in getting a top 10 ranking in Kaggle competitions.
sqldf [We use it for selecting from data frames using SQL]
data.table [This is very famous for extension of data.frame]
foreach [This is useful for them who wants to use Foreach looping construct for R]
Matrix [This package is mainly useful for working with Sparse and Dense Matrix Classes and Methods]
forecast [For easy forecasting of time series)
plyr [It is the best tools for Splitting, Applying and Combining Data]
stringr [This package is really helpful for string manipulation]
Database connection packages RPostgreSQL, RMongo, RODBC, RSQLite
lubridate [Data Scientist mainly use them for easy time and date manipulation]
ggplot2 [This is one of the famous and strong packages for data visualization and exploratory data analysis]
qcc [It is mainly used for statistical quality control and QC charts]
reshape2 [You can use this package for data restructuring very easily]
randomForest (This a very well known package in data science community for building random forest predictive models)
gbm [This package provides Gradient Boosting Machine]
e1071 [It is one of the best package I have used ever. Mainly used for building Support Vector Machines]
caret [caret is mainly useful to Classification and Regression Training]
glmnet [This provides Lasso and Elastic-Net Regularized Generalized Linear Models]
tau [This is very good for Text Analysis Utilities]
SOAR [If you want Memory management in R by delayed assignments then this the package you are looking for]
doMC [This is for Foreach parallel adaptor for the multicore package]
The CRAN Package repository features 6778 active packages. Which of these should you know? Here is an analysis of the daily download logs of the
CRAN mirror from Jan-May 2015. See a link to full data at the bottom of the post.
They are also rated & reviewed by users as a crowdsourced solution by Crantastic.org. We also present the CRANtastic rating of few packages here
only to represent that it is gaining popularity. However, some of these packages have user ratings that are too few to be based on for analysis and
hence, for these we omitted the ratings.
All the top 20 (Jan-May 2015) above are covered in computerworld.com Top R package ranking for April .
For completeness, here is data on 135 R package downloads, from Jan to May 2015.
Did we miss your favorites? Light up this space and contribute to the R community by letting us know which R packages you use!!
Bio: Bhavya Geethika is pursuing a masters in Management Information Systems at University of Illinois at Chicago. Her areas of interests
include Statistics & Data Mining for Business, Machine learning and Data-Driven Marketing.
By Ashish
Although there is abundance of such data both in print and electronic format but it is mostly either buried deep in voluminous books or in a long threaded conversation?
I think it will be appropriate to “cluster” all such useful packages as used in two popular data mining languages R and Python in a single thread.
Is data exploration your objective? The pandas package in Python is very powerful and extremely flexible but its equally challenging to learn too. Similarly, the dplyr package
in R can be used for the same.
Is data visualization your objective? If so then in R, ggplot2 is an excellent package for data visualization. Similarly, you can use ggplot for python for graphics
And finally, like the CRAN-R project is a single repository for R packages the Anaconda distribution for Python has a similar package management system
Knowing how to USE the top 10 data mining algorithms in R is even more awesome.
That’s when you can slap a big ol’ “S” on your chest…
Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3
separate panels in this survey paper.
UPDATE 18-Jun-2015: Thanks to Albert for the creating the image above!
UPDATE 22-Jun-2015: Thanks to Ulf for the fantastic feedback which I’ve included below.
Getting Started
C5.0
k-means
Support Vector Machines
Apriori
EM
PageRank
AdaBoost
kNN
Naive Bayes
CART
You Can Totally Do This!
Getting Started
First, what is R?
R is both a language and environment for statistical computing and graphics. It’s a powerful suite of software for data manipulation,
calculation and graphical display.
1. R has a fantastic community of bloggers, mailing lists, forums, a Stack Overflow tag and that’s just for starters.
2. The real kicker is R’s awesome repository of packages over at CRAN. A package includes reusable R code, the documentation that describes how
to use them and even sample data.
It’s a great environment for manipulating data, but if you’re on the fence between R and Python, lots of folks have compared them.
1. Install R
2. Install RStudio
The next step is to couple R with knitr…
Using knitr to learn data mining is an odd pairing, but it’s also incredibly powerful.
1. It’s a perfect match for learning R. I’m not sure if anyone else is doing this, but knitr lets you experiment and see a reproducible document of what
you’ve learned and accomplished. What better way to learn, teach and grow?
2. Yihui (the author of knitr) is super on top of maintaining, enhancing and making knitr awesome.
3. knitr is light-weight and comes with RStudio!
Don’t wait!
1. In RStudio, create a new R Markdown document by clicking File > New File > R Markdown…
2. Set the Title to a meaningful name.
3. Click OK.
4. Delete the text after the second set of --- .
5. Click Knit HTML .
Your R Markdown code should look like this:
1 ---
2 title: "Your Title"
3 output: html_document
4 ---
After “knitting” your document, you should see something like this in the Viewer pane:
Few pre-requisites
You’ll be installing these package pre-reqs:
adabag
arules
C50
dplyr
e1071
igraph
mclust
One final package pre-req is printr which is currently experimental (but I think is fantastic!). We’re including it here to generate better
tables from the code below.
4 'printr',
5 type = 'source',
C5.0
Wait… what happened to C4.5? C5.0 is the successor to C4.5 (one of the original top 10 algorithms). The author of C4.5/C5.0 claims the
successor is faster, more accurate and more robust.
Ok, so what are we doing? We’re going to train C5.0 to recognize 3 different species of irises. Once C5.0 is trained, we’ll test it with some
data it hasn’t seen before to see how accurately it “learned” the characteristics of each species.
Hang on, what’s iris? The iris dataset comes with R by default. It contains 150 rows of iris observations. Each iris observation consists of 5
columns: Sepal.Length , Sepal.Width , Petal.Length , Petal.Width and Species .
Although we know the species for every iris, you’re going to divide this dataset into training data and test data.
Here’s the deal:
3 ```{r}
4 library(C50)
5 library(printr)
```
6
In knitr, the R code is surrounded at the beginning by a triple backticks. This tells knitr that the text between the triple backticks is R code
and should be executed.
Hit the Knit HTML button, and you’ll have a newly generated document with the code you just added.
Note: The packages are loaded within the context of the document being “knitted” together. They will not stay loaded after knitting
completes.
Sweet! Packages are being loaded, what’s next? Now we need to divide our data into training data and test data. C5.0 is a classifier, so
you’ll be teaching it how to classify the different species of irises using the training data.
And the test data? That’s what you use to test whether C5.0 is classifying correctly.
1 This code takes a sample of 100 rows from the iris dataset:
2
3 ```{r}
Line 4 takes a random 100 row sample from 1 through 150. That’s what sample() does. This sample is stored in train.indeces .
Line 5 selects some rows (specifically the 100 you sampled) and all columns (leaving the part after the comma empty means you want all
columns). This partial dataset is stored in iris.train .
Remember, iris consists of rows and columns. Using the square brackets, you can select all rows, some rows, all columns or some
columns.
Line 6 selects some rows (specifically the rows not in the 100 you sampled) and all columns. This is stored in iris.test .
Hit the Knit HTML button, and now you’ve divided your dataset!
How can you train C5.0? This is the most algorithmically complex part, but it will only take you one line of R code.
3 ```{r}
5 ```
1. You’re using the C5.0 function from the C50 package to create a model. Remember, a model is something that describes how observed data is
generated.
2. You’re telling C5.0 to use iris.train .
3. Finally, you’re telling C5.0 that the Species column depends on the other columns (Sepal.Width, Petal.Height, etc.). The tilde means “depends” and
the period means all the other columns. So, you’d say something like “Species depends on all the other column data.”
Hit the Knit HTML button, and now you’ve trained C5.0 with just one line of code!
How can you test the C5.0 model? Evaluating a predictive model can get really complicated. Lots of techniques are available for very
sophisticated validation: part 1, part 2a/b, part 3 and part 4.
What’s cross-validation? Cross-validation is usually done in multiple rounds. You’re just going to do one round of training on part of the
dataset followed by testing on the remaining dataset.
How can you cross-validate? Add this to the bottom of your knitr document:
3 ```{r}
5 ```
The predict() function takes your model, the test data and one parameter that tells it to guess the class (in this case, the model indicate
species).
Then it attempts to predict the species based on the other data columns and stores the results in results .
How to check the results? A quick way to check the results is to use a confusion matrix.
So… what’s a confusion matrix? Also known as a contingency table, a confusion matrix allows us to visually compare the predicted
species vs. the actual species.
Here’s an example:
The rows represent the predicted species, and the columns represent the actual species from the iris dataset.
Starting from the setosa row, you would read this as:
21 iris observations were predicted to be setosa when they were actually setosa .
14 iris observations were predicted to be versicolor when they were actually versicolor .
1 iris observation was predicted to be versicolor when it was actually virginica .
14 iris observations were predicted to be virginica when it was actually virginica .
3 ```{r}
4 table(results, iris.test$Species)
5 ```
Hit the Knit HTML button, and now you see the 4 things weaved together:
1. You’ve divided this iris dataset into training and testing data.
2. You’ve created a model after training C5.0 to predict the species using the training data.
3. You’ve tested your model with the testing data.
4. Finally, you’ve evaluated your model using a confusion matrix.
Don’t sit back just yet — you’ve nailed classification, now checkout clustering…
k-means
What are we doing? As you probably recall from my previous post, k-means is a cluster analysis technique. Using k-means, we’re looking
to form groups (a.k.a. clusters) around data that “look similar.”
We don’t know which data belongs to which group — we don’t even know the number of groups, but k-means can help.
3 ```{r}
4 library(stats)
5 library(printr)
```
6
Hit the Knit HTML button, and you’ll have tested the code that imports the required libraries.
Okay, what’s next? Now we use k-means! With a single line of R code, we can apply the k-means algorithm.
1 This code removes the Species column from the iris dataset.
2 Then it uses k-means to create 3 clusters:
4 ```{r}
```
6
2 things are happening on line 5:
1. The subset() function is used to remove the Species column from the iris dataset. It’s no fun if we know the Species before clustering, right?
2. Then kmeans() is applied to the iris dataset (w/ Species removed), and we tell it to create 3 clusters.
Hit the Knit HTML button, and you’ll have a newly generated document for kmeans.
How can you test the k-means clusters? Since we started with the known species from the iris dataset, it’s straight-forward to test how
accurate k-means clustering is.
3 ```{r}
4 table(model$cluster, iris$Species)
5 ```
Hit the Knit HTML button to generate your own confusion matrix.
What do the results tell us? The k-means results aren’t great, and your results will probably be slightly different.
What are the numbers along the side? The numbers along the side are the cluster numbers. Since we removed the Species column, k-
means has no idea what to name the clusters, so it numbers them.
What does the matrix tell us? Here’s a potential interpretation of the matrix:
k-means picked up really well on the characteristics for setosa in cluster 2. Out of 50 setosa irises, k-means grouped together all 50.
k-means had a tough time with versicolor and virginica , since they are being grouped into both clusters 1 and 2. Cluster 1
favors versicolor and cluster 3 strongly favors virginica .
An interesting investigation would be to try clustering the data into 2 clusters rather than 3. You could easily experiment with the centers parameter
in kmeans() to see if that would work better.
Does this data mining stuff work? k-means didn’t do great in this instance. Unfortunately, no algorithm will be able to cluster or classify in
every case.
Using this iris dataset, k-means could be used to cluster setosa and possibly virginica . With data mining, model testing/validation is super
important, but we’re not going to be able to cover it in this post. Perhaps a future one…
With C5.0 and k-means under your belt, let’s tackle a tougher one…
3 ```{r}
4 library(e1071)
5 library(printr)
```
6
Hit the Knit HTML button to ensure importing the required libraries is working.
Loading of libraries is good, what’s next? SVM is trained just like C5.0, so we need a training and test set, just like before.
1 This code takes a sample of 100 rows from the iris dataset:
2
3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)
```
7
This should look familiar. To keep things consistent, it’s the same code we used to create training and testing data.
Hit the Knit HTML button, and you’ve divided your dataset.
How can you train SVM? Like C5.0, this is the most algorithmically complex part, but it will take us only a single line of R code.
3 ```{r}
5 ```
3 ```{r}
5 ```
What do the results tell us? To get a better understanding of the results, generate a confusion matrix for the results.
Add this to the bottom of your knitr document, and hit Knit HTML :
3 ```{r}
4 table(results, iris.test$Species)
5 ```
Apriori
What are we doing? We’re going to use Apriori to mine a dataset of census income in order to discover related items in the survey data.
As you probably recall from my previous post, these related items are called itemsets. The relationship among items are called association
rules.
5 library(printr)
6 data("Adult")
```
7
Hit the Knit HTML button to import the required libraries and dataset.
Loading of libraries and data is working, what’s next? Now we use Apriori!
```
5
This single function call does a ton of things, so let’s break it down…
Line 4 tells apriori() that you’ll be working on the Adult dataset and to store the association rules into rules .
Line 5 tells apriori() a few parameters it needs to filter the generated rules.
As you probably recall, support is the percentage of records in the dataset that contain the related items. Here you’re saying we want at
least 40% support.
Confidence is the conditional probability of some item given you have certain other items in your itemset. You’re using 70% confidence
here.
…you’ll need to experiment with support and confidence to filter for interesting rules/itemsets.
Line 6 tells apriori() certain characteristics to look for in the association rules.
Association rules look like this {United States} => {White, Male} . You’d read this as “When I see United States, I will also see White, Male.”
There’s a left-hand side (lhs) to the rule and a right-hand side (rhs).
All this line indicates is that we want to see race=White and sex=Male on the right-hand side. The left-hand side can remain
the default (which means anything goes).
Hit the Knit HTML button, and generate the association rules!
How can you look at the rules? Looking at the association rules takes only a few lines of R code.
3 ```{r}
4 rules.sorted <- sort(rules, by = "lift")
6 as(top5.rules, "data.frame")
```
7
What’s lift? Lift tells you how strongly associated the left-hand and right-hand sides are associated with each other. The higher the lift
value, the stronger the association.
Bottom line is: It’s another measure to help you filter the large number of rules/itemsets.
Line 6 takes the top 5 rules and converts them into a data frame so that we can view them.
What do these rules tell us? When you’re dealing with a large amount of data, Apriori gives you an alternate view of the data for
exploration.
A word of caution:
Although these rules are based on the data and systematically generated, I get the feeling that there’s a bit of an art to selecting support,
confidence and even lift. Depending on what you select for these values, you may get rules that will really help you make decisions…
Alternatively…
But I have to admit, Apriori gives some interesting insights on a data set I didn’t know much about.
EM
What are we doing? Within the context of data mining, expectation maximization (EM) is used for clustering (like k-means!). We’re going to
cluster the irises using the EM algorithm. I found EM to be one of the more difficult to understand conceptually.
Despite being difficult to understand, R takes care of all the heavy lifting with one of its CRAN packages: mclust .
3 ```{r}
4 library(mclust)
5 library(printr)
```
6
Hit the Knit HTML button to make sure you can import the required libraries.
1 This code removes the Species column from the iris dataset.
2 Then it uses Mclust to create clusters:
4 ```{r}
```
6
This should look familiar. We’re removing the Species column (just like before), except this time we’re using Mclust() to do the clustering.
Hit the Knit HTML button, and you’ve generated a cluster model using EM.
How are Mclust and EM related? Mclust() uses the EM algorithm under the hood. In a nutshell, Mclust() tunes a set of models using EM
and then selects the one with the lowest BIC.
Hang on… what’s BIC? BIC stands for Bayesian Information Criterion. In a nutshell, given a few models, BIC is an index which measures
both the explanatory power and the simplicity of a model.
The simpler the model and the more data it can explain… the lower the BIC. The model with lowest BIC is the winner.
How can you test the EM clusters? You can test whether clustering was effective using the same approach as in k-means.
Add this code to the bottom of your knitr document:
3 ```{r}
4 table(model$classification, iris$Species)
5 ```
Hit the Knit HTML button to generate your own confusion matrix.
What are the numbers along the left-hand side? Just like k-means clustering, the algorithm has no idea what the cluster names are. EM
found 2 clusters, so it numbered them accordingly.
Why only 2 clusters? Using the model selected by Mclust , this algorithm effectively segmented setosa from the other 2 species.
Like k-means, EM had trouble distinguishing between versicolor and virginica . While k-means had some success with virginica and to
a lesser degree versicolor , k-means made the effort to form a 3rd cluster because we told it to form 3 clusters.
A neat feature of the mclust package is to plot the model. Here you can investigate the clustering plots:
2 plot(model)
PageRank
What are we doing? We’re going to use the PageRank algorithm to determine the relative importance of objects in a graph.
What’s a graph? Within the context of mathematics, a graph is a set of objects where some of the objects are connected by links. In the
context of PageRank, the set of objects are web pages and the links are hyperlinks to other web pages.
3 ```{r}
4 library(igraph)
5 library(dplyr)
6 library(printr)
```
7
Hit the Knit HTML button to try importing the required libraries.
Alright, what’s next? Using the PageRank algorithm, the aim is to discover the relative importance of objects. Like k-means, Apriori and
EM, we’re not going to train PageRank.
3 ```{r}
5 ```
What’s a directed link? In graphs, you can have 2 kinds of links: directed and undirected. Directed links are single directional.
For example, a web page hyperlinking to another web page is one-way. Unless the 2nd web page hyperlinks back to the 1st page, the link
doesn’t go both ways.
What does this graph look like? Seeing a graph is so much better than describing a graph.
3 ```{r}
4 plot(g)
5 ```
How can I apply PageRank to this graph? With a single line of R code, you can apply PageRank to the graph you just generated.
3 ```{r}
4 pr <- page.rank(g)$vector
5 ```
The single line of R code applies the PageRank algorithm and retrieves the vector of PageRanks for the 10 objects in the graph. You can
think of a vector as a list, so we’re just retrieving a list of PageRanks.
How can I view the PageRanks? R already did the heavy lifting in order to calculate the PageRank of each object.
To view the PageRanks, add this code to the bottom of your knitr document:
1 This code outputs the PageRank for each object:
2
3 ```{r}
5 arrange(df, desc(PageRank))
```
6
Hit the Knit HTML button, and you’ll see the PageRanks.
Line 4 creates a data frame with 2 columns: Object and PageRank. You can think of a data frame as table with rows and columns.
The Object column contains the numbers 1 through 10. The PageRank column contains the PageRanks.
Each row in the data frame represents an object with its object number and PageRank.
Line 5 sorts the data frame from highest PageRank to lowest PageRank.
Looking back at the original graph, this seems to be accurate. Object 8 is linked to by 3 other objects: 3, 2 and 6. Object 3 is linked to by just
object 9.
Remember, the number of objects linking to object 8 is just one component of PageRank…
…the relative importance of the objects linking to object 8 also factor into object 8’s PageRank.
AdaBoost
What are we doing? Like C5.0 and SVM, we’re going to train AdaBoost to recognize 3 different species of irises. Then we’ll perform a
similar test on the test dataset to see how accurately it “learned” the different iris species.
3 ```{r}
4 library(adabag)
5 library(printr)
```
6
Hit the Knit HTML button to test import of the required libraries.
Loading libraries is working, what’s next? Just like before, we need a training and test set.
1 This code takes a sample of 100 rows from the iris dataset:
2
3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)
```
7
No change here. To keep things focused on the algorithms, it’s the same code we used to create training and testing data.
Hit the Knit HTML button, and you now have a training and test dataset.
How can you train AdaBoost? Just like before, this will take us only a single line of R code.
3 ```{r}
5 ```
Why does AdaBoost take longer to train? R’s default number of iterations is 100. You can modify the number of iterations using
the mfinal parameter.
As you’ll recall from AdaBoost in plain English, AdaBoost is trained in rounds (a.k.a. iterations).
What are the weak learners? boosting() is an implementation of AdaBoost.M1. This variation uses 3 weak learners: FindAttrTest,
FindDecRule and C4.5.
How do you test the model? Let’s use precisely the same code as before.
3 ```{r}
```
5
What do the results tell us? To get a better understanding of the results, output the confusion matrix for the results.
Add this to the bottom of your knitr document, and hit Knit HTML :
3 ```{r}
4 results$confusion
5 ```
It only made 2 mistakes misclassifying 2 virginica irises as versicolor . A single test run isn’t a lot to thoroughly evaluate AdaBoost on
this dataset, but this is definitely a good sign.
3 ```{r}
4 library(class)
5 library(printr)
```
6
Hit the Knit HTML button to try importing the required libraries.
All set, what’s next? Despite that you’re not “training” kNN, the algorithm still requires a training set to base its just-in-time calculations on.
We’ll need a training and test set.
This code takes a sample of 100 rows from the iris dataset:
1
2
```{r}
3
train.indeces <- sample(1:nrow(iris), 100)
4
iris.train <- iris[train.indeces, ]
5
iris.test <- iris[-train.indeces, ]
6 ```
7
Hit the Knit HTML button, and you now have a training and test dataset.
How do you use kNN? In just a single call, you’ll be initializing kNN with the training dataset and testing with the test dataset.
1
This code initializes kNN with the training data.
2 In addition, it does a test with the testing data.
3
4 ```{r}
cl = iris.train$Species)
7
```
8
What do the results tell us? To get a better understanding of the results, output the confusion matrix for the results.
Add this to the bottom of your knitr document, and hit Knit HTML :
3 ```{r}
4 table(results, iris.test$Species)
5 ```
Not as great as some of the other results we’ve been seeing, but pretty decent since we’re using the default kNN settings.
Naive Bayes
What are we doing? We’re going to use the Naive Bayes algorithm to recognize 3 different species of irises.
2
```{r}
3
library(e1071)
4
library(printr)
5 ```
Hit the Knit HTML button to try importing the required libraries.
1 This code takes a sample of 100 rows from the iris dataset:
2
3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)
```
7
Hit the Knit HTML button, and you now have a training and test dataset.
How can you train Naive Bayes? Just like before, this will take us only a single line of R code.
3 ```{r}
5 ```
1. The data to train Naive Bayes is stored in iris.train , but we remove the Species column. This is x .
2. The Species for iris.train . This is y .
How do you test the model? We’ll use precisely the same code as before.
3 ```{r}
5 ```
Hit the Knit HTML button to test your Naive Bayes model.
What do the results tell us? Let’s output the confusion matrix for the results.
Add this to the bottom of your knitr document, and hit Knit HTML :
3 ```{r}
4 table(results, iris.test$Species)
5 ```
It made 2 mistakes misclassifying 2 virginica irises as versicolor and another mistake misclassifying versicolor as virginica .
CART
What are we doing? In this last and final algorithm, we’re going to use CART to recognize 3 different species of irises.
3 ```{r}
4 library(rpart)
5 library(printr)
```
6
Hit the Knit HTML button to see if importing the required libraries works.
What’s next? CART is a decision tree classifier just like C5.0. We’ll need a training and test set.
3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)
```
7
Hit the Knit HTML button, and you now have a training and test dataset.
How can you train CART? Once again, R encapsulates the complexity of CART nicely so this will take us only a single line of R code.
3 ```{r}
5 ```
How do you test the model? We’ll be using predict() again to test our model.
3 ```{r}
```
5
What do the results tell us? The confusion matrix for the results is generated in the same way.
Add this to the bottom of your knitr document, and hit Knit HTML :
3 ```{r}
4 table(results, iris.test$Species)
5 ```
The default CART model didn’t do awesome compared to C5.0. However, this is a single test run. Performing more test runs with different
samples would be a much more reliable metric.
In this particular case, CART misclassified 1 virginica iris as versicolor and 4 mistakes misclassifying versicolor as virginica .