Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Quick list of useful R packages

Garrett Grolemund
January 10, 2018 17:49

Recommended Packages
Many useful R function come in packages, free libraries of code written by R's active user community. To install an R package, open an R
session and type at the command line

install.packages("<the package's name>")

R will download the package from CRAN, so you'll need to be connected to the internet. Once you have a package installed, you can make
its contents available to use in your current R session by running

library("<the package's name>")

There are thousands of helpful R packages for you to use, but navigating them all can be a challenge. To help you out, we've compiled this
guide to some of the best. We've used each of these, and found them to be outstanding – we've even written some of them. But you don't
have to take our word for it, these packages are also some of the top most downloaded R packages.

To load data
RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages are a good place to start. Choose the
package that fits your type of database.

XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can also just export your spreadsheets from
Excel as .csv's.

foreign - Want to read a SAS data set into R? Or an SPSS data set? Foreign provides functions that help you load data files from other
programs into R.

R can handle plain text files – no package required. Just use the functions read.csv, read.table, and read.fwf. If you have even more exotic
data, consult the CRAN guide to data import and export.

To manipulate data
dplyr - Essential shortcuts for subsetting, summarizing, rearranging, and joining together data sets. dplyr is our go to package for fast data
manipulation.

tidyr - Tools for changing the layout of your data sets. Use the gather and spread functions to convert your data into the tidy format, the
layout R likes best.

stringr - Easy to learn tools for regular expressions and character strings.

lubridate - Tools that make working with dates and times easier.

To visualize data
ggplot2 - R's famous package for making beautiful graphics. ggplot2 lets you use the grammar of graphics to build layered, customizable plots.

ggvis - Interactive, web based graphics built with the grammar of graphics.

rgl - Interactive 3D visualizations with R

htmlwidgets - A fast way to build interactive (javascript based) visualizations with R. Packages that implement htmlwidgets include:

 leaflet (maps)
 dygraphs (time series)
 DT (tables)
 diagrammeR (diagrams)
 network3D (network graphs)
 threeJS (3D scatterplots and globes).

googleVis - Let's you use Google Chart tools to visualize data in R. Google Chart tools used to be called Gapminder, the graphing software
Hans Rosling made famous in hie TED talk.

To model data
car - car's Anova function is popular for making type II and type III Anova tables.

mgcv - Generalized Additive Models

lme4/nlme - Linear and Non-linear mixed effects models

randomForest - Random forest methods from machine learning


multcomp - Tools for multiple comparison testing

vcd - Visualization tools and tests for categorical data

glmnet - Lasso and elastic-net regression methods with cross validation

survival - Tools for survival analysis

caret - Tools for training regression and classification models

To report results
shiny - Easily make interactive, web apps with R. A perfect way to explore data and share findings with non-programmers.

R Markdown - The perfect workflow for reproducible reporting. Write R code in your markdown reports. When you run render, R
Markdown will replace the code with its results and then export your report as an HTML, pdf, or MS Word document, or a HTML or pdf
slideshow. The result? Automated reporting. R Markdown is integrated straight into RStudio.

xtable - The xtable function takes an R object (like a data frame) and returns the latex or HTML code you need to paste a pretty version of
the object into your documents. Copy and paste, or pair up with R Markdown.

For Spatial data


sp, maptools - Tools for loading and using spatial data including shapefiles.

maps - Easy to use map polygons for plots.

ggmap - Download street maps straight from Google maps and use them as a background in your ggplots.

For Time Series and Financial data


zoo - Provides the most popular format for saving time series objects in R.

xts - Very flexible tools for manipulating time series data sets.

quantmod - Tools for downloading financial data, plotting common charts, and doing technical analysis.

To write high performance R code


Rcpp - Write R functions that call C++ code for lightning fast speed.

data.table - An alternative way to organize data sets for very, very fast operations. Useful for big data.

parallel - Use parallel processing in R to speed up your code or to crunch large data sets.

To work with the web


XML - Read and create XML documents with R

jsonlite - Read and create JSON data tables with R

httr - A set of useful tools for working with http connections

To write your own R packages


devtools - An essential suite of tools for turning your code into an R package.

testthat - testthat provides an easy way to write unit tests for your code projects.

roxygen2 - A quick way to document your R packages. roxygen2 turns inline code comments into documentation pages and builds a
package namespace.

You can also read about the entire package development process online in Hadley Wickham's R Packages book

List of user-installed R packages and their versions


By Andrew Z

This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)
Example output

Package Version
bitops 1.0-6
BradleyTerry2 1.0-6
brew 1.0-6
brglm 0.5-9
car 2.0-25
caret 6.0-47
coin 1.0-24
colorspace 1.2-6
crayon 1.2.1
devtools 1.8.0
dichromat 2.0-0
digest 0.6.8
earth 4.4.0
evaluate 0.7
[..snip..]

Tested with R 3.2.0.

This is a small step towards managing package versions: for a better solution, see the checkpoint package. You could also use the first column to reinstall user-installed R
packages after an R upgrade.

Find Installed Packages


Description
Find (or retrieve) details of all packages installed in the specified libraries.

Usage
installed.packages(lib.loc = NULL, priority = NULL,
noCache = FALSE, fields = NULL,
subarch = .Platform$r_arch, ...)
Arguments
lib.loc character vector describing the location of R library trees to search through, or NULL for all known trees (see .libPaths).
priority character vector or NULL (default). If non-null, used to select packages; "high" is equivalent to c("base", "recommended"). To
select all packages without an assigned priority use priority = "NA".
noCache Do not use cached information, nor cache it.
fields a character vector giving the fields to extract from each package's DESCRIPTION file in addition to the default ones,
or NULL (default). Unavailable fields result in NA values.
subarch character string or NULL. If non-null and non-empty, used to select packages which are installed for that sub-architecture.
... allows unused arguments to be passed down from other functions.

Details
installed.packages scans the ‘DESCRIPTION’ files of each package found along lib.loc and returns a matrix of package names,
library paths and version numbers.

The information found is cached (by library) for the R session and specified fields argument, and updated only if the top-level library
directory has been altered, for example by installing or removing a package. If the cached information becomes confused, it can be
avoided by specifying noCache = TRUE.

Value
A matrix with one row per package, row names the package names and column names
(currently) "Package", "LibPath", "Version", "Priority", "Depends", "Imports", "LinkingTo", "Suggests", "Enhances", "OS_type", "Licens
e" and "Built" (the R version the package was built under). Additional columns can be specified using the fields argument.

Note
This needs to read several files per installed package, which will be slow on Windows and on some network-mounted file systems.

It will be slow when thousands of packages are installed, so do not use it to find out if a named package is installed
(use find.package or system.file) nor to find out if a package is usable (call requireNamespace orrequire and check the return value) nor to find
details of a small number of packages (use packageDescription).
See Also
update.packages, install.packages, INSTALL, REMOVE.

Examples
## confine search to .Library for speed
str(ip <- installed.packages(.Library, priority = "high"))
ip[, c(1,3:5)]
plic <- installed.packages(.Library, priority = "high", fields = "License")
## what licenses are there:
table( plic[, "License"] )

Here are my go-to R packages -- in a handy searchable table.


One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines --
analyzing everything from weather or financial data to the human genome -- not to mention analyzing computer security-breach data.

[ Need to learn R or brush up on basics? Download our free Beginner's Guide to R or the Advanced Beginner's Guide to R ]

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The
table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in
the table are clickable if you want more information. To find out more about a package once you've installed it, type help(package =
"packagename") in your R console (of course substituting the actual package name ).

My favorite R packages for data visualization and munging

PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

devtools package While devtools is aimed at helping you create install_github("rstudio/leaflet") Hadley
development, your own R packages, it's also essential if you Wickham &
package want to easily install other packages from others
installation GitHub. Install it! Requires Rtools on Windows
and XCode on a Mac. On CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

remotes package If all you want is to install packages from GitHub, remotes::install_github("mangothecat/franc") Gabor Csardi &
installation devtools may be a bit of a heavyweight. remotes others
will easily install from GitHub as well as
Bitbucket and some others. On CRAN. (ghit is
another option, but is GitHub-only.)

installr misc Windows only: Update your installed version of updateR() Tal Galili &
R from within R. On CRAN. others

reinstallr misc Seeks to find packages that had previously been reinstallr() Calli Gross
installed on your system and need to be re-
installed after upgrading R. CRAN.

readxl data import Fast way to read Excel files in R, without read_excel("my-spreadsheet.xls", sheet = 1) Hadley
dependencies such as Java. CRAN. Wickham

googlesheets data import, Easily read data into R from Google Sheets. mysheet <- gs_title("Google Spreadsheet Title") Jennifer Bryan
data export CRAN. mydata <- mydata <- gs_read(mysheet, ws =
“WorksheetTitle”)

readr data import Base R handles most of these functions; but if read_csv(myfile.csv) Hadley
you have huge files, this is a speedy and Wickham
standardized way to read tabular files such as
CSVs into R data frames, as well as plain text files
into character strings with read_file. CRAN.

rio data import, rio has a good idea: Pull a lot of separate data- import("myfile") Thomas J.
data export reading packages into one, so you just need to Leeper &
remember 2 functions: import and export. others
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

Hmisc data analysis There are a number of useful functions in here. describe(mydf) Frank E Harrell
Two of my favorites: describe, a more robust Cs(so, it, goes) Jr & others
summary function, and Cs, which creates a
vector of quoted character strings from
unquoted comma-separated text. Cs(so, it,
goes) creates c("so", "it", "goes"). CRAN.

datapasta data import Data copy and paste: Meet reproducible df_paste() to create a data frame, vector_paste() Miles McBain
research. If you've copied data from the Web, a to create a vector.
spreadsheet, or other source into your
clipboard, datapasta lets you paste it into R as
an R object, with the code to reproduce it. It
includes RStudio add-ins as well as command-
line functions for transposing data, turning it
into markdown format, and more. CRAN.

sqldf data wrangling, Do you know a great SQL query you'd use if your sqldf("select * from mydf where mycol > 4") G.
data analysis R data frame were in a SQL database? Run SQL Grothendieck
queries on your data frame with sqldf. CRAN.

jsonlite data import, Parse json within R or turn R data frames into myjson <- toJSON(mydf, pretty=TRUE) Jeroen Ooms &
data wrangling json. CRAN. mydf2 <- fromJSON(myjson) others

XML data import, Many functions for elegantly dealing with XML mytables <- readHTMLTable(myurl) Duncan
data wrangling and HTML, such as readHTMLTable. CRAN. Temple Lang

httr data import, An R interface to http protocols; useful for r <- GET("http://httpbin.org/get") Hadley
data wrangling pulling data from APIs. See the httr quickstart content(r, "text") Wickham
guide. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

quantmod data import, Even if you're not interested in analyzing and getSymbols("AITINO", src="FRED") Jeffrey A. Ryan
data graphing financial investment data, quantmod
visualization, has easy-to-use functions for importing
data analysis economic as well as financial data from sources
like the Federal Reserve. CRAN.

tidyquant data import, Another financial package that's useful for aapl_key_ratios <- tq_get("AAPL", get = Matt Dancho
data importing, analyzing and visualizing data, "key.ratios")
visualization, integrating aspects of other popular finance
data analysis packages as well as tidyverse tools. With
thorough documentation. CRAN.

rvest data import, Web scraping: Extract data from HTML pages. See the package vignette Hadley
web scraping Inspired by Python's Beautiful Soup. Works well Wickham
with Selectorgadget. CRAN.

dplyr data wrangling, The essential data-munging R package when See the intro vignette Hadley
data analysis working with data frames. Especially useful for Wickham
operating on data by categories. CRAN.

purrr data wrangling purrr is a relatively new package aimed at map_df(mylist, myfunction) Hadley
replacing plyr and some base R apply functions More: Charlotte Wickham's purr tutorial video, Wickham
for doing operations for running . It's more the purrr cheat sheet PDF download.
complex to learn but also has more functionality.
CRAN.

reshape2 data wrangling Change data row and column formats from See my tutorial Hadley
"wide" to "long"; turn variables into column Wickham
names or column names into variables and
more. The tidyr package is a newer, more
focused option, but I still use reshape2. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

tidyr data wrangling While I still prefer reshape2 for general re- See examples in this blog post. Hadley
arranging, tidy won me over with specialized Wickham
functions like fill (fill in missing columns from
data above) and replace_na. CRAN.

splitstackshape data wrangling It's rare that I'd recommend a package that cSplit(mydata, "multi_val_column", sep = ",", Ananda Mahto
hasn't been updated in years, but the cSplit() direction = "long").
function solves a rather complex shaping
problem in an astonishingly easy way. If you
have a data frame column with one or
more comma-separated values (think a survey
question with "select all that apply"), this is
worth an install if you want to separate each
item into its own new data frame row.. CRAN.

magrittr data wrangling This package gave us the %>% symbol for mydf %<>% mutate(newcol = myfun(colname)) Stefan Milton
chaining R operations, but it's got other useful Bache &
operators such as %<>% for mutating a data Hadley
frame in place and and . as a placeholder for the Wickham
original object being operated upon. CRAN.

validate data wrangling Intuitive data validation based on rules you can See the introductory vignette. Mark van der
define, save and re-use. CRAN. Loo & Edwin
de Jonge

testthat programming Package that makes it easy to write unit tests for See the testing chapter of Hadley Wickham's Hadley
your R code. CRAN. book on R packages. Wickham

data.table data wrangling, Popular package for heavy-duty data wrangling. Useful tutorial Matt Dowle &
data analysis While I typically prefer dplyr, data.table has others
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

many fans for its speed with large data sets.


CRAN.

stringr data wrangling Numerous functions for text manipulation. Some str_pad(myzipcodevector, 5, "left", "0") Hadley
are similar to existing base R functions but in a Wickham
more standard format, including working with
regular expressions. Some of my favorites:
str_pad and str_trim. CRAN.

lubridate data wrangling Everything you ever wanted to do with date mdy("05/06/2015") + months(1) Garrett
arithmetic, although understanding & using More examples in the package vignette Grolemund,
available functionality can be somewhat Hadley
complex. CRAN. Wickham &
others

zoo data wrangling, Robust package with a slew of functions for rollmean(mydf, 7) Achim Zeileis &
data analysis dealing with time series data; I like the handy others
rollmean function with its align=right and fill=NA
options for calculating moving averages. CRAN.

editR data display Interactive editor for R Markdowndocuments. editR("path/to/myfile.Rmd") Simon Garnier
Note that R Markdown Notebooks are another
useful way to generate Markdown interactively.
editR is on GitHub.

knitr data display Add R to a markdown document and easily See the Minimal Examples page. Yihui Xie &
generate reports in HTML, Word and other others
formats. A must-have if you're interested in
reproducible research and automating the
journey from data analysis to report creation.
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

officeR data display Import and edit Microsoft Word and PowerPoint my_doc <- read_docx() %>% David Gohel
documents, making it easy to add R-generated body_add_img(src = myplot)
analysis and visualizations to existing as well as The package website has many more examples.
new reports and presentations. CRAN.

listviewer data display, While RStudio has since added a list-viewing jsonedit(mylist) Kent Russell
data wrangling option, this HTML widget still offers an elegant
way to view complex nested lists within R.
GitHub timelyportfolio/listviewer.

DT data display Create a sortable, searchable table in one line of datatable(mydf) RStudio
code with this R interface to the jQuery
DataTables plug-in. GitHub rstudio/DT.

ggplot2 data Powerful, flexible and well-thought-out dataviz qplot(factor(myfactor), data=mydf, geom="bar", Hadley
visualization package following 'grammar of graphics' syntax fill=factor(myfactor)) Wickham
to create static graphics, but be prepared for a See my searchable ggplot2 cheat sheet and
steep learning curve. CRAN. time-saving code snippets.

patchwork data Easily combine ggplot2 plots and keep the new, plot1 + plot2 + plot_layout(ncol=1) Thomas Lin
visualization merged plot a ggplot2 object. plot_layout() adds Pedersen
ability to set columns, rows, and relative sizes of
each component graphic. GitHub.

ggiraph data Make ggplot2 plots interactive with this g <- ggplot(mpg, aes( x = displ, y = cty, color = David Gohel
visualization extension's new geom functions such drv) )
geom_bar_interactive and arguments for my_gg <- g + geom_point_interactive(aes(tooltip
tooltips and JavaScript onclicks. CRAN. = model), size = 2)
ggiraph(code = print(my_gg), width = .7).
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

dygraphs data Create HTML/JavaScript graphs of time series - dygraph(myxtsobject) JJ Allaire &
visualization one-line command if your data is an xts object. RStudio
CRAN.

googleVis data Tap into the Google Charts API using R. CRAN. mychart <- gvisColumnChart(mydata) Markus
visualization plot(Column) Gesmann &
Numerous examples here others

metricsgraphics data R interface to the metricsgraphics JavaScript See package intro Bob Rudis
visualization library for bare-bones line, scatterplot and bar
charts. GitHub hrbrmstr/metricsgraphics.

RColorBrewer data Not a designer? RColorBrewer helps you select See Jennifer Bryan's tutorial Erich Neuwirth
visualization color pallettes for your visualizations. CRAN.

sf mapping, data This package makes it much easier to do GIS See the package vignettes, starting with the Edzer Pebesma
wrangling work in R. Simple features protocols make introduction, Simple Features for R. & others
geospatial data look a lot like regular data
frames, while various functions allow for analysis
such as determining whether points are in a
polygons. A GIS game-changer for R. CRAN.

leaflet mapping Map data using the Leaflet JavaScript library See my tutorial RStudio
within R. GitHub rstudio/leaflet.

ggmap mapping Although I don't use this package often for its geocode("492 Old Connecticut Path, David Kahle
main purpose of pulling down background map Framingham, MA") &Hadley
tiles, it's my go-to for geocoding up to 2,500 Wickham
addresses with the Google Maps API with its
geocode and mutate_geocode functions. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

tmap & mapping These package offer an easy way to read in See the package vignette or my mapping in R Martijn
tmaptools shape files and join data files with geographic tutorial Tennekes
info, as well as do some exploratory mapping.
Recent functionality adds support for simple
features, interactive maps and creating leaflet
objects. Plus, tmaptools::palette_explorer() is a
great tool for picking ColorBrewer palettes.
CRAN.

mapsapi mapping, data This interface to the Google Maps Direction and google_directions( origin = c(my_longitude, Michael
wrangling Distance Matrix APIs let you analyze and map my_latitude), Dorman
distances and driving routes. CRAN. destination = c(my_address),
alternatives = TRUE
Also see the vignette

tidycensus mapping, data Want to analyze and map U.S. Census Bureau See Basic usage of tidycensus. Kyle E. Walker
wrangling data from 5-year American Community Surveys
or 10-year censuses? This makes it easy to
download numerical and geospatial info in R-
ready format. CRAN.

glue data wrangling Main function, also glue, evaluates variables and glue("Today is {Sys.Date()}") Jim Hester
R expressions within a quoted string, as long as
they're enclosed by {} braces. This makes for an
elegant paste() replacement. CRAN.

rga Web analytics Use Google Analytics with R. GitHub See package README file and my tutorial Bror
skardhamar/rga. Skardhamar

RSiteCatalyst Web analytics Use Adobe Analytics with R. GitHub See intro video Randy Zwitch
randyzwitch/RSiteCatalyst.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

roxygen2 package Useful tools for documenting functions within R See this short, easy-to-read blog post Hadley
development packages. CRAN. on writing R packages Wickham &
others

shiny data Turn R data into interactive Web applications. See the tutorial RStudio
visualization I've seen some nice (if sometimes sluggish) apps
and it's got many enthusiasts. CRAN.

flexdashboard data If Shiny is too complex and involved for your More info in Using flexdashboard JJ Allaire,
visualization needs, this package offers a simpler (if RStudio &
somewhat less robust) solution based on R others
Markdown. CRAN.

openxlsx misc If you need to write to an Excel file as well as write.xlsx(mydf, "myfile.xlsx") Alexander
read, this package is easy to use. CRAN. Walker

gmodels data wrangling, There are several functions for modeling data CrossTable(myxvector, myyvector, Gregory R.
data analysis here, but the one I use, CrossTable, simply prop.t=FALSE, prop.chisq = FALSE) Warnes
creates cross-tabs with loads of options -- totals,
proprotions and several statistical tests. CRAN.

janitor data wrangling, Basic data cleaning made easy, such as finding tabyl(mydf, sort = TRUE) %>% Samuel Firke
data analysis duplicates by multiple columns, making R- adorn_totals("row")
friendly column names and removing empty
columns. It also has some nice tabulating tools,
like adding a total row, as well as generating
tables with percentages and easy crosstabs.
CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

car data wrangling car's recode function makes it easy to bin recode(x, "1:3='Low'; 4:7='Mid'; 8:hi='High'") John Fox &
continuous numerical data into categories or others
factors. While base R's cut accomplishes the
same task, I find recode's syntax to be more
intuitive - just remember to put the entire
recoding formula within double quotation
marks. dplyr's case_when() function is another
option worth considering. CRAN.

rcdimple data R interface to the dimple JavaScript library with dimple(mtcars, mpg ~ cyl, type = "bar") Kent Russell
visualization numerous customization options. Good choice
for JavaScript bar charts, among others. GitHub
timelyportfolio/rcdimple.

foreach data wrangling Efficient - and intuitive if you come from another foreach(i=1:3) %do% sqrt(i) Revolution
programming language - for loops in R. CRAN. Also see The Wonders of foreach Analytics,
Steve Weston

scales data wrangling While this package has many more sophisticated comma(mynumvec) Hadley
ways to help you format data for graphing, it's Wickham
worth a download just for the comma(),
percent() and dollar() functions. CRAN.

plotly data R interface to the Plotly JavaScript library that d <- diamonds[sample(nrow(diamonds), 1000), ] Carson Sievert
visualization was open-sourced in late 2015. Basic graphs plot_ly(d, x = carat, y = price, text = & others
have a distinctive look which may not be for paste("Clarity: ", clarity), mode = "markers", color
everyone, but it's full-featured, relatively easy to = carat, size = carat)
learn (especially if you know ggplot2) and
includes a ggplotly() function to turn graphs
created with ggplot2 interactive. CRAN.
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

highcharter data R wrapper for the robust and well documented hchart(mydf, "charttype", hcaes(x = xcol, y = ycol, Joshua Kunst &
visualization Highcharts JavaScript library, one of my favorite group = groupbycol)) others
choices for presentation-quality interactive
graphics. The package uses ggplot2-like syntax,
including options for handling both long and
wide data, and comes with plenty of examples.
Note that a paid Highcharts license is needed to
use this for commercial or government work (it's
free for personal and non-profit projects). CRAN.
. CRAN.

profvis programming Is your R code sluggish? This package gives you a profvis({ your code here }) Winston Chang
visual representative of your code line by line so & others
you can find the speed bottlenecks. CRAN.

tidytext text mining Elegant implementation of text mining functions See tidytextmining.com for numerous examples. Julia Silge &
using Hadley Wickham's "tidy data" principles. David Robinson
CRAN.

diffobj data analysis Base R's identical() function tells you whether or diffObj(x,y) Brodie Gaslam
not two objects are the same; but if they're not, & Michael B.
it won't tell you why. diffobj gives you a visual Allen
representation of how two R objects differ.
CRAN.

Prophet forecasting I don't do much forecasting analysis; but if I did, See the Quick start guide. Sean Taylor &
I'd start with this package. CRAN. Ben Letham at
Facebook
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

feather data import, This binary data-file format can be read by both write_feather(mydf, "myfile") Wes McKinney
data export Python and R, making data interchange easier & Hadley
between the two languages. It's also built for I/O Wickham
speed. CRAN.

fst data import, Another alternative for binary file storage (R- write.fst(mydf, "myfile.fst", 100) Mark Klik
data export only), fst was built for fast storage and retrieval,
with access speeds above 1 GB/sec. It also offers
compression that doesn't slow data access too
much, as well as the ability to import a specific
range of rows (by row number). CRAN.

googleAuthR data import If you want to use data from a Google API in an R See examples on the package website and this Mark
project and there's not yet a specific package for gist for use with Google Calendars. CRAN. Edmondson
that API, this is the place to turn for
authenticating CRAN.

here misc This package has one function with a single, my_project_directory <- here() Kirill Müller
useful purpose: find your project's working
directory. Surprisingly helpful if you want your
code to run on more than one system. CRAN.

pacman misc This package is another that aims to solve one p_load(dplyr, here, tidycensus) Tyler Rinker
problem, and solve it well: package installation.
The main functions will loadi a package that's
already installed or installing it first if it's not
available. While this is certainly possible to do
with base R's require() and an if statement,
p_load() is so much more elegant for CRAN
packages, or p_load_gh() for GitHub. Other
useful options include p_temp(), which allows
PACKAGE CATEGORY DESCRIPTION SAMPLE USE AUTHOR

for a temporary, this-session-only package


installation. CRAN.

cloudyR project data import, This is a collection of packages aimed at making See the list of packages. Various
data export it easier for R to work with cloud platforms such
as Amazon Web Services, Google and Travis-CI.
Some are already on CRAN, some can be found
on GitHub.

A few important points for newbies:

To install a package from CRAN, use the command install.packages("packagename") -- of course substituting the actual package name
for packagename and putting it in quotation marks. Package names, like pretty much everything else in R, are case sensitive.

To install from GitHub, it's easiest to use the install-github function from the devtools package, using the
format devtools::install_github("githubaccountname/packagename"). That means you first want to install the devtools package on your
system with install.packages("devtools"). Note that devtools sometimes needs some extra non-R software on your system -- more
specifically, an Rtools download for Windows or Xcode for OS X. There's more information about devtools here.

In order to use a package's function during your R session, you need to do one of two things. One option is to load it into your R session
with the library("packagename") or require("packagename"). The other is to call the function including the package name, like
this: packagename::functioname(). Package names, like pretty much everything else in R, are case sensitive.
List of useful packages (libraries) for Data Analysis in R

Introduction
R offers multiple packages for performing data analysis. Apart from providing an awesome interface for statistical analysis, the next best thing
about R is the endless support it gets from developers and data science maestros from all over the world. Current count of downloadable
packages from CRAN stands close to 7000 packages!

Beyond some of the popular packages such as caret, ggplot, dplyr, lattice, there exist many more libraries which remain unnoticeable, but
prove to be very handy at certain stages of analysis. So, we created a comprehensive list of all packages in R.

In order to make the guide more useful, we further did 2 things:

1. Mapped use of each of these libraries to the stage they generally get used at – Pre-Modeling, Modeling and Post-Modeling.
2. Created a handy infographic with the most commonly used libraries. Analysts can just print this out and keep handy for reference. The
graphic is displayed below:
Here is a complete guide to powerful R packages, which are categorized into various stages of process of data analysis. Download Here.

AWESOME R

A curated list of awesome R packages and tools. Inspired by awesome-machine-learning.


For better navigation, see https://awesome-r.com
for Top 50 CRAN downloaded packages or repos with 400+
2017
 prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
 tidyverse - Easily install and load packages from the tidyverse
 purrr - A functional programming toolkit for R
 hrbrthemes - 🔏 Opinionated, typographic-centric ggplot2 themes and theme components
 xaringan - Create HTML5 slides with R Markdown and the JavaScript library
 blogdown - Create Blogs and Websites with R Markdown
 glue - Glue strings to data in R. Small, fast, dependency free interpreted string literals.
 covr - Test coverage reports for R
 lintr - Static Code Analysis for R
 reprex - Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
 reticulate - R Interface to Python
 tensorflow - TensorFlow for R
 utf8 - Manipulating and printing UTF-8 text that fixes multiple bugs in R’s UTF-8 handling.
INTEGRATED DEVELOPMENT ENVIRONMENTS
Integrated Development Environment
 RStudio - A powerful and productive user interface for R. Works great on Windows, Mac, and Linux.
 Emacs + ESS - Emacs Speaks Statistics is an add-on package for emacs text editors.
 Sublime Text + R-Box - Add-on package for Sublime Text 2/3.
 TextMate + r.tmblundle - Add-on package for TextMate 1/2.
 StatET - An Eclipse based IDE for R.
 Revolution R Enterprise - Revolution R would be offered free to academic users and commercial software would focus on big data, large scale multiprocessor
functionality.
 R Commander - A package that provides a basic graphical user interface.
 IRkernel - R kernel for Jupyter.
 Deducer - A Menu driven data analysis GUI with a spreadsheet like data editor.
 Radiant - A platform-independent browser-based interface for business analytics in R, based on the Shiny.
 Vim-R - Vim plugin for R.
 Nvim-R - Neovim plugin for R.
 JASP - A complete package for both Bayesian and Frequentist methods, that is familiar to users of SPSS.
 Bio7 - A IDE contains tools for model creation, scientific image analysis and statistical analysis for ecological modelling.
 RTVS - R Tools for Visual Studio.
 Rice - A modern R console with syntax highlighting.
SYNTAX
Packages change the way you use R.
 magrittr - Let’s pipe it.
 pipeR - Multi-paradigm Pipeline Implementation.
 lambda.r - Functional programming and simple pattern matching in R.
 purrr - A FP package for R in the spirit of underscore.js.
DATA MANIPULATION
Packages for cooking data.
 dplyr - Fast data frames manipulation and database query.
 data.table - Fast data manipulation in a short and flexible syntax.
 reshape2 - Flexible rearrange, reshape and aggregate data.
 readr - A fast and friendly way to read tabular data into R.
 haven - Improved methods to import SPSS, Stata and SAS files in R.
 tidyr - Easily tidy data with spread and gather functions.
 broom - Convert statistical analysis objects into tidy data frames.
 rlist - A toolbox for non-tabular data manipulation with lists.
 jsonlite - A robust and quick way to parse JSON files in R.
 ff - Data structures designed to store large datasets.
 lubridate - A set of functions to work with dates and times.
 stringi - ICU based string processing package.
 stringr - Consistent API for string processing, built on top of stringi.
 bigmemory - Shared memory and memory-mapped matrices. The big* packages provide additional tools including linear models (biglm) and Random Forests
(bigrf).
 fuzzyjoin - Join tables together on inexact matching.
 tidyverse - Easily install and load packages from the tidyverse.
GRAPHIC DISPLAYS
Packages for showing data.
 ggplot2 - An implementation of the Grammar of Graphics.
 ggfortify - A unified interface to ggplot2 popular statistical packages using one line of code.
 ggrepel - Repel overlapping text labels away from each other.
 ggalt - Extra Coordinate Systems, Geoms and Statistical Transformations for ggplot2.
 ggtree - Visualization and annotation of phylogenetic tree.
 ggtech - ggplot2 tech themes and scales
 ggplot2 Extensions - Showcases of ggplot2 extensions.
 lattice - A powerful and elegant high-level data visualization system.
 corrplot - A graphical display of a correlation matrix or general matrix. It also contains some algorithms to do matrix reordering.
 rgl - 3D visualization device system for R.
 Cairo - R graphics device using cairo graphics library for creating high-quality display output.
 extrafont - Tools for using fonts in R graphics.
 showtext - Enable R graphics device to show text using system fonts.
 animation - A simple way to produce animated graphics in R, using ImageMagick.
 gganimate - Create easy animations with ggplot2.
 misc3d - Powerful functions to deal with 3d plots, isosurfaces, etc.
 xkcd - Use xkcd style in graphs.
 imager - An image processing package based on CImg library to work with images and display them.
 hrbrthemes - 🔏 Opinionated, typographic-centric ggplot2 themes and theme components.
 waffle - 🔏 Make waffle (square pie) charts in R
HTML WIDGETS
Packages for interactive visualizations.
 d3heatmap - Interactive heatmaps with D3.
 DataTables - Displays R matrices or data frames as interactive HTML tables.
 DiagrammeR - Create JS graph diagrams and flowcharts in R.
 dygraphs - Charting time-series data in R.
 formattable - Formattable Data Structures.
 ggvis - Interactive grammar of graphics for R.
 Leaflet - One of the most popular JavaScript libraries interactive maps.
 MetricsGraphics - Enables easy creation of D3 scatterplots, line charts, and histograms.
 networkD3 - D3 JavaScript Network Graphs from R.
 scatterD3 - Interactive scatterplots with D3.
 plotly - Interactive ggplot2 and Shiny plotting with plot.ly.
 rCharts - Interactive JS Charts from R.
 rbokeh - R Interface to Bokeh.
 threejs - Interactive 3D scatter plots and globes.
 timevis - Create fully interactive timeline visualizations.
 visNetwork - Using vis.js library for network visualization.
 wordcloud2 - R interface to wordcloud2.js.
 highcharter - R wrapper for highcharts based on htmlwidgets
REPRODUCIBLE RESEARCH
Packages for literate programming.
 knitr - Easy dynamic report generation in R.
 xtable - Export tables to LaTeX or HTML.
 rapport - An R templating system.
 rmarkdown - Dynamic documents for R.
 slidify - Generate reproducible html5 slides from R markdown.
 Sweave - A package designed to write LaTeX reports using R.
 texreg - Formatting statistical models in LaTex and HTML.
 checkpoint - Install packages from snapshots on the checkpoint server.
 brew - Pre-compute data to enhance your report templates. Can be combined with knitr.
 ReporteRs - An R package to generate Microsoft Word, Microsoft PowerPoint and HTML reports.
 bookdown - Authoring Books with R Markdown.
 ezknitr - Avoid the typical working directory pain when using ‘knitr’
WEB TECHNOLOGIES AND SERVICES
Packages to surf the web.
 Web Technologies List - Information about how to use R and the world wide web together.
 shiny - Easy interactive web applications with R. See also awesome-rshiny
 shinyjs - Easily improve the user interaction and user experience in your Shiny apps in seconds.
 RCurl - General network (HTTP/FTP/…) client interface for R.
 httr - User-friendly RCurl wrapper.
 httpuv - HTTP and WebSocket server library.
 XML - Tools for parsing and generating XML within R.
 rvest - Simple web scraping for R, using CSSSelect or XPath syntax.
 OpenCPU - HTTP API for R.
 Rfacebook - Access to Facebook API via R.
 RSiteCatalyst - R client library for the Adobe Analytics.
 plumber - A library to expose existing R code as web API.
PARALLEL COMPUTING
Packages for parallel computing.
 parallel - R started with release 2.14.0 which includes a new package parallel incorporating (slightly revised) copies of packages multicore and snow.
 Rmpi - Rmpi provides an interface (wrapper) to MPI APIs. It also provides interactive R slave environment.
 foreach - Executing the loop in parallel.
 future - A minimal, efficient, cross-platform unified Future API for parallel and distributed processing in R; designed for beginners as well as advanced developers.
 SparkR - R frontend for Spark.
 DistributedR - A scalable high-performance platform from HP Vertica Analytics Team.
 ddR - Provides distributed data structures and simplifies distributed computing in R.
 sparklyr - R interface for Apache Spark from RStudio.
 batchtools - High performance computing with LSF, TORQUE, Slurm, OpenLava, SGE and Docker Swarm.
HIGH PERFORMANCE
Packages for making R faster.
 Rcpp - Rcpp provides a powerful API on top of R, make function in R extremely faster.
 Rcpp11 - Rcpp11 is a complete redesign of Rcpp, targetting C++11.
 compiler - speeding up your R code using the JIT
LANGUAGE API
Packages for other languages.
 rJava - Low-level R to Java interface.
 jvmr - Integration of R, Java, and Scala.
 rJython - R interface to Python via Jython.
 rPython - Package allowing R to call Python.
 runr - Run Julia and Bash from R.
 RJulia - R package Call Julia.
 JuliaCall - Seamless Integration Between R and Julia.
 RinRuby - a Ruby library that integrates the R interpreter in Ruby.
 R.matlab - Read and write of MAT files together with R-to-MATLAB connectivity.
 RcppOctave - Seamless Interface to Octave and Matlab.
 RSPerl - A bidirectional interface for calling R from Perl and Perl from R.
 V8 - Embedded JavaScript Engine.
 htmlwidgets - Bring the best of JavaScript data visualization to R.
 rpy2 - Python interface for R.
DATABASE MANAGEMENT
Packages for managing data.
 RODBC - ODBC database access for R.
 DBI - Defines a common interface between the R and database management systems.
 elastic - Wrapper for the Elasticsearch HTTP API
 mongolite - Streaming Mongo Client for R
 RMariaDB - An R interface to MariaDB (a replacement for the old RMySQL package)
 RMySQL - R interface to the MySQL database.
 ROracle - OCI based Oracle database interface for R.
 RPostgreSQL - R interface to the PostgreSQL database system.
 RSQLite - SQLite interface for R
 RJDBC - Provides access to databases through the JDBC interface.
 rmongodb - R driver for MongoDB.
 rredis - Redis client for R.
 RCassandra - Direct interface (not Java) to the most basic functionality of Apache Cassanda.
 RHive - R extension facilitating distributed computing via Apache Hive.
 RNeo4j - Neo4j graph database driver.
 rpostgis - R interface to PostGIS database and get spatial objects in R.
MACHINE LEARNING
Packages for making R cleverer.
 AnomalyDetection - AnomalyDetection R package from Twitter.
 ahaz - Regularization for semiparametric additive hazards regression.
 arules - Mining Association Rules and Frequent Itemsets
 bigrf - Big Random Forests: Classification and Regression Forests for Large Data Sets
 bigRR - Generalized Ridge Regression (with special advantage for p >> n cases)
 bmrm - Bundle Methods for Regularized Risk Minimization Package
 Boruta - A wrapper algorithm for all-relevant feature selection
 BreakoutDetection - Breakout Detection via Robust E-Statistics from Twitter.
 bst - Gradient Boosting
 CausalImpact - Causal inference using Bayesian structural time-series models.
 C50 - C5.0 Decision Trees and Rule-Based Models
 caret - Classification and Regression Training
 Clever Algorithms For Machine Learning
 CORElearn - Classification, regression, feature evaluation and ordinal evaluation
 CoxBoost - Cox models by likelihood based boosting for a single survival endpoint or competing risks
 Cubist - Rule- and Instance-Based Regression Modeling
 e1071 - Misc Functions of the Department of Statistics (e1071), TU Wien
 earth - Multivariate Adaptive Regression Spline Models
 elasticnet - Elastic-Net for Sparse Estimation and Sparse PCA
 ElemStatLearn - Data sets, functions and examples from the book: “The Elements of Statistical Learning, Data Mining, Inference, and Prediction” by Trevor Hastie,
Robert Tibshirani and Jerome Friedman
 evtree - Evolutionary Learning of Globally Optimal Trees
 forecast - Timeseries forecasting using ARIMA, ETS, STLM, TBATS, and neural network models
 forecastHybrid - Automatic ensemble and cross validation of ARIMA, ETS, STLM, TBATS, and neural network models from the “forecast” package
 prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
 FSelector - A feature selection framework, based on subset-search or feature ranking approches.
 frbs - Fuzzy Rule-based Systems for Classification and Regression Tasks
 GAMBoost - Generalized linear and additive models by likelihood based boosting
 gamboostLSS - Boosting Methods for GAMLSS
 gbm - Generalized Boosted Regression Models
 glmnet - Lasso and elastic-net regularized generalized linear models
 glmpath - L1 Regularization Path for Generalized Linear Models and Cox Proportional Hazards Model
 GMMBoost - Likelihood-based Boosting for Generalized mixed models
 grplasso - Fitting user specified models with Group Lasso penalty
 grpreg - Regularization paths for regression models with grouped covariates
 h2o - Deeplearning, Random forests, GBM, KMeans, PCA, GLM
 hda - Heteroscedastic Discriminant Analysis
 ipred - Improved Predictors
 kernlab - kernlab: Kernel-based Machine Learning Lab
 klaR - Classification and visualization
 kohonen - Supervised and Unsupervised Self-Organising Maps.
 lars - Least Angle Regression, Lasso and Forward Stagewise
 lasso2 - L1 constrained estimation aka ‘lasso’
 LiblineaR - Linear Predictive Models Based On The Liblinear C/C++ Library
 lme4 - Mixed-effects models
 LogicReg - Logic Regression
 maptree - Mapping, pruning, and graphing tree models
 mboost - Model-Based Boosting
 Machine Learning For Hackers
 mlr - Extensible framework for classification, regression, survival analysis and clustering
 mvpart - Multivariate partitioning
 MXNet - MXNet brings flexible and efficient GPU computing and state-of-art deep learning to R.
 ncvreg - Regularization paths for SCAD- and MCP-penalized regression models
 nnet - eed-forward Neural Networks and Multinomial Log-Linear Models
 oblique.tree - Oblique Trees for Classification Data
 pamr - Pam: prediction analysis for microarrays
 party - A Laboratory for Recursive Partytioning
 partykit - A Toolkit for Recursive Partytioning
 penalized - L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model
 penalizedLDA - Penalized classification using Fisher’s linear discriminant
 penalizedSVM - Feature Selection SVM using penalty functions
 quantregForest - quantregForest: Quantile Regression Forests
 randomForest - randomForest: Breiman and Cutler’s random forests for classification and regression.
 randomForestSRC - randomForestSRC: Random Forests for Survival, Regression and Classification (RF-SRC).
 rattle - Graphical user interface for data mining in R.
 rda - Shrunken Centroids Regularized Discriminant Analysis
 rdetools - Relevant Dimension Estimation (RDE) in Feature Spaces
 REEMtree - Regression Trees with Random Effects for Longitudinal (Panel) Data
 relaxo - Relaxed Lasso
 rgenoud - R version of GENetic Optimization Using Derivatives
 rgp - R genetic programming framework
 Rmalschains - Continuous Optimization using Memetic Algorithms with Local Search Chains (MA-LS-Chains) in R
 rminer - Simpler use of data mining methods (e.g. NN and SVM) in classification and regression
 ROCR - Visualizing the performance of scoring classifiers
 RoughSets - Data Analysis Using Rough Set and Fuzzy Rough Set Theories
 rpart - Recursive Partitioning and Regression Trees
 RPMM - Recursively Partitioned Mixture Model
 RSNNS - Neural Networks in R using the Stuttgart Neural Network Simulator (SNNS)
 Rsomoclu - Parallel implementation of self-organizing maps.
 RWeka - R/Weka interface
 RXshrink - RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least Angle Regression
 sda - Shrinkage Discriminant Analysis and CAT Score Variable Selection
 SDDA - Stepwise Diagonal Discriminant Analysis
 SuperLearner and subsemble - Multi-algorithm ensemble learning packages.
 svmpath - svmpath: the SVM Path algorithm
 tgp - Bayesian treed Gaussian process models
 tree - Classification and regression trees
 varSelRF - Variable selection using random forests
 xgboost - eXtreme Gradient Boosting Tree model, well known for its speed and performance.
NATURAL LANGUAGE PROCESSING
Packages for Natural Language Processing.
 text2vec - Fast Text Mining Framework for Vectorization and Word Embeddings.
 tm - A comprehensive text mining framework for R.
 openNLP - Apache OpenNLP Tools Interface.
 koRpus - An R Package for Text Analysis.
 zipfR - Statistical models for word frequency distributions.
 NLP - Basic functions for Natural Language Processing.
 LDAvis - Interactive visualization of topic models.
 topicmodels - Topic modeling interface to the C code developed by by David M. Blei for Topic Modeling (Latent Dirichlet Allocation (LDA), and Correlated
Topics Models (CTM)).
 syuzhet - Extracts sentiment from text using three different sentiment dictionaries.
 SnowballC - Snowball stemmers based on the C libstemmer UTF-8 library.
 quanteda - R functions for Quantitative Analysis of Textual Data.
 Topic Models Resources - Topic Models learning and R related resources.
 NLP for - NLP related resources in R. @Chinese
 MonkeyLearn - 🔏 R package for text analysis with Monkeylearn 🔏.
 tidytext - Implementing tidy principles of Hadley Wickham to text mining.
 utf8 - Manipulating and printing UTF-8 text that fixes multiple bugs in R’s UTF-8 handling.
BAYESIAN
Packages for Bayesian Inference.
 coda - Output analysis and diagnostics for MCMC.
 mcmc - Markov Chain Monte Carlo.
 MCMCpack - Markov chain Monte Carlo (MCMC) Package.
 R2WinBUGS - Running WinBUGS and OpenBUGS from R / S-PLUS.
 BRugs - R interface to the OpenBUGS MCMC software.
 rjags - R interface to the JAGS MCMC library.
 rstan - R interface to the Stan MCMC software.
OPTIMIZATION
Packages for Optimization.
 lpSolve - Interface to Lp_solve to Solve Linear/Integer Programs.
 minqa - Derivative-free optimization algorithms by quadratic approximation.
 nloptr - NLopt is a free/open-source library for nonlinear optimization.
 ompr - Model mixed integer linear programs in an algebraic way directly in R.
 Rglpk - R/GNU Linear Programming Kit Interface
 ROI - The R Optimization Infrastructure (‘ROI’) is a sophisticated framework for handling optimization problems in R.
FINANCE
Packages for dealing with money.
 quantmod - Quantitative Financial Modelling & Trading Framework for R.
 TTR - Functions and data to construct technical trading rules with R.
 PerformanceAnalytics - Econometric tools for performance and risk analysis.
 zoo - S3 Infrastructure for Regular and Irregular Time Series.
 xts - eXtensible Time Series.
 tseries - Time series analysis and computational finance.
 fAssets - Analysing and Modelling Financial Assets.
BIOINFORMATICS
Packages for processing biological datasets.
 Bioconductor - Tools for the analysis and comprehension of high-throughput genomic data.
 genetics - Classes and methods for handling genetic data.
 gap - An integrated package for genetic data analysis of both population and family data.
 ape - Analyses of Phylogenetics and Evolution.
 pheatmap - Pretty heatmaps made easy.
NETWORK ANALYSIS
Packages to construct, analyze and visualize network data.
 Network Analysis List - Network Analysis related resources.
 igraph - A collection of network analysis tools.
 network - Basic tools to manipulate relational data in R.
 sna - Basic network measures and visualization tools.
 netdiffuseR - Tools for Analysis of Network Diffusion.
 networkDynamic - Support for dynamic, (inter)temporal networks.
 ndtv - Tools to construct animated visualizations of dynamic network data in various formats.
 statnet - The project behind many R network analysis packages.
 ergm - Exponential random graph models in R.
 latentnet - Latent position and cluster models for network objects.
 tnet - Network measures for weighted, two-mode and longitudinal networks.
 rgexf - Export network objects from R to GEXF, for manipulation with network software like Gephi or Sigma.
 visNetwork - Using vis.js library for network visualization.
SPATIAL
Packages to explore the earth.
 CRAN Task View: Analysis of Spatial Data- Spatial Analysis related resources.
 Leaflet - One of the most popular JavaScript libraries interactive maps.
 ggmap - Plotting maps in R with ggplot2.
 REmap - R interface to the JavaScript library ECharts for interactive map data visualization.
 sp - Classes and Methods for Spatial Data.
 rgeos - Interface to Geometry Engine - Open Source
 rgdal - Bindings for the Geospatial Data Abstraction Library
 maptools - Tools for Reading and Handling Spatial Objects
 gstat - Spatial and spatio-temporal geostatistical modelling, prediction and simulation.
 spacetime - R classes and methods for spatio-temporal data.
 RColorBrewer - Provides color schemes for maps
 spatstat - Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests
 spdep - Spatial Dependence: Weighting Schemes, Statistics and Models
 tigris - Download and use Census TIGER/Line shapefiles in R
R DEVELOPMENT
Packages for packages.
 Package Development List - R packages to improve package development.
 devtools - Tools to make an R developer’s life easier.
 testthat - An R package to make testing fun.
 R6 - simpler, faster, lighter-weight alternative to R’s built-in classes.
 pryr - Make it easier to understand what’s going on in R.
 roxygen - Describe your functions in comments next to their definitions.
 lineprof - Visualise line profiling results in R.
 packrat - Make your R projects more isolated, portable, and reproducible.
 installr - Functions for installing softwares from within R (for Windows).
 import - An import mechanism for R.
 modules - An alternative (Python style) module system for R.
 Rocker - R configurations for Docker.
 RStudio Addins - List of RStudio addins.
 drat - Creation and use of R repositories on GitHub or other repos.
 covr - Test coverage for your R package and (optionally) upload the results to coveralls or codecov.
 lintr - Static code analysis for R to enforce code style.
 staticdocs - Generate static html documentation for an R package.
LOGGING
Packages for Logging
 futile.logger - A logging package in R similar to log4j
 log4r - A log4j derivative for R
 logging - A logging package emulating the python logging package.
DATA PACKAGES
Handy Data Packages
 engsoccerdata - English and European soccer results 1871-2016.
 gapminder - Excerpt from the Gapminder dataset (data about countries throught the past 50 years).
OTHER TOOLS
Handy Tools for R
 git2r - Gives you programmatic access to Git repositories from R.
OTHER INTERPRETERS
Alternative R engines.
 CXXR - Refactorising R into C++.
 fastR - FastR is an implementation of the R Language in Java atop Truffle and Graal.
 pqR - a “pretty quick” implementation of R
 renjin - a JVM-based interpreter for R.
 rho - Refactor the interpreter of the R language into a fully-compatible, efficient, VM for R.
 riposte - a fast interpreter and JIT for R.
 TERR - TIBCO Enterprise Runtime for R.
LEARNING R
Packages for Learning R.
 swirl - An interactive R tutorial directly in your R console.
 DataScienceR - a list of R tutorials for Data Science, NLP and Machine Learning.
RESOURCES
Where to discover new R-esources.
WEBSITES
 R-project - The R Project for Statistical Computing.
 R Weekly - Weekly updates about R and Data Science. R Weekly is openly developed on GitHub.
 R Bloggers - There are people scattered across the Web who blog about R. This is simply an aggregator of many of those feeds.
 DataCamp - Learn R data analytics online.
 Quick-R - An excellent quick reference.
 Advanced R - An online version of the Advanced R book.
 Efficient R Programming - An online home of the O’Reilly book: Efficient R Programming.
 CRAN Task Views - Task Views for CRAN packages.
 The R Programming Wikibook - A collaborative handbook for R.
 R-users - A job board for R users (and the people who are looking to hire them)
 R Cookbook - A problem-oriented website that supports the R Graphics Cookbook.
 tryR - A quick course for getting started with R.
 RDocumentation - Search through all CRAN, Bioconductor, Github packages and their archives with RDocumentation.
BOOKS
 R Books List - List of R Books.
 The Art of R Programming - It’s a good resource for systematically learning fundamentals such as types of objects, control statements, variable scope, classes and
debugging in R.
 Free Books - CRAN Contributed Documentation in many languages.
 R Cookbook - A quick and simple introduction to conducting many common statistical tasks with R.
 Books written as part of the Johns Hopkins Data Science Specialization:
 Exploratory Data Analysis with R - Basic analytical skills for all sorts of data in R.
 R Programming for Data Science - More advanced data analysis that relies on R programming.
 Report Writing for Data Science in R - R-based methods for reproducible research and report generation.
 R Packages - A book (in paper and website formats) on writing R packages.
 R in Action - This book aims at all levels of users, with sections for beginning, intermediate and advanced R ranging from “Exploring R data structures” to running
regressions and conducting factor analyses.
 Use R! - This series of inexpensive and focused books from Springer publish shorter books aimed at practitioners. Books can discuss the use of R in a particular
subject area, such as Bayesian networks, ggplot2 and Rcpp.
 R for SAS and SPSS users - An excelllent resource for users already familiar with SAS or SPSS.
 An Introduction to R - A very good introductory text on R, also covers some advanced topics.
 Introduction to Statistical Learning with Application in R - A simplified and “operational” version of The Elements of Statistical Learning. Free softcopy provided
by its authors.
 The R Inferno - Patrick Burns gives insight into R’s ins and outs along with its quirks!
 R for Data Science - Free book from RStudio developers with emphasis on data science workflow.
 Learning R Programming - Learning R as a programming language from basics to advanced topics.
 Data Munging with R - A gentle introduction to data processing and programming in R, for beginners moving beyond spreadsheets.
PODCASTS
 Not So Standard Deviations - The Data Science Podcast.
 @Roger Peng and @Hilary Parker.
 R World News - R World News helps you keep up with happenings within the R community.
 @Bob Rudis and @Jay Jacobs.
 The R-Podcast - Giving practical advice on how to use R.
 @Eric Nantz.
 R Talk - News and discussions of statistical software and language R.
 @Oliver Keyes, @Jasmine Dumas, @Ted Hart and @Mikhail Popov.
 R Weekly - Weekly news updates about the R community.
REFERENCE CARDS
 R Reference Card 2.0 - Material from R for Beginners by permission of Emmanuel Paradis (Version 2 by Matt Baggott).
 Regression Analysis Refcard - R Reference Card for Regression Analysis.
 Reference Card for ESS - Reference Card for ESS.
 R Markdown Cheat sheet - Quick reference guide for writing reports with R Markdown.
 Shiny Cheat sheet - Quick reference guide for building Shiny apps.
 ggplot2 Cheat sheet - Quick reference guide for data visualisation with ggplot2.
 devtools Cheat sheet - Quick reference guide to package development in R.
MOOCS
Massive open online courses.
 The Analytics Edge - Hands-on introduction to data analysis with R from MITx.
 Johns Hopkins University Data Science Specialization - 9 courses including: Introduction to R, literate analysis tools, Shiny and some more.
 HarvardX Biomedical Data Science - Introduction to R for the Life Sciences.
 Explore Statistics with R - Covers introduction, data handling and statistical analysis in R.
LISTS
Great resources for learning domain knowledge.
 Books - List of R Books.
 DataScienceR - a list of R tutorials for Data Science, NLP and Machine Learning.
 ggplot2 Extensions - Showcases of ggplot2 extensions.
 Natural Language Processing - NLP related resources in R. @Chinese
 Network Analysis - Network Analysis related resources.
 Open Data - Using R to obtain, parse, manipulate, create, and share open data.
 Posts - Great R blog posts or Rticles.
 Package Development - R packages to improve package development.
 R Project Conferences - Information about useR! Conferences and DSC Conferences.
 RStartHere - A guide to some of the most useful R packages, organized by workflow.
 RStudio Addins - List of RStudio addins.
 Topic Models - Topic Models learning and R related resources.
 Web Technologies - Information about how to use R and the world wide web together.
OTHER AWESOME LISTS
 awesome-awesomeness
 lists
 awesome-rshiny

Here is a list for you, that I have used and found to be very very useful and powerful. Among these packages some are ofently used by
Kagglers. Few of these R packages played a key role in getting a top 10 ranking in Kaggle competitions.

 sqldf [We use it for selecting from data frames using SQL]
 data.table [This is very famous for extension of data.frame]
 foreach [This is useful for them who wants to use Foreach looping construct for R]
 Matrix [This package is mainly useful for working with Sparse and Dense Matrix Classes and Methods]
 forecast [For easy forecasting of time series)
 plyr [It is the best tools for Splitting, Applying and Combining Data]
 stringr [This package is really helpful for string manipulation]
 Database connection packages RPostgreSQL, RMongo, RODBC, RSQLite
 lubridate [Data Scientist mainly use them for easy time and date manipulation]
 ggplot2 [This is one of the famous and strong packages for data visualization and exploratory data analysis]
 qcc [It is mainly used for statistical quality control and QC charts]
 reshape2 [You can use this package for data restructuring very easily]
 randomForest (This a very well known package in data science community for building random forest predictive models)
 gbm [This package provides Gradient Boosting Machine]
 e1071 [It is one of the best package I have used ever. Mainly used for building Support Vector Machines]
 caret [caret is mainly useful to Classification and Regression Training]
 glmnet [This provides Lasso and Elastic-Net Regularized Generalized Linear Models]
 tau [This is very good for Text Analysis Utilities]
 SOAR [If you want Memory management in R by delayed assignments then this the package you are looking for]
 doMC [This is for Foreach parallel adaptor for the multicore package]
The CRAN Package repository features 6778 active packages. Which of these should you know? Here is an analysis of the daily download logs of the
CRAN mirror from Jan-May 2015. See a link to full data at the bottom of the post.

Most of these R packages are favorites of Kagglers, endorsed by many


authors, rated based on one package's dependency on other packages, some of them gained mentions on Quora and on various R blogs.

They are also rated & reviewed by users as a crowdsourced solution by Crantastic.org. We also present the CRANtastic rating of few packages here
only to represent that it is gaining popularity. However, some of these packages have user ratings that are too few to be based on for analysis and
hence, for these we omitted the ratings.

Let us explore the list based on the number of downloads!

1. Rcpp Seamless R and C++ Integration (693288 downloads, 3.2/5 by 10 users)


2. ggplot2 An Implementation of the Grammar of Graphics (598484 downloads, 4.0/5 by 82 users)
3. stringr Simple, Consistent Wrappers for Common String Operations.(543434 downloads, 4.1/5 by 18 users)
4. plyr Tools for Splitting, Applying and Combining Data(523220 downloads, 3.8/5 by 65 users)
5. digest Create Cryptographic Hash Digests of R Objects. (521344 downloads)
6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package (483065 downloads, 4.1/5 by 18 users)
7. colorspace Color Space Manipulation (476304 downloads, 4.0/5 by 2 users)
8. RColorBrewer ColorBrewer Palettes(453858 downloads, 4.0/5 by 17 users)
9. manipulate Interactive Plots for RStudio. (395232 downloads)
10. scales Scale Functions for Visualization(394389 downloads)
11. labeling Axis Labeling (373374 downloads )
12. proto Prototype object-based programming. (369096 downloads)
13. munsell Munsell colour system. (368949 downloads)
14. gtable Arrange grobs in tables (364015 downloads)
15. dichromat Color Schemes for Dichromats (362562 downloads)
16. mime Map Filenames to MIME Types.(352780 downloads)
17. RCurl General network (HTTP/FTP/...) client interface for R. (340530 downloads, 4.2/5 by 11 users)
18. bitops Bitwise Operations(322743 downloads)
19. zoo S3 Infrastructure for Regular and Irregular Time Series (Z's Ordered Observations) (302052 downloads, 3.8/5 by 11 users)
20. knitr A General-Purpose Package for Dynamic Report Generation in R. (295528 downloads)

All the top 20 (Jan-May 2015) above are covered in computerworld.com Top R package ranking for April .
For completeness, here is data on 135 R package downloads, from Jan to May 2015.
Did we miss your favorites? Light up this space and contribute to the R community by letting us know which R packages you use!!

Bio: Bhavya Geethika is pursuing a masters in Management Information Systems at University of Illinois at Chicago. Her areas of interests
include Statistics & Data Mining for Business, Machine learning and Data-Driven Marketing.

Packages for data mining algorithms in R and Python


May 11, 2015

By Ashish

Although there is abundance of such data both in print and electronic format but it is mostly either buried deep in voluminous books or in a long threaded conversation?
I think it will be appropriate to “cluster” all such useful packages as used in two popular data mining languages R and Python in a single thread.

1. Hierarchical Clustering Methods


2. For hierarchical clustering methods use the cluster package in R. An example implementation is posted on thisthread In the same package you can find methods
for clues, clara, clarans, Diana, ClustOfVar algorithms
3. BIRCH methods- TheR package has been removed from the CRAN repository. You can either use the earlier versions found here or else you can modify the code. For
Python, you can use sckit-learn
4. Agglomerative Clustering- the r function is agnes found in the cluster package
5. Expectation-Maximization algorithm- the r package isEMCluster
6. K-modesRfor classical k-means, kernlab, Flexclust
7. Clustering and Cluster validation in RPackage – fpc, RANN for k-nearest neighbors
8. For clustering mixed-type dataset, the R package isCluster Ensembles
9. In Python- Text processing tasks can be handled byNatural Language Toolkit (NLP) is a mature, well-documented package for NLP, TextBlob is a simpler
alternative, spaCy is a brand new alternative focused on performance. The R package for text processing is tm package
10. CRAN Task View– contains a list of packages that can be used for finding groups in data and modeling unobserved cross-sectional heterogeneity. This is one place
where you can find both the function name and its description.
Is data cleaning your objective? So if your focus is on data cleaning also known as data munging then python is more powerful in my experience because its backed
by regular expression

Is data exploration your objective? The pandas package in Python is very powerful and extremely flexible but its equally challenging to learn too. Similarly, the dplyr package
in R can be used for the same.

Is data visualization your objective? If so then in R, ggplot2 is an excellent package for data visualization. Similarly, you can use ggplot for python for graphics

And finally, like the CRAN-R project is a single repository for R packages the Anaconda distribution for Python has a similar package management system

Top 10 data mining algorithms in plain R


FacebookTwitterLinkedInPartajare
Knowing the top 10 most influential data mining algorithms is awesome.

Knowing how to USE the top 10 data mining algorithms in R is even more awesome.

That’s when you can slap a big ol’ “S” on your chest…

…because you’ll be unstoppable!

Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3
separate panels in this survey paper.

By the end of this post…


You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away.

UPDATE 18-Jun-2015: Thanks to Albert for the creating the image above!

UPDATE 22-Jun-2015: Thanks to Ulf for the fantastic feedback which I’ve included below.

 Getting Started
 C5.0
 k-means
 Support Vector Machines
 Apriori
 EM
 PageRank
 AdaBoost
 kNN
 Naive Bayes
 CART
 You Can Totally Do This!

Getting Started
First, what is R?
R is both a language and environment for statistical computing and graphics. It’s a powerful suite of software for data manipulation,
calculation and graphical display.

R has 2 key selling points:

1. R has a fantastic community of bloggers, mailing lists, forums, a Stack Overflow tag and that’s just for starters.
2. The real kicker is R’s awesome repository of packages over at CRAN. A package includes reusable R code, the documentation that describes how
to use them and even sample data.

It’s a great environment for manipulating data, but if you’re on the fence between R and Python, lots of folks have compared them.

For this post, do 2 things right now:

1. Install R
2. Install RStudio
The next step is to couple R with knitr…

Okay, what’s knitr?


knitr (pronounced nit-ter) weaves together plain text (like you’re reading) with R code into a single document. In the words of the author, it’s
“elegant, flexible and fast!”

You’re probably wondering…

What does this have to do with data mining?

Using knitr to learn data mining is an odd pairing, but it’s also incredibly powerful.

Here’s 3 reasons why:

1. It’s a perfect match for learning R. I’m not sure if anyone else is doing this, but knitr lets you experiment and see a reproducible document of what
you’ve learned and accomplished. What better way to learn, teach and grow?
2. Yihui (the author of knitr) is super on top of maintaining, enhancing and making knitr awesome.
3. knitr is light-weight and comes with RStudio!

Don’t wait!

Follow these 5 steps to create your first knitr document:

1. In RStudio, create a new R Markdown document by clicking File > New File > R Markdown…
2. Set the Title to a meaningful name.
3. Click OK.
4. Delete the text after the second set of --- .
5. Click Knit HTML .
Your R Markdown code should look like this:

1 ---
2 title: "Your Title"

3 output: html_document

4 ---

After “knitting” your document, you should see something like this in the Viewer pane:

Congratulations! You’ve coded your first knitr document!

Few pre-requisites
You’ll be installing these package pre-reqs:
 adabag
 arules
 C50
 dplyr
 e1071
 igraph
 mclust

One final package pre-req is printr which is currently experimental (but I think is fantastic!). We’re including it here to generate better
tables from the code below.

In your RStudio console window, copy and paste these 2 commands:

1 install.packages(c("adabag", "arules", "C50", "dplyr",


2 "e1071", "igraph", "mclust"))
3 install.packages(

4 'printr',

5 type = 'source',

repos = c('http://yihui.name/xran', 'http://cran.rstudio.com'))


6

Then press Enter .

Now let’s get started data mining!

C5.0
Wait… what happened to C4.5? C5.0 is the successor to C4.5 (one of the original top 10 algorithms). The author of C4.5/C5.0 claims the
successor is faster, more accurate and more robust.

Ok, so what are we doing? We’re going to train C5.0 to recognize 3 different species of irises. Once C5.0 is trained, we’ll test it with some
data it hasn’t seen before to see how accurately it “learned” the characteristics of each species.

Hang on, what’s iris? The iris dataset comes with R by default. It contains 150 rows of iris observations. Each iris observation consists of 5
columns: Sepal.Length , Sepal.Width , Petal.Length , Petal.Width and Species .

Although we know the species for every iris, you’re going to divide this dataset into training data and test data.
Here’s the deal:

Each row in the dataset is numbered 1 through 150.

How do we start? Create a new knitr document, and title it C50 .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(C50)

5 library(printr)

```
6

Can you see how plain text is weaved in with R code?

In knitr, the R code is surrounded at the beginning by a triple backticks. This tells knitr that the text between the triple backticks is R code
and should be executed.

Hit the Knit HTML button, and you’ll have a newly generated document with the code you just added.

Note: The packages are loaded within the context of the document being “knitted” together. They will not stay loaded after knitting
completes.

Sweet! Packages are being loaded, what’s next? Now we need to divide our data into training data and test data. C5.0 is a classifier, so
you’ll be teaching it how to classify the different species of irises using the training data.

And the test data? That’s what you use to test whether C5.0 is classifying correctly.

Add this to the bottom of your knitr document:

1 This code takes a sample of 100 rows from the iris dataset:

2
3 ```{r}

4 train.indeces <- sample(1:nrow(iris), 100)

iris.train <- iris[train.indeces, ]


5
iris.test <- iris[-train.indeces, ]
6
```
7

Line 4 takes a random 100 row sample from 1 through 150. That’s what sample() does. This sample is stored in train.indeces .

Line 5 selects some rows (specifically the 100 you sampled) and all columns (leaving the part after the comma empty means you want all
columns). This partial dataset is stored in iris.train .

Remember, iris consists of rows and columns. Using the square brackets, you can select all rows, some rows, all columns or some
columns.

Line 6 selects some rows (specifically the rows not in the 100 you sampled) and all columns. This is stored in iris.test .

Hit the Knit HTML button, and now you’ve divided your dataset!

How can you train C5.0? This is the most algorithmically complex part, but it will only take you one line of R code.

Check this out:

1 This code trains a model based on the training data:


2

3 ```{r}

4 model <- C5.0(Species ~ ., data = iris.train)

5 ```

Add the above code to your knitr document.

In this single line of R code you’re doing 3 things:

1. You’re using the C5.0 function from the C50 package to create a model. Remember, a model is something that describes how observed data is
generated.
2. You’re telling C5.0 to use iris.train .
3. Finally, you’re telling C5.0 that the Species column depends on the other columns (Sepal.Width, Petal.Height, etc.). The tilde means “depends” and
the period means all the other columns. So, you’d say something like “Species depends on all the other column data.”

Hit the Knit HTML button, and now you’ve trained C5.0 with just one line of code!

How can you test the C5.0 model? Evaluating a predictive model can get really complicated. Lots of techniques are available for very
sophisticated validation: part 1, part 2a/b, part 3 and part 4.

One of the simplest approaches is: cross-validation.

What’s cross-validation? Cross-validation is usually done in multiple rounds. You’re just going to do one round of training on part of the
dataset followed by testing on the remaining dataset.

How can you cross-validate? Add this to the bottom of your knitr document:

1 This code tests the model using the test data:


2

3 ```{r}

4 results <- predict(object = model, newdata = iris.test, type = "class")

5 ```

The predict() function takes your model, the test data and one parameter that tells it to guess the class (in this case, the model indicate
species).

Then it attempts to predict the species based on the other data columns and stores the results in results .

How to check the results? A quick way to check the results is to use a confusion matrix.

So… what’s a confusion matrix? Also known as a contingency table, a confusion matrix allows us to visually compare the predicted
species vs. the actual species.

Here’s an example:
The rows represent the predicted species, and the columns represent the actual species from the iris dataset.

Starting from the setosa row, you would read this as:

 21 iris observations were predicted to be setosa when they were actually setosa .
 14 iris observations were predicted to be versicolor when they were actually versicolor .
 1 iris observation was predicted to be versicolor when it was actually virginica .
 14 iris observations were predicted to be virginica when it was actually virginica .

How can I create a confusion matrix? Again, this is a one-liner:

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(results, iris.test$Species)

5 ```

Hit the Knit HTML button, and now you see the 4 things weaved together:

1. You’ve divided this iris dataset into training and testing data.
2. You’ve created a model after training C5.0 to predict the species using the training data.
3. You’ve tested your model with the testing data.
4. Finally, you’ve evaluated your model using a confusion matrix.

Don’t sit back just yet — you’ve nailed classification, now checkout clustering…
k-means
What are we doing? As you probably recall from my previous post, k-means is a cluster analysis technique. Using k-means, we’re looking
to form groups (a.k.a. clusters) around data that “look similar.”

The problem k-means solves is:

We don’t know which data belongs to which group — we don’t even know the number of groups, but k-means can help.

How do we start? Create a new knitr document, and title it kmeans .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(stats)

5 library(printr)

```
6

Hit the Knit HTML button, and you’ll have tested the code that imports the required libraries.

Okay, what’s next? Now we use k-means! With a single line of R code, we can apply the k-means algorithm.

Add this to the bottom of your knitr document:

1 This code removes the Species column from the iris dataset.
2 Then it uses k-means to create 3 clusters:

4 ```{r}

5 model <- kmeans(x = subset(iris, select = -Species), centers = 3)

```
6
2 things are happening on line 5:

1. The subset() function is used to remove the Species column from the iris dataset. It’s no fun if we know the Species before clustering, right?
2. Then kmeans() is applied to the iris dataset (w/ Species removed), and we tell it to create 3 clusters.

Hit the Knit HTML button, and you’ll have a newly generated document for kmeans.

How can you test the k-means clusters? Since we started with the known species from the iris dataset, it’s straight-forward to test how
accurate k-means clustering is.

Add this code to the bottom of your knitr document:

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(model$cluster, iris$Species)

5 ```

Hit the Knit HTML button to generate your own confusion matrix.

What do the results tell us? The k-means results aren’t great, and your results will probably be slightly different.

Here’s what mine looks like:

What are the numbers along the side? The numbers along the side are the cluster numbers. Since we removed the Species column, k-
means has no idea what to name the clusters, so it numbers them.
What does the matrix tell us? Here’s a potential interpretation of the matrix:

 k-means picked up really well on the characteristics for setosa in cluster 2. Out of 50 setosa irises, k-means grouped together all 50.
 k-means had a tough time with versicolor and virginica , since they are being grouped into both clusters 1 and 2. Cluster 1
favors versicolor and cluster 3 strongly favors virginica .
 An interesting investigation would be to try clustering the data into 2 clusters rather than 3. You could easily experiment with the centers parameter
in kmeans() to see if that would work better.

Does this data mining stuff work? k-means didn’t do great in this instance. Unfortunately, no algorithm will be able to cluster or classify in
every case.

Using this iris dataset, k-means could be used to cluster setosa and possibly virginica . With data mining, model testing/validation is super

important, but we’re not going to be able to cover it in this post. Perhaps a future one…

With C5.0 and k-means under your belt, let’s tackle a tougher one…

Support Vector Machines


What are we doing? Just like we did with C5.0, we’re going to train SVM to recognize 3 different species of irises. Then we’ll perform a
similar test to see how accurately it “learned” the different species.

How do we start? Create a new knitr document, and title it svm .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(e1071)

5 library(printr)

```
6

Hit the Knit HTML button to ensure importing the required libraries is working.
Loading of libraries is good, what’s next? SVM is trained just like C5.0, so we need a training and test set, just like before.

Add this to the bottom of your knitr document:

1 This code takes a sample of 100 rows from the iris dataset:
2

3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)

5 iris.train <- iris[train.indeces, ]

6 iris.test <- iris[-train.indeces, ]

```
7

This should look familiar. To keep things consistent, it’s the same code we used to create training and testing data.

Hit the Knit HTML button, and you’ve divided your dataset.

How can you train SVM? Like C5.0, this is the most algorithmically complex part, but it will take us only a single line of R code.

Add this code to the bottom of your knitr document:

1 This code trains a model based on the training data:


2

3 ```{r}

4 model <- svm(Species ~ ., data = iris.train)

5 ```

As in the C5.0 code, this code tells svm 2 things:

1. Species depends on the other columns.


2. The data to train SVM is stored in iris.train .

Hit the Knit HTML button to train SVM.


Let’s test the model! We’re going to use precisely the same code as before.

Add this to the your knitr document:

1 This code tests the model using the test data:


2

3 ```{r}

4 results <- predict(object = model, newdata = iris.test, type = "class")

5 ```

Hit the Knit HTML button to test your SVM model.

What do the results tell us? To get a better understanding of the results, generate a confusion matrix for the results.

Add this to the bottom of your knitr document, and hit Knit HTML :

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(results, iris.test$Species)

5 ```

Here’s what mine looks like:

The first thing that jumps out at me is SVM kicked butt!


The only mistake it made was misclassifying a virginica iris as versicolor . Although the testing so far hasn’t been very thorough, based
on the test runs so far… SVM and C5.0 seem to do about the same on this dataset, and both do better than k-means.

Now for my favorite…

Apriori
What are we doing? We’re going to use Apriori to mine a dataset of census income in order to discover related items in the survey data.

As you probably recall from my previous post, these related items are called itemsets. The relationship among items are called association
rules.

How do we start? Create a new knitr document, and title it apriori .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r, warning = FALSE, message = FALSE}


4 library(arules)

5 library(printr)

6 data("Adult")

```
7

Hit the Knit HTML button to import the required libraries and dataset.

Loading of libraries and data is working, what’s next? Now we use Apriori!

Add this to the bottom of your knitr document:

This code generates the association rules from the dataset:


1
```{r}
2
rules <- apriori(Adult,
3 parameter = list(support = 0.4, confidence = 0.7),

4 appearance = list(rhs = c("race=White", "sex=Male"), default = "lhs"))

```
5

This single function call does a ton of things, so let’s break it down…

Line 4 tells apriori() that you’ll be working on the Adult dataset and to store the association rules into rules .

Line 5 tells apriori() a few parameters it needs to filter the generated rules.

As you probably recall, support is the percentage of records in the dataset that contain the related items. Here you’re saying we want at
least 40% support.

Confidence is the conditional probability of some item given you have certain other items in your itemset. You’re using 70% confidence
here.

In truth, Apriori generates a ton of itemsets and rules…

…you’ll need to experiment with support and confidence to filter for interesting rules/itemsets.

Line 6 tells apriori() certain characteristics to look for in the association rules.

Association rules look like this {United States} => {White, Male} . You’d read this as “When I see United States, I will also see White, Male.”

There’s a left-hand side (lhs) to the rule and a right-hand side (rhs).

All this line indicates is that we want to see race=White and sex=Male on the right-hand side. The left-hand side can remain
the default (which means anything goes).

Hit the Knit HTML button, and generate the association rules!

How can you look at the rules? Looking at the association rules takes only a few lines of R code.

Add this code to the bottom of your knitr document:


1 This code gives us a view of the rules:
2

3 ```{r}
4 rules.sorted <- sort(rules, by = "lift")

5 top5.rules <- head(rules.sorted, 5)

6 as(top5.rules, "data.frame")

```
7

Line 4 sorts the rules by lift.

What’s lift? Lift tells you how strongly associated the left-hand and right-hand sides are associated with each other. The higher the lift
value, the stronger the association.

Bottom line is: It’s another measure to help you filter the large number of rules/itemsets.

Line 5 grabs the top 5 rules based on lift.

Line 6 takes the top 5 rules and converts them into a data frame so that we can view them.

Here’s the top 5 rules:

What do these rules tell us? When you’re dealing with a large amount of data, Apriori gives you an alternate view of the data for
exploration.

Check out these findings:


 In the 1st rule… When we see Husband we are virtually guaranteed to see Male . Nothing surprising with this revelation. What would be interesting

is to understand why it’s not 100%.


 In the 2nd rule… It’s basically the same as the 1st rule, except we’re dealing with civilian spouses . No surprises here.
 In the 3rd and 4th rules… When we see civilian spouse , we have a high chance of seeing Male and White . This is interesting, because it
potentially tells us something about the data. Why isn’t a similar rule for Female showing up? Why aren’t rules for other races showing up in the top
rules?
 In the 5th rule… When we see US , we tend to see White . This seems to fit with my expectation, but it could also point to the way the data was
collected. Did we expect the data set to have more race=White ?

A word of caution:

Although these rules are based on the data and systematically generated, I get the feeling that there’s a bit of an art to selecting support,
confidence and even lift. Depending on what you select for these values, you may get rules that will really help you make decisions…

Alternatively…

You may get rules that mislead you.

But I have to admit, Apriori gives some interesting insights on a data set I didn’t know much about.

Here’s comes the toughest algorithm (at least for me)…

EM
What are we doing? Within the context of data mining, expectation maximization (EM) is used for clustering (like k-means!). We’re going to
cluster the irises using the EM algorithm. I found EM to be one of the more difficult to understand conceptually.

Here’s some great news though:

Despite being difficult to understand, R takes care of all the heavy lifting with one of its CRAN packages: mclust .

How do we start? Create a new knitr document, and title it em .

Add this code to the bottom of your knitr document:


1 This code loads the required packages:
2

3 ```{r}

4 library(mclust)

5 library(printr)

```
6

Hit the Knit HTML button to make sure you can import the required libraries.

All set, what’s next? Now we use EM!

Add this to the bottom of your knitr document:

1 This code removes the Species column from the iris dataset.
2 Then it uses Mclust to create clusters:

4 ```{r}

5 model <- Mclust(subset(iris, select = -Species))

```
6

This should look familiar. We’re removing the Species column (just like before), except this time we’re using Mclust() to do the clustering.

Hit the Knit HTML button, and you’ve generated a cluster model using EM.

How are Mclust and EM related? Mclust() uses the EM algorithm under the hood. In a nutshell, Mclust() tunes a set of models using EM
and then selects the one with the lowest BIC.

Hang on… what’s BIC? BIC stands for Bayesian Information Criterion. In a nutshell, given a few models, BIC is an index which measures
both the explanatory power and the simplicity of a model.

The simpler the model and the more data it can explain… the lower the BIC. The model with lowest BIC is the winner.

How can you test the EM clusters? You can test whether clustering was effective using the same approach as in k-means.
Add this code to the bottom of your knitr document:

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(model$classification, iris$Species)

5 ```

Hit the Knit HTML button to generate your own confusion matrix.

Here’s what mine looks like:

What are the numbers along the left-hand side? Just like k-means clustering, the algorithm has no idea what the cluster names are. EM
found 2 clusters, so it numbered them accordingly.

Why only 2 clusters? Using the model selected by Mclust , this algorithm effectively segmented setosa from the other 2 species.

Like k-means, EM had trouble distinguishing between versicolor and virginica . While k-means had some success with virginica and to
a lesser degree versicolor , k-means made the effort to form a 3rd cluster because we told it to form 3 clusters.

A neat feature of the mclust package is to plot the model. Here you can investigate the clustering plots:

1 # Use this to plot the EM model

2 plot(model)

Checkout the next algorithm which is used by Google…

PageRank
What are we doing? We’re going to use the PageRank algorithm to determine the relative importance of objects in a graph.

What’s a graph? Within the context of mathematics, a graph is a set of objects where some of the objects are connected by links. In the
context of PageRank, the set of objects are web pages and the links are hyperlinks to other web pages.

Here’s an example of a graph:

How do we start? Create a new knitr document, and title it pagerank .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}
4 library(igraph)

5 library(dplyr)

6 library(printr)

```
7
Hit the Knit HTML button to try importing the required libraries.

Alright, what’s next? Using the PageRank algorithm, the aim is to discover the relative importance of objects. Like k-means, Apriori and
EM, we’re not going to train PageRank.

Instead, let’s generate a random graph to do our analysis:

1 This code generates a random directed graph with 10 objects:


2

3 ```{r}

4 g <- random.graph.game(n = 10, p.or.m = 1/4, directed = TRUE)

5 ```

The single line of R code in line 4 tells random.graph.game() 4 things:

1. Generate a graph with 10 objects.


2. There’s a 1 in 4 chance of a link being drawn between 2 objects.
3. Use directed links.
4. Store the graph in g .

What’s a directed link? In graphs, you can have 2 kinds of links: directed and undirected. Directed links are single directional.

For example, a web page hyperlinking to another web page is one-way. Unless the 2nd web page hyperlinks back to the 1st page, the link
doesn’t go both ways.

Undirected links go both ways and are bidirectional.

What does this graph look like? Seeing a graph is so much better than describing a graph.

Add this code to the bottom of your knitr document:

1 Here's what the graph looks like:

3 ```{r}
4 plot(g)

5 ```

Here’s what mine looks like:

How can I apply PageRank to this graph? With a single line of R code, you can apply PageRank to the graph you just generated.

Add this code to the bottom of your knitr document:

1 This code calculates the PageRank for each object:


2

3 ```{r}

4 pr <- page.rank(g)$vector

5 ```

The single line of R code applies the PageRank algorithm and retrieves the vector of PageRanks for the 10 objects in the graph. You can
think of a vector as a list, so we’re just retrieving a list of PageRanks.

How can I view the PageRanks? R already did the heavy lifting in order to calculate the PageRank of each object.

To view the PageRanks, add this code to the bottom of your knitr document:
1 This code outputs the PageRank for each object:
2

3 ```{r}

4 df <- data.frame(Object = 1:10, PageRank = pr)

5 arrange(df, desc(PageRank))

```
6

Hit the Knit HTML button, and you’ll see the PageRanks.

Line 4 creates a data frame with 2 columns: Object and PageRank. You can think of a data frame as table with rows and columns.

The Object column contains the numbers 1 through 10. The PageRank column contains the PageRanks.

Bottom line is:

Each row in the data frame represents an object with its object number and PageRank.

Line 5 sorts the data frame from highest PageRank to lowest PageRank.

Here’s what my data frame looks like:


What does this table mean? This table tells you the relative importance of each object in the graph.

2 things are clear:

1. Object 8 is the most relevant with a PageRank of 0.18.


2. Object 3 is the least relevant with a PageRank of 0.04.

Looking back at the original graph, this seems to be accurate. Object 8 is linked to by 3 other objects: 3, 2 and 6. Object 3 is linked to by just
object 9.

Remember, the number of objects linking to object 8 is just one component of PageRank…

…the relative importance of the objects linking to object 8 also factor into object 8’s PageRank.

Ready for a boost?

AdaBoost
What are we doing? Like C5.0 and SVM, we’re going to train AdaBoost to recognize 3 different species of irises. Then we’ll perform a
similar test on the test dataset to see how accurately it “learned” the different iris species.

How do we start? Create a new knitr document, and title it adaboost .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(adabag)

5 library(printr)

```
6

Hit the Knit HTML button to test import of the required libraries.

Loading libraries is working, what’s next? Just like before, we need a training and test set.

Add this to the bottom of your knitr document:

1 This code takes a sample of 100 rows from the iris dataset:
2

3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)

5 iris.train <- iris[train.indeces, ]

6 iris.test <- iris[-train.indeces, ]

```
7

No change here. To keep things focused on the algorithms, it’s the same code we used to create training and testing data.

Hit the Knit HTML button, and you now have a training and test dataset.
How can you train AdaBoost? Just like before, this will take us only a single line of R code.

Add this code to the bottom of your knitr document:

1 This code trains a model based on the training data:


2

3 ```{r}

4 model <- boosting(Species ~ ., data = iris.train)

5 ```

As before, this code tells adaboost 2 things:

1. Species depends on the other columns.


2. The data to train AdaBoost is stored in iris.train .

Hit the Knit HTML button to train AdaBoost.

Why does AdaBoost take longer to train? R’s default number of iterations is 100. You can modify the number of iterations using
the mfinal parameter.

As you’ll recall from AdaBoost in plain English, AdaBoost is trained in rounds (a.k.a. iterations).

You might be wondering:

If AdaBoost consists of an ensemble of weak learners…

What are the weak learners? boosting() is an implementation of AdaBoost.M1. This variation uses 3 weak learners: FindAttrTest,
FindDecRule and C4.5.

How do you test the model? Let’s use precisely the same code as before.

Add this to your knitr document:

1 This code tests the model using the test data:


2

3 ```{r}

4 results <- predict(object = model, newdata = iris.test, type = "class")

```
5

Hit the Knit HTML button to test your AdaBoost model.

What do the results tell us? To get a better understanding of the results, output the confusion matrix for the results.

Add this to the bottom of your knitr document, and hit Knit HTML :

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 results$confusion

5 ```

Here’s what mine looks like:

AdaBoost did well!

It only made 2 mistakes misclassifying 2 virginica irises as versicolor . A single test run isn’t a lot to thoroughly evaluate AdaBoost on
this dataset, but this is definitely a good sign.

The next algorithm might be lazy, but it’s definitely no slouch…


kNN
What are we doing? We’re going to use the kNN algorithm to recognize 3 different species of irises. As you’ll recall from my previous post,
kNN is a lazy learner and isn’t “trained” with the goal of producing a model for prediction.

Instead, kNN does a just-in-time calculation to classify new data points.

How do we start? Create a new knitr document, and title it knn .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(class)

5 library(printr)

```
6

Hit the Knit HTML button to try importing the required libraries.

All set, what’s next? Despite that you’re not “training” kNN, the algorithm still requires a training set to base its just-in-time calculations on.
We’ll need a training and test set.

Add this to the bottom of your knitr document:

This code takes a sample of 100 rows from the iris dataset:
1

2
```{r}
3
train.indeces <- sample(1:nrow(iris), 100)
4
iris.train <- iris[train.indeces, ]
5
iris.test <- iris[-train.indeces, ]
6 ```
7

Hit the Knit HTML button, and you now have a training and test dataset.

How do you use kNN? In just a single call, you’ll be initializing kNN with the training dataset and testing with the test dataset.

Add this code to the bottom of your knitr document:

1
This code initializes kNN with the training data.
2 In addition, it does a test with the testing data.
3

4 ```{r}

5 results <- knn(train = subset(iris.train, select = -Species),

6 test = subset(iris.test, select = -Species),

cl = iris.train$Species)
7
```
8

This code tells kNN 3 things:

1. The training dataset with the Species removed.


2. The test dataset with the Species removed.
3. The Species of the training dataset is specified by cl which stands for class.

Hit the Knit HTML button to get the prediction result.

What do the results tell us? To get a better understanding of the results, output the confusion matrix for the results.

Add this to the bottom of your knitr document, and hit Knit HTML :

1 This code generates a confusion matrix for the results:

3 ```{r}
4 table(results, iris.test$Species)

5 ```

Here’s what mine looks like:

kNN made a few mistakes:

 It misclassified 5 virginica irises as versicolor .


 In addition, it misclassified 2 versicolor irises as virginica .

Not as great as some of the other results we’ve been seeing, but pretty decent since we’re using the default kNN settings.

The next algorithm may be naive, but don’t underestimate it…

Naive Bayes
What are we doing? We’re going to use the Naive Bayes algorithm to recognize 3 different species of irises.

How do we start? Create a new knitr document, and title it nb.

Add this code to the bottom of your knitr document:

This code loads the required packages:


1

2
```{r}
3
library(e1071)
4
library(printr)
5 ```

Hit the Knit HTML button to try importing the required libraries.

Cool, what’s next? We’ll need a training and test set.

Add this to the bottom of your knitr document:

1 This code takes a sample of 100 rows from the iris dataset:
2

3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)

5 iris.train <- iris[train.indeces, ]

6 iris.test <- iris[-train.indeces, ]

```
7

Hit the Knit HTML button, and you now have a training and test dataset.

How can you train Naive Bayes? Just like before, this will take us only a single line of R code.

Add this code to the bottom of your knitr document:

1 This code trains a model based on the training data:


2

3 ```{r}

4 model <- naiveBayes(x = subset(iris.train, select=-Species), y = iris.train$Species)

5 ```

This code tells naiveBayes 2 things:

1. The data to train Naive Bayes is stored in iris.train , but we remove the Species column. This is x .
2. The Species for iris.train . This is y .

Hit the Knit HTML button to train Naive Bayes.

How do you test the model? We’ll use precisely the same code as before.

Add this to the your knitr document:

1 This code tests the model using the test data:


2

3 ```{r}

4 results <- predict(object = model, newdata = iris.test, type = "class")

5 ```

Hit the Knit HTML button to test your Naive Bayes model.

What do the results tell us? Let’s output the confusion matrix for the results.

Add this to the bottom of your knitr document, and hit Knit HTML :

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(results, iris.test$Species)

5 ```

Here’s what mine looks like:


Naive Bayes did quite well!

It made 2 mistakes misclassifying 2 virginica irises as versicolor and another mistake misclassifying versicolor as virginica .

Last, but not least…

CART
What are we doing? In this last and final algorithm, we’re going to use CART to recognize 3 different species of irises.

How do we start? Create a new knitr document, and title it cart .

Add this code to the bottom of your knitr document:

1 This code loads the required packages:


2

3 ```{r}

4 library(rpart)

5 library(printr)

```
6

Hit the Knit HTML button to see if importing the required libraries works.

What’s next? CART is a decision tree classifier just like C5.0. We’ll need a training and test set.

Add this to the bottom of your knitr document:


1 This code takes a sample of 100 rows from the iris dataset:
2

3 ```{r}
4 train.indeces <- sample(1:nrow(iris), 100)

5 iris.train <- iris[train.indeces, ]

6 iris.test <- iris[-train.indeces, ]

```
7

Hit the Knit HTML button, and you now have a training and test dataset.

How can you train CART? Once again, R encapsulates the complexity of CART nicely so this will take us only a single line of R code.

Add this code to the bottom of your knitr document:

1 This code trains a model based on the training data:


2

3 ```{r}

4 model <- rpart(Species ~ ., data = iris.train)

5 ```

This code tells CART 2 things:

1. Species depends on the other columns.


2. The data to train CART is stored in iris.train .

Hit the Knit HTML button to train CART.

How do you test the model? We’ll be using predict() again to test our model.

Add this to the your knitr document:

1 This code tests the model using the test data:


2

3 ```{r}

4 results <- predict(object = model, newdata = iris.test, type = "class")

```
5

Hit the Knit HTML button to test your CART model.

What do the results tell us? The confusion matrix for the results is generated in the same way.

Add this to the bottom of your knitr document, and hit Knit HTML :

1 This code generates a confusion matrix for the results:


2

3 ```{r}

4 table(results, iris.test$Species)

5 ```

Here’s what mine looks like:

The default CART model didn’t do awesome compared to C5.0. However, this is a single test run. Performing more test runs with different
samples would be a much more reliable metric.

In this particular case, CART misclassified 1 virginica iris as versicolor and 4 mistakes misclassifying versicolor as virginica .

You might also like