Data manipulation tricks: Even better in R!

by Sharon Machlis, Online Managing Editor, Computerworld After covering recent a session on data munging with Excel, I wanted to see how those tasks could be accomplished in R. Surely anything you can do in a spreadsheet should be doable in a platform designed for heavy-duty statistical analysis! You can download the Excel Magic PDF and sample data spreadsheet and then follow along.(The original Excel tips come from MaryJo Webster, senior data reporter with Digital First Media .) (New to R? You can get up and running with our Beginners guide to R series.) If you want to follow along, youll rst want to load data from her sample spreadsheet into R. There are several ways to do this, including: You can save each sheet to CSV and load in with Rs read.csv() function. You can copy data in a spreadsheet and read.table() your clipboard slightly dierent techniques for Windows and Mac. Or, you can install and load the xlsx package within R and read data directly from Excel. Note: If a package I reference is not already installed on your system, youll need to install it rst by using Rs install.packages() function. Heres how to install the xlsx package: install.packages("xlsx", dependencies=TRUE) Note that this package can be a little nicky in Windows due to Java issues. You only need to install a package once on a system. However, in order to use it, you need to load it in each session. Heres how to load the xlsx package with library() library(xlsx) If you downloaded the spreadsheet to run R code on that sample data, set your working R directory to whatever directory holds the spreadsheet. Replace DIRECTORY with your actual directory (capitalization matters): setwd("DIRECTORY") If you are using the RStudio IDE for R, you can create a new RStudio project in the directory with the spreadsheet, and then automatically be switched to your working directory each time you load that project. See more about projects in RStudio. Finally: Lets start coding!

Dates: Extract month, day and year from each date in a column
Well start by parsing a single example date: 4/3/04. If you load in an Excel spreadsheet with dates, your dates may already be R date objects. If youve pulled in a CSV le, though, they may just be character strings. If your date is just a text string, rst wed need to turn that text into a date object and store it in a variable. The package lubridate is helpful for date parsing. Make sure to run 1

install.packages("lubridate", dependencies=TRUE) if lubridate is not already installed on your system. Then well load lubridate with library(lubridate) and use lubridates mdy() function to let R know that the date format is month/day/year and not, say, the European day/month/year. We can then use lubridate functions such as year() and month() to parse the date, similar to functions in Excel: library(lubridate) ## Warning: package lubridate was built under R version 3.0.3

mydate <- mdy("4/3/04") #get year year(mydate) ## [1] 2004 #get month month(mydate) #as number ## [1] 4 month(mydate, label=TRUE) #as name of month ## [1] Apr ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec #get day day(mydate) ## [1] 3 #day of week as number wday(mydate) ## [1] 7 #day of week as name of day wday(mydate, label=TRUE) ## [1] Sat ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat #number of the week week(mydate) ## [1] 14 2

Calculating ages and other date arithmetic

The lone tricky thing about doing date arithmetic in R is making sure youve got your data in the correct date format for the package and function you decide to use. The eeptools package has an extremely handly and elegant age_calc() function. It requires R Date objects as input, which are easy to create with base Rs as.Date() function. When using as.Date(), you just need to remember to tell R the format of your character string, such as %m/%d/%y for mm/dd/yy and %m/%d/%Y for mm/dd/yyyy. Theres a [list of how to describe some common date formats at Quick-R. In this test, well calculate how many days of summer there are between Memorial Day (May 26) and Labor Day (Sept. 1) in 2014. Remember to install eeptools with install.packages(eeptools) if its not already on your system, then load it with library(eeptools) library(eeptools) ## Loading required package: ggplot2 ## Loading required package: MASS ## Warning: package MASS was built under R version 3.0.3

## Loading required namespace: car MemorialDay <- as.Date("5/26/2014", format="%m/%d/%Y") #Create date object for Memorial Day LaborDay <- as.Date("9/1/2014", format="%m/%d/%Y") #Create date object for Labor Day #The difference between the two dates in units of days: summerdays <- age_calc(MemorialDay, LaborDay, units="days") summerdays #This variable is an object of class difftime ## Time difference of 98 days #To see the number of days as an integer, use as.integer() as.integer(summerdays) ## [1] 98 Calculating ages, as MaryJo did in her Excel sheet, is even easier with age_calc(), because if no second date is given, the function defaults to the current system date. So, you dont even have to explicitly state you want todays date to calculate someones age as of today, as you need to do in Excel. dob <- as.Date("2/4/1982", format="%m/%d/%Y") #Create a test date of birth date as a date object #Find today s age as of when I wrote this script and saved the results to an html file: age <- age_calc(dob, units= years ) #Round off the age to whole years with the floor() function wholeyears <- floor(age) #Now let s try this with an entire column of birth dates from MaryJo s spreadsheet. #Read in data from the ExcelTricks2014 Dates worksheet using xlsx package library(xlsx) 3

## Loading required package: rJava ## Loading required package: xlsxjars testdates <- read.xlsx("ExcelTricks2014.xlsx", sheetName="Dates") #What does the structure of that testdates object look like? str(testdates) ## ## ## ## ## ## ## ## ## ## data.frame $ Player.: $ Pos. : $ Status.: $ Ht. : $ Wt. : $ DOB. : $ DATEDIF: $ YEAR : $ WEEKDAY: : 58 obs. of 9 variables: Factor w/ 58 levels "Adrian Awasom",..: 47 33 14 2 40 49 30 55 23 36 ... Factor w/ 11 levels "Center ","Defensive Back ",..: 7 7 7 8 8 8 8 8 11 11 ... Factor w/ 2 levels "Active ","Out ": 1 1 1 1 1 1 1 1 1 1 ... Factor w/ 12 levels "5 10 ","5 11 ",..: 6 8 6 5 1 7 8 5 5 4 ... num 215 220 229 217 191 240 258 237 190 204 ... Date, format: "1985-07-02" "1986-11-14" ... logi NA NA NA NA NA NA ... logi NA NA NA NA NA NA ... logi NA NA NA NA NA NA ...

#Excellent, the DOB column was already read in as date objects! #We want her DATEDIF column to have the ages: testdates$DATEDIF <- round(age_calc(testdates$DOB., units= years )) #While we re at it, let s add year and weekday columns testdates$YEAR <- year(testdates$DOB.) testdates$WEEKDAY <- wday(testdates$DOB., label=TRUE) #we can add week numbers per MaryJo s discussion of seeking patterns in the data testdates$WEEKNUMs <- strftime(testdates$DOB., format="%W") table(testdates$WEEKDAY, testdates$WEEKNUMs) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

Sun Mon Tues Wed Thurs Fri Sat Sun Mon Tues Wed Thurs Fri Sat

00 02 03 04 05 06 07 08 11 12 13 14 15 17 18 19 21 23 24 25 26 27 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 28 29 30 31 32 33 34 35 37 38 39 40 41 43 44 45 46 50 51 52 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

#or just a frequency table for days of the week: table(testdates$WEEKDAY) ## ## ##

Sun 8

Mon 3

Tues 13

Wed Thurs 7 11

Fri 6

Sat 10

#And let s save that to a new spreadsheet write.xlsx(testdates, "ExcelToR.xlsx", sheetName = "Dates") A nal note about dates: The lubridate mdy() function creates an object of THE class POSIXct. If you are familiar with Unix (or some other programming languages THAT handle POSIX dates), you can probably already guess what that means: POSIXct stores the date as THE number of seconds since January 1, 1970. If you try to print this object, it will show in R as a human-readable date such as 2004-04-03 UTC, but dont be fooled: Its not actually an R Date object. So, not all R date arithmetic functions that require an object of R class Date will work, because theyre trying to use the wrong type of object. lubridates mdy() and ymd() functions parse most date-like character strings into POSIXct objects, not R Date objects. You can turn POSIXct objects into R Date objects with as.Date(): mydate <- mdy(2/28/14) mydateAsDate <- as.Date(mydate) Does it annoy you that R takes two steps (or one more complex single step) to do something that Excel does in a single function? Well, thats the beauty of using a scripting language: If you dont want to repeat multiple lines of code, write your own function to simplify it. Heres one way to create a function called myDateFunc() that combines the lubridate mdy() and base Rs as.Date() into a simpler single line of code: myDateFunc <- function(dateliketext){ #Reminder: This requires text to be in some month-date-year format require(lubridate) #Load the lubridate package thedate <- mdy(dateliketext) #Create a POSIXct object from the date-like string thedate <- as.Date(thedate) #Turn the POSIXct object into a Date object } Voila! Now if we want to create a date object from February 28, 2014, we just run a single line of code using the new function: mynewdate <- myDateFunc("February 28, 2014") #See what that mynewdate object looks like: print(mynewdate) ## [1] "2014-02-28" #Check the class of mynewdate class(mynewdate) ## [1] "Date" You can put that function in a separate le mynewdate.R, for example and then add the code

source("mynewdate.R") to your script le. That tells the script le to run all the code in the mynewdate.R le (This assumes that the script le is in your working directory. If not, just include the full path to the le such as C:/Rscripts/mynewdate.R).

Text functions: Search and substring extraction

Rs substr() function performs the same task as Excels LEFT and MID, using the syntax substr(thestring, start, stop) where start and stop are integers. So, substring(Computerworld, 1, 8) would return Computer: It slices the string starting at position 1 and stopping at position 8. Searching is much more robust in R than in Excel, in part because R can use powerful regular expressions. In addition, there are many ways to handle and process strings in R . One demonstrated Excel task was to extract a two-letter state abbreviation from a city and state when theres no comma separating them but you know the state is always the last two letters of the character string using LEFT and MID. We can use the same technique to nd the last two letters of New York NY by nding the length of the string with nchar() and two characters before the end of the string with nchar() - 2, like so: mytext <- "New York NY" substr(mytext, nchar(mytext) - 2, nchar(mytext)) ## [1] " NY" #Get the rest of the string before the space and two-letter state abbreviation: mytext <- "New York NY" substr(mytext, 1, nchar(mytext)-3) ## [1] "New York" As you probably guessed, nchar() returns the number of characters in a character string, including number of spaces. If you are familiar with regular expressions, you can also search for a more complex pattern than last two characters, such as all characters except a space and the last two letters. Base R handles regular expressions, but I nd the stringr package to be more convenient for some text operations, including matching regular expressions with str_match(): library(stringr) #This pattern says the first group in parentheses is "everything up until a space #and two capital letters." #The second group in parentheses is "two capital letters." mypattern <- "(.*?) ([A-Z]{2})" parsed <- str_match(mytext, mypattern)

#The first column of the parsed object contains the entire match. The second column #is the first group - that is, the match just within the first parentheses, #which in this case is the city. #The third column is the match within the second parentheses, in this case the state. parsed ## [,1] [,2] [,3] ## [1,] "New York NY" "New York" "NY" To perform this task on the sample spreadsheet, we can read in data from the CityState worksheet and then populate the blank CITY and STATE columns: cities <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "CityState") parsed <- str_match(cities$CITY.STATE, mypattern) cities$CITY <- parsed[,2] #The second column of parsed has all the matches of the first group -#in this case, everything before space and 2 capital letters cities$STATE <- parsed[,3] #Append this as a sheet to our new ExcelToR.xlsx spreadsheet write.xlsx(cities, "ExcelToR.xlsx", sheetName = "CityState", append = TRUE)

Text functions: Search and replace

For Excels SUBSTITUTE replacing old text with new text, there is base Rs gsub("pattern to search for", "patern to replace it with", CharacterString) #And stringr s str_replace_all(CharacterString, "pattern to search for", "pattern to replace it with") Pick a syntax and structure you like, and o you go. To create a new column in a dataframe named df that removes PUBLIC SCHOOL DISTRICT from a SchoolDistricts column, you could run stringrs str_replace_all() function on the SchoolDistricts column: df$SchoolDistrictsEdited <- str_replace_all(df$SchoolDistricts, "PUBLIC SCHOOL DISTRICT", "")

Misc text functions

For Excels EXACT to see if two strings are identical, base R has identical(). For Excels LEN(text) to get the length of a string, base R has nchar() and stringr has str_length(). For Excels REPT(text, number) to repeat a text string a certain number of times, the stringr package has str_dup() To capitalize the rst letter of each word, the existing toupper() functions help le to write and load your own function, here called titleCase:

titleCase <- function(x) { s <- strsplit(x, " ")[[1]] paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse = " ") } titleCase("hello there, world!")

Using a wildcard search

Because R supports regular expressions, wildcard searching is a bit simpler than Excels somewhat convoluted =IF(ISERROR(SEARCH(Texas ,B4,1)>0)=FALSE, X,) which adds a column marking with an X all rows where one column includes Texas. In R, you can just use an if-else statement such as: ifelse(str_detect(mycolumn, Texas), X, ) However, there isnt always a need to add a column to do this, since you can easily lter a data frame by searching for a string within a column. Heres some code to nd all rows that include the phrase WESTERN DISTRICT from the BasicIF tab #Note we need to tell R to start reading on row 5 here #because rows 1-4 are not part of the table df <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "BasicIF", stringsAsFactors = FALSE, startRow=5, header=TRUE) #Now we just want rows where the SUBDEPT includes the phrase "WESTERN DISTRICT" justWestern <- subset(df, str_detect(SUBDEPT, "WESTERN DISTRICT")) #Check the first 20 rows & first 5 columns head(justWestern[,1:5], n=20) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## LASTNAME FIRSTNAME DEPT SUBDEPT YRS.EXP 9 ANDERSON NEIL SPPD WESTERN DISTRICT - SOUTH 3 10 ANDERSON ALLEN SPPD WESTERN DISTRICT-NORTH 5 15 ANDERSON STEVE SPPD WESTERN DISTRICT-NORTH 18 16 ANDERSON ERIC SPPD WESTERN DISTRICT 20 17 ARNOLD THOMAS SPPD WESTERN DISTRICT - SOUTH 14 27 BAILEY SARA SPPD WESTERN DISTRICT-NORTH 20 33 BARABAS MICHAEL SPPD WESTERN DISTRICT-NORTH 19 40 BAUMHOFER AMY SPPD WESTERN DISTRICT - SOUTH 15 50 BENNETT CONSTANCE SPPD WESTERN DISTRICT 4 51 BENNETT BRUCE SPPD WESTERN DISTRICT-NORTH 13 55 BITNEY TERRANCE SPPD WESTERN DISTRICT 6 60 BOERGER DARRYL SPPD WESTERN DISTRICT - SOUTH 4 62 BOHN TIM SPPD WESTERN DISTRICT - SOUTH 11 69 BOYLE JEFFERY SPPD WESTERN DISTRICT-NORTH 20 76 BRODT MARY SPPD WESTERN DISTRICT - SOUTH 1 80 BROWN ANTHONY SPPD WESTERN DISTRICT-NORTH 17 96 CARTER MICHAEL SPPD WESTERN DISTRICT 16 104 CHERRY LYNETTE SPPD WESTERN DISTRICT - SOUTH 12 8

## 111 CLEVELAND ## 115 CONROY



1 16

If statements
Excels basic IF statement is similar to Rs ifelse(): Both use the format (logical test, result if true, result if false). This code can determine whether a home or visiting team won a game based on points scored by each, using the sample spreadsheets More BasicIF worksheet: scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF") str(scores) ## ## ## ## ## ## ## ## ## data.frame : $ Date : $ WeekNum : $ Visit.Team : $ Visit.Score: $ Home.Team : $ Home.Score : $ Winner : $ WinTeam : 256 obs. of 8 variables: Date, format: "2003-09-04" "2003-09-07" ... num 1 1 1 1 1 1 1 1 1 1 ... Factor w/ 32 levels "ARI","ATL","BAL",..: 22 1 10 14 3 13 15 18 19 26 ... num 13 24 30 9 15 21 23 30 0 14 ... Factor w/ 32 levels "ARI","ATL","BAL",..: 32 11 7 8 25 17 5 12 4 16 ... num 16 42 10 6 34 20 24 25 31 27 ... logi NA NA NA NA NA NA ... logi NA NA NA NA NA NA ...

#The team names are coming in as "factors" and not characters. #We ll re-import the data, this time adding stringsAsFactors = FALSE #to the function arguments: scores <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "More BasicIF", stringsAsFactors = FALSE) str(scores) ## ## ## ## ## ## ## ## ## data.frame : $ Date : $ WeekNum : $ Visit.Team : $ Visit.Score: $ Home.Team : $ Home.Score : $ Winner : $ WinTeam : 256 obs. of 8 variables: Date, format: "2003-09-04" "2003-09-07" ... num 1 1 1 1 1 1 1 1 1 1 ... chr "NYJ" "ARI" "DEN" "IND" ... num 13 24 30 9 15 21 23 30 0 14 ... chr "WAS" "DET" "CIN" "CLE" ... num 16 42 10 6 34 20 24 25 31 27 ... logi NA NA NA NA NA NA ... logi NA NA NA NA NA NA ...

#That s better. #Note that if there s a space in a column name, R converts it to a period #Use an ifelse statement to find whether the home or visiting team had more points #and thus won the game scores$Winner <- ifelse(scores$Home.Score > scores$Visit.Score, "Home", "Visitor") #Find out which team had more points scores$WinTeam <- ifelse(scores$Home.Score > scores$Visit.Score, scores$Home.Team, scores$Visit.Team) #save to our new spreadsheet write.xlsx(scores, "ExcelToR.xlsx", sheetName = "MoreBasicIF", append = TRUE) As with Excel IF, R ifelse statements can be nested. 9

Deal with data where column headers are rows within the data
Look at the Copy Down tab on the ExcelTricks2014.xlsx spreadsheet, and youll see the problem: Theres a single row with the name of a team, the players on that team, the name of a second team, a list of players on that team and so on. This interspersing of categories and values means that if you do any sorting or aggregating of that column, youll no longer know which player is on what team. Whats needed is a way to add a new column identifying which team each player is on. Im sure theres a more elegant R way" to do this, but here Ill use a simple for loop instead. For loops are discouraged in R, with vectorized functions preferred. However, those of us with experience in languages where loops are common do nd them a handy go-to. #Read player data into R players <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Copy Down", stringsAsFactors = FALSE) #See the structure of the data str(players) ## ## ## ## data.frame : 451 obs. of 3 variables: $ Name : chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ... $ Position: chr NA "DB" "DB" "DT" ... $ NA. : chr NA NA "" "" ...

#Not sure what that .NA column is about, but we can get rid of it by setting it to NULL players$NA. <- NULL str(players) ## ## ## data.frame : 451 obs. of 2 variables: $ Name : chr "Arizona Cardinals" "Starks, Duane" "Stone, Michael" "Ransom, Derrick" ... $ Position: chr NA "DB" "DB" "DT" ...

#That s better #To see if a value is missing in R, use the function. #Here we ll create a new column called Team. #If there s no value in the Position column, we ll use the value of the Players$Name column. #If there is a value in the Position column, #we ll use the value of the Team column one row higher. for(i in 1:length(players$Name)){ players$Team[i] <- ifelse($Position[i]), players$Name[i], players$Team[i-1]) } #We can delete rows with the team names by using the handly na.omit(dataframe) function; #that will eliminate all rows in a data frame that have at least one missing value. players <- na.omit(players) #Here we can add the reformatted data to our new spreadsheet. #Don t forget append=TRUE or the spreadsheet will be overwritten. write.xlsx(players, "ExcelToR.xlsx", sheetName = "CopyDownTeams", append=TRUE)


Functions by groups: SUMIF and COUNTIF equivalents

This is one of many areas where R shines over Excel grouping items for any purpose, not just subtotals or counts. Assuming all the data is in rows 2 to 424, the team name is in column c and the salaries are in column e, the Excel tip was to use =sumif(c2:c424, Dallas Mavericks, e2:e424) to get just the Mavericks total and =sumif(Salaries!c2:Salaries!c424, a3, Salaries!$e2 : Salaries!e$424) to get subtotals by all teams where Team names are in column a of another worksheet. I prefer something thats not hardcoded with total row numbers and thats more easily reproducible on slightly dierent data. In R, there are numerous ways to apply functions to a data set by group. My current favorite is the relatively new dplyr package for R because of its consistent and (to me) fairly human-readable functions. Since salary data isnt included in the sample spreadsheet, Im going to load a short table of top 40 salaries from ESPN using the incredibly handy readHTMLTable() function in Rs XML package. Note that Im just scraping the top 40 and not all the salaries to save time. Also note that it is indeed possible to scrape and clean data from the Web using R :-). #Load in data from table 1 at ESPN with XML package s readHTMLTable() library(XML) url <- salaries <- readHTMLTable(url, stringsAsFactors = FALSE, which=1, header=TRUE) #which=1 above means load the first table on the page str(salaries) ## ## ## ## ## data.frame : $ RK : chr $ NAME : chr $ TEAM : chr $ SALARY: chr 43 obs. of 4 variables: "1" "2" "3" "4" ... "Kobe Bryant, SG" "Dirk Nowitzki, PF" "Amar e Stoudemire, PF" "Joe Johnson, SG" ... "Los Angeles Lakers" "Dallas Mavericks" "New York Knicks" "Brooklyn Nets" ... "$30,453,805" "$22,721,381" "$21,679,893" "$21,466,718" ...

#It s necessary to remove dollar signs and commas #to turn SALARY character strings into integers for R #Removes dollar sign: salaries$SALARY <- str_replace_all(salaries$SALARY, \\$ , )

#A handy decomma() function in the eeptools package removes commas and turns the #numerical character strings into numbers salaries$SALARY <- decomma(salaries$SALARY) ## Warning: NAs introduced by coercion #Rows that don t contain numbers will appear in R as NA; we can remove those rows with na.omit() salaries <- na.omit(salaries) #Now that we have the data, time to sum and count top salaries by team -#and let s add mean and median for good measure: library(dplyr) 11

#Probably self-explanatory: #Create a new variable salaries_grouped_by_team that uses dplyr s group_by() function #to group the salaries data by the TEAM column salaries_grouped_by_team <- group_by(salaries, TEAM) #This creates new columns in a new variable, summaries_by_team, with summaries by group: summaries_by_team <- summarise(salaries_grouped_by_team, sums = sum(SALARY), count = n(), average = mean(SALARY), median = median(SALARY)) #Finally, the arrange() function sorts by sums descending: arrange(summaries_by_team, desc(sums)) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Source: local data frame [22 x 5] TEAM 1 Brooklyn Nets 2 New York Knicks 3 Miami Heat 4 Los Angeles Lakers 5 Oklahoma City Thunder 6 Golden State Warriors 7 Los Angeles Clippers 8 Houston Rockets 9 Memphis Grizzlies 10 Chicago Bulls 11 Minnesota Timberwolves 12 Charlotte Bobcats 13 Dallas Mavericks 14 Toronto Raptors 15 Portland Trail Blazers sums count average median 67699917 4 16924979 16899732 57169384 3 19056461 21388953 56808000 3 18936000 19067500 49739655 2 24869828 24869828 44876533 3 14958844 14693906 40746632 3 13582211 13878000 35109931 2 17554966 17554966 34214428 2 17107214 17107214 33098856 2 16549428 16549428 32932688 2 16466344 16466344 26793906 2 13396953 13396953 26700000 2 13350000 13350000 22721381 1 22721381 22721381 17888932 1 17888932 17888932 14878000 1 14878000 14878000 12

## ## ## ## ## ## ##

16 17 18 19 20 21 22

Phoenix Suns Indiana Pacers New Orleans Pelicans Cleveland Cavaliers Detroit Pistons Washington Wizards San Antonio Spurs

14487500 14283844 14283844 14275000 13500000 13000000 12500000

1 1 1 1 1 1 1

14487500 14283844 14283844 14275000 13500000 13000000 12500000

14487500 14283844 14283844 14275000 13500000 13000000 12500000

#If we want just the Dallas Mavericks and Maimi Heat data, #pick one from several syntax options that you like best: mydata <- subset(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat") #Or mydata <- summaries_by_team[summaries_by_team$TEAM=="Dallas Mavericks" | summaries_by_team$TEAM=="Miami Heat",] #Or dplyr s filter() mydata <- filter(summaries_by_team, TEAM=="Dallas Mavericks" | TEAM=="Miami Heat") mydata ## Source: local data frame [2 x 5] ## ## TEAM sums count average median ## 1 Dallas Mavericks 22721381 1 22721381 22721381 ## 2 Miami Heat 56808000 3 18936000 19067500 #Likewise you can easily count how many teams have at least 3 players in this list subset(summaries_by_team, count >= 3) ## ## ## ## ## ## ## ## Source: local data frame [5 x 5] TEAM 1 Brooklyn Nets 7 Golden State Warriors 13 Miami Heat 16 New York Knicks 17 Oklahoma City Thunder sums count average median 67699917 4 16924979 16899732 40746632 3 13582211 13878000 56808000 3 18936000 19067500 57169384 3 19056461 21388953 44876533 3 14958844 14693906

R also has round() and rank() functions. rank() gives the numerical rank by whatever column you want, while a stackexchange thread suggested this easy function for percentile rank perc.rank <- function(x) trunc(rank(x))/length(x)

Lookup tables
I confess: I have indeed used combinations of VLOOKUP, INDEX and MATCH in Excel to look up the value of a key on one worksheet to insert a related value in another. However, in general Im not a fan of trying to


use Excel as a relational database unless theres a good reason for keeping my data in Excel (such as Im sharing a spreadsheet with colleagues who dont use MySQL or R). With several dierent robust lookup options, R is a much better tool than Excel for using lookup tables. One choice: You can run SQL commands on a data frame with the sqldf package, much like running SQL queries on a relational database. Another option: The data.table package, which comes highly recommended for its speed with large data sets, creates index keys for data frames and many join options. Finally, there are several R functions that oer SQL-like joins, such as dplyrs inner join and left join options and base Rs merge() function. (Some options require the common column in each table to have the same name.) You can read more about all these options in this stackoverow thread. In the Excel Tricks example, there is a Lookups table with a pscty column that holds a numerical code for each county. She wants to add the county name to this worksheet (a separate table, Lookup2, has a list of all the codes and county names). Ill use dplyrs left_join() to accomplish this task (several other techniques will work well too). Why a left join? Thats a SQL database term which means join two tables by one or more common columns, keeping all the rows in the left table (here, left means the rst one mentioned in the join statement) and adding whatever matches there are from the right column. Heres the code: #Read in data from spreadsheet Lookups <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookups", startRow = 2, stringsAsFactors = FALSE) Lookup2 <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Lookup2", startRow = 2, stringsAsFactors = FALSE) #I am going to rename the first column in the lookup table to match the name in the Lookups table names(Lookup2)[1] <- "fipscty" #One line of code adds the county name from Lookup2 to the Lookups table Lookups <- left_join(Lookups, Lookup2, by="fipscty") #Check our results head(Lookups) ## ## ## ## ## ## ## ## ## ## ## ## ## ## fipstate fipscty Tot.Employ An.Payroll Num.Estab. n1_4 n5_9 n10_19 27 085 16223 522051 987 495 231 126 27 135 7721 215072 414 240 84 43 27 129 5517 133031 556 337 99 76 27 127 4742 113020 572 348 115 60 27 125 835 18330 109 74 17 8 27 143 3050 71132 399 258 76 36 n20_49 n50_99 n100_249 n250_499 n500_999 n1000 COUNTY.NAME County 86 30 13 4 0 2 NA McLeod 34 9 1 1 0 2 NA Roseau 29 8 5 1 1 0 NA Renville 32 10 7 0 0 0 NA Redwood 7 2 1 0 0 0 NA Red Lake 20 5 3 1 0 0 NA Sibley

1 2 3 4 5 6 1 2 3 4 5 6


#Get rid of the blank COUNTY.NAMES column Lookups$COUNTY.NAME <- NULL Note that left_join will work regardless of where the columns are located within the tables, unlike VLOOKUP in Excel.

Copying down a date versus date sequence

In Excel, if you click and drag a date down a column, Excel will assume you want to increment the date by 1 each row. So you need a special technique to copy the same date down a column. In R, if you want a column to all be one date, you just assign it, such as: df$mycolumn <- as.Date("2014-03-21") With the code above, every row in the df dataframe will have the date value 2014-03-21 in the mycolumn column. But what if you want the default Excel behavior in R: adding one to each day in a column? Use the seq() function. For instance, to get a date sequence of 15 days incremented by 1 day: seq(as.Date("2014-03-21"), by="day", length.out=15 ) ## [1] "2014-03-21" "2014-03-22" "2014-03-23" "2014-03-24" "2014-03-25" ## [6] "2014-03-26" "2014-03-27" "2014-03-28" "2014-03-29" "2014-03-30" ## [11] "2014-03-31" "2014-04-01" "2014-04-02" "2014-04-03" "2014-04-04" If you want to do this for a data frame column, you have to tell seq() how many items your column needs. You can do this by changing the hard-coded number 15 from the example above to the number of rows in your data frame using the nrow() function: df$mycolumn <- seq(as.Date("2014-03-21"), by="day", length.out=nrow(df) ) You can create date sequences by week, month, quarters and years as well.

Using column names

In Excel, you need to explicitly create names in a spreadsheet in order to use column names in formulas. In R, you can use either the column name or its numerical index position.

Reshaping data
The sample spreadsheet features an example of Aordable Health Care premium data where each plans row had age group data across multiple columns. The desired format was to have one plan price per age group per row, not many age groups in a row. This means the data needs to be reshaped. In R lingo, we want to reshape the data frame from wide to long. Webster demonstrated a very useful free add-in for Excel from Tableau to perform this kind of reshaping. To use the Tableau reshaping add-in for Excel, all the columns you want to be moved down from being column headers must be on the right side of your spreadsheet; and all the columns you want to keep as column headers must be on the left. In addition, you need to manually open the sheet and click on the correct cell 15

ne if youre working on a one-time project, but less ideal if this is data you process frequently (or if you want others to be able to easily reproduce and check your work) With an R script, the columns can be in any order and a script thats written once can be run from a batch le. Please see my detailed explanation of Reshaping: Wide to long (and back) in R for a full run-through of this type of reshaping. But in brief, you want to use the reshape2 package and tell it which column headers you want to move down so theyre no longer separate columns. In other words, if a data frame had column headers for young, middle age and old with a price for each but you wanted only one price per row, youd want to move those three column headers into one new variable column, perhaps called something like age group. To go from wide to long you use reshape2s melt() function and tell melt either which columns you want to move into a new variable column or which columns you want to stay as ID variables and not move. In this sample data, there are far fewer ID variables thatdont need to move than there are column variables that do need to move, so Ill specify the id variables. In addition, we have the option of naming what we want the variable column and value column to be called, which Ill do below: library(reshape2) widedata <- read.xlsx("ExcelTricks2014.xlsx", sheetName = "Reshaper", header = TRUE) #id.vars are the columns to keep as column headers. #There s then no need to identify all the age group column headers that are moving #from being column headers to being part of a new agegroup column. #premium is the value column. reshaped <- melt(widedata, id.vars <- c("Company", "PlanName", "Metal", "RatingArea", "RateAreatxt"),"agegroup","premium" ) #Check results head(reshaped) ## ## ## ## ## ## ## ## ## ## ## ## ## ## Company All Savers All Savers All Savers All Savers All Savers Cigna premium 189.1 186.5 196.8 245.8 242.7 158.1 PlanName cost Silver cost Silver cost Silver cost Silver cost Silver cost Silver Metal RatingArea RateAreatxt agegroup Silver 3 Colorado3 X0.20. Silver 7 Colorado7 X0.20. Silver 8 Colorado8 X0.20. Silver 9 Colorado9 X0.20. Silver 10 Colorado10 X0.20. Silver 3 Colorado3 X0.20.

1 2 3 4 5 6 1 2 3 4 5 6

Lowest Lowest Lowest Lowest Lowest Lowest

#There s an X in front of the all the age groups because R columns can t start with a number #and those columns in the spreadsheet all started with numbers. #In addition the - in 0-20 was turned into a period because - is not a legal character #for an R data frame column name. #If that bothers us, we can use some search-and-replace strategies we learned above 16

#to remove X from the age groups and return the - to 0-20: reshaped$agegroup <- str_replace_all(reshaped$agegroup, "X", "") reshaped$agegroup <- str_replace_all(reshaped$agegroup, "0.20.", "0-20") #We can see how many unique values of reshaped$agegroup there are with unique() unique(reshaped$agegroup) ## ## ## ## ## ## ## ## ## ## ## ## [1] [5] [9] [13] [17] [21] [25] [29] [33] [37] [41] [45] "0-20" "24" "28" "32" "36" "40" "44" "48" "52" "56" "60" "64.and.other." "21" "25" "29" "33" "37" "41" "45" "49" "53" "57" "61" "22" "26" "30" "34" "38" "42" "46" "50" "54" "58" "62" "23" "27" "31" "35" "39" "43" "47" "51" "55" "59" "63"

#I ll change "64.and.other. " to "64+" reshaped$agegroup <- str_replace_all(reshaped$agegroup, "64.and.other.", "64+") #Check unique values again unique(reshaped$agegroup) ## ## ## ## ## [1] [11] [21] [31] [41] "0-20" "30" "40" "50" "60" "21" "31" "41" "51" "61" "22" "32" "42" "52" "62" "23" "33" "43" "53" "63" "24" "34" "44" "54" "64+" "25" "35" "45" "55" "26" "36" "46" "56" "27" "37" "47" "57" "28" "38" "48" "58" "29" "39" "49" "59"

For lots more on using R, see The Beginners Guide to R and 4 Data Wrangling Tasks for Advanced Beginners. Sharon Machlis is online managing editor at Computerworld. You can follow her on Twitter at sharon000, on Google or by subscribing to her RSS feeds: articles and blogs.


