Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Lab Name: To perform basic data management tasks in modern programming language named R.

Course title: Soft Computing and Data mining Lab Total Marks: ___20_________
Practical No. 3 Date of experiment performed: ____________
Course teacher/Lab Instructor: Engr. Muhammad Usman Date of marking: ____________
Student Name:__________________________
Registration no.__________________________

Marking Evaluation Sheet

Knowledge components Domain Taxonomy Contribution Max. Obtained


level marks marks

1. Student is aware with


requirement and use of Imitation (P1) 3
apparatus involved in
experiment.
2. Student has conducted the Psychomotor 70%
experiment by practicing the Manipulate (P2) 11
hands-on skills as per
instructions.
3. Student has achieved required -
Precision (P3)
accuracy in performance.

4. Student is aware of discipline &


safety rules to follow them rules Receiving (A1) 2
Affective
during experiment.
20%
5. Student has responded well and
Respond (A2) 2
contributed affectively in
respective lab activity.
6. Student understands use of Understand.
modern programming languages Cognitive 10% 2
and software environment for (C2)
Data Mining (DM)
Total 20

Normalize
marks out of 5
(5)

Signed by Course teacher/ Lab Instructor


EXPERIMENT # 3
To perform basic data management tasks in modern programming language R.

PRE LAB TASK

Objective:
1. To be familiar with basic data management
2. To be familiar with basic data management functions of modern programming language
R including creating, recoding and renaming variables, sorting, merging and subsetting
variables and handling dates, missing values and coercion.
3. To know how to use basic data management functions of modern programming
language named R.
Theory:

1. Basic Data management:


Data provided to statisticians and data scientists is raw and unorganized data that needs to be
arranged for analysis. In R Unfortunately, getting your data in the rectangular arrangement of a
matrix or data frame is only the first step in preparing it for analysis. The fact is described in second
edition of “R in Action (pp.71-73). manning.” by Kabacoff, R. I. (2010)1 as “In my own work, as much
as 60% of the time I spend on data analysis is focused on preparing the data for analysis. I’ll go out
a limb and say that the same is probably true in one form or another for most real-world data
analysts.” Lets took a working example to understand the fact as provided by same stated author.
1.1 Example:
One of the topics that I study in my current job is how men and women differ in the ways they lead
their organizations. Typical questions might be
 Do men and women in management positions differ in the degree to which they defer to
superiors?
 Does this vary from country to country, or are these gender differences universal?
One way to address these questions is to have bosses in multiple countries rate their managers on
deferential behavior, using questions like the following:

The resulting data might resemble that in table 4.1. Each row represents the ratings given to a
manager by his or her boss.

Here, each manager is rated by their boss on five statements (q1 to q5) related to deference to
authority. For example, manager 1 is a 32-year-old male working in the US and is rated
1
https://scholar.google.com.pk/scholar?hl=en&as_sdt=0%2C5&q=r+in+action+by+robert+kabacoff&btnG=#
deferential by his boss, whereas manager 5 is a female of unknown age (99 probably indicates
that the information is missing) working in the UK and is rated low on deferential behavior.
The Date column captures when the ratings were made. Although a dataset might have dozens
of variables and thousands of observations, we’ve included only 10 columns and 5 rows to
simplify the examples. Additionally, we’ve limited the number of items pertaining to the
managers’ deferential behavior to five. In a real-world study, you’d probably use 10–20 such
items to improve the reliability and validity of the results.
In order to address the questions of interest, you must first deal with several data management
issues. Here’s a partial list:
 The five ratings (q1 to q5) need to be combined, yielding a single mean deferential
score from each manager.
 In surveys, respondents often skip questions. For example, the boss rating manager 4
skipped questions 4 and 5. You need a method of handling incomplete data. You also
need to recode values like 99 for age to missing.
 There may be hundreds of variables in a dataset, but you may only be interested in a
few. To simplify matters, you’ll want to create a new dataset with only the variables of
interest.
 Past research suggests that leadership behavior may change as a function of the
manager’s age. To examine this, you may want to recode the current values of age into
a new categorical age grouping (for example, young, middle-aged, elder).
 Leadership behavior may change over time. You might want to focus on deferential
behavior during the recent global financial crisis. To do so, you may want to limit the
study to data gathered during a specific period of time (say, January 1, 2009 to
December 31, 2009).

We’ll work through each of these issues in this lab, as well as other basic data management
tasks; creating, recoding and renaming variables, sorting, merging and subsetting variables and
handling dates, missing values and coercion, using modern programing language R.

2. Basic data management Functions in R:


Basic data management functions of modern programming language R include creating,
recoding and renaming variables, sorting, merging and subsetting variables and handling dates,
missing values and coercion. These functions are discussed below along with details of how to
use these functions in R.
2.1 Creating Variable:
In a typical research project, you’ll need to create new variables and transform existing ones.
We are familiar with creating variables. Transforming existing variables is accomplished with
statements of the form
variable <- expression
A wide array of operators and functions can be included in the expression portion of
the statement. Table 4.2 lists R’s arithmetic operators. Arithmetic operators are used
when developing formulas.
Let’s say you have a data frame named mydata, with variables x1 and x2, and you want to create
a new variable sumx that adds these two variables and a new variable called meanx that
averages the two variables. If you use the code
sumx <- x1 + x2
meanx <- (x1 + x2)/2
you’ll get an error, because R doesn’t know that x1 and x2 are from the data frame mydata. If
you use this code instead
sumx <- mydata$x1 + mydata$x2
meanx <- (mydata$x1 + mydata$x2)/2
the statements will succeed but you’ll end up with a data frame (mydata) and two separate
vectors (sumx and meanx). This probably isn’t the result you want. Ultimately, you want to
incorporate new variables into the original data frame. The following listing 4.2 provides three
separate ways to accomplish this goal.

2.2 Recording Variables:


Recoding involves creating new values of a variable conditional on the existing values of
the same and/or other variables. For example, you may want to
 Change a continuous variable into a set of categories
 Replace miscoded values with correct values
 Create a pass/fail variable based on a set of cutoff scores
To recode data, you can use one or more of R’s logical operators (see table 4.3). Logical
operators are expressions that return TRUE or FALSE.
Let’s say you want to recode the ages of the managers in the leadership dataset from the
continuous variable age to the categorical variable agecat (Young, Middle Aged, Elder). First,
you must recode the value 99 for age to indicate that the value is missing using code such as

leadership$age[leadership$age == 99] <- NA

The statement variable[condition] <- expression will only make the assignment when condition
is TRUE.
Once missing values for age have been specified, you can then use the following code to
create the agecat variable:

leadership$agecat[leadership$age > 75] <- "Elder"


leadership$agecat[leadership$age >= 55 &
leadership$age <= 75] <- "Middle Aged"
leadership$agecat[leadership$age < 55] <- "Young"

You include the data-frame names in leadership$agecat to ensure that the new variable
is saved back to the data frame.
2.3 Renaming Variables:
If you’re not happy with your variable names, you can change them interactively or
programmatically. Let’s say you want to change the variable manager to managerID and date
to testDate. You can use the following statement to invoke an interactive editor:
fix(leadership)
Then you click the variable names and rename them in the dialogs that are presented (see figure
4.1).
Programmatically, you can rename variables via the names() function. For example, consider
the data table 4.1 in Example 1.1, stored using R as below

For example, this statement


names(leadership)[2] <- "testDate"
renames date to testDate as demonstrated in the following code:
> names(leadership)
[1] "manager" "date" "country" "gender" "age" "q1" "q2"
[8] "q3" "q4" "q5"
> names(leadership)[2] <- "testDate"
> leadership
manager testDate country gender age q1 q2 q3 q4 q5
1 1 10/24/08 US M 32 54555
2 2 10/28/08 US F 45 35255
3 3 10/1/08 UK F 25 35552
4 4 10/12/08 UK M 39 3 3 4 NA NA
5 5 5/1/09 UK F 99 22121
2.4 Naming Vectors:
Sometimes it is useful to name the entries of a vector when defining a vector. For example
when defining vector of country codes, we can use the names to connect the two:
>codes <- c(italy = 380, canada = 124, egypt = 818)
codes
#> italy canada egypt
#> 380 124 818
We can also assign names using the names functions:
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country
codes
#> italy canada egypt
#> 380 124 818
2.5 Missing values:
In a project of any size, data is likely to be incomplete because of missed questions, faulty
equipment, or improperly coded data. In R, missing values are represented by the symbol NA
(not available). Unlike programs such as SAS, R uses the same missing value symbol for
character and numeric data.
R provides a number of functions for identifying observations that contain missing values. The
function is.na() allows you to test for the presence of missing values.
Assume that you have this vector:
y <- c(1, 2, 3, NA)
then function returns c(FALSE, FALSE, FALSE, TRUE):
is.na(y)
Notice how the is.na() function works on an object. It returns an object of the same size, with
the entries replaced by TRUE if the element is a missing value or FALSE if the element isn’t
a missing value.
2.5.1 Recording missing values:
As demonstrated in section 2.3, you can use assignments to recode values to missing.
In the leadership example, missing age values are coded as 99. Before analyzing this
dataset, you must let R know that the value 99 means missing in this case (otherwise,
the mean age for this sample of bosses will be way off!). You can accomplish this by
recoding the variable:
leadership$age[leadership$age == 99] <- NA

Any value of age that’s equal to 99 is changed to NA. Be sure that any missing data is
properly coded as missing before you analyze the data, or the results will be
meaningless.
2.5.2 Excluding missing values from analysis:
Once you’ve identified missing values, you need to eliminate them in some way
before analyzing your data further. The reason is that arithmetic expressions and
functions that contain missing values yield missing values.
Luckily, most numeric functions have an na.rm=TRUE option that removes missing
values prior to calculations and applies the function to the remaining values:
x <- c(1, 2, NA, 3)
y <- sum(x, na.rm=TRUE)
2.6 Date values:
Dates are typically entered into R as character strings and then translated into date variables
that are stored numerically. The function as.Date() is used to make this translation. The syntax
is as.Date(x,"input_format"), where x is the character data and input_format gives the
appropriate format for reading the date (see table 4.4).
The default format for inputting dates is yyyy-mm-dd. The statement

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

converts the character data to dates using this default format. In contrast,

strDates <- c("01/05/1965", "08/16/1975")


dates <- as.Date(strDates, "%m/%d/%Y")

reads the data using a mm/dd/yyyy format.

In the leadership dataset, date is coded as a character variable in mm/dd/yy format.


Therefore:
myformat <- "%m/%d/%y"
leadership$date <- as.Date(leadership$date, myformat)

uses the specified format to read the character variable and replace it in the data frame as a date
variable. Once the variable is in date format, you can analyze and plot the dates using the wide
range of analytic techniques.
To learn more about converting character data to dates, look at help(as.Date) and
help(strftime). To learn more about formatting dates and times, see help(ISOdatetime). The
lubridate package contains a number of functions that simplify working with dates, including
functions to identify and parse date-time data, extract date-time components (for example,
years, months, days, and so on), and perform arithmetic calculations on date-times. If you need
to do complex calculations with dates, the
timeDate package can also help.
2.7 Sorting Data:
Sometimes, viewing a dataset in a sorted order can tell you quite a bit about the data. For
example, which managers are most deferential? To sort a data frame in R, you use the order()
function.
By default, the sorting order is ascending. Prepend the sorting variable with a minus sign to
indicate descending order. The following examples illustrate sorting with the leadership data
frame.
The statement
newdata <- leadership[order(leadership$age),]
creates a new dataset containing rows sorted from youngest manager to oldest manager.
The statement
attach(leadership)
newdata <- leadership[order(gender, age),]
detach(leadership)
sorts the rows into female followed by male, and youngest to oldest within each gender.
Finally,
attach(leadership)
newdata <-leadership[order(gender, -age),]
detach(leadership)
sorts the rows by gender, and then from oldest to youngest manager within each gender.

In this example, you create a list with four components: a string, a numeric vector, a matrix,
and a character vector. You can combine any number of objects and save them as a list. You
can also specify elements of the list by indicating a component number or a name within double
brackets.
In this example, mylist[[2]] and mylist[["ages"]] both refer to the same four-element numeric
vector. For named components, mylist$ages would also work. Lists are important R structures
for two reasons. First, they allow you to organize and recall disparate information in a simple
way. Second, the results of many R functions return lists. It’s up to the analyst to pull out the
components that are needed.
2.8 Sorting Data:
If your data exists in multiple locations, you’ll need to combine it before moving forward. This
section shows you how to add columns (variables) and rows (observations) to a data frame.

2.8.1 Adding columns to a data frame:


To merge two data frames (datasets) horizontally, you use the merge() function. In
most cases, two data frames are joined by one or more common key variables (that
is, an inner join). For example,
total <- merge(dataframeA, dataframeB, by="ID")
merges dataframeA and dataframeB by ID. Similarly,
total <- merge(dataframeA, dataframeB, by=c("ID","Country"))
merges the two data frames by ID and Country. Horizontal joins like this are
typically used to add variables to a data frame.

If you’re joining two matrices or data frames horizontally and don’t need to specify
a common key, you can use the cbind() function:
total <- cbind(A, B)
This function horizontally concatenates objects A and B. For the function to work
properly, each object must have the same number of rows and be sorted in the same
order.
2.8.2 Adding rows to a data frame:

To join two data frames (datasets) vertically, use the rbind() function:
total <- rbind(dataframeA, dataframeB)
The two data frames must have the same variables, but they don’t have to be in the
same order. If dataframeA has variables that dataframeB doesn’t, then before joining
them, do one of the following:
 Delete the extra variables in dataframeA.
 Create the additional variables in dataframeB, and set them to NA (missing).
Vertical concatenation is typically used to add observations to a data frame.
2.9 Subsetting Data:
R has powerful indexing features for accessing the elements of an object. These features can
be used to select and exclude variables, observations, or both beside subsetting single vectors
of dataset using logical operations, which, match and %in% functions.
2.9.1 Selecting variables:
It’s a common practice to create a new dataset from a limited number of variables
chosen from a larger dataset. We know saw that the elements of a data frame are
accessed using the notation dataframe[row indices, column indices]. You can use
this to select variables. For example,
newdata <- leadership[, c(6:10)]
selects variables q1, q2, q3, q4, and q5 from the leadership data frame and saves
them to the data frame newdata. Leaving the row indices blank (,) selects all the
rows by default.
The statements
myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <-leadership[myvars]
accomplish the same variable selection. Here, variable names (in quotes) are
entered as column indices, thereby selecting the same columns.
Finally, you could use
myvars <- paste("q", 1:5, sep="")
newdata <- leadership[myvars]
This example uses the paste() function to create the same character vector as in
the previous example.

2.9.2 Selecting Observations:

Selecting or excluding observations (rows) is typically a key aspect of successful


data preparation and analysis. Several examples are given in the following listing.
Each of these examples provides the row indices and leaves the column indices blank
(therefore, choosing all columns). Let’s break down the line of code at B in order to
understand it:
1. The logical comparison leadership$gender=="M" produces the vector
c(TRUE, FALSE, FALSE, TRUE, FALSE).
2. The logical comparison leadership$age > 30 produces the vector c(TRUE, TRUE,
FALSE, TRUE, TRUE).
3. The logical comparison c(TRUE, FALSE, FALSE, TRUE, FALSE) & c(TRUE,
TRUE, FALSE, TRUE, TRUE) produces the vector c(TRUE, FALSE, FALSE,
TRUE, FALSE).
4. leadership[c(TRUE, FALSE, FALSE, TRUE, FALSE),] selects the first and fourth
observations from the data frame (when the row index is TRUE, the row is included;
when it’s FALSE, the row is excluded). This meets the selection criteria (men over
30).

2.9.3 Subsetting single vectors of dataset:


Subsetting of singles vectors of a dataset can performed using logical operators
given section 2.2 in Table 4.3, which, match and %in% functions. Following syntax
is used for each of these function operators. Consider the dataset named ‘murders’
in dslabs linrary with following detatils.
>str(murders)
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
1. If we want to compare murder rate against different states of above data
we can use logical operator as follow.
>murder_rate<-murders$total/Murders&population *1000000
>ind<-murders$state < 0.71
>murders$state[ind]

2. Suppose we want to look up California’s murder rate. For this type of


operation, it is convenient to convert vectors of logicals into indexes
instead of keeping long vectors of logicals. The function which tells us
which entries of a logical vector are TRUE. So we can type:
>ind <- which(murders$state == "California")
>murder_rate[ind]

3. If instead of just one state we want to find out the murder rates for
several states, say New York, Florida, and Texas, we can use the
function match. This function tells us which indexes of a second vector
match each of the entries of a first vector:
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
4. If rather than an index we want a logical that tells us whether or not each
element of a first vector is in a second, we can use the function %in%.
Let’s imagine you are not sure if Boston, Dakota, and Washington are
states. You can find out like this:
c("Boston", "Dakota", "Washington") %in% murders$state
#> [1] FALSE FALSE TRUE
LAB SESSION

Lab Task:
1. To perform basic data management tasks in modern programming language R.

Apparatus:
 Laptop
 R

Experimental Procedure:

1. How to Setup R:

1. Start-up the Microsoft Windows.


2. Open the website http://cran.r-project.org or use Pin drive to access software folder
named R-3.6.2-win.exe
3. Double click on the software folder and double click on ‘R-3.6.2-win.exe’ file and run
the setup.
4. Press next until you reach the window which ask for the key.
5. Finally chose Finish and close the installation.

2. Get started with R:

1. Start R by double-click on the R icon on your desktop. It will open following windows
in your PC as shown in image.

Fig. 1. R Startup GUI window

2. Install a package named dslabs


> install.packages("dslabs")
3. Load the example dataset stored in library named dslabs with name murders
> library(dslabs)
> data(murders)
4. Examine the dataset in detail for overview
>str(murders)
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
OR
>head(murders)
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
5. Rank the states with least to most gun murders.
>sort(murders$total)
#>[1] 2 4 5 5 7 8 11 12 12 16 19 21 22
#> [14] 27 32 36 38 53 63 65 67 84 93 93 97 97
#> [27] 99 111 116 118 120 135 142 207 219 232 246 250 286
#> [40] 293 310 321 351 364 376 413 457 517 669 805 1257
6. Order the states with their total murders.
ind <- order(murders$total)
murders$abb[ind]
#> [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT"
#> [14] "WV" "NE" "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT"
"WI"
#> [27] "DC" "OK" "KY" "MA" "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC"
#> [40] "MD" "OH" "MO" "LA" "IL" "GA" "MI" "PA" "NY" "FL" "TX" "CA"
Get the class and length of all variables
7. Select the state with max number of murders
>max(murders$total)
#> [1] 1257
>i_max <- which.max(murders$total)
>murders$state[i_max]
#> [1] "California"
>levels(murders$region)
8. Select the state with 0.71 or less murder rate:
>murder_rate<-murders$total/Murders&population *1000000
>ind<-murders$state < 0.71
>murders$state[ind]
#> [1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
#> [5] "Vermont"
9. Access the murder rate of state Mississippi from dataset.
ind <- which(murders$state == "Mississippi")
murder_rate[ind]
#> [1] 3.37
10. Access the murder rate of states New York, Flrodia and Texas.
>ind <- match(c("New York", "Florida", "Texas"), murders$state)
>murder_rate[ind]
#> [1] 2.67 3.40 3.20
Extra Credit Points:
(Follow Similar procedure as well as using PRE-LAB TASK Session data complete the tasks
provided to you as Exercise)

EXPERIMENT DOMAIN:

Domains Psychomotor (70%) Affective (20%) Cognitive


(10%)
Attributes Realization of Conducting Data Data Discipline Individual Understa
Experiment Experiment Collection Analysis Participation nd
(Receiving)
(Awareness) (Act) (Use (Perform) (Respond/
Instrument) Contribute)
Taxonomy P1 P2 P2 P2 A1 A2 C2
Level
Marks 3 5 3 3 3 1 2
distribution
LAB REPORT
Prepare the Lab Report as below:
TITLE:

OBJECTIVE:

APPARATUS:

PROCEDURE:
(Note: Use all steps you studied in LAB SESSION of this tab to write procedure and to
complete the experiment)
DISCUSSION:

Q1.: Why Basic data management is necessary for data scientists?

________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Q2.: List the basic data management function name that are commonly used in R?
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
_______________

Conclusion /Summary
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________

Domains Psychomotor (70%) Affective (20%) Cognitive


(10%)
Attributes Realization of Conducting Data Data Discipline Individual Understa
Experiment Experiment Collection Analysis Participation nd
(Receiving)
(Awareness) (Act) (Use (Perform) (Respond/
Instrument) Contribute)
Taxonomy P1 P2 P2 P2 A1 A2 C2
Level
Marks 3 5 3 3 2 2 2
distribution
Obtained
Marks

You might also like