Training Manual For Data Analytics Using R

TRAINNING MANNUAL
for
Data Analytics With R

---------------------------------------------------------------
Prepared by Mr Majaliwa John

Assistant Lecturer
Eastern Africa Statistical Training Centre

(EASTC)
Table of Contents
Module 1: introduction to statistics, data and data analytics.................................................................5
1.0 Statistics and Data concepts............................................................................................................5
1
1.1 What is statistics?............................................................................................................................5
1.2 Branches of statistics.......................................................................................................................5
1.2.1 Descriptive Statistics..................................................................................................................6
1.2.2 Inferential Statistics..................................................................................................................6
1.3 Confidence and Significance Levels...............................................................................................6
1.4 Major sources of statistics...............................................................................................................7
1.4.1 Primary sources............................................................................................................................7
1.4.2 Secondary Sources........................................................................................................................7
1.5 Measures of Central Tendency.......................................................................................................7
1.6 Measures of Dispersion...................................................................................................................7
1.7 Data (raw information)...................................................................................................................7
1.8 Variable............................................................................................................................................8
1.8.1 Qualitative variable.....................................................................................................................8
1.8.2 Quantitative variable...................................................................................................................8
1.9 BIG DATA: What is it?.................................................................................................................10
1.9.1 Characteristics of Big Data.......................................................................................................10
1.9.2 Big Data Sources......................................................................................................................10
1.10 HOW Big data is ACTUALLY USED........................................................................................11
1.11 Big Data Challenges.....................................................................................................................11
1.12 Softwares for big data.................................................................................................................12
1.13 Data analytics (DA)......................................................................................................................12
1.14 Tools for data analytics...............................................................................................................12
1.15 Types of data analytics applications...........................................................................................13
1.16 Stages of Data analysis................................................................................................................13
Module 2 : Getting started with R..........................................................................................................17
2.1 R Software Definition....................................................................................................................17
2.2 R Studio Definition........................................................................................................................17
2.3 Why R?...........................................................................................................................................18
2.4 Installation of R and RStudio.......................................................................................................18
2.5 Install and Load Packages............................................................................................................24
2.6 Rstudio interface............................................................................................................................25
2.7 DATA OPERATORS IN R...........................................................................................................28
2.8 R Comments...................................................................................................................................29
2
2.9 Data Types in R.............................................................................................................................29
2.10 Variables in R..............................................................................................................................30
2.11 R objects.......................................................................................................................................31
2.12 WORKING DIRECTORY IN R................................................................................................35
2.13 Getting data into Rstudio............................................................................................................37
Module 3 : Data Management and Manipulation.................................................................................40
3.1 What is data management.............................................................................................................40
3.2 Activities involved in data Management......................................................................................41
3.3 How to perform data management in R.......................................................................................41
Module 4 : Descriptive statistics.............................................................................................................41
4.1 What is descriptive statistics?.......................................................................................................41
4.2 How to perform Descriptive statistics in R..................................................................................42
Module 5 : Inferential statistics..............................................................................................................42
5.1 What is inferential statistics..........................................................................................................42
5.2 Types of Inferential Statistics.......................................................................................................43
5.2.1 Hypothesis Testing......................................................................................................................43
5.2.2 CORRELATION........................................................................................................................44
5.2.3 How to perform correlation in R...............................................................................................45
5.2.4 Chi-Square Test..........................................................................................................................45
5.2.5 How to Perform a Chi square in R............................................................................................45
5.2.6 The t-Test....................................................................................................................................46
5.2.7 How to perform t - tests in R......................................................................................................46
5.2.8 One-Way ANOVA......................................................................................................................46
5.2.9 How to perform one way anova in R.........................................................................................47
5.2.10 REGRESSION ANALYSIS.....................................................................................................47
5.2.11 HOW to perform linear regressions in R................................................................................48
References................................................................................................................................................49
List of Appendixes...................................................................................................................................50
Appendix 1: 1_Getting Started WIth R.R.....................................................................................50
Appendix 2: 2_Data Management.R.............................................................................................50
Appendix 3: 3_Descriptive Statistics.R.........................................................................................50
Appendix 4: 4_correlation and chi-square.R................................................................................50
Appendix 5: 5_t – tests and anova.R.............................................................................................50
3
Appendix 5: 6_Linear models.R....................................................................................................50
4
Module 1: introduction to statistics, data and data analytics
1.0 Statistics and Data concepts
1.1 What is statistics?
Statistics is the way of getting information from data. It is an art and science of deciding: what
are the appropriate data to collect, how to collect them efficiently and then using them to give
information (answer questions and make decisions). It changes numbers into information and it
helps making decisions when there is uncertainty. It is the problem solving process that seeks
answers to questions through data. Therefore it is a tool for creating an understanding from a set
of numbers.
1.2 Branches of statistics
The two main branches of statistics are descriptive statistics and inferential statistics.
1.2.1 Descriptive Statistics
These are methods of organizing, summarizing, and presenting data in a convenient and
informative way. These methods include: Graphical Techniques and Numerical Techniques. The
actual method used depends on what information we would like to extract. Are we interested in.
measure(s) of central location? and/or measure(s) of variability (dispersion)?
5
1.2.2 Inferential Statistics
Descriptive Statistics describe the data set that’s being analyzed, but doesn’t allow us to draw
any conclusions or make any interferences about the data. Hence we need another branch of
statistics which is inferential statistics.
Inferential statistics is also a set of methods, but it is used to draw conclusions or inferences
about characteristics of populations based on data from a sample. Statistical Inference is the
process of making an estimate, prediction, or decision about a population based on a sample.
However such conclusions and estimates are not always going to be correct. For this reason, we
build into the statistical inference “measures of reliability,” namely confidence level and
significance level.
1.3 Confidence and Significance Levels
The confidence level is the proportion of times that an estimating procedure will be correct. E.g.
a confidence level of 95% means that, estimates based on this form of statistical inference will be
correct 95% of the time.
When the purpose of the statistical inference is to draw a conclusion about a population, the
significance level measures how frequently the conclusion will be wrong in the long run. E.g. a
5% significance level means that, in the long run, this type of conclusion will be wrong 5% of
the time.
If we use α (Greek letter “alpha”) to represent significance, then our confidence level is 1 – α.
This relationship can also be stated as:
Confidence Level + Significance Level = 1
1.4 Major sources of statistics
1.4.1 Primary sources.
Data may be collected for the purpose required. Such data are known as primary data. The
collection of facts and figures relating to the population in the censuses and surveys provide
primary data. The great advantage of such data is that the exact information wanted is obtained.
1.4.2 Secondary Sources.
Often data is picked from reports of other institutions and organizations, such data is referred to
as secondary. For example, details of industrial production data is picked from reports of the
industries.
6
1.5 Measures of Central Tendency
The term measure of central tendency is used to identify the values which may be computed to
in an attempt to characterize the central part of the frequency distribution. The arithmetic mean,
median and mode are frequently used.
1.6 Measures of Dispersion
Variance - the mean of all squared deviations from the mean. Deviations are the amount that
each score varies from the mean of the distribution, that is, how far each score is away from the
mean.
Standard Deviation - a measure of the dispersion or variation in a distribution, equal to the
square root of the arithmetic mean of the squares of the deviations from the arithmetic mean. The
greater the degree of difference of a value from the average, the larger the standard deviation.
Range -difference between the lowest and highest values. The range tells you something about
how spread out the data are. Data with large ranges tend to be more spread out.
1.7 Data (raw information)
 Are facts that become useful information when organized in a meaningful way or when
entered into a computer.
 Data is also defined as information organized for analysis.
 Data could be of qualitative or quantitative nature
1.8 Variable
 Are Characteristic being studied. Examples; ages of people; heights of children,

educational level, etc.
 There are two type of variables, namely, qualitative and quantitative.
1.8.1 Qualitative variable
This is identifiable simply by noting its presence. For example; the color of an object; sex of an
individual; etc.
1.8.2 Quantitative variable

This consists numerical values. For example; weight of coffee, height of individuals, volume of
sales, etc. There are two types of quantitative variables namely continuous variable and
discrete variable.
Continuous variable – exists if there are no breaks in the possible values. For example;
distance; weight, height, etc.
7
Discrete variable – exists if Possible values consist of breaks between successive values. For
example; number of cows, number of people, number of bags of coffee; etc.
Or
8
Or
9
1.9 BIG DATA: What is it?
“Big data is defined as large amount of data which requires new technologies and architectures
so that it becomes possible to extract value from it…”
1.9.1 Characteristics of Big Data
 High-Volume: Huge amount of data is being generated
 Velocity: Data is being generated at a very high speed
 Variety: different forms and types of data are being produced
 Veracity: The produced data is messy (not clean)
 Variability: Data flow is not constant
1.9.2 Big Data Sources
 People Data
- Social media, online survey, Blogs
 Machine Data
- “Smart Devices”, like Satelites.
10
 Organization Data
- Medical records, Commercial Transactions and Utility payments.
1.10 HOW Big data is ACTUALLY USED
 Used as a source of real time statistics

 Better understand and target customer
 Improving Health
 Improving Security and law enforcement
 The applications of big data are endless.
1.11 Big Data Challenges
 Privacy and Security

 The personal information of a person when combined with external large data sets
leads to the inference of new private facts about that person
 Big data used by law enforcement will increase the chances of certain tagged
people to suffer from adverse consequences without the ability to fight back or
even having knowledge that they are being discriminated against
 Data Access and Sharing of Information
 If data is to be used to make accurate decisions in time it becomes necessary that
it should be available in accurate, complete and timely manner
 Storage and Processing Issues
 Many companies are struggling to store the large amount of data they are
producing
 Processing a large amount of data also takes a lot of time
 Knowledge gap
 Many companies have fewer or no employees with skills required to extract
values from big data.
 Lack of competent instructors
 Data science is a new and challenging field which poses a challenge to academic
institution on how to build capacity to their academic staffs.
11
1.12 Softwares for big data
Traditional software like Excel and Spss could not manage to analyze a huge amount of data
which triggered the development of other powerful software for big data like R and Python. For
the purpose of this course R is the software that we will learn.
1.13 Data analytics (DA)
Data analytics (DA) is the process of examining data sets in order to find trends and draw
conclusions about the information they contain. The purpose of Data Analytics is to extract
useful information from data and taking the decision based upon the data analysis. A simple
example of Data analysis is whenever we take any decision in our day-to-day life is by thinking
about what happened last time or what will happen by choosing that particular decision
1.14 Tools for data analytics
Increasingly, data analytics is done with the aid of specialized systems and software. Data
analyst tools is a term used to describe software and applications that data analysts use in order
to develop and perform analytical processes that help companies to make better, informed
business decisions while decreasing costs and increasing profits. Below are examples of
statistical analytical tools like SPSS, STATA,MATLAB, MINITAB, R,PYTHON,SAS,
MICROSOFT EXCEL and many more. There are a range of different software tools available,
and each offers something slightly different to the user – what you choose will depend on a range
of factors, including your research question, knowledge of statistics, experience of coding and
availability of the software (free or commercial). For this course we will use R software.
1.15 Types of data analytics applications
At a high level, data analytics methodologies include exploratory data analysis (EDA) and
confirmatory data analysis (CDA). EDA aims to find patterns and relationships in data, while
CDA applies statistical techniques to determine whether hypotheses about a data set are true or
false.
Data analytics can also be separated into quantitative data analysis and qualitative data
analysis. The former involves the analysis of numerical data with quantifiable variables. These
12
variables can be compared or measured statistically. The qualitative approach is more
interpretive -- it focuses on understanding the content of non-numerical data like text, images,
audio and video, as well as common phrases, themes and points of view.
Advanced types of data analytics include data mining, which involves sorting through large data
sets to identify trends, patterns and relationships. Another is predictive analytics, which seeks to
predict customer behavior, equipment failures and other future business scenarios and events.
Machine learning can also be used for data analytics, by running automated algorithms to churn
through data sets more quickly than data scientists can do via conventional analytical modeling.
Big data analytics applies data mining, predictive analytics and machine learning tools to data
sets that can include a mix of structured, unstructured and semi structured data. Text mining
provides a means of analyzing documents, emails and other text-based content.
1.16 Stages of Data analysis
Like any scientific discipline, data analysis follows a rigorous step-by-step process. Each stage
requires different skills and know-how. These stages include; Defining the question, collecting
the data, cleaning the data, Analyzing the data and sharing the results. Consider the figure below.
1. Step one: Defining the question
The first step in any data analysis process is to define your objective. Defining your objective
means coming up with a hypothesis and figuring how to test it. For instance, your
organization’s senior management might pose an issue, such as: “Why are we losing
customers?” or “Which factors are negatively impacting the customer experience?” or better
yet: “How can we boost customer retention while minimizing costs?”
13
Now you have defined a problem, you need to determine which sources of data will best help
you solve it.
2. Step two: Collecting the data
Once you have established your objective, you will need to create a strategy for collecting
and aggregating the appropriate data. A key part of this is determining which data you need.
This might be quantitative (numeric) data, e.g. sales figures, or qualitative (descriptive) data,
such as customer reviews.
3. Step three: Cleaning the data
Once you have collected your data, the next step is to get it ready for analysis. This means
cleaning, or ‘scrubbing’ it, and is crucial in making sure that you are working with high-quality
data. Key data cleaning tasks include:
 Removing major errors, duplicates, and outliers—all of which are inevitable problems
when aggregating data from numerous sources.
 Removing unwanted data points—extracting irrelevant observations that have no
bearing on your intended analysis.
 Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout
issues, which will help you map and manipulate your data more easily.
 Filling in major gaps—as you are tidying up, you might notice that important data are
missing. Once you have identified gaps, you can go about filling them.
A good data analyst will spend around 70-90% of their time cleaning their data. This might
sound excessive. But focusing on the wrong data points (or analyzing erroneous data) will
severely impact your results.
4. Step four: Analyzing the data
Depends on what insights you are hoping to gain from your data, you may apply different
types of data analysis. It may be descriptive analysis which identifies what has already
happened or diagnostic analytics which focuses on understanding why something has
happened or predictive analysis which allows you to identify future trends based on historical
data or prescriptive analysis allows you to make recommendations for the future.
14
5. Step five: Sharing your results
After you have finished carrying out your analyses. You have your insights. The final step of the
data analytics process is to share these insights with the wider world (or at least with your
organization’s stakeholders!). It involves interpreting the outcomes, and presenting them in a
manner that is digestible for all types of audiences. Since you will often present information to
decision-makers, it is very important that the insights you present are 100% clear and
unambiguous. For this reason, data analysts commonly use reports, dashboards, and interactive
visualizations to support their findings.
15
Module 2 : Getting started with R
2.1 R Software Definition
R is a programming language and software environment for statistical analysis, graphics

representation and reporting. It is a popular programming language used for statistical computing
and graphical presentation was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core Team. It is
freely available under the GNU General Public License, and pre-compiled binary versions are
provided for various operating systems like Linux, Windows and Mac. This programming
language was named R, based on the first letter of first name of the two R authors (Ross Ihaka
and Robert Gentleman).
R and its libraries are made up of statistical and graphical techniques, including descriptive
statistics, inferential statistics, and regression analysis. Another strength of R is that it is able to
produce publishable quality graphs and charts, and can use packages like ggplot for advanced
graphs.
2.2 R Studio Definition
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-

highlighting editor that supports direct code execution, as well as tools for plotting, history,
debugging and workspace management. RStudio is available in open source and commercial
editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to
RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE L nux).
16
RStudio is a free and open-source integrated development environment (IDE) for R, a
programming language for statistical computing and graphics. An IDE is software that has
comprehensive facilities like a code editor, compiler, and debugger tools to help developers write
R scripts. RStudio assists developers in writing R scripts by providing all the required tools in
one software package. Rstudio interface is organized so that the user can clearly view graphs,
data tables, R code, and output all at the same time. It also offers an Import-Wizard-like feature
that allows users to import CSV, Excel, SAS (*.sas7bdat), SPSS (*.sav), and Stata (*.dta) files
into R without having to write the code to do so
You can use R without using RStudio, but you can't use RStudio without using R, so R comes
first.
JJ Allaire, creator of the programming language ColdFusion, founded RStudio. Hadley Wickham
is the Chief Scientist at RStudio.
2.3 Why R?
There are many programming languages available for data science, like R, Python, SAS, Java,
and more. There are many data science software packages to learn, such as SPSS Statistics, SPSS
Modeler, SAS Enterprise Miner, Tableau, RapidMiner, Weka, GATE, and more. Learning R for
statistics is recommended because it was developed for statistics in the first place. R
programming is very strong in statistics, so it is ideal for data exploration or data understanding
using descriptive statistics, inferential statistics, regression analysis, and data visualizations. R is
also ideal for modeling because you can use statistical learning like regressions for predictive
analytics. R also has some packages for data mining, text mining, and machine learning. Also R
has a large community support and it is free. It has many packages (libraries of functions) that
can be used to solve different problems.
2.4 Installation of R and RStudio
(You must install R before you can install RStudio)

In order to code R scripts, you must install the R programming command line application. You
can download the R programming command line application from www.r-project.org/,
17
Figure 3: website for downloading R programming command line application (R base)
Click download and select your closer cran mirrors, then download R programming application
that suits your operating system.
Figure 4: downloading R programming command line application (R base) for different

operating systems options
18
To install the software, double-click the download setup file and follow the instructions of the
installer to install the R programming command line application, as seen below:
Figure 5: Installation of R
After the R programming command line application is installed, you can start it, as seen below:
19
Figure 1: R programming command line application (R base) graphical interface
RStudio is the most popular IDE for the R programming language. RStudio helps you write R
programming code more easily and more productively. To download and install RStudio, visit
www.rstudio.com/, as seen below:
Figure 6: The RStudio IDE website
After downloading the RStudio installer or setup file, double-click the file to install the RStudio
IDE, as seen below:
20
Figure 7: Installation of the RStudio IDE
After installing the RStudio IDE, you can run the RStudio IDE software, as seen below:
Figure 8: The RStudio IDE interface

Before running the script, you need to select the R programming command line application
version to use. Click Tools ➤ Global Options, as seen below:
21
Figure 9: The RStudio IDE’s Tools menu
Click the Change button to select the R version, as seen below:
22
Figure 10: RStudio IDE options
2.5 Install and Load Packages
Packages already included in the initial installation may not have functions that you need. In R,
you can easily install and load additional packages provided by other users.
1. To install packages:
You can either use install.packages() function
install.packages("foreign") # install 'foreign' package
or
23
click Tools > Install packages. Write the package name in the dialog, then click install.
2. To Load Packages
Once you install the package, you need to load it so that it becomes available to use. Simply
use library() function
library(foreign) # load 'foreign' package
PACKAGES TO INSTALL FOR THIS COURSE
readr, dplyr, ggplot2, Rcmdr, openxlsx, lubridate, WDI and Data360r
2.6 Rstudio interface
There are four RStudio Windows (also called panes).However, your windows might be in a
different order that those in Figure below. If you’d like, you can change the order of the windows
under RStudio preferences.
Figure 2: Rstudio IDE interface (windows)
1. Source - Your notepad for code
24
The source pane is where you create and edit R Scripts - your collections of code. Don’t worry,
R scripts are just text files with the “.R” extension.
There are many ways to send your code from the Source to the console. The slowest way is to
copy and paste or “Control + Enter”. A faster way is to highlight the code you wish to evaluate
and clicking on the “Run” button on the top right of the Source.
While the Console forms the workhorse of R, operating solely in the Console is very
cumbersome. Instead of typing your commands in the Console each time you run R, we will
instead create a script. A script is a list of R commands that is saved as a text file to then be
submitted into R line by line. Scripts are what makes R so useful as they allow easy
reproducibility. Scripts are saved as a .R extension which can be read by most text editors (e.g.,
Notepad in Windows). To create a new script you can select File > New File > R Script at the
top of R Studio. You should see the Console window go to the bottom and a new, empty
window appear above the Console.
This new window is where your scripts that are open will appear; each script that you have
loaded is marked by a tab at the top of the window. Anytime you change a script, the text
becomes red and an * appears to indicate that it has been edited since it was last saved.
2. Console: R’s Heart

The console is the heart of R. Here is where R actually evaluates code. At the beginning of the
console you’ll see the character (The > symbol indicates the current line in the Console, with the
pointer at this line indicated by a blinking vertical bar). This is a prompt that tells you that R is
ready for new code. You can type code directly into the console after the prompt and get an
immediate response. For example, if you type 1+1 into the console and press enter, you’ll see
that R immediately gives an output of 2.
You cannot have multiple commands on the same line unless you separate the commands in the
single line with a semicolon (;).Try typing 2+2;3+3 in the Console and view the output. In fact, if
you were to run R and not R Studio, this Console is the only window you would see. This is
where all of the input, calculations and output are contained. You can think of this as the
“calculator” for R Studio.
However, 99% most of the time, you should be using the Source rather than the Console. The
reason for this is straightforward: If you type code into the console, it won’t be saved (though
you can look back on your command History). And if you make a mistake in typing code into the
console, you’d have to re-type everything all over again. Instead, it’s better to write all your code
25
in the Source. When you are ready to execute some code, you can then send “Run” it to the
console.
3. Environment/History
The Environment tab of this panel shows you the names of all the data objects (like vectors,
matrices, and data frames) that you’ve defined in your current R session. You can also see
information like the number of observations and rows in data objects. The History tab of this
panel simply shows you a history of all the code you’ve previously evaluated in the Console. To
be honest, I never look at this.
4. Files / Plots / Packages / Help
 Files - The files panel gives you access to the file directory on your hard drive. One nice
feature of the “Files” panel is that you can use it to set your working directory.
 Plots - The Plots panel (no big surprise), shows all your plots. There are buttons for
opening the plot in a separate window and exporting the plot as a pdf or jpeg (though you
can also do this with code using the pdf () or jpeg () functions.)
 Packages - Shows a list of all the R packages installed on your hard drive and indicates
whether or not they are currently loaded.
 Help - Help menu for R functions. You can either type the name of a function in the
search window, or use the code to search for a function with the name. examples
?hist # How does the histogram function work?
?t.test # What about a t-test?
2.7 DATA OPERATORS IN R
The following are different operators in R programming.

i. Arithmetic Operators
ii. Assignment Operators
iii. Relational Operators
iv. Logical Operators
Arithmetic Operators
Description Operator Example
Adding up two or more operands + 3+2=5
Subtracting two or more operands - 3-2=1
Multiplying two or more operands * 3*2=6
Dividing two or more operands / 3/2=1.5
26
Exponentiation of operand ^ 3ˆ2=3 2 =9
Reminder for the division %% 5%%2=1
Assignment Operators
 The assignment operators are used to assign values,characters or any other type to the
object.
 The operator can either be left or right assignment operator.
 There are about five assignment operators.
Assignment Operators
= a=5
Left Assignment <− b< −5
<< − d<< −5
−> 5− >e
Right Assignment − >> 5− >>f
Relational Operators

TRUE if the left operand is greater than the right > 3>2
TRUE if the left operand is less than the right < 2<3
TRUE if the right operand is equal to the left == 2 == 3
TRUE if the right operand is not equal to the left != 2!=3
TRUE if the left operand is greater than or equal to the >= 3 >= 2
right
TRUE if the left operand is less than or equal to the right <= 2 <= 3
Logical Operators
Three symbols are the famous one for logical operators for AND, OR and NOT (negation)

AND & TRUE & TRUE==TRUE
OR | TRUE | FALSE==TRUE
27
NOR ! !FALSE==TRUE
2.8 R Comments
Comments can be used to explain R code, and to make it more readable. They can also be used
to prevent execution when testing alternative code. Comments starts with a # symbol. When
executing the R-code, R will ignore anything that starts with # symbol. For example a comment
before a line of code makes that line ignored or skipped during execution otherwise the system
will fire an error. Example;” # This is a comment to explain my codes “will be ignored but “This
is a comment to explain my codes” will cause an error to occur.
Data Types and Variables

2.9 Data Types in R
There are three basic data types namely numeric, character and logical.
 It’s possible to use the function class() to see what type a variable is.
 You can also use the functions is.numeric(), is.character(), is.logical() to check whether
a variable is numeric, character or logical, respectively. For instance:
 If you want to change the type of a variable to another one, use the as.* functions,
including: as.numeric(), as.character(), as.logical(), etc.
NOTE:
28
2.10 Variables in R
Generally, while doing programming in any programming language, you need to use various
variables to store various information. In R variables are nothing but reserved memory locations
to store values. This means that, when you create a variable you reserve some space in memory.
Variable Names
A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume)
Rules for Naming R variables are:
i. A variable name must start with a letter and can be a combination of letters, digits,
period(.) and underscore(_). If it starts with period (.), it cannot be followed by a
digit.
ii. A variable name cannot start with a number or underscore (_)
iii. Variable names are case-sensitive (age, Age and AGE are three different variables)
iv. Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
Variables that can hold one value (string, number, boolean etc) are called simple variables and
variables that can hold pairs of variables and values are called objects.
2.11 R objects
There are many types of R-objects. In this course we are going to learn how to create the
frequently used ones which are; Vectors, Matrices, Factors, Lists and Data Frames. Then we will
dive deep in dataframes because it is the object that we need to work with.
VECTORS
29
Vector: a combination of elements (i.e. numbers, words), usually created using c(), seq(), or rep()
NB: All elements in a vector must be of the same data type namely numeric, character or
logical.
How to create vectors in R
Method example
1 c(a, b, ...) x = c(1, 5, 9)
2 a:b y = 1:5
2 seq(from, to, by) z = seq(from = 0, to = 6, by = 2)
4 rep(x, times, each) q = rep(c(7, 8), times = 2, each = 2)
MATRIX
A matrix is a two-dimensional (r _ c) object. All elements in a matrix must be of the same data
type.
How to create Matrices in in R
We use the matrix () function to create a matrix in R.
matrix(vector, nrow=r, ncol=c, x_ma= matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=TRUE)
byrow=TRUE)
LIST
A list is a set of objects. Each element in a list can be a vector, a matrix, dataframe or a list. A
list allows you to gather a variety of (possibly unrelated) objects under one name. we create a list
using list() function.
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists

list1=list(1,2,3,4,5)
list2=list(6,7,8,9,10)
v <- c(list1,list2)
30
DATAFRAME
A dataframe is a two-dimensional (r _ c _ h) object (like a matrix). Each column in a dataframe

must be of the same data type, but data type may vary by column meaning that two columns in a
dtataframe may have different data types one may be numeric and the other may be
character.Regression and other statistical functions usually use dataframes. We use
as.data.frame() to convert matrices to dataframes.
How to create a dataframe in in R
We use data.frame(). For example data.frame(d,e,f) where d, e and f are vectors with same
length.
For example.
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)

 mydata
we use names()function to give variable names to a data frme.
For example:
names(mydata) <- c("ID","Color","Passed")
 mydata
HOW TO ACCESS VARIABLES (ELEMENTS) OF A DATA FRAME

There are a variety of ways to identify the elements of a data frame.
1. Using Brackets [] or [[]]
# access columns 3,4,5 of data frame called mydata
 mydata[3:5]
 mydata[[1]]
# acess columns ID and Age from data frame called mydata

 mydata[c("ID","Age")]
2. Using $ notation : name_dataframe$componentName

# access variable x1 in the data frame called mydata
 mydata$X1
31
3. using the functions attach() and detach()
attach(name_dataframe)
……….
detach()
Through the function attach() it is possible to access the components of a data frame without quoting the
name of the data frame each time. The attach() function appends the name of the data frame to the
searching path.
In this way the components of the data frame become temporarily available as variables under
their component names until the detach() function is called.
This function detaches the name of the data frame from the searching path, so the components
are no longer visible with their names only. Again they can be accessed only with the $ notation:
station$wind
NOTE:
After the function attach() is called, all the operations on the components of the data frame do
not affect the original data frame, but a copy of its components. In order to modify the data frame
it is necessary to use the “$”notation or [[ ]]. For example in order to modify the first column of a
data frame, see the format below)
 dataframe [[1]] <- dataframe [[1]] *(100)

For example:
x<-1:3; y<-3:1; z<-0

>df <- data.frame(x,y,z)
32
> df
> rm(x,y,z)
> attach(df)
> z<-x+y
> detach(df)
> df
df$z <-df$x+df$y
> df
FACTOR
Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a
vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal
variable), and an internal vector of character strings (the original values) mapped to these
integers.
# variable gender with 20 "male" entries and

# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
2.12 WORKING DIRECTORY IN R
The working directory in R is the folder where you are working. Hence, it's the place (the
environment) where you have to store your files of your project in order to load them or where
your R objects will be saved. Therefore the working directory is the default location where R
will look for files you want to load and where it will put any files you save.
33
Get working directory
In case you want to check the directory of your R session, the function getwd() will print the
current working directory path as a string. Hence, the output is the folder where all your files will
be saved
# Find the path of your working directory
 getwd()
Set working directory

If you are wondering how to change the working directory in R you just need to call the setwd()
function, specifying as argument the path of the new working directory folder.
# Set the path of your working directory
setwd("My\\Path")
setwd("My/Path") # Equivalent
Changing working directory in RStudio
In order to create a new RStudio project go to Session → Set Working Directory and select the
folder (choose directory) you want to make it to be your working directory. See the figure
below.
List files of the working directory
Once you set up your working directory, you may want to know which files are inside it. For that
purpose just call the dir() or the list.files() functions as illustrated in the following example.
 dir()
 list.files() # Equivalent
34
2.13 Getting data into Rstudio
To import dataset into R through Rstudio just click on FileImport DatasetSelect from which
file format to import. See the figure below.
35
Exporting data from R
Steps to Export a DataFrame to CSV in R
Step 1: Create a DataFrame
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
> mydata <- data.frame(d,e,f)
 names(mydata) <- c("ID","Color","Passed")
 mydata
Step 2: Use write.csv to Export the DataFrame

 write.csv(mydata,"majaliwaaa.csv")
Step 3: Run the code to Export the DataFrame to CSV
Finally, run the code in R and a new CSV file will be created at your specified location or in the
working directory . The data within that file should match with the data in DataFrame created in
R
2.14 The cheat Sheet of Frequently used functions in R

Command description
1 getwd() displays the current working directory
2 data() shows R built in data set
3 data("fileName") loads R build in dataset in memory
4 View("fileName") displays the dataframe for example R built in dataset
5 attach("fileName") attaches the objects into memory
6 detach("fileName") detaches the objects from memory
7 summary(variablename) provides summary statistics of a variable
8 summary(fileName) Provides descriptive statistics for all variables.
9 mean(variablename) provides the mean of the variable
10 median(variablename) provides the median value of the variable
11 sd(variablename) provides the std deviation of the variable
12 hist(variablename) draws histogram if file attached
13 boxplot(variablename) draws box plot if file attached
14 min(filename$variablename) provides the minimum value of a variable if file not
attached
15 max(variablename) provides the maximum value of a variable if file not
attached
16 read.table("majaliwa.txt”, header=T) imports data from a text file
17 head(filename) shows the first six observations of an object
18 tail(filename) shows the last six observations of an object
19 rm(varname) removes one object
36
20 rm(list=ls()) remove all objects at once
21 Ctrl + L clear the console
22 t.test(filename,u=90) one sample ttest with automatic confidence level 95%
23 ls(dataframeName) list the variable names of a data frame
24 ls() or objects() display the names of objects in current environment (work space)
25 dir() to check what is there in the working directory
26 save(crime, file = "mhola.RData") save as R object
27 write.table(mtcars, file = "mauu.txt", sep = "\t") export data in tab delimited format
28 write.csv2(npk,file = "nyau.csv") export data in comma separated format
29 write.csv(mtcars, file = export data in comma separated format
"mtcarcsv.csv")
30 c(a, b, ...) c(1, 5, 9)
31 a:b 1:5
32 seq(from, to, by, length.out) seq(from = 0, to = 6, by = 2)
33 rep(x, times, each, length.out) rep(c(7, 8), times = 2, each = 2)
34 matrix(vector, nrow=r, ncol=c, matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=TRUE)
byrow=TRUE)
35 data.frame(d,e,f) d, e and f are vectors with same length
36 names(mydata) = Naming variables of a data frame

c("ID","Color","Passed")
37 list(name="Fred", mynumbers=a, Creating a list object with list()
mymatrix=y, age=5.3)
38 gender <- c(rep("male",20), gender <- factor(gender)# tell R that gender is a factor
rep("female", 30))
39** how to access elements of objects is not covered in this cheat sheet
Module 3 : Data Management and Manipulation
37
3.1 What is data management
In short, data management is the process of operating on datasets in such a way the
analyst becomes able to access the dataset in different ways and being able to change the
dataset by assessing and editing the data to suit the purpose of and formats that the
analyst wants.
3.2 Activities involved in data Management
The activities that can be done during data management include the following:
 Generating new variables in existing data set
 Generating new variables into a different dataset
 Renaming variables
 Coding categorical variables
 Recoding Categorical variables
 Assigning value labels to categorical variables
 Changing numeric variables into categorical
 Data cleaning
 Combining datasets
 Subsetting datasets
3.3 How to perform data management in R
This practical part is covered by Appendix 2: 2_Data Management.R
Module 4 : Descriptive statistics
38
4.1 What is descriptive statistics?
Descriptive statistics, in short, help describe and understand the features of a specific data set by
giving short summaries about the sample and measures of the data. Descriptive statistics
summarizes or describes the characteristics of a data set.
Descriptive statistics consists of three basic categories of measures: measures of central

tendency, measures of variability (or spread), and frequency distribution. Measures of central
tendency describe the center of the data set (mean, median, mode).Measures of variability
describe the dispersion of the data set (variance, standard deviation). Measures of frequency
distribution describe the occurrence of data within the data set (count).
4.2 How to perform Descriptive statistics in R

This practical part is covered by appendix 3: 3_Descriptive Statistics.R
Module 5 : Inferential statistics
5.1 What is inferential statistics

Inferential statistics can be defined as a field of statistics that uses analytical tools for drawing
conclusions about a population by analysing the sample data obtained from that particular
39
population. The goal of inferential statistics is to make generalizations about a population. In
inferential statistics, a statistic is taken from the sample data (e.g., the sample mean) that used to
make inferences about the population parameter (e.g., the population mean).
5.2 Types of Inferential Statistics

Inferential statistics can be classified into hypothesis testing and regression analysis.
5.2.1 Hypothesis Testing
Hypothesis testing is a type of inferential statistics that is used to test assumptions and draw
conclusions about the population from the available sample data. It involves setting up a null
hypothesis and an alternative hypothesis followed by conducting a statistical test of significance.
The null and alternative are always claims about the population. That’s because the goal of
hypothesis testing is to make inferences about a population based on a sample. Often, we infer
whether there’s an effect in the population by looking at differences between groups or
relationships between variables in the sample.
What is a null hypothesis?
The null hypothesis is the claim that there is no effect in the population. If the sample provides
enough evidence against the claim that there’s no effect in the population (p ≤ α), then we can
reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.
When you incorrectly reject the null hypothesis, it is called a type I error. When you incorrectly
fail to reject it, it is a type II error.
What is an alternative hypothesis?
The alternative hypothesis (Ha) is the other answer to your research question. It claims that
there’s an effect in the population. In other words, it is the claim that you expect or hope will be
true.
Differences between Null and alternative hypothesis
40
For the purpose of this course we will focus on correlation, chi-square, t-tests and analysis of
variance (one way anova).
5.2.2 CORRELATION
Pearson correlation coefficient: Is the measure of the strength of the linear relationship
between two variables. In a sample it is denoted by r and is by design constrained as -1≤r ≤ 1
Furthermore:
 Positive values denote positive linear correlation;
 Negative values denote negative linear correlation;
 A value of 0 denotes no linear correlation;
 The closer the value is to 1 or –1, the stronger the linear correlation
Correlation is an effect size and so we can verbally describe the strength of the correlation using
the guide that Evans (1996) suggests for the absolute value of r:
 0 .00-0.19 “very weak”
 0 .20-0.39 “weak”
 0.40-0.59 “moderate”
 0.60-.79 “strong”
41
 0.80-1.0 “very strong”
The direction of the relationship depends on the sign of the coefficient, if it is (+) we say there is
positive linear relationship between the two variables and if it is (-) we say there is a negative
linear relationship between the two variables.
Hypothesis
H0 : There is no linear relationship btn the two variables
Ha : There is linear relationship btn the two variables
5.2.3 How to perform correlation in R
This practical part is covered by appendix 4: 4_correlation and chi-square.R
5.2.4 Chi-Square Test

Chi-square test: an inferential statistics technique designed to test for significant relationships
between two variables organized in a bivariate table.
Stating Alternative and Null Hypotheses
Null hypothesis
The null hypothesis (H0) states that no association exists between the two cross-tabulated
variables in the population, and therefore the variables are statistically independent.
Alternative hypothesis
The alternative hypothesis (H1) proposes that the two variables are related in the population.
Simply
H0: There is no association between the two variables.
H1: The two variables are related in the population.
5.2.5 How to Perform a Chi square in R

This practical part is covered by appendix 4: 4_correlation and chi-square.R
5.2.6 The t-Test

What is a t-test?
A t-test (also known as Student's t-test) is a tool for evaluating the means of one or two
populations using hypothesis testing. A t-test may be used to evaluate whether a single group
42
differs from a known value (a one-sample t-test), whether two groups differ from each other (an
independent two-sample t-test), or whether there is a significant difference in paired
measurements (a paired, or dependent samples t-test).
Types of t-tests
There are three t-tests to compare means: a one-sample t-test, a two-sample t-test and a paired
t-test. The table below summarizes the characteristics of each and provides guidance on how to
choose the correct test.
One-sample t-test Two-sample t-test
Paired t-test
Number of variables one two two
Purpose of test Decide if the Decide if the
Decide if the
population mean is population means for
difference between
equal to a specific two different groups
paired measurements
value or not are equal or notfor a population is
zero or not
Example: test if... Mean heart rate of a Mean heart rates for Mean difference in
group of people is two groups of people heart rate for a group
equal to 65 or not are the same or not of people before and
after exercise is zero
or not
5.2.7 How to perform t - tests in R

This practical part is covered by appendix 5: 5_t – tests and anova.R
5.2.8 One-Way ANOVA

What is one-way ANOVA?
One-way analysis of variance (ANOVA) is a statistical method for testing for differences in the
means of three or more groups.
How is one-way ANOVA used?
One-way ANOVA is typically used when you have a single independent variable, or factor, and
your goal is to investigate if variations, or different levels of that factor have a measurable effect
on a dependent variable. One-way ANOVA is a statistical method to test the null hypothesis
(H0) that three or more population means are equal vs. the alternative hypothesis (Ha) that at
least one mean is different.
5.2.9 How to perform one way anova in R
This practical part is covered by appendix 5: 5_t – tests and anova.R
43
5.2.10 REGRESSION ANALYSIS
Regression analysis is used to quantify how one variable will change with respect to another
variable. There are many types of regressions available such as simple linear, multiple linear,
logistic, and ordinal regression. The most commonly used regression in inferential statistics is
linear regression. Linear regression checks the effect of a unit change of the independent variable
in the dependent variable.
LINEAR REGRESSION
Linear regression is a common Statistical Data Analysis technique. It is used to determine the
extent to which there is a linear relationship between a dependent variable and one or more
independent variables. There are two types of linear regression, simple linear regression and
multiple linear regression.
In simple linear regression a single independent variable is used to predict the value of a
dependent variable.
In multiple linear regression two or more independent variables are used to predict the value of
a dependent variable. The difference between the two is the number of independent variables. In
both cases there is only a single dependent variable.
LINEAR REGRESSION INTERPRETATION

We can see that at the top we have our Anova Table and below we have a table of coefficients.
 We have the observations, F test, probability of the F test, R-squared, probability for t
test, coefficients for t test.
 In terms of the order we want to interpret this
1. we start with F test,
2. we progress to the R-squared,
3. then to t test of individual coefficients
4. and then to the coefficients themselves.
NB: In F, of interest to us is the probability of the F test.
5.2.11 HOW to perform linear regressions in R

This practical part is covered by appendix 6: 6_Linear models.R
44
References
 Adler, J. (2012). R in a Nutshell. Boston: O’Reilly.

 Albert, J., & Rizzo, M. (2012). R by Example. New York: Springer.
 Young, G., & Smith, R. (2005).Essentials of statistical inference. Cambridge:
Cambridge University Press.
 Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
 Weisberg, S. (1985). Applied Linear Regression. Wiley
45
List of Appendixes
Appendix 1: 1_Getting Started WIth R.R

Appendix 2: 2_Data Management.R
Appendix 3: 3_Descriptive Statistics.R
Appendix 4: 4_correlation and chi-square.R
Appendix 5: 5_t – tests and anova.R
Appendix 5: 6_Linear models.R
Note: These appendices are R script files with R codes that will be used to produce outputs
46
47

Training Manual For Data Analytics Using R

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training Manual For Data Analytics Using R

Uploaded by

Copyright:

Available Formats

TRAINNING MANNUAL

Data Analytics With R

Prepared by Mr Majaliwa John

Eastern Africa Statistical Training Centre

1.0 Statistics and Data concepts

1.1 What is statistics?

1.2 Branches of statistics

1.4 Major sources of statistics

1.4.1 Primary sources.

1.4.2 Secondary Sources.

 Are Characteristic being studied. Examples; ages of people; heights of children,

1.8.2 Quantitative variable

 Used as a source of real time statistics

 Privacy and Security

1.13 Data analytics (DA)

1.14 Tools for data analytics

1.15 Types of data analytics applications

1.16 Stages of Data analysis

1. Step one: Defining the question

2. Step two: Collecting the data

3. Step three: Cleaning the data

4. Step four: Analyzing the data

2.1 R Software Definition

R is a programming language and software environment for statistical analysis, graphics

2.2 R Studio Definition

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-

2.4 Installation of R and RStudio

(You must install R before you can install RStudio)

Figure 4: downloading R programming command line application (R base) for different

Figure 6: The RStudio IDE website

Figure 8: The RStudio IDE interface

2.5 Install and Load Packages

PACKAGES TO INSTALL FOR THIS COURSE

readr, dplyr, ggplot2, Rcmdr, openxlsx, lubridate, WDI and Data360r

2.6 Rstudio interface

Figure 2: Rstudio IDE interface (windows)

1. Source - Your notepad for code

2. Console: R’s Heart

2.7 DATA OPERATORS IN R

The following are different operators in R programming.

Description Operator Example

Description Operator Example

Data Types and Variables

Rules for Naming R variables are:

# example of a list containing two lists

A dataframe is a two-dimensional (r _ c _ h) object (like a matrix). Each column in a dataframe

mydata <- data.frame(d,e,f)

HOW TO ACCESS VARIABLES (ELEMENTS) OF A DATA FRAME

# acess columns ID and Age from data frame called mydata

2. Using $ notation : name_dataframe$componentName

 dataframe [[1]] <- dataframe [[1]] *(100)

x<-1:3; y<-3:1; z<-0

# variable gender with 20 "male" entries and

2.12 WORKING DIRECTORY IN R

# Find the path of your working directory

Set working directory

# Set the path of your working directory

Changing working directory in RStudio

List files of the working directory

Step 2: Use write.csv to Export the DataFrame

2.14 The cheat Sheet of Frequently used functions in R

35 data.frame(d,e,f) d, e and f are vectors with same length