Professional Documents
Culture Documents
Training Manual For Data Analytics Using R
Training Manual For Data Analytics Using R
for
Table of Contents
Module 1: introduction to statistics, data and data analytics.................................................................5
1.0 Statistics and Data concepts............................................................................................................5
1
1.1 What is statistics?............................................................................................................................5
1.2 Branches of statistics.......................................................................................................................5
1.2.1 Descriptive Statistics..................................................................................................................6
1.2.2 Inferential Statistics..................................................................................................................6
1.3 Confidence and Significance Levels...............................................................................................6
1.4 Major sources of statistics...............................................................................................................7
1.4.1 Primary sources............................................................................................................................7
1.4.2 Secondary Sources........................................................................................................................7
1.5 Measures of Central Tendency.......................................................................................................7
1.6 Measures of Dispersion...................................................................................................................7
1.7 Data (raw information)...................................................................................................................7
1.8 Variable............................................................................................................................................8
1.8.1 Qualitative variable.....................................................................................................................8
1.8.2 Quantitative variable...................................................................................................................8
1.9 BIG DATA: What is it?.................................................................................................................10
1.9.1 Characteristics of Big Data.......................................................................................................10
1.9.2 Big Data Sources......................................................................................................................10
1.10 HOW Big data is ACTUALLY USED........................................................................................11
1.11 Big Data Challenges.....................................................................................................................11
1.12 Softwares for big data.................................................................................................................12
1.13 Data analytics (DA)......................................................................................................................12
1.14 Tools for data analytics...............................................................................................................12
1.15 Types of data analytics applications...........................................................................................13
1.16 Stages of Data analysis................................................................................................................13
Module 2 : Getting started with R..........................................................................................................17
2.1 R Software Definition....................................................................................................................17
2.2 R Studio Definition........................................................................................................................17
2.3 Why R?...........................................................................................................................................18
2.4 Installation of R and RStudio.......................................................................................................18
2.5 Install and Load Packages............................................................................................................24
2.6 Rstudio interface............................................................................................................................25
2.7 DATA OPERATORS IN R...........................................................................................................28
2.8 R Comments...................................................................................................................................29
2
2.9 Data Types in R.............................................................................................................................29
2.10 Variables in R..............................................................................................................................30
2.11 R objects.......................................................................................................................................31
2.12 WORKING DIRECTORY IN R................................................................................................35
2.13 Getting data into Rstudio............................................................................................................37
Module 3 : Data Management and Manipulation.................................................................................40
3.1 What is data management.............................................................................................................40
3.2 Activities involved in data Management......................................................................................41
3.3 How to perform data management in R.......................................................................................41
Module 4 : Descriptive statistics.............................................................................................................41
4.1 What is descriptive statistics?.......................................................................................................41
4.2 How to perform Descriptive statistics in R..................................................................................42
Module 5 : Inferential statistics..............................................................................................................42
5.1 What is inferential statistics..........................................................................................................42
5.2 Types of Inferential Statistics.......................................................................................................43
5.2.1 Hypothesis Testing......................................................................................................................43
5.2.2 CORRELATION........................................................................................................................44
5.2.3 How to perform correlation in R...............................................................................................45
5.2.4 Chi-Square Test..........................................................................................................................45
5.2.5 How to Perform a Chi square in R............................................................................................45
5.2.6 The t-Test....................................................................................................................................46
5.2.7 How to perform t - tests in R......................................................................................................46
5.2.8 One-Way ANOVA......................................................................................................................46
5.2.9 How to perform one way anova in R.........................................................................................47
5.2.10 REGRESSION ANALYSIS.....................................................................................................47
5.2.11 HOW to perform linear regressions in R................................................................................48
References................................................................................................................................................49
List of Appendixes...................................................................................................................................50
Appendix 1: 1_Getting Started WIth R.R.....................................................................................50
Appendix 2: 2_Data Management.R.............................................................................................50
Appendix 3: 3_Descriptive Statistics.R.........................................................................................50
Appendix 4: 4_correlation and chi-square.R................................................................................50
Appendix 5: 5_t – tests and anova.R.............................................................................................50
3
Appendix 5: 6_Linear models.R....................................................................................................50
4
Module 1: introduction to statistics, data and data analytics
Statistics is the way of getting information from data. It is an art and science of deciding: what
are the appropriate data to collect, how to collect them efficiently and then using them to give
information (answer questions and make decisions). It changes numbers into information and it
helps making decisions when there is uncertainty. It is the problem solving process that seeks
answers to questions through data. Therefore it is a tool for creating an understanding from a set
of numbers.
The two main branches of statistics are descriptive statistics and inferential statistics.
1.2.1 Descriptive Statistics
These are methods of organizing, summarizing, and presenting data in a convenient and
informative way. These methods include: Graphical Techniques and Numerical Techniques. The
actual method used depends on what information we would like to extract. Are we interested in.
measure(s) of central location? and/or measure(s) of variability (dispersion)?
5
1.2.2 Inferential Statistics
Descriptive Statistics describe the data set that’s being analyzed, but doesn’t allow us to draw
any conclusions or make any interferences about the data. Hence we need another branch of
statistics which is inferential statistics.
Inferential statistics is also a set of methods, but it is used to draw conclusions or inferences
about characteristics of populations based on data from a sample. Statistical Inference is the
process of making an estimate, prediction, or decision about a population based on a sample.
However such conclusions and estimates are not always going to be correct. For this reason, we
build into the statistical inference “measures of reliability,” namely confidence level and
significance level.
1.3 Confidence and Significance Levels
The confidence level is the proportion of times that an estimating procedure will be correct. E.g.
a confidence level of 95% means that, estimates based on this form of statistical inference will be
correct 95% of the time.
When the purpose of the statistical inference is to draw a conclusion about a population, the
significance level measures how frequently the conclusion will be wrong in the long run. E.g. a
5% significance level means that, in the long run, this type of conclusion will be wrong 5% of
the time.
If we use α (Greek letter “alpha”) to represent significance, then our confidence level is 1 – α.
This relationship can also be stated as:
Confidence Level + Significance Level = 1
Data may be collected for the purpose required. Such data are known as primary data. The
collection of facts and figures relating to the population in the censuses and surveys provide
primary data. The great advantage of such data is that the exact information wanted is obtained.
Often data is picked from reports of other institutions and organizations, such data is referred to
as secondary. For example, details of industrial production data is picked from reports of the
industries.
6
1.5 Measures of Central Tendency
The term measure of central tendency is used to identify the values which may be computed to
in an attempt to characterize the central part of the frequency distribution. The arithmetic mean,
median and mode are frequently used.
1.6 Measures of Dispersion
Variance - the mean of all squared deviations from the mean. Deviations are the amount that
each score varies from the mean of the distribution, that is, how far each score is away from the
mean.
Standard Deviation - a measure of the dispersion or variation in a distribution, equal to the
square root of the arithmetic mean of the squares of the deviations from the arithmetic mean. The
greater the degree of difference of a value from the average, the larger the standard deviation.
Range -difference between the lowest and highest values. The range tells you something about
how spread out the data are. Data with large ranges tend to be more spread out.
1.7 Data (raw information)
Are facts that become useful information when organized in a meaningful way or when
entered into a computer.
Data is also defined as information organized for analysis.
Data could be of qualitative or quantitative nature
1.8 Variable
7
Discrete variable – exists if Possible values consist of breaks between successive values. For
example; number of cows, number of people, number of bags of coffee; etc.
Or
8
Or
9
1.9 BIG DATA: What is it?
“Big data is defined as large amount of data which requires new technologies and architectures
so that it becomes possible to extract value from it…”
1.9.1 Characteristics of Big Data
High-Volume: Huge amount of data is being generated
Velocity: Data is being generated at a very high speed
Variety: different forms and types of data are being produced
Veracity: The produced data is messy (not clean)
Variability: Data flow is not constant
1.9.2 Big Data Sources
People Data
- Social media, online survey, Blogs
Machine Data
- “Smart Devices”, like Satelites.
10
Organization Data
- Medical records, Commercial Transactions and Utility payments.
1.10 HOW Big data is ACTUALLY USED
11
1.12 Softwares for big data
Traditional software like Excel and Spss could not manage to analyze a huge amount of data
which triggered the development of other powerful software for big data like R and Python. For
the purpose of this course R is the software that we will learn.
Data analytics (DA) is the process of examining data sets in order to find trends and draw
conclusions about the information they contain. The purpose of Data Analytics is to extract
useful information from data and taking the decision based upon the data analysis. A simple
example of Data analysis is whenever we take any decision in our day-to-day life is by thinking
about what happened last time or what will happen by choosing that particular decision
Increasingly, data analytics is done with the aid of specialized systems and software. Data
analyst tools is a term used to describe software and applications that data analysts use in order
to develop and perform analytical processes that help companies to make better, informed
business decisions while decreasing costs and increasing profits. Below are examples of
statistical analytical tools like SPSS, STATA,MATLAB, MINITAB, R,PYTHON,SAS,
MICROSOFT EXCEL and many more. There are a range of different software tools available,
and each offers something slightly different to the user – what you choose will depend on a range
of factors, including your research question, knowledge of statistics, experience of coding and
availability of the software (free or commercial). For this course we will use R software.
At a high level, data analytics methodologies include exploratory data analysis (EDA) and
confirmatory data analysis (CDA). EDA aims to find patterns and relationships in data, while
CDA applies statistical techniques to determine whether hypotheses about a data set are true or
false.
Data analytics can also be separated into quantitative data analysis and qualitative data
analysis. The former involves the analysis of numerical data with quantifiable variables. These
12
variables can be compared or measured statistically. The qualitative approach is more
interpretive -- it focuses on understanding the content of non-numerical data like text, images,
audio and video, as well as common phrases, themes and points of view.
Advanced types of data analytics include data mining, which involves sorting through large data
sets to identify trends, patterns and relationships. Another is predictive analytics, which seeks to
predict customer behavior, equipment failures and other future business scenarios and events.
Machine learning can also be used for data analytics, by running automated algorithms to churn
through data sets more quickly than data scientists can do via conventional analytical modeling.
Big data analytics applies data mining, predictive analytics and machine learning tools to data
sets that can include a mix of structured, unstructured and semi structured data. Text mining
provides a means of analyzing documents, emails and other text-based content.
Like any scientific discipline, data analysis follows a rigorous step-by-step process. Each stage
requires different skills and know-how. These stages include; Defining the question, collecting
the data, cleaning the data, Analyzing the data and sharing the results. Consider the figure below.
The first step in any data analysis process is to define your objective. Defining your objective
means coming up with a hypothesis and figuring how to test it. For instance, your
organization’s senior management might pose an issue, such as: “Why are we losing
customers?” or “Which factors are negatively impacting the customer experience?” or better
yet: “How can we boost customer retention while minimizing costs?”
13
Now you have defined a problem, you need to determine which sources of data will best help
you solve it.
Once you have established your objective, you will need to create a strategy for collecting
and aggregating the appropriate data. A key part of this is determining which data you need.
This might be quantitative (numeric) data, e.g. sales figures, or qualitative (descriptive) data,
such as customer reviews.
Once you have collected your data, the next step is to get it ready for analysis. This means
cleaning, or ‘scrubbing’ it, and is crucial in making sure that you are working with high-quality
data. Key data cleaning tasks include:
Removing major errors, duplicates, and outliers—all of which are inevitable problems
when aggregating data from numerous sources.
Removing unwanted data points—extracting irrelevant observations that have no
bearing on your intended analysis.
Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout
issues, which will help you map and manipulate your data more easily.
Filling in major gaps—as you are tidying up, you might notice that important data are
missing. Once you have identified gaps, you can go about filling them.
A good data analyst will spend around 70-90% of their time cleaning their data. This might
sound excessive. But focusing on the wrong data points (or analyzing erroneous data) will
severely impact your results.
Depends on what insights you are hoping to gain from your data, you may apply different
types of data analysis. It may be descriptive analysis which identifies what has already
happened or diagnostic analytics which focuses on understanding why something has
happened or predictive analysis which allows you to identify future trends based on historical
data or prescriptive analysis allows you to make recommendations for the future.
14
5. Step five: Sharing your results
After you have finished carrying out your analyses. You have your insights. The final step of the
data analytics process is to share these insights with the wider world (or at least with your
organization’s stakeholders!). It involves interpreting the outcomes, and presenting them in a
manner that is digestible for all types of audiences. Since you will often present information to
decision-makers, it is very important that the insights you present are 100% clear and
unambiguous. For this reason, data analysts commonly use reports, dashboards, and interactive
visualizations to support their findings.
15
Module 2 : Getting started with R
16
RStudio is a free and open-source integrated development environment (IDE) for R, a
programming language for statistical computing and graphics. An IDE is software that has
comprehensive facilities like a code editor, compiler, and debugger tools to help developers write
R scripts. RStudio assists developers in writing R scripts by providing all the required tools in
one software package. Rstudio interface is organized so that the user can clearly view graphs,
data tables, R code, and output all at the same time. It also offers an Import-Wizard-like feature
that allows users to import CSV, Excel, SAS (*.sas7bdat), SPSS (*.sav), and Stata (*.dta) files
into R without having to write the code to do so
You can use R without using RStudio, but you can't use RStudio without using R, so R comes
first.
JJ Allaire, creator of the programming language ColdFusion, founded RStudio. Hadley Wickham
is the Chief Scientist at RStudio.
2.3 Why R?
There are many programming languages available for data science, like R, Python, SAS, Java,
and more. There are many data science software packages to learn, such as SPSS Statistics, SPSS
Modeler, SAS Enterprise Miner, Tableau, RapidMiner, Weka, GATE, and more. Learning R for
statistics is recommended because it was developed for statistics in the first place. R
programming is very strong in statistics, so it is ideal for data exploration or data understanding
using descriptive statistics, inferential statistics, regression analysis, and data visualizations. R is
also ideal for modeling because you can use statistical learning like regressions for predictive
analytics. R also has some packages for data mining, text mining, and machine learning. Also R
has a large community support and it is free. It has many packages (libraries of functions) that
can be used to solve different problems.
17
Figure 3: website for downloading R programming command line application (R base)
Click download and select your closer cran mirrors, then download R programming application
that suits your operating system.
18
To install the software, double-click the download setup file and follow the instructions of the
installer to install the R programming command line application, as seen below:
Figure 5: Installation of R
After the R programming command line application is installed, you can start it, as seen below:
19
Figure 1: R programming command line application (R base) graphical interface
RStudio is the most popular IDE for the R programming language. RStudio helps you write R
programming code more easily and more productively. To download and install RStudio, visit
www.rstudio.com/, as seen below:
After downloading the RStudio installer or setup file, double-click the file to install the RStudio
IDE, as seen below:
20
Figure 7: Installation of the RStudio IDE
After installing the RStudio IDE, you can run the RStudio IDE software, as seen below:
21
Figure 9: The RStudio IDE’s Tools menu
Click the Change button to select the R version, as seen below:
22
Figure 10: RStudio IDE options
Packages already included in the initial installation may not have functions that you need. In R,
you can easily install and load additional packages provided by other users.
1. To install packages:
You can either use install.packages() function
install.packages("foreign") # install 'foreign' package
or
23
click Tools > Install packages. Write the package name in the dialog, then click install.
2. To Load Packages
Once you install the package, you need to load it so that it becomes available to use. Simply
use library() function
library(foreign) # load 'foreign' package
There are four RStudio Windows (also called panes).However, your windows might be in a
different order that those in Figure below. If you’d like, you can change the order of the windows
under RStudio preferences.
24
The source pane is where you create and edit R Scripts - your collections of code. Don’t worry,
R scripts are just text files with the “.R” extension.
There are many ways to send your code from the Source to the console. The slowest way is to
copy and paste or “Control + Enter”. A faster way is to highlight the code you wish to evaluate
and clicking on the “Run” button on the top right of the Source.
While the Console forms the workhorse of R, operating solely in the Console is very
cumbersome. Instead of typing your commands in the Console each time you run R, we will
instead create a script. A script is a list of R commands that is saved as a text file to then be
submitted into R line by line. Scripts are what makes R so useful as they allow easy
reproducibility. Scripts are saved as a .R extension which can be read by most text editors (e.g.,
Notepad in Windows). To create a new script you can select File > New File > R Script at the
top of R Studio. You should see the Console window go to the bottom and a new, empty
window appear above the Console.
This new window is where your scripts that are open will appear; each script that you have
loaded is marked by a tab at the top of the window. Anytime you change a script, the text
becomes red and an * appears to indicate that it has been edited since it was last saved.
However, 99% most of the time, you should be using the Source rather than the Console. The
reason for this is straightforward: If you type code into the console, it won’t be saved (though
you can look back on your command History). And if you make a mistake in typing code into the
console, you’d have to re-type everything all over again. Instead, it’s better to write all your code
25
in the Source. When you are ready to execute some code, you can then send “Run” it to the
console.
3. Environment/History
The Environment tab of this panel shows you the names of all the data objects (like vectors,
matrices, and data frames) that you’ve defined in your current R session. You can also see
information like the number of observations and rows in data objects. The History tab of this
panel simply shows you a history of all the code you’ve previously evaluated in the Console. To
be honest, I never look at this.
4. Files / Plots / Packages / Help
Files - The files panel gives you access to the file directory on your hard drive. One nice
feature of the “Files” panel is that you can use it to set your working directory.
Plots - The Plots panel (no big surprise), shows all your plots. There are buttons for
opening the plot in a separate window and exporting the plot as a pdf or jpeg (though you
can also do this with code using the pdf () or jpeg () functions.)
Packages - Shows a list of all the R packages installed on your hard drive and indicates
whether or not they are currently loaded.
Help - Help menu for R functions. You can either type the name of a function in the
search window, or use the code to search for a function with the name. examples
?hist # How does the histogram function work?
?t.test # What about a t-test?
Arithmetic Operators
Description Operator Example
Adding up two or more operands + 3+2=5
Subtracting two or more operands - 3-2=1
Multiplying two or more operands * 3*2=6
Dividing two or more operands / 3/2=1.5
26
Exponentiation of operand ^ 3ˆ2=3 2 =9
Reminder for the division %% 5%%2=1
Assignment Operators
The assignment operators are used to assign values,characters or any other type to the
object.
The operator can either be left or right assignment operator.
There are about five assignment operators.
Assignment Operators
Description Operator Example
= a=5
Left Assignment <− b< −5
<< − d<< −5
−> 5− >e
Right Assignment − >> 5− >>f
Relational Operators
Logical Operators
Three symbols are the famous one for logical operators for AND, OR and NOT (negation)
27
NOR ! !FALSE==TRUE
2.8 R Comments
Comments can be used to explain R code, and to make it more readable. They can also be used
to prevent execution when testing alternative code. Comments starts with a # symbol. When
executing the R-code, R will ignore anything that starts with # symbol. For example a comment
before a line of code makes that line ignored or skipped during execution otherwise the system
will fire an error. Example;” # This is a comment to explain my codes “will be ignored but “This
is a comment to explain my codes” will cause an error to occur.
There are three basic data types namely numeric, character and logical.
It’s possible to use the function class() to see what type a variable is.
You can also use the functions is.numeric(), is.character(), is.logical() to check whether
a variable is numeric, character or logical, respectively. For instance:
If you want to change the type of a variable to another one, use the as.* functions,
including: as.numeric(), as.character(), as.logical(), etc.
NOTE:
28
2.10 Variables in R
Generally, while doing programming in any programming language, you need to use various
variables to store various information. In R variables are nothing but reserved memory locations
to store values. This means that, when you create a variable you reserve some space in memory.
Variable Names
A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume)
i. A variable name must start with a letter and can be a combination of letters, digits,
period(.) and underscore(_). If it starts with period (.), it cannot be followed by a
digit.
ii. A variable name cannot start with a number or underscore (_)
iii. Variable names are case-sensitive (age, Age and AGE are three different variables)
iv. Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
Variables that can hold one value (string, number, boolean etc) are called simple variables and
variables that can hold pairs of variables and values are called objects.
2.11 R objects
There are many types of R-objects. In this course we are going to learn how to create the
frequently used ones which are; Vectors, Matrices, Factors, Lists and Data Frames. Then we will
dive deep in dataframes because it is the object that we need to work with.
VECTORS
29
Vector: a combination of elements (i.e. numbers, words), usually created using c(), seq(), or rep()
NB: All elements in a vector must be of the same data type namely numeric, character or
logical.
How to create vectors in R
Method example
1 c(a, b, ...) x = c(1, 5, 9)
2 a:b y = 1:5
2 seq(from, to, by) z = seq(from = 0, to = 6, by = 2)
4 rep(x, times, each) q = rep(c(7, 8), times = 2, each = 2)
MATRIX
A matrix is a two-dimensional (r _ c) object. All elements in a matrix must be of the same data
type.
How to create Matrices in in R
We use the matrix () function to create a matrix in R.
matrix(vector, nrow=r, ncol=c, x_ma= matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=TRUE)
byrow=TRUE)
LIST
A list is a set of objects. Each element in a list can be a vector, a matrix, dataframe or a list. A
list allows you to gather a variety of (possibly unrelated) objects under one name. we create a list
using list() function.
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
30
DATAFRAME
mydata$X1
31
3. using the functions attach() and detach()
attach(name_dataframe)
……….
detach()
Through the function attach() it is possible to access the components of a data frame without quoting the
name of the data frame each time. The attach() function appends the name of the data frame to the
searching path.
In this way the components of the data frame become temporarily available as variables under
their component names until the detach() function is called.
This function detaches the name of the data frame from the searching path, so the components
are no longer visible with their names only. Again they can be accessed only with the $ notation:
station$wind
NOTE:
After the function attach() is called, all the operations on the components of the data frame do
not affect the original data frame, but a copy of its components. In order to modify the data frame
it is necessary to use the “$”notation or [[ ]]. For example in order to modify the first column of a
data frame, see the format below)
32
> df
> rm(x,y,z)
> attach(df)
> z<-x+y
> detach(df)
> df
df$z <-df$x+df$y
> df
FACTOR
Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a
vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal
variable), and an internal vector of character strings (the original values) mapped to these
integers.
The working directory in R is the folder where you are working. Hence, it's the place (the
environment) where you have to store your files of your project in order to load them or where
your R objects will be saved. Therefore the working directory is the default location where R
will look for files you want to load and where it will put any files you save.
33
Get working directory
In case you want to check the directory of your R session, the function getwd() will print the
current working directory path as a string. Hence, the output is the folder where all your files will
be saved
getwd()
setwd("My\\Path")
setwd("My/Path") # Equivalent
In order to create a new RStudio project go to Session → Set Working Directory and select the
folder (choose directory) you want to make it to be your working directory. See the figure
below.
Once you set up your working directory, you may want to know which files are inside it. For that
purpose just call the dir() or the list.files() functions as illustrated in the following example.
dir()
list.files() # Equivalent
34
2.13 Getting data into Rstudio
To import dataset into R through Rstudio just click on FileImport DatasetSelect from which
file format to import. See the figure below.
35
Exporting data from R
Steps to Export a DataFrame to CSV in R
Step 1: Create a DataFrame
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
> mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed")
mydata
36
20 rm(list=ls()) remove all objects at once
21 Ctrl + L clear the console
22 t.test(filename,u=90) one sample ttest with automatic confidence level 95%
23 ls(dataframeName) list the variable names of a data frame
24 ls() or objects() display the names of objects in current environment (work space)
25 dir() to check what is there in the working directory
26 save(crime, file = "mhola.RData") save as R object
27 write.table(mtcars, file = "mauu.txt", sep = "\t") export data in tab delimited format
28 write.csv2(npk,file = "nyau.csv") export data in comma separated format
29 write.csv(mtcars, file = export data in comma separated format
"mtcarcsv.csv")
30 c(a, b, ...) c(1, 5, 9)
31 a:b 1:5
32 seq(from, to, by, length.out) seq(from = 0, to = 6, by = 2)
33 rep(x, times, each, length.out) rep(c(7, 8), times = 2, each = 2)
34 matrix(vector, nrow=r, ncol=c, matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=TRUE)
byrow=TRUE)
37
3.1 What is data management
In short, data management is the process of operating on datasets in such a way the
analyst becomes able to access the dataset in different ways and being able to change the
dataset by assessing and editing the data to suit the purpose of and formats that the
analyst wants.
The activities that can be done during data management include the following:
Generating new variables in existing data set
Generating new variables into a different dataset
Renaming variables
Coding categorical variables
Recoding Categorical variables
Assigning value labels to categorical variables
Changing numeric variables into categorical
Data cleaning
Combining datasets
Subsetting datasets
38
4.1 What is descriptive statistics?
Descriptive statistics, in short, help describe and understand the features of a specific data set by
giving short summaries about the sample and measures of the data. Descriptive statistics
summarizes or describes the characteristics of a data set.
39
population. The goal of inferential statistics is to make generalizations about a population. In
inferential statistics, a statistic is taken from the sample data (e.g., the sample mean) that used to
make inferences about the population parameter (e.g., the population mean).
40
For the purpose of this course we will focus on correlation, chi-square, t-tests and analysis of
variance (one way anova).
5.2.2 CORRELATION
Pearson correlation coefficient: Is the measure of the strength of the linear relationship
between two variables. In a sample it is denoted by r and is by design constrained as -1≤r ≤ 1
Furthermore:
Positive values denote positive linear correlation;
Negative values denote negative linear correlation;
A value of 0 denotes no linear correlation;
The closer the value is to 1 or –1, the stronger the linear correlation
Correlation is an effect size and so we can verbally describe the strength of the correlation using
the guide that Evans (1996) suggests for the absolute value of r:
0 .00-0.19 “very weak”
0 .20-0.39 “weak”
0.40-0.59 “moderate”
0.60-.79 “strong”
41
0.80-1.0 “very strong”
The direction of the relationship depends on the sign of the coefficient, if it is (+) we say there is
positive linear relationship between the two variables and if it is (-) we say there is a negative
linear relationship between the two variables.
Hypothesis
H0 : There is no linear relationship btn the two variables
Ha : There is linear relationship btn the two variables
5.2.3 How to perform correlation in R
This practical part is covered by appendix 4: 4_correlation and chi-square.R
Alternative hypothesis
The alternative hypothesis (H1) proposes that the two variables are related in the population.
Simply
H0: There is no association between the two variables.
H1: The two variables are related in the population.
42
differs from a known value (a one-sample t-test), whether two groups differ from each other (an
independent two-sample t-test), or whether there is a significant difference in paired
measurements (a paired, or dependent samples t-test).
Types of t-tests
There are three t-tests to compare means: a one-sample t-test, a two-sample t-test and a paired
t-test. The table below summarizes the characteristics of each and provides guidance on how to
choose the correct test.
One-sample t-test Two-sample t-test
Paired t-test
Number of variables one two two
Purpose of test Decide if the Decide if the
Decide if the
population mean is population means for
difference between
equal to a specific two different groups
paired measurements
value or not are equal or notfor a population is
zero or not
Example: test if... Mean heart rate of a Mean heart rates for Mean difference in
group of people is two groups of people heart rate for a group
equal to 65 or not are the same or not of people before and
after exercise is zero
or not
43
5.2.10 REGRESSION ANALYSIS
Regression analysis is used to quantify how one variable will change with respect to another
variable. There are many types of regressions available such as simple linear, multiple linear,
logistic, and ordinal regression. The most commonly used regression in inferential statistics is
linear regression. Linear regression checks the effect of a unit change of the independent variable
in the dependent variable.
LINEAR REGRESSION
Linear regression is a common Statistical Data Analysis technique. It is used to determine the
extent to which there is a linear relationship between a dependent variable and one or more
independent variables. There are two types of linear regression, simple linear regression and
multiple linear regression.
In simple linear regression a single independent variable is used to predict the value of a
dependent variable.
In multiple linear regression two or more independent variables are used to predict the value of
a dependent variable. The difference between the two is the number of independent variables. In
both cases there is only a single dependent variable.
44
References
45
List of Appendixes
Note: These appendices are R script files with R codes that will be used to produce outputs
46
47