Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 103

R BASIC

SESSION 2 - 3

UNIVERSITAS BINA NUSANTARA

SUBJECT MATTER EXPERT


Sub-Topics
• Running R code
• Variables and functions
• Descriptive statistics
• Plotting numerical data
• Plotting categorical data
• Descriptive statistics
Classifying Variables by Type
Variables

Numerical Variables
Categorical Variables
(Quantitative)
(Qualitative)

Continuous
Discrete Variables Variables

Bina Nusantara University 4


Measurement Scales
1. Nominal

2. Ordinal

Bina Nusantara University 5


Measurement Scales
3. Interval
(does not involve a true zero point)

4. Ratio
(involves a true zero point, as in height, weight, age)

Bina Nusantara University 6


Measurement and Scales

Bina Nusantara University 7


Data sources are classified as being either primary
sources or secondary sources.
Sources of data fall into one of four categories:
 Data distributed by an organization or an individual
 A designed experiment
 A survey
 An observational study

Organizations
and individuals
that collect and
publish data
typically use that
data as a primary
source and then
let others use it
as a secondary
source.

Bina Nusantara University 8


A population contains all the items A sample contains only a portion of a population
or individuals of interest that you of interest. You analyze a sample to estimate
seek to study characteristics of an entire population.

Bina Nusantara University 9


Types of Sampling Methods
3 main reasons for selecting a sample:
• Selecting a sample is less time-consuming than selecting every
item in the population.
• Selecting a sample is less costly than selecting every item in the
population.
• An analysis of a sample is less cumbersome and more practical
than an analysis of the entire population

Bina Nusantara University 10


Bina Nusantara University 11
Summary Table

A summary table tallies the set of individual values as frequencies or


percentages for each category. A summary table helps you see the
differences among the categories by displaying the frequency,
amount, or percentage of items in a set of categories in a separate
column.

Bina Nusantara University 12


Bar Chart
A bar chart visualizes a categorical variable as a series of bars, with
each bar representing the tallies for a single category. In a bar chart, the
length of each bar represents either the frequency or percentage of
values for a category and each bar is separated by space, called a gap.

Bina Nusantara University 13


Pie Chart
A pie chart and a doughnut chart both use parts of a circle to represent the tallies
of each category of a categorical variable. The size of each part, or slice, varies
according to the percentage in each category. To represent a category as a slice,
you multiply the percentage of the whole that the category represents by 360 (the
number of degrees in a circle) to get the size of the slice.

Bina Nusantara University 14


Doughnut chart?

Bina Nusantara University 15


Doughnut Chart
Doughnut charts can visualize two as well as a single categorical variable. When
visualizing two variables, the doughnut chart appears as two concentric rings,
one inside the other, each ring containing the categories of one variable.

Bina Nusantara University 16


Pareto Chart
In a Pareto diagram, the categorized responses are plotted in
descending order, according to their frequencies, and are combined
with a cumulative percentage line on the same chart. The Pareto
diagram can identify situations in which the Pareto principle occurs.

Bina Nusantara University 17


Histogram
A histogram visualizes data as a vertical bar chart in which each bar
represents a class interval from a frequency or percentage
distribution. In a histogram, you display the numerical variable along
the horizontal (X) axis and use the vertical (Y) axis to represent either
the frequency or the percentage of values per class interval. There are
never any gaps between adjacent bars in a histogram.

Bina Nusantara University 18


Polygon
When using a categorical variable to divide the data of a numerical variable into two
or more groups, you visualize data by constructing a percentage polygon.

Bina Nusantara University 19


Multidimensional Chart Types
Multidimensional data elements are those with two or more
dimensions. This category is home to many of the most common
types of charts and graphs.
Temporal Visualizations
Temporal visualizations are similar to one-dimensional linear
visualizations but differ because they have a start and finish
time and items that may overlap each other.
Temporal Visualizations
Temporal visualizations are similar to one-dimensional linear
visualizations but differ because they have a start and finish
time and items that may overlap each other.
Temporal Visualizations
Temporal visualizations are similar to one-dimensional linear
visualizations but differ because they have a start and finish
time and items that may overlap each other.
2D Area Visualizations
2D area types of data visualization are usually geospatial,
meaning that they relate to the relative position of things on
the earth’s surface.
2D Area Visualizations
2D area types of data visualization are usually geospatial,
meaning that they relate to the relative position of things on
the earth’s surface.
2D Area Visualizations
2D area types of data visualization are usually geospatial,
meaning that they relate to the relative position of things on
the earth’s surface.
Hierarchical Charts
Hierarchical data sets are orderings of groups in
which larger groups encompass sets of smaller
groups.
Hierarchical Charts
Hierarchical data sets are orderings of groups in
which larger groups encompass sets of smaller
groups.
Network Visualizations
Network data visualizations show how data sets
are related to one another within a network.
Network Visualizations
Network data visualizations show how data sets
are related to one another within a network.
Network Visualizations
Network data visualizations show how data sets
are related to one another within a network.
R Code
• Everything that we do in R revolves
around code. The code will contain
instructions for how the computer
should treat, analyse and manipulate
data. Thus each line of code tells R to
do something: compute a mean value,
create a plot, sort a dataset, or
something else.
32
R Code
• To create a new script file in RStudio, press
Ctrl+Shift+N on your keyboard, or select File
> New File > R Script in the menu.
• You can then start writing your code in the
Script panel. For instance, try the following:

33
R Code
• To run the entire script do one of the
following:
– Press the Source button in the upper right
corner of the Script panel.
– Press Ctrl+Shift+Enter on your keyboard.
– Press Ctrl+Alt+Enter on your keyboard to run
the code without printing the code and its
output in the Console.

34
R Code
• To run a part of the script, first select the
lines you wish to run, e.g. by highlighting
them using your mouse. Then do one of the
following:
– Press the Run button at the upper right corner
of the Script panel.
– Press Ctrl+Enter on your keyboard (this is how I
usually do it!).

To save your script, click the Save icon, choose File > Save
in the menu or press Ctrl+S. R script files should have the
file extension .R, e.g. My first R script.R.
35
Glimpse of Previous Session

36
Glimpse of Previous Session

37
Types and Structures
• R has six basic data types. For most people, it suffices
to know about the first three in the list below:
• numeric: numbers like 1 and 16.823 (sometimes also called
double).
• logical: true/false values (boolean): either TRUE or FALSE.
• character: text, e.g. "a", "Hello! I'm Ada." and
"name@domain.com".
• integer: integer numbers, denoted in R by the letter L: 1L, 55L.
• complex: complex numbers, like 2+3i. Rarely used in statistical
work.
• raw: used to hold raw bytes. Don’t fret if you don’t know what
that means.

38
Try this on your code
# Memasukkan data bertipe logical TRUE, T, FALSE atau F (vektor dengan class logical)
(x.log = c(TRUE, T, FALSE, F, T, T, FALSE)) # data akan langsung ditampilkan
mode(x.log)
storage.mode(x.log)
typeof(x.log)

# Memasukkan data dengan tipe yang berbeda (numeric dan logical)


(z = c(5,TRUE,3,7, F)) # TRUE dipaksa menjadi numeric 1 dan F menjadi 0
mode(z)

# Operasi aritmatika pada vector


xa = c(1, 5, 10, 100); xb = c(2, 20, 25, -2)
xa * 5
xa / 5
xa^2
xa + 50
xa * xb
39
Try this on your code

40
Try this on your code

41
Try this on your code

42
Types and Structures
• To check what type of object a variable is, you can use
the class function:

• What happens if we use class on a vector?


Variables and Functions
Storing data
Storing data is to store data in an object (varible), for
example:
The code
x <- 4.
is used to assign the value 4 to the variable x. It is read as
“assign 4 to x”. The <- part is made by writing a less than sign
(<) and a hyphen (-) with no space between them
The following two lines of code will compute
𝑥 + 1 = 4 + 1 = 5 and 𝑥 + 𝑥 = 4 + 4 = 8:

45
Storing data
Exercise. Do the following using R:
1. Compute the sum 924 + 124 and assign the result to a
variable named a.
2. Compute 𝑎 ⋅ 𝑎.
It is often preferable to give your variables more
informative names. Compare the following two code
chunks:.

46
Storing data

• Some characters, including spaces, -, +, *, :, =, ! and


$ are not allowed in variable names.
• Another great way to improve the readability of your
code is to use comments. A comment is a piece of
text, marked by #, that is ignored by R.
• Comments can be placed on separate lines or at the
end of a line of code.

47
ASSIGNMENT (A)

Please explain the central tendency in statistics from that data!


(Remember that if you want to find median, you have to rank the data first)

48
Storing data- ASSIGNMENT (B)

In the Script panel in RStudio, you can comment and uncomment (i.e. remove the #
symbol) a row by pressing Ctrl+Shift+C on your keyboard.
49
Storing data- ASSIGNMENT (B)
Answer the following questions:
1. What happens if you use an invalid character in a variable
name? Try e.g. the following:

2. What happens if you put R code as a comment?


E.g.:

50
Storing data- ASSIGNMENT (B)

3. What happens if you remove a line break and replace it by


a semicolon ;? E.g.::

4. What happens if you do two assignments on the


same line? E.g.:

51
Answer it in .pdf file only!
Link is prohibited
Submit in Teams Group
Due date: 10 Oct 2023 (23.59 WIB)

52
Vectors and data frames
• Almost invariably, you’ll deal with more than one figure at
a time in your analyses. For instance, we may have a list
of the ages of customers at a bookstore:
28, 48, 47, 71, 22, 80, 48, 30, 31
• We could store each observation in a separate variable:

53
Vectors and data frames
• A much better solution is to store the entire list in just one
variable. In R, such a list is called a vector. We can create a
vector using the following code, where c stands for combine:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 3)
• The elements of a vector must be of the same type (all
numeric, all character, all logical, etc). All elements of the age
vector are numeric
• So for instance, if we wish to express the ages in months rather
than years, we can convert all ages to months using:

54
Vectors and data frames
• In the case of our bookstore customers, we also have
information about the amount of money they spent on
their last purchase:
20, 59, 2, 12, 22, 160, 34, 34, 29
• First, let’s store this data in a vector:

• We can combine the two vectors into a data frame as follows:

• The combination of two vectors like this forms an


object with class data.frame. Vector types can be
different (e.g. numeric and character)
55
Vectors and data frames
• If you type bookstore into the Console, it will show a
simply formatted table with the values of the two
vectors (and row numbers):

56
Functions
• A function is a ready-made set of instructions - code -
that tells R to do something. There are thousands of
functions in R.
• The code for doing this follows the pattern
function_name(variable_name). As a first example,
consider the function mean, which computes the mean
of a variable:

57
Functions
• Some functions take more than one variable as input,
and may also have additional arguments (or parameters).
• One such example is cor, which computes the correlation
between two variables:

By default, cor uses the Pearson correlation formula,


which is known to be sensitive to outliers.

58
Coefficient of Correlation

Bina Nusantara University 59


Positive Correlation Negative Correlation

Image sources:
Zero Correlation https://
numberdyslexia.com/
common-examples-of-
Bina Nusantara University positive-negative-and-zero-
60
Bina Nusantara University 61
Let’s find these!

62
Let’s find these!

63
Let’s find these!

64
Functions

• How can we know what arguments to pass to a function?


• We can look at the documentation, i.e. help file, for a function that
we are interested in. This is done by typing ?function_name in the
Console panel, or doing a web search for R function_name. To view
the documentation for the cor function, type: ?cor

65
Functions
The documentation for R functions all follow the same pattern:
• Description: a short (and sometimes quite technical) description of what the function
does.
• Usage: an abstract example of how the function is used in R code.
• Arguments: a list and description of the input arguments for the function.
• Details: further details about how the function works.
• Value: information about the output from the function.
• Note: additional comments from the function’s author (not always included).
• References: references to papers or books related to the function (not always
included).
• See Also: a list of related functions.
• Examples: practical (and sometimes less practical) examples of how to use the
function.

66
Packages
• Packages are collections of functions and datasets that add
new features to R.
• More than 17,000 packages and counting.
• There are R packages for just about anything that you could
possibly want to do.
• All packages have been contributed by the R community -
that is, by users like you and me.
• To install the package from CRAN, you can either select Tools
> Install packages in the RStudio menu
• Example:

67
Packages
• After you’ve installed the package, you’re still not
finished quite yet.
• The package may have been installed, but its
functions and datasets won’t be available until you
load it.
• It is done with a single short line of code using the
library function, that I recommend putting at the top
of your script file:

68
Descriptive statistics
Descriptive statistics
• We will study two datasets that are shipped with the
ggplot2 package:
– diamonds: describing the prices of more than 50,000 cut
diamonds.
– msleep: describing the sleep times of 83 mammals.
• These two datasets are automatically loaded as data
frames when you load ggplot2:

70
Descriptive statistics
• To have a first look at it, type the following in the
Console panel:

• To view all 83 rows and all 11 variables, use:

• You’ll notice that some cells have the value NA instead


of a proper value. NA stands for Not Available, and is a
placeholder used by R to point out missing data. In this
case, it means that the value is unknown for theanimal.

71
Descriptive statistics
• To have a first look at it, type the following in the
Console panel:

• To view all 83 rows and all 11 variables, use:

• You’ll notice that some cells have the value NA instead


of a proper value. NA stands for Not Available, and is a
placeholder used by R to point out missing data. In this
case, it means that the value is unknown for theanimal.

72
Descriptive statistics
• To find information about the data frame containing
the data, some useful functions are:

• The documentation for msleep gives a short


description of the data and its variables. Read it to
learn a bit more about the variables:

73
Numerical data
• A convenient way to get some descriptive statistics
giving a summary of each variable is to use the
summary function:

• For the numerical variable sleep_rem, we have the


following:

74
Numerical data
• This tells us that the mean of sleep_rem is 1.875, that
smallest value is 0.100 and that the largest is 6.600. The
1st quartile is 0.900, the median is 1.500 and the third
quartile is 2.400. Finally, there are 22 animals for which
there are no values (missing data - represented by NA).

• Let’s say that we want to compute some descriptive statistics for


the sleep_total variable. To access a vector inside a data frame, we
use a dollar sign: data_frame_name$vector_name.
• So to access the sleep_total vector in the msleep data frame, we
write:

75
Numerical data
• Some examples of functions that can be used to compute
descriptive statistics for this vector are:

To see how many animals sleep for more than 8 hours a day, we can
use the following:

76
Numerical data

• Now, let’s try to compute the mean value for the length of REM
sleep for the animals:

returns the answer NA.


• In order to ignore NA:s in the computation, we set na.rm =
TRUE in the function call:

• Compute the correlation between sleep_total and sleep_rem:

77
Numerical data

• A quick look at the documentation (?cor), tells us that the


argument used to ignore NA values has a different name for cor
- it’s not na.rm but use.
• We set use = "complete.obs" to compute the correlation
using only observations with complete data (i.e. no missing
values):

78
Categorical data
• Some of the variables, like vore (feeding behaviour) and
conservation (conservation status) are categorical rather than
numerical.
• For categorical variables (often called factors in R), we can
instead create a table showing the frequencies of different
categories using table:

• To instead show the proportion of different categories, we can


apply proportions

79
Categorical data

• Data values that are only able to discriminate and/or sort, such
as gender, clothing size (S, M, L) are classified as catagorical
data. In R data like this is included in the factor class
• The table function can also be used to construct a cross table
that shows the counts for different combinations of two
categorical variables:

80
Categorical data

81
Plotting Numerical Data
Plotting numerical data
• There are several different approaches to creating
plots with R.
• We will mainly focus on creating plots using the
ggplot2 package, which allows us to create good-
looking plots using the so-called grammar of graphics.
• The three key components to grammar of graphics
plots are:
 Data: the observations in your dataset,
 Aesthetics: mappings from the data to visual
properties (like axes and sizes of geometric objects),
 Geoms: geometric objects, e.g. lines, representing
what you see in the plot.
83
Our first plot
• Using base R, we simply do a call to the plot function

• The code for doing this using ggplot2 is more verbose:

The code consists of three parts:


 Data: given by the first argument in the call to ggplot: msleep
 Aesthetics: given by the second argument in the ggplot call: aes,
where we map sleep_total to the x-axis and sleep_rem to the y-axis.
 Geoms: given by geom_point, meaning that the observations will be
represented by points.

84
Our first plot

85
Colours, shapes and axis labels
• it’s usually a good idea to change the label for the x-
axis from the variable name “sleep_total” to
something like “Total sleep time (h)”. This is done by
using the + sign again, adding a call to xlab to the plot:

• To change the colour of the points, you can set the colour in
geom_point:

86
Colours, shapes and axis labels
• Alternatively, you may want to use the colours of the
point to separate different categories.
• This is done by adding a colour argument to aes, since
you are now mapping a data variable to a visual
property.

What happens if we use a continuous variable, such as the sleep


cycle length sleep_cycle to set the colour?

87
Axis limits and scales
• Assume that we wish to study the relationship
between animals’ brain sizes and their total sleep
time. We create a satterplot using:

• There are two animals with brains that are much


heavier than the rest (African elephant and Asian
elephant). These outliers distort the plot, making it
difficult to spot any patterns.
• We can try changing the x-axis to only go from 0 to 1.5
by adding xlim to the plot, to see if that improves it:

88
Axis limits and scales
• We can try changing the x-axis to only go from 0 to 1.5
by adding xlim to the plot, to see if that improves it:

This is slightly better, but we still have a lot of points clustered


near the y-axis, and some animals are now missing from the
plot.

89
Axis limits and scales
• Another option is to resacle the x-axis by applying a
log transform to the brainweights, which we can do
directly in aes:

• A third option that mitigates this is to add


scale_x_log10 to the plot, which changes the scale of
the x-axis to a log10 scale.

90
Comparing groups
• In our plot of animal brain weight versus total sleep
time, we may wish to separate the different feeding
behaviours (omnivores, carnivores, etc.) in the msleep
data using facetting instead of different coloured
points. In ggplot2 we do this by adding a call to
facet_wrap to the plot:

91
Boxplots
• Another option for comparing groups is boxplots
(also called box-and-whiskers plots).

92
Boxplots
• The boxes visualise important descriptive statistics for
the different groups, similar to what we got using
summary:
 Median: the thick black line inside the box.
 First quartile: the bottom of the box.
 Third quartile: the top of the box.
 Minimum: the end of the line (“whisker”) that extends from the bottom of the
box.
 Maximum: the end of the line that extends from the top of the box.
 Outliers: observations that deviate too much12 from the rest are shown as
separate points. These outliers are not included in the computation of the
median, quartiles and the extremes.

93
Boxplots

94
Histograms
• The ggplot2 code for histograms follows the same
pattern as other plots, while the base R code uses the
hist function:

• Data: given by the first argument in the call to ggplot: msleep


• Aesthetics: given by the second argument in the ggplot call:
aes, where we map sleep_total to the x-axis.
• Geoms: given by geom_histogram, meaning that the data will
be visualised by a histogram.
95
Histograms

96
Plotting Categorical Data
Bar charts
• When visualising categorical data, we typically try to
show the counts, i.e. the number of observations, for
each category. The most common plot for this type of
data is the bar chart.
• The code for creating them is:

98
Bar charts

99
Bar charts
• To create a stacked bar chart using ggplot2, we use
map all groups to the same value on the x-axis and
then map the different groups to different colours.
This can be done as follows:

100
Saving your plot
• When you create a ggplot2 plot, you can save it as a
plot object in R:

• To plot a saved plot object, just write its name:

• If you like, you can add things to the plot, just as before:

• To save your plot object as an image file, use


ggsave

101
THANK YOU
R PROGRAMMING
UNIVERSITAS BINA NUSANTARA

Februay 2023
References
• Måns Thulin. (2021). Modern Statistics with
R, From wrangling and exploring data to
inference and predictive modelling. Eos
Chasma Press. Uppsala, Sweden. ISBN:
978-91-527-0151-5

103

You might also like