Professional Documents
Culture Documents
Chapter 3 - DESCRIPTIVE ANALYSIS
Chapter 3 - DESCRIPTIVE ANALYSIS
CONTENT
• Introduction of Programming Language R
• Key Ideas of R
• A Taste of R
• Summarizing the Data
• Summarizing Numeric Variables
• Summarizing Categorical Variables
• Exploring Relationships between Numeric Variables
• Plotting the Data
• The Grammar of Plots
• Using Categorical Variables as Facets
• Grouping and Summarizing Data by Categorical Variables
Section I:
Introduction to
R
Introduction to R Programming
Language
R is a powerful tool which was built for productivity in data analysis tasks from
the outset.
Idea 2 : Functions
This states that each time we pass the same input to the function, we get the same answer. Second, functions do not
All data analysis is a sequence of several functions applied one after another, each time producing a specific, meaningful
transformation of the input data. In devising these functions, we are splitting the analysis into a sequence of small steps,
each independent of the other. Any changes to the internal logic of this function do not change the nature of our analysis
Data There is only one way to store data that is, Data Frame. A data frame is merely a
spreadsheet which is a kind of a structure with variables of interest arranged in columns
and the samples on which the variables have been measured arranged in the rows.
Functions and Composition For commonly used statistical functions such as mean()
and median() we do not need to enter any library. These are included when you enter
the R environment. The pipe operator provides an elegant way to organize nested
computations.
Section II:
Summarizing
Data
Summarizing the Data
Jumping to the modeling stage without understanding the dependence between the
Once the data is read, a great way to peek into the data is to use the glimpse() function. It
The output which is generated consists of numeric variables and categorical variables.
Summarizing (Contd.)
After applying the glimpse function, skim() function from the skim library is applied.
It summarizes the categorical variables in the data set by providing the number of unique
values of these variables and the numeric variables by providing the mean, standard
The glimpse function provides a good sense of what the data looks like, i.e., what kinds of
variables exist and the skim function provides an intuition for the exact range and
need a sense of the variability and here the standard deviation (sd) and quantiles (quantile) comes into the consideration.
For Instance,
[1] 40.93621
Here we are asking R to pull out the age variable from the data and compute the mean of this variable.
[1] 39
Here we are asking R to pull out the age variable from the data and compute the median of this variable.
Summarizing (Contd..)
> data_df %>% pull(age) %>% sd()
[1] 10.61876
Here we are asking R to pull out the age variable from the data and compute the Standard Deviation of this variable.
One final important way to understand numeric variables is to look at quintiles. This is accomplished using the quintile
function in R.
For Instance,
33 39 48
Summarizing (Contd..)
Here, we are asking R to pull the age column and return the value of age for which 25%, 50% and 75% of the data lie below
this value. Looking at the output, we conclude that 50% of ages are below 39 (i.e., 39 is the median) and 75% of ages are
below 48.
• For Instance,
99%
71
Here, output from the code below indicates that 99% of ages lie below 71.
The quintile function can be used to diagnose if the model specification assumptions are violated for specific variables
Summarizing Categorical
Variables
In case of Summarizing Categorical Variables, we use the unique function.
For Instance,
> data_df %>% pull(education) %>% unique()
[1] “primary” “secondary” “tertiary” “unknown”
The output indicates that education has four unique values – primary, secondary, tertiary and unknown.
We also want to know how many samples are concentrated in each of these categories which can be done by table ( )
function
> data_df %>% pull(education) %>% table()
primary secondary tertiary unknown
6851 23202 13301 1857
Summarizing (Contd..)
The output indicates that secondary and tertiary categories of education are dominant in the
sample. We could also look at the relative proportions of each category in the sample by passing
this output to the prop.table() function.
This output indicates that about half of the sample has education of the secondary category and
about one-third has education of the tertiary category.
Section III:
Exploring
relationships
Exploring Relationships between
Numeric Variables
A key problem that can derail model development is correlation
between predictors. To compute correlation, we use the cor
function
For Instance,
In the below code, we first select the variables from the data frame
between which we wish to measure the correlation (i.e., age, balance
and duration) and then compute the correlation between these
variables.
Numeric relationships (Contd..)
• One way to confirm our suspicion is to conduct a correlation test using the cor.test function
The built-in function xtabs (read the ‘x’ as ‘cross’) comes in consideration here.
• For Instance,
Numeric Relationships (Contd..)
The tilde sign (~) is the hallmark of the formula notation in R. This
notation tells R that we wish to explore the relationship between the variables
mentioned in the formula.
Section IV:
Creating Plots
Plotting the Data
As we discussed, numerical summaries are informative and should be conducted
before any modeling begins, visual summaries are often more insightful.
Thus, plotting the data graphically came into existence for better understanding.
A properly designed plot can bring out the nuances in the variables and their
interrelationship.
Plots also provide the analyst with more degrees of freedom to represent
information as opposed to presenting information in a table.
The Grammar of Plots
The Grammar of Graphics breaks down the process of plotting into a sequence
starting with the input noun ‘data’ and culminating in the output noun ‘graphic’.
The grammar of graphics framework provides a powerful anchor to reason about the
plotting process and produce insightful plots.
We now look at this framework in greater detail by building a histogram to describe
the age of customers targeted in the bank in the direct marketing campaign.
Grammar of Graphics (Contd..)
All plots begin with a call to the ggplot function with input data frame as the argument that tells R where to
look for variables.
To actually compute the statistics and plot the histogram on the blank canvas we call the geom_histogram
function. The prefix ‘geom_’ before the function signals our intention to map the data into a specific geometry.
The plot now incorporates the X and Y axis labels passed to the labs function and also adds in a title and
subtitle to the plot.
Note that we have only added one line, that is, a call to the function theme classic() compared to the previous
code listing. This function applies a built-in ‘classic’ theme, replacing the grey background with a white one
and removing the grid lines.
Plotting Numeric Variables and
their Relationships
Another plot which is used to gauge the distribution of numeric variables is the
density plot which can be generated using the geom_density function.
Plots of numeric variables
(Contd….)
For Instance,
The objective of making this density plot is to see if there are any observable
differences in the ages of customers who subscribed to the term deposit at the end
of the campaign compared to those who did not.
Since this division is encoded in the data as the final observed outcome y, we
pass this as an input to the fill parameter of the aesthetic, instructing R to plot two
different colours for the two groups demarcated by the observed final outcome y.
The alpha argument of the geom_density function is being used to see the
difference between the two (using the opacity of the two curves)
Using Categorical Variables as
Facets
In cases, where a categorical variable exhibits only a few levels (i.e.,
4 or lesser), it is often convenient to divide the plot into subplots, with
each subplot focusing on a level of the categorical variable. In other
words, we can use the levels of categorical variables as facets and ask
R to treat each level as a facet using the facet_wrap function.
Grouping and Summarizing Data
by Categorical Variables
For this plot we introduce a new geometry – the bar plot which
represent the number of occurrences of the levels of a categorical
variable. example. There is only one new parameter passed to the
geom_bar function we use to make a bar plot: stat = ‘identity’.
‘identity’ which tells the function that we have already completed the
count process and it need not count the occurrences of the categorical
variable levels.
THANK YOU