Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Module – 06

Data Analysis ToolPak


for Statistics
Table of Contents

1. Activation of Data Analysis ToolPak Add-in............................................................................................................... 4


2. Descriptive Statistics.................................................................................................................................................. 4
3. Histogram .................................................................................................................................................................. 5
4. Scatter Plot and Correlation Coefficient.................................................................................................................... 5

5. Linear Regression ...................................................................................................................................................... 7


6. Time Series Analysis ................................................................................................................................................ 10
Module Description

In this module, you will learn Data Analysis ToolPak for statistics in detail.

Learning Objectives

By the end of this module, you will be able to:

• Understand the activation of Data Analysis ToolPak add-in


• Explain Descriptive Statistics
• Define Histogram
• Describe Scatter Plots and Correlation Coefficient
• Define Linear Regression
• Explain the Time Series analysis
1. Activation of Data Analysis ToolPak Add-in
Let us discuss the activation of the Data Analysis ToolPak add-in. This option helps to do all the
analysis in Excel.
➢ The user can access the Options button from the File tab. You will get the pop-up,
which you can see in the figure below.

Figure 6.1

➢ When the user enables the option, the Data Analysis tab will appear in the Data tab in
Excel.

2. Descriptive Statistics
➢ Descriptive Statistics will provide all the basic information about the dataset.
➢ These details will assist the user in understanding how the data is structured and
provide a general idea of data distribution.
Figure 6.2

➢ This table (Figure 6.2) contains the standard error, which is nothing but the mean
divided by the square root of the total number of records, which is 200.

Standard Error=
Mean/√Total number of
records
√𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠

➢ The standard error, median, mode, standard deviation, sample, variance, kurtosis
skewness, range, minimum, maximum, sum, and count all give information about the
data sets. It helps understand how the data shapes up.

3. Histogram
The purpose of the histogram is to understand the data distribution and frequency. The
histogram shows data in bar format.

Figure 6.3

For example,
In the figure above, you have a numerical field called the price of the house. You want to find
out which range of the house price is more? The above chart shows that most of the house
price values are around 250000.

4. Scatter Plot and Correlation Coefficient


Scatter Plot
➢ It is used compare two different numerical data columns.
➢ It helps understand the correlation between the columns.

Area of the House


Figure 6.4

For example, in this particular chart (figure 6.4), you have the area of the house and the
property price; analyze the data to see whether there is any correlation between them. The
area of the house expands as the price of the property rises. So, there is linearity, pattern, and
correlation. So, a scatterplot helps understand the correlation between the columns or the data
pattern.

Correlation Coefficient
➢ Correlation coefficient values vary from -1 to +1.
➢ If the two variables are positively correlated, the values will be more towards +1. For
example, as the house's area increases, the house's price will also increase.
➢ If the two variables are negatively correlated, the values will be more toward -1. For
example, the distance between the house and the city increases, and the house price
decreases. This indicates it is a negative correlation.
➢ If there is no relation between two variables, the values are more towards 0. For
example, let's say the kid's age and the house price; there is no correlation between the
kids' age and the house price.

The formula for the correlation coefficient is given below:


For example, consider 𝑥𝑖 , as the area of the house then 𝑦𝑖 will be the house price. So, 𝑥̅ will
be the average area and 𝑦
̅ will be the average house price divided by the square root of
(𝑥𝑖 - 𝑥̅ ) 2 (𝑦𝑖 -𝑦̅)2

Figure 6.5

In the table above, you can see the area of the property, number of bedrooms, distance from
the municipality, and price of the property. A matrix is coming in, also known as a correlation
coefficient matrix. As the area increases, property prices increase with a higher value of 0.95. As
the number of rooms increases, the property area also increases, which is 0.06; however, there
is no correlation for distance from the municipality because it's more toward zero or it's more
toward negative zero.

5. Linear Regression
➢ Linear regression is used to predict a variable's value based on another variable's value.
➢ The variable which is used to predict is called the dependent variable. Using the
independent variable, the user can predict the dependent variable.
➢ Mathematically, it is said that the independent variable is x, and the dependent variable
is y.
➢ When X is 1, Y will also be 1. This type of regression is known as simple linear
regression.
➢ When you have multiple Xs and one y, it is known as multiple linear regression.
➢ Mathematically, you have a data set dn that consists of 𝑥𝑖 and 𝑦𝑖 , 𝑥𝑖 and 𝑦𝑖 belongs
to a real-valued number.

For example, if you want to predict the property's price, which is the variable y, a real-valued
number. Based on various features, I want to predict the area of the house, distance from the
city, number of rooms, and so on. So, the dependent variable is house price, and the
independent variables are the house distance from the city and the number of rooms.

The assumption that you have in the data should be linear. Either the data should be positively
linear or negatively linear.
Assumptions of Linear Regression - Linearity
➢ There should be linearity between the independent and dependent variables (It can be
positive or negative).
➢ Linear Regression will not work well when there is no linearity in the variables.

Figure 6.6

For example, as the house area increases, the house prices increase, or the distance from the
city increases, the house prices decrease, which you can see in the scatterplot above. So, this
linearity is critical. However, if there is no linearity, for example, in the third chart, where you
can see a child’s age and house price, there is no correlation. So, that variable is not essential.
Assumptions of Linear Regression – No Outliers
➢ There should not be any outliers. If the outliers are present, it will impact the overall
predictions.

Figure 6.7

➢ Pass a line that should pass from the majority of points.


➢ You can see the orange line, which is the expected value, but there was an outlier at the
lower side so the orange line will tilt towards that outlier and the red line will come into
the picture.
➢ This red line will make many wrong predictions for the rest of the data. So, when there
are outliers, you remove the outliers and then create the model.
Assumptions of Linear Regression – No Multicollinearity
➢ There should not be any multicollinearity.
➢ Multicollinearity is nothing but, when two independent variables collinear with each
other, the coefficients change arbitrarily.

Figure 6.8

Example: Consider Figure 6.8.


As experience increases, salary increases; age increases salary increases. When the age
increases, experience also increases. Your age and experiences are independent variables,
which is a problem. In that case, remove one of the variables. As per this example, it is
preferred to go with experience to predict the salary. You may drop ‘Age’ from the predictions.
So, this is a critical concept.

Adjusted R-Squared Value


It helps check the accuracy of the model. Normally, R square value has a tendency to increase
always, and to take care of that behavior, an adjusted R square value is used.

In the formula above, the adjusted R square value is denoted with R square with the bar sign (
𝑅2 )
̅̅̅̅
➢ n is the number of records on which you are training the model.
➢ k is the number of independent variables you use while creating the model.
As K increases, the denominator will decrease automatically, the fraction will increase, and this
fraction gets multiplied with (1 − 𝑅 2 ). So that the value will increase.
➢ k is also known as a penalizing factor, as when you increase, the key adjusted R2 value
will decrease.
➢ R2 tends to increase always; if you add a relevant variable to predict your y or an
irrelevant variable, then R2 will also increase.
➢ So that's the reason the adjusted R2 value is used.

6. Time Series Analysis


➢ Time series analysis is one of the very important concepts in business.
➢ The analyst learns the forecasting. For example, how much revenue you will get in the
next month or how many passengers will a touring company have in the upcoming
months, etc.
➢ It helps you predict the values over a period of time; there are various components
available in time series:
• Trend
• Seasonality
• Cyclical
• Irregular
➢ Trend: It refers to how things are progressing, whether the trend is upward or
downward.
➢ Seasonality: The sales get increased for a particular season. Example: Increasing sales in
the month of September.
➢ Cyclical seasonality component: As an example, consider research. It is not very much
regular; it sometimes happens.
➢ Trends and seasonality are the only factors considered in the prediction model.
➢ The model does not consider the cyclical component in the prediction.
➢ Excel's ETS (Exponential Triple Smoothing) function is used internally to perform the
prediction.

You might also like