Professional Documents
Culture Documents
Data Analysis ToolPak for Statistics
Data Analysis ToolPak for Statistics
In this module, you will learn Data Analysis ToolPak for statistics in detail.
Learning Objectives
Figure 6.1
➢ When the user enables the option, the Data Analysis tab will appear in the Data tab in
Excel.
2. Descriptive Statistics
➢ Descriptive Statistics will provide all the basic information about the dataset.
➢ These details will assist the user in understanding how the data is structured and
provide a general idea of data distribution.
Figure 6.2
➢ This table (Figure 6.2) contains the standard error, which is nothing but the mean
divided by the square root of the total number of records, which is 200.
Standard Error=
Mean/√Total number of
records
√𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
➢ The standard error, median, mode, standard deviation, sample, variance, kurtosis
skewness, range, minimum, maximum, sum, and count all give information about the
data sets. It helps understand how the data shapes up.
3. Histogram
The purpose of the histogram is to understand the data distribution and frequency. The
histogram shows data in bar format.
Figure 6.3
For example,
In the figure above, you have a numerical field called the price of the house. You want to find
out which range of the house price is more? The above chart shows that most of the house
price values are around 250000.
For example, in this particular chart (figure 6.4), you have the area of the house and the
property price; analyze the data to see whether there is any correlation between them. The
area of the house expands as the price of the property rises. So, there is linearity, pattern, and
correlation. So, a scatterplot helps understand the correlation between the columns or the data
pattern.
Correlation Coefficient
➢ Correlation coefficient values vary from -1 to +1.
➢ If the two variables are positively correlated, the values will be more towards +1. For
example, as the house's area increases, the house's price will also increase.
➢ If the two variables are negatively correlated, the values will be more toward -1. For
example, the distance between the house and the city increases, and the house price
decreases. This indicates it is a negative correlation.
➢ If there is no relation between two variables, the values are more towards 0. For
example, let's say the kid's age and the house price; there is no correlation between the
kids' age and the house price.
Figure 6.5
In the table above, you can see the area of the property, number of bedrooms, distance from
the municipality, and price of the property. A matrix is coming in, also known as a correlation
coefficient matrix. As the area increases, property prices increase with a higher value of 0.95. As
the number of rooms increases, the property area also increases, which is 0.06; however, there
is no correlation for distance from the municipality because it's more toward zero or it's more
toward negative zero.
5. Linear Regression
➢ Linear regression is used to predict a variable's value based on another variable's value.
➢ The variable which is used to predict is called the dependent variable. Using the
independent variable, the user can predict the dependent variable.
➢ Mathematically, it is said that the independent variable is x, and the dependent variable
is y.
➢ When X is 1, Y will also be 1. This type of regression is known as simple linear
regression.
➢ When you have multiple Xs and one y, it is known as multiple linear regression.
➢ Mathematically, you have a data set dn that consists of 𝑥𝑖 and 𝑦𝑖 , 𝑥𝑖 and 𝑦𝑖 belongs
to a real-valued number.
For example, if you want to predict the property's price, which is the variable y, a real-valued
number. Based on various features, I want to predict the area of the house, distance from the
city, number of rooms, and so on. So, the dependent variable is house price, and the
independent variables are the house distance from the city and the number of rooms.
The assumption that you have in the data should be linear. Either the data should be positively
linear or negatively linear.
Assumptions of Linear Regression - Linearity
➢ There should be linearity between the independent and dependent variables (It can be
positive or negative).
➢ Linear Regression will not work well when there is no linearity in the variables.
Figure 6.6
For example, as the house area increases, the house prices increase, or the distance from the
city increases, the house prices decrease, which you can see in the scatterplot above. So, this
linearity is critical. However, if there is no linearity, for example, in the third chart, where you
can see a child’s age and house price, there is no correlation. So, that variable is not essential.
Assumptions of Linear Regression – No Outliers
➢ There should not be any outliers. If the outliers are present, it will impact the overall
predictions.
Figure 6.7
Figure 6.8
In the formula above, the adjusted R square value is denoted with R square with the bar sign (
𝑅2 )
̅̅̅̅
➢ n is the number of records on which you are training the model.
➢ k is the number of independent variables you use while creating the model.
As K increases, the denominator will decrease automatically, the fraction will increase, and this
fraction gets multiplied with (1 − 𝑅 2 ). So that the value will increase.
➢ k is also known as a penalizing factor, as when you increase, the key adjusted R2 value
will decrease.
➢ R2 tends to increase always; if you add a relevant variable to predict your y or an
irrelevant variable, then R2 will also increase.
➢ So that's the reason the adjusted R2 value is used.