INF30036 Lecture5

Descriptive analytics (statistics)
DESCRIPTIVE ANALYTICS
Descriptive analytics is the first stage of data analysis. It provides a

summary of historical data that may indicate the need for additional
data pre-processing to better prepare the data for predictive
modelling. For example, if a variable that is highly skewed, may
need to be normalized to produce a more accurate model.
2
Descriptive analytics summarizes a data set, which can be either a

representation of the entire population or just a sample.
Organizations performing predictive analytics may use their own

data; therefore, they have access to the entire population, whereas
marketing analytics is often working with sample data (survey
responses, tweets, purchased data, etc.).
3
Descriptive statistics are broken down into measures of central

tendency and measures of variability and shape.
Measures of central tendency include the mean, median, and mode.
Measures of variability include the standard deviation, variance,

range and the kurtosis and skewness.
4
THE ROLE OF THE MEAN, MEDIAN, AND MODE
The central tendency is the extent to which the values of a numerical

variable group around a typical or central value.
The mean is the most common measure of central tendency. It is the

sum of all the values divided by the number of values. Although it is
the preferred measure, extreme values or outliers affect it.
5
The median is another measure of central tendency. It is the middle

value in an ordered set of observations. It is less sensitive to outliers
or extreme values.
If the number of observations is odd, the median is simply the middle

number. If the number of observations is even, the median is the
average of those two values on either side of the center. The median
is not influenced by outliers.
6
The third measure of central tendency is the mode. The mode is the
value that occurs most often.
Unless extreme values or outliers exist, the mean is generally the

preferred measure.
The median is used when the data is skewed; there are a small
number of observations or working with ordinal data.
The mode is rarely used. The only situation in which the mode would
be preferred is when describing categorical or class variables.
7
VARIANCE AND DISTRIBUTION
Measures of variation provide information on the spread or variability

of the data set.
Some of the measures of variation include the range, the sample

variance, the sample standard deviation, and the coefficient of
variation.
The range is the easiest measure to calculate. It is the difference

between the highest and the lowest value.
8
The range does not provide any information about the distribution of
the data. The range is also sensitive to outliers.
A common measure of variation is the sample variance. The sample

variance is the average of the squared deviations of each
observation from the mean.
9
The standard deviation is the square root of the variance and is in

the same units of measurement as the original data.
The more the data is spread out, the greater the range, variance,
and standard deviation.
The more the data is concentrated, the smaller the range, variance,
and standard deviation.
10
THE SHAPE OF THE DISTRIBUTION – Skewness
Skewness measures the extent to which input variables are not
symmetrical. It measures the relative size of the two tails. A
distribution can be left skewed, symmetric, or right skewed. In a left-
skewed distribution, the mean is less than the median, and the
skewness value is negative.
For a normal, symmetrical distribution, the mean and the median are
equal and the skewness value is zero. For a right-skewed
distribution, the mean is greater than the median, the peak is on the
left, and the skewness value is positive.
11
THE SHAPE OF THE DISTRIBUTION – Kurtosis
Kurtosis measures how peaked the curve of the distribution is. In
other words, how sharply the curve rises approaching the center of
the distribution. It measures the amount of probability in the tails.
A mesokurtic distribution is a normal bell-shaped distribution, and the

kurtosis value is equal to 3. A leptokurtic has a sharper peak than a
bell shape; it has heavier tails than a normal distribution. A leptokurtic
distribution has a kurtosis value is greater 3. A platykurtic distribution
is flatter and has smaller tails. The kurtosis value is less than 3.
12
THE SHAPE OF THE DISTRIBUTION – Kurtosis
An absolute kurtosis value >7.1 is considered a substantial departure

from normality. Many statistical software packages provide a statistic
titled “excess kurtosis” which is calculated by subtracting 3 from the
kurtosis value. The excess kurtosis value of zero indicates a perfectly
symmetrical normal distribution.
13
THE SHAPE OF THE DISTRIBUTION
When performing predictive modelling, the distribution of the

variables should be evaluated.
If it is highly skewed, a small percentage of the data points may have

a great deal of influence on the predictive model.
In addition, the number of variables available to predict the target

variable can vary greatly. To counteract the impact of skewness or
kurtosis, the variable can be transformed.
14
There are three strategies to overcome these problems.
• Use a transformation function, to stabilize the variance.
• Use a binning transformation, which divides the variable values

into groups to appropriately weight each range.
• Use both a transformation function and a binning

transformation. This transformation can result in a better fitting
model.
15
Transformations are more effective when there is a relatively wide

range of values as opposed to a relatively small range of values
(for example, total sales for companies can range significantly by
the company’s size).
The logarithmic function is the most widely used transformation

method to deal with skewed data.
The log transformation is used to transform skewed data to follow

an approximately normal distribution.
16
COVARIANCE AND CORRELATION
It is important to investigate the relationship of the input variables

to one another and to the target variable. The input variables
should be independent of one another.
If the input variables are too related (i.e., correlated) to one

another, multicollinearity can occur which affects the accuracy of
the model. Both the covariance and correlation describe how two
variables are related.
The covariance indicates whether two variables are positively or

inversely related. A measure used to indicate the extent to which
two random variables change in tandem is known as covariance.
17
The covariance is only concerned with the strength of the

relationship;
If the covariance between two variables of x (input variable) and y

(target variable), is greater than zero, then the two variables tend
to move in the same direction. That is, if x (input variable)
increases, then y (target variable) will also tend to increase.
Conversely, if the covariance is less than zero, x (input variable)

and y (target variable) will tend to move in opposing or opposite
directions. If the covariance of x (input variable) and y (target
variable) are equal to zero, they are independent.
18
A major weakness with covariance is the inability to determine the

relative strength of the relationship from the size of the covariance.
The covariance value is the product of the two variables and is not
a standardized unit of measurement. So, it is impossible to
measure the degree the variables move together.
19
The correlation does measure the relative strength of the relationship.

It standardizes the measures so that two variables can be compared.
The correlation ranges from −1 to +1. A correlation value of −1

indicates an inverse correlation, +1 indicates a positive correlation,
and 0 indicates no correlation exists.
20
The r-square or coefficient of determination is equal to the percent of

the variation in the target variable that is explained by the input
variable. R-square ranges from 0 to 100%.
For example, an r-square of 0.25 indicates that 25% of the variation in

target variable (y) is explained by the input variable (x).
21
VARIABLE CLUSTERING
It identifies the correlations and covariances between the input

variables and creates groups or clusters of similar variables.
Clustering attempts to reduce the correlation within the groups.
A few representative variables that are fairly independent of one

another can then be selected from each cluster. The representative
variables are used as input variables and the other input variables
are rejected.
22
PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is another variable reduction

strategy.
It is used when there are several redundant variables or variables

that are correlated with one another and may be measuring the
same construct.
Principal component analysis mathematically manipulates the input

variables and develops a smaller number of artificial variables
(called principal components). These components are then used in
the subsequent nodes.
23
HYPOTHESIS TESTING
A hypothesis is a supposition or observation regarding the results

of sample data. It is the second step in the scientific method. In
hypothesis testing, information is known about the sample. The
purpose of hypothesis testing is to see if the hypothesis can be
extended to describe the population.
For example, a sample of hospital wait times is 15 min. The

hypothesis may state that average wait time for the hospital
population is 15 min. The null hypothesis states the supposition to
be tested. The alternative hypothesis is the opposite of the null
hypothesis
24
HYPOTHESIS TESTING
In hypothesis testing, there are two types of errors that can occur.
A type I and type II error.
A type I occurs when you reject a true null hypothesis. This is like
finding an innocent person guilty. It is a false alarm. The probability
of a type I error is referred to as alpha, and it is called the level of
significance of the test.
A type II error called beta is the failure to reject a false null

hypothesis. It is equivalent to letting a guilty person go.
25
ANALYSIS OF VARIANCE (ANOVA)
ANOVA is used to test the difference among the means of input

variables. It can be broken into two categories: one-way ANOVA
and two-way ANOVA.
In one-way ANOVA, the difference among the means of three or

more inputs is evaluated to see if any of them are equal.
Two-way ANOVA examines the effect of two inputs on the target

variable. In other words, it examines the interaction between two
input variables on the target variable.
26
CHI SQUARE
The chi-square test is used to determine if two categorical (class)

variables are independent. The null hypothesis is that the two
categorical variables are independent. The alternative is that they
are not independent ( = dependent).
The chi-square test statistic approximately follows a chi-square

distribution with (r − 1) * (c − 1) degrees of freedom,
where r is the number of levels for one categorical variable, and c is

the number of levels for the other categorical variable.
27
CHI SQUARE
If the chi-square statistic is greater than the critical chi-square (there

is a table defining this), then the null hypothesis is rejected.
Table:
https://www.statology.org/how-to-read-chi-square-distribution-table/
28
FIT STATISTICS
To measure the accuracy of a data model, fit statistics such as the

misclassification rate, the Receiver Operating Characteristic (ROC)
and the average squared error are often evaluated.
A fit statistic is a measure used to compare multiple models. A

misclassification occurs when the predicted target value is not equal
to the actual target value.
A low misclassification rate is more desirable and indicates a better

fitting model.
29
FIT STATISTICS
ROC curve is a plot of the true positive rate against the false positive
rate at various possible outcomes. The true positive rate or
sensitivity is the measure of the proportion of actual positives that
were correctly identified.
30
STOCHASTIC MODELS
A stochastic model is a mathematical method that accounts for

random variation in one or more inputs.
The random variation is usually based on past observations. It is

better described by comparing a stochastic model to a
deterministic one.
A deterministic model generally only has one set of output values

based upon the parameter values and the initial conditions.
31
STOCHASTIC MODELS
A stochastic model will use the same set of parameter values and
initial conditions but adds some random variation to the model
resulting in a set of different outputs.
The random variation is generated from observations of historical

data over a period of time.
The set of outputs are usually generated through the use of many
simulations with random variations in the inputs.
32
Thank You for your attention.
Q&A

INF30036 Lecture5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

INF30036 Lecture5

Uploaded by

Copyright:

Available Formats

Descriptive analytics (statistics)

Descriptive analytics is the first stage of data analysis. It provides a

Descriptive analytics summarizes a data set, which can be either a

Organizations performing predictive analytics may use their own

Descriptive statistics are broken down into measures of central

Measures of central tendency include the mean, median, and mode.

Measures of variability include the standard deviation, variance,

The central tendency is the extent to which the values of a numerical

The mean is the most common measure of central tendency. It is the

The median is another measure of central tendency. It is the middle

If the number of observations is odd, the median is simply the middle

Unless extreme values or outliers exist, the mean is generally the

Measures of variation provide information on the spread or variability

Some of the measures of variation include the range, the sample

The range is the easiest measure to calculate. It is the difference

A common measure of variation is the sample variance. The sample

The standard deviation is the square root of the variance and is in

A mesokurtic distribution is a normal bell-shaped distribution, and the

An absolute kurtosis value >7.1 is considered a substantial departure

When performing predictive modelling, the distribution of the

If it is highly skewed, a small percentage of the data points may have

In addition, the number of variables available to predict the target

There are three strategies to overcome these problems.

• Use a transformation function, to stabilize the variance.

• Use a binning transformation, which divides the variable values

• Use both a transformation function and a binning

Transformations are more effective when there is a relatively wide

The logarithmic function is the most widely used transformation

The log transformation is used to transform skewed data to follow

It is important to investigate the relationship of the input variables

If the input variables are too related (i.e., correlated) to one

The covariance indicates whether two variables are positively or

The covariance is only concerned with the strength of the

If the covariance between two variables of x (input variable) and y

Conversely, if the covariance is less than zero, x (input variable)

A major weakness with covariance is the inability to determine the

The correlation does measure the relative strength of the relationship.

The correlation ranges from −1 to +1. A correlation value of −1

The r-square or coefficient of determination is equal to the percent of

For example, an r-square of 0.25 indicates that 25% of the variation in

It identifies the correlations and covariances between the input

A few representative variables that are fairly independent of one

Principal component analysis (PCA) is another variable reduction

It is used when there are several redundant variables or variables

Principal component analysis mathematically manipulates the input

A hypothesis is a supposition or observation regarding the results

For example, a sample of hospital wait times is 15 min. The

A type II error called beta is the failure to reject a false null

ANOVA is used to test the difference among the means of input

In one-way ANOVA, the difference among the means of three or

Two-way ANOVA examines the effect of two inputs on the target

The chi-square test is used to determine if two categorical (class)

The chi-square test statistic approximately follows a chi-square

where r is the number of levels for one categorical variable, and c is

If the chi-square statistic is greater than the critical chi-square (there

To measure the accuracy of a data model, fit statistics such as the

A fit statistic is a measure used to compare multiple models. A

A low misclassification rate is more desirable and indicates a better

A stochastic model is a mathematical method that accounts for

The random variation is usually based on past observations. It is

A deterministic model generally only has one set of output values

The random variation is generated from observations of historical