Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Descriptive analytics (statistics)

DESCRIPTIVE ANALYTICS

Descriptive analytics is the first stage of data analysis. It provides a


summary of historical data that may indicate the need for additional
data pre-processing to better prepare the data for predictive
modelling. For example, if a variable that is highly skewed, may
need to be normalized to produce a more accurate model.

2
DESCRIPTIVE ANALYTICS

Descriptive analytics summarizes a data set, which can be either a


representation of the entire population or just a sample.

Organizations performing predictive analytics may use their own


data; therefore, they have access to the entire population, whereas
marketing analytics is often working with sample data (survey
responses, tweets, purchased data, etc.).

3
DESCRIPTIVE ANALYTICS

Descriptive statistics are broken down into measures of central


tendency and measures of variability and shape.

Measures of central tendency include the mean, median, and mode.

Measures of variability include the standard deviation, variance,


range and the kurtosis and skewness.

4
THE ROLE OF THE MEAN, MEDIAN, AND MODE

The central tendency is the extent to which the values of a numerical


variable group around a typical or central value.

The mean is the most common measure of central tendency. It is the


sum of all the values divided by the number of values. Although it is
the preferred measure, extreme values or outliers affect it.

5
THE ROLE OF THE MEAN, MEDIAN, AND MODE

The median is another measure of central tendency. It is the middle


value in an ordered set of observations. It is less sensitive to outliers
or extreme values.

If the number of observations is odd, the median is simply the middle


number. If the number of observations is even, the median is the
average of those two values on either side of the center. The median
is not influenced by outliers.

6
THE ROLE OF THE MEAN, MEDIAN, AND MODE

The third measure of central tendency is the mode. The mode is the
value that occurs most often.

Unless extreme values or outliers exist, the mean is generally the


preferred measure.

The median is used when the data is skewed; there are a small
number of observations or working with ordinal data.

The mode is rarely used. The only situation in which the mode would
be preferred is when describing categorical or class variables.

7
VARIANCE AND DISTRIBUTION

Measures of variation provide information on the spread or variability


of the data set.

Some of the measures of variation include the range, the sample


variance, the sample standard deviation, and the coefficient of
variation.

The range is the easiest measure to calculate. It is the difference


between the highest and the lowest value.

8
VARIANCE AND DISTRIBUTION

The range does not provide any information about the distribution of
the data. The range is also sensitive to outliers.

A common measure of variation is the sample variance. The sample


variance is the average of the squared deviations of each
observation from the mean.

9
VARIANCE AND DISTRIBUTION

The standard deviation is the square root of the variance and is in


the same units of measurement as the original data.

The more the data is spread out, the greater the range, variance,
and standard deviation.

The more the data is concentrated, the smaller the range, variance,
and standard deviation.

10
THE SHAPE OF THE DISTRIBUTION – Skewness
Skewness measures the extent to which input variables are not
symmetrical. It measures the relative size of the two tails. A
distribution can be left skewed, symmetric, or right skewed. In a left-
skewed distribution, the mean is less than the median, and the
skewness value is negative.

For a normal, symmetrical distribution, the mean and the median are
equal and the skewness value is zero. For a right-skewed
distribution, the mean is greater than the median, the peak is on the
left, and the skewness value is positive.

11
THE SHAPE OF THE DISTRIBUTION – Kurtosis
Kurtosis measures how peaked the curve of the distribution is. In
other words, how sharply the curve rises approaching the center of
the distribution. It measures the amount of probability in the tails.

A mesokurtic distribution is a normal bell-shaped distribution, and the


kurtosis value is equal to 3. A leptokurtic has a sharper peak than a
bell shape; it has heavier tails than a normal distribution. A leptokurtic
distribution has a kurtosis value is greater 3. A platykurtic distribution
is flatter and has smaller tails. The kurtosis value is less than 3.

12
THE SHAPE OF THE DISTRIBUTION – Kurtosis

An absolute kurtosis value >7.1 is considered a substantial departure


from normality. Many statistical software packages provide a statistic
titled “excess kurtosis” which is calculated by subtracting 3 from the
kurtosis value. The excess kurtosis value of zero indicates a perfectly
symmetrical normal distribution.

13
THE SHAPE OF THE DISTRIBUTION

When performing predictive modelling, the distribution of the


variables should be evaluated.

If it is highly skewed, a small percentage of the data points may have


a great deal of influence on the predictive model.

In addition, the number of variables available to predict the target


variable can vary greatly. To counteract the impact of skewness or
kurtosis, the variable can be transformed.

14
THE SHAPE OF THE DISTRIBUTION

There are three strategies to overcome these problems.

• Use a transformation function, to stabilize the variance.

• Use a binning transformation, which divides the variable values


into groups to appropriately weight each range.

• Use both a transformation function and a binning


transformation. This transformation can result in a better fitting
model.

15
THE SHAPE OF THE DISTRIBUTION

Transformations are more effective when there is a relatively wide


range of values as opposed to a relatively small range of values
(for example, total sales for companies can range significantly by
the company’s size).

The logarithmic function is the most widely used transformation


method to deal with skewed data.

The log transformation is used to transform skewed data to follow


an approximately normal distribution.

16
COVARIANCE AND CORRELATION

It is important to investigate the relationship of the input variables


to one another and to the target variable. The input variables
should be independent of one another.

If the input variables are too related (i.e., correlated) to one


another, multicollinearity can occur which affects the accuracy of
the model. Both the covariance and correlation describe how two
variables are related.

The covariance indicates whether two variables are positively or


inversely related. A measure used to indicate the extent to which
two random variables change in tandem is known as covariance.

17
COVARIANCE AND CORRELATION

The covariance is only concerned with the strength of the


relationship;

If the covariance between two variables of x (input variable) and y


(target variable), is greater than zero, then the two variables tend
to move in the same direction. That is, if x (input variable)
increases, then y (target variable) will also tend to increase.

Conversely, if the covariance is less than zero, x (input variable)


and y (target variable) will tend to move in opposing or opposite
directions. If the covariance of x (input variable) and y (target
variable) are equal to zero, they are independent.

18
COVARIANCE AND CORRELATION

A major weakness with covariance is the inability to determine the


relative strength of the relationship from the size of the covariance.

The covariance value is the product of the two variables and is not
a standardized unit of measurement. So, it is impossible to
measure the degree the variables move together.

19
COVARIANCE AND CORRELATION

The correlation does measure the relative strength of the relationship.


It standardizes the measures so that two variables can be compared.

The correlation ranges from −1 to +1. A correlation value of −1


indicates an inverse correlation, +1 indicates a positive correlation,
and 0 indicates no correlation exists.

20
COVARIANCE AND CORRELATION

The r-square or coefficient of determination is equal to the percent of


the variation in the target variable that is explained by the input
variable. R-square ranges from 0 to 100%.

For example, an r-square of 0.25 indicates that 25% of the variation in


target variable (y) is explained by the input variable (x).

21
VARIABLE CLUSTERING

It identifies the correlations and covariances between the input


variables and creates groups or clusters of similar variables.
Clustering attempts to reduce the correlation within the groups.

A few representative variables that are fairly independent of one


another can then be selected from each cluster. The representative
variables are used as input variables and the other input variables
are rejected.

22
PRINCIPAL COMPONENT ANALYSIS

Principal component analysis (PCA) is another variable reduction


strategy.

It is used when there are several redundant variables or variables


that are correlated with one another and may be measuring the
same construct.

Principal component analysis mathematically manipulates the input


variables and develops a smaller number of artificial variables
(called principal components). These components are then used in
the subsequent nodes.

23
HYPOTHESIS TESTING

A hypothesis is a supposition or observation regarding the results


of sample data. It is the second step in the scientific method. In
hypothesis testing, information is known about the sample. The
purpose of hypothesis testing is to see if the hypothesis can be
extended to describe the population.

For example, a sample of hospital wait times is 15 min. The


hypothesis may state that average wait time for the hospital
population is 15 min. The null hypothesis states the supposition to
be tested. The alternative hypothesis is the opposite of the null
hypothesis

24
HYPOTHESIS TESTING

In hypothesis testing, there are two types of errors that can occur.
A type I and type II error.

A type I occurs when you reject a true null hypothesis. This is like
finding an innocent person guilty. It is a false alarm. The probability
of a type I error is referred to as alpha, and it is called the level of
significance of the test.

A type II error called beta is the failure to reject a false null


hypothesis. It is equivalent to letting a guilty person go.

25
ANALYSIS OF VARIANCE (ANOVA)

ANOVA is used to test the difference among the means of input


variables. It can be broken into two categories: one-way ANOVA
and two-way ANOVA.

In one-way ANOVA, the difference among the means of three or


more inputs is evaluated to see if any of them are equal.

Two-way ANOVA examines the effect of two inputs on the target


variable. In other words, it examines the interaction between two
input variables on the target variable.

26
CHI SQUARE

The chi-square test is used to determine if two categorical (class)


variables are independent. The null hypothesis is that the two
categorical variables are independent. The alternative is that they
are not independent ( = dependent).

The chi-square test statistic approximately follows a chi-square


distribution with (r − 1) * (c − 1) degrees of freedom,

where r is the number of levels for one categorical variable, and c is


the number of levels for the other categorical variable.

27
CHI SQUARE

If the chi-square statistic is greater than the critical chi-square (there


is a table defining this), then the null hypothesis is rejected.

Table:
https://www.statology.org/how-to-read-chi-square-distribution-table/

28
FIT STATISTICS

To measure the accuracy of a data model, fit statistics such as the


misclassification rate, the Receiver Operating Characteristic (ROC)
and the average squared error are often evaluated.

A fit statistic is a measure used to compare multiple models. A


misclassification occurs when the predicted target value is not equal
to the actual target value.

A low misclassification rate is more desirable and indicates a better


fitting model.

29
FIT STATISTICS

ROC curve is a plot of the true positive rate against the false positive
rate at various possible outcomes. The true positive rate or
sensitivity is the measure of the proportion of actual positives that
were correctly identified.

30
STOCHASTIC MODELS

A stochastic model is a mathematical method that accounts for


random variation in one or more inputs.

The random variation is usually based on past observations. It is


better described by comparing a stochastic model to a
deterministic one.

A deterministic model generally only has one set of output values


based upon the parameter values and the initial conditions.

31
STOCHASTIC MODELS

A stochastic model will use the same set of parameter values and
initial conditions but adds some random variation to the model
resulting in a set of different outputs.

The random variation is generated from observations of historical


data over a period of time.

The set of outputs are usually generated through the use of many
simulations with random variations in the inputs.

32
Thank You for your attention.

Q&A

You might also like