Article Review 1 Eng

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Statistics and Data Analytics

Statistics for Data Science


Daftar Isi
Introduction 3
What is Descriptive Statistics? 3
Types of data 5
Levels of Measurement 7
Measure of Central Tendency 8
Which is the best measure? 10
Measures of variability 10
Measure of Asymmetry 14
What is a Probability Distribution? 18
Normal Distribution 19
Univariate Analysis 20
Bivariate Analysis 22
Multivariate Analysis 24
Inferential Analysis 25
Use Case 27
References 29

2
Introduction

Statistics is a big part of a Data Scientist’s daily living. Each time you start an
analysis, your first step before applying fancy algorithms and making some
predictions is to first do some exploratory data analysis (EDA) and try to read and
understand the data by applying statistical techniques. With this first data analysis,
you are able to understand what type of distribution the data presents.

At the end of this brief introduction, we will use the Lego dataset to make sense of
these concepts.

What is Descriptive Statistics?

Descriptive statistics is the analysis of data which helps to describe, show or


summarize information in a meaningful way such that whoever is looking at it might
detect certain relevant patterns. Examples of statistics descriptive are mean,
median, modus, standard deviation, range etc.

When looking at data the first step of your statistical analysis will be to determine if
the dataset you’re dealing with is a population or a sample.

3
A population is the collection of all items of interest in your study and it is generally
denoted with the capital letter N . The calculated values when analyzing a
population are known as parameters. On the other hand, a sample is a subset of a
population and it’s usually denoted by the letter n. The values calculated when using
a sample are known as statistics.

Populations are hard to define and analyze in real life. It is easy to miss values when
studying a population which will influence the analysis, as well as an analysis of the
whole population is very expensive and time-consuming. Therefore, you normally
hear about samples. In contrast to a population, a sample is not expected to account
for all the data and is easier to analyze since its smaller size makes the analysis less
time consuming, less costly and less prone to error. A sample must be random and
both representative of the population. With a sample, anyone can make deductions
on the population.

4
Example of sampling techniques are:
● Simple Random Sampling
○ Simple random sampling is like putting all the elements of a population into
a hat and picking names without any particular order or reason. Every
individual in the group has an equal chance of being chosen. Imagine you
have a class of students, and you want to select a few for a survey. If you use
simple random sampling, you give each student a number, put those
numbers in a hat, and pick out the ones you need. It's completely random,
like drawing names out of a hat.
● Stratified Random Sampling
○ Stratified random sampling is a bit like organizing your class into groups
based on something important, like their grades. Instead of picking
randomly from everyone, you first divide the students into these groups,
called strata, and then randomly select from each group. So, if you're doing a
survey in your class, you might first divide the students by grades, like A, B,
and C, and then randomly choose a few from each grade. This helps make
sure you have a good mix from each important subgroup.

Types of data

In a dataset, data can be either Categorical or Numerical. Categorical data describes


groups or categories such as car brands, gender, age groups, names, etc. On the
other hand, numerical data just as the name reveals represents numbers. Within
this category, you can have Discrete and Continuous numbers.

5
● Discrete - data which can only take certain values. You only have a
fixed set of values you have access to. For example, age, number of
cars in a street, number of fingers.
● Continuous - data which can take any real or fractional value between
a certain range, without any restrictions (e.g. weight, Balance in a bank
account, value spent on the purchase, Grade on Exam,Foot Size)
On categorical data, there are also Ordinal and Nominal types:
● Nominal Data:
○ Nominal data are categories without any inherent order or
ranking. They represent different groups or labels, but there is
no implied order among them.
○ Example: Colors of cars (red, blue, green). Each color is distinct,
but there is no inherent order or ranking among them.
● Ordinal Data:
○ Ordinal data, on the other hand, have categories with a specific
order or rank. The intervals between the categories are not
necessarily uniform, but there is a clear sequence.
○ Example: Educational attainment levels (high school diploma,
bachelor's degree, master's degree). Here, there is an order
from less to more education, but the difference between having
a high school diploma and a bachelor's degree may not be the
same as the difference between a bachelor's and a master's
degree.

6
Figure 1. Data Types

Levels of Measurement

Data can have two levels of measurement : Qualitative and Quantitative.


1. Qualitative Data is information that characterizes attributes in data but
does not measure them. It can be divided into two types: Nominal or
Ordinal.
● Nominal: They are not numbers and cannot be put in any order;
Example: gender
● Ordinal: Consists of groups and categories that follow a strict
order.
Example : Grades (e.g. Bad, Satisfy, Good)
2. Quantitative Data measures attributes in the data. It can be divided
into two groups : Interval and Ratio

7
● Interval : Represented by numbers, without having a true zero.
In this case, the zero value is meaningless.
● Ratio: Represented by numbers and has a true zero.
For quantitative data to be regarded as an interval or ratio, it depends on the context
we are using them in. For example, think about temperature. Saying it is 0º Celsius
or 0º Fahrenheit has no meaning, since that is not the true zero. The absolute zero
temperature in Celsius is -273.15 ºC whereas in Fahrenheit is -459.67º F. Therefore,
in this case, the temperature has to be considered as Interval data, since the zero
value is meaningless. However, if you analyze temperature in Kelvins, the absolute
zero temperature is 0º Kelvin, thus you can say now the temperature value is a Ratio
since it has a true zero.

Measure of Central Tendency

The measure of central tendency refers to the idea there’s one number that best
summarizes the entire set. The most popular are mean, median and mode.
1. Mean
This is considered the most reliable measure of the measure of central
tendency for making assumptions about a population from a single sample.
The μ symbol is used to describe the population value whereas the x̅ to
describe the sample mean.

Figure 2 Mean Formula


8
We can find the mean by summing all the components and then dividing the
sum by the number of components. As already said, it’s the most common
measure of central tendency, but it has the downside of being easily affected
by outliers. Sometimes, due to outliers, the mean might not be enough to
make conclusions.

2. Median
The median is the midpoint or the “middle” value in your orderly ascending
dataset. It is also known as the 50th percentile. In order to avoid the error
provoked in the mean by outliers, it is usually a good idea to also calculate the
median.

Figure 3. How to calculate median

3. Mode
The mode shows us the value that occurs most often. It can be used for
numerical as well as categorical variables. If there is not a single value that
does not appear more than once you say there is no mode.
Suppose you have the following exam scores for a class of students:
● 85,92,75,88,92,92,75,82,88,92
To find the mode:
● Count the Frequency of each values:

9
○ 75 appears twice
○ 82 appears once
○ 85 appears once
○ 88 appears twice
○ 92 appears four times
So the mode is 92.

Which is the best measure?

The measures should be used together instead of independently. There is no best


and using only one is not advisable. Moreover, in a normal distribution, these
measures all fall at the same midline point. This means that the mean, mode and
median are all equal!

Measures of variability
The measure of variability refers to the idea of measuring the dispersion in our data
according to the mean value. The most known measures of variability are the range,
interquartile range (IQR), variance and standard deviation.
1. Range
The range is the most obvious measure of dispersion and describes the
difference between the largest and the smallest points in your data.

Range is 99–12 = 87
2. Interquartile range (IQR)

10
The IQR is a measure of variability between the upper (75th) and lower
(25th) quartiles. The data is sorted into ascending order and divided into four
quarters.

Figure 4. How to calculate the Interquartile range

Where:

- Q1 : Quartile 1 of the data


- Q2 : Quartile 2 of the data
- Q3 : Quartile 3 of the data

While the range measures the range of values in which our dataset is distributed,
the interquartile range measures the interval of values where the majority of values
lie.

3. Variance
The variance as well as the standard deviation are more complex forms of
measuring how much the data disperse from the mean value of the dataset.

Figure 5. How to calculate variance

11
Where:

- n : the number of data


- Xi : data points
- Xavg : average of data

The variance is found by computing the difference between every data point and
the mean, squaring that value and summing for all available data points. In the end,
the variance is calculated by dividing the sum by the total number of available
points.
Squaring the difference has two main purposes :
1. Dispersion is non-negative, by powering the subtraction by 2 we
ensure we do not have negative values and thus there is not the
chance of them canceling out.
2. Amplifies the effect of large differences
The problem with Variance is that because of the squaring, it is not in
the same unit of measurement as the original data. This is why the
Standard Deviation is used more often because it is in the original unit.
Squared dollars means nothing in statistics.
3. Standard Deviation
Usually standard deviation is much more meaningful than variance. It
is the preferred measure of variability as it is directly interpretable.
Standard deviation is basically the square root of our variance.

12
Figure 6. How to calculate standard deviation

Where:

- n : the number of data


- Xi : data points
- Xavg : average of data

Standard deviation is best used when data presents a unimodal shape. In a normal
distribution, approximately 34% of data points fall one standard deviation away
from the mean. Since a normal distribution is symmetrical, we have 68.2% of data
points one standard deviation away from the mean. Around 95% of points fall
between two standard deviations from the mean whereas 99.7% fall under three
standard deviations.

Figure 7. Three sigma of standard deviation

With the Z-Score, you can check how many standard deviations below (or above)
the mean, a specific data point is.

13
Measure of Asymmetry
1. Modality
The modality of a distribution is determined by the number of peaks the data
presents. Most distributions are unimodal which means it has only one
frequently occurring score, clustered at the top while a bimodal has two
values occurring frequently.

Figure 8. Modality type

2. Skewness
It is the most common tool to measure asymmetry. Skewness indicates to
which side the data is concentrated. The skewness captures the outliers in
the data. If it is left skewed it means the outliers are to the left. Moreover,
when the mean is higher than the median we have a right skew. If it’s lower
we have a left skew.

14
Figure 9. Skewness type

Measures of asymmetry are the link between Central Tendency Measures and
Probability theory which will ultimately allow us to obtain a more accurate
knowledge of the data we are working with.
Impact of Skewness on Mean, Median, and Mode:
● Mean:
○ In a right-skewed distribution (positive skewness), the mean
will be larger than the median due to higher values in the right
tail.
○ In a left-skewed distribution (negative skewness), the mean will
be smaller than the median due to lower values in the left tail.
● Median:
○ The median is less sensitive to extreme values and is not
significantly influenced by skewness. It represents the middle
value when the data is sorted.

15
● Mode:
○ If the distribution is skewed, the mode (most frequently
occurring value) may not align with the mean or median. The
mode tends to move toward the longer tail of the distribution.
Now, Let’s take a look at the EPL 2014–2015 Player Heights and Weights dataset,
which shows information about English Premier League player’s height, weight and
age as well as the name, number, position and team.

Table 1. EPL 2014–2015 Player Heights and Weights

Starting simple, what do you think is the relationship between Height and Weight
for football players? You probably assumed that the higher the player is the heavier
he will be. So, we can see here a positive relationship between height and weight.

Figure 10. Correlation Plot

16
And it is clear that higher players present a higher weight, with some
exceptions.
3. Covariance
Covariance is a measure that indicates how two variables are related. A
positive covariance means the variables are positively related, while a
negative covariance means the variables are inversely related. The formula
for calculating covariance of sample data is shown below.

Figure 11. Covariance Formula

x = the independent variable


y = the dependent variable
n = number of data points in the sample
x̅ = the mean of the independent variable x
y-ybar= the mean of the dependent variable y

Nevertheless, with covariance we have an issue… with the units. If you calculated
the covariance by end you might have noticed this issue but with pandas it is not
possible. Take a look again at the formula. We’ve chosen two variables: Height
measured in centimeters (cm) and Weight measured in kilograms (Kg). Notice the
denominator, where for each data point you subtract the mean of the respective
variable and later multiple both values. In the end, our value for the covariance will
be 34,43 cm.kg. This is not very informative! First of all, it seems that our covariance
depends on the magnitude of our variables. If we’ve used the american metric

17
system for Height and Weight, the covariance would probably return a different
value and deceive us on the covariance’s strength.
So, our metric is showing us to what extent these variables are changing together,
which is good, but it is dependent on the magnitude of the variables themselves
which generally does not give us what we want. A better question instead of “How
do our variables relate?” is “How strong is the relationship between our variables?”.
For that, Correlation is the best answer.

What is a Probability Distribution?

“A probability distribution is a mathematical function that, stated in simplest terms,


can be thought of as providing the probability of occurrence of different possible
outcomes in an experiment”. — Wikipedia

Another way to think about it is to see a distribution as a function that shows the
possible values for a variable and how often they occur. It is a common mistake to
believe that the distribution is the graph when in fact it’s the “rule” that determines
how values are positioned in relation to each other.

Here you have a map of relationships between the different distributions out there,
with many naturally following Bernoulli distribution. Each distribution is illustrated
by an example of its probability density function (PDF), which we’ll see later.

18
Figure 12. Probability distribution

We will first start by focusing our attention on the most widely used distribution,
Normal Distribution, due to the following reasons:
● It approximates to a wide variety of random variables;
● Distributions of sample mean with large enough sample sizes could be
approximated to normal;
● All computable statistics are elegant;
● Heavily used in regression analysis;
● Decisions based on normal distribution insights have a good track
record.

Normal Distribution
Also known as Gaussian Distribution or the Bell curve, it is a continuous probability
distribution, and it’s the most common distribution you’ll find. A distribution of a
19
dataset shows the frequency at which possible values occur. It presents the
following notation

Figure 13. Notation of Normal distribution

With N standing for normal, ~ as distribution, μ being the mean and the squared σ
the variance. Normal distribution is symmetrical and its median, mean and mode
are equal, thus it does not have any skewness.

Univariate Analysis

Univariate analysis represents the fundamental approach in statistical data analysis.


It is employed when the dataset involves only a single variable and does not address
cause-and-effect relationships.

Here is an example of Univariate analysis:


In a classroom survey, the researcher might focus on tallying the number of boys
and girls. In this scenario, the data would consist of a single variable, specifically the
quantity of each gender. Another example is Understand the distribution and
characteristics of monthly sales data for a retail chain. Through this comprehensive
univariate analysis, you can gain insights into the distribution, trends, and potential
outliers in monthly sales data. This information is valuable for making informed
decisions, identifying areas for improvement, and optimizing strategies for
individual stores within the retail chain.

20
The primary aim of Univariate analysis is to succinctly depict the data and identify
patterns within it. This is achieved by examining metrics such as mean, median,
mode, dispersion, variance, range, standard deviation, and so on.
Univariate analysis employs various descriptive methods, including:
● Frequency Distribution Tables
● Histograms
● Frequency Polygons
● Pie Charts
● Bar Charts

Figure 14. Example of Univariate Analysis

21
Bivariate Analysis

Bivariate analysis involves a slightly more analytical approach compared to


Univariate analysis. It is employed when a dataset comprises two variables, and
researchers seek to draw comparisons between them.

Here is a straightforward example of bivariate analysis:


In a classroom survey, the researcher may examine the ratio of students who scored
above 85%, considering their genders. In this case, there are two variables: gender
(X, the independent variable) and result (Y, the dependent variable). Bivariate
analysis measures the correlations between these two variables.

Another example of bivariate analysis is Investigate how variations in advertising


spending relate to changes in sales revenue. By conducting this bivariate analysis,
you can gain a deeper understanding of the relationship between advertising
spending and sales revenue. This information can inform marketing strategies,
budget allocations, and decision-making processes for optimizing the company's
overall performance.

Bivariate analysis is carried out through:


● Correlation coefficients
○ Correlation coefficient is a statistical measure that quantifies
the strength and direction of the relationship between two
variables. It ranges from -1 to 1:
○ A correlation coefficient of 1 indicates a perfect positive linear
relationship.

22
○ A correlation coefficient of -1 indicates a perfect negative linear
relationship.
○ A correlation coefficient of 0 suggests no linear relationship.
○ For example, if we have data on hours of study and exam
scores, a positive correlation coefficient would imply that as
hours of study increase, exam scores also tend to increase.
Conversely, a negative correlation coefficient would suggest
that as one variable increases, the other tends to decrease.
● Regression analysis
○ Regression analysis is a statistical method used to model the
relationship between a dependent variable and one or more
independent variables. It helps us understand how changes in
the independent variables are associated with changes in the
dependent variable.
○ In simple terms, if we take the same example of hours of study
and exam scores, regression analysis would allow us to create a
mathematical formula (a regression equation) that predicts
exam scores based on the number of hours studied. The
equation might tell us how much, on average, an additional
hour of study is associated with an increase or decrease in
exam scores.
○ Regression analysis is widely used for prediction, understanding
cause-and-effect relationships, and making informed decisions
based on the relationships between

23
Multivariate Analysis

Multivariate analysis represents a more intricate statistical technique applied when


there are more than two variables in a given dataset.

Consider this example of multivariate analysis:


A doctor has gathered data on cholesterol levels, blood pressure, and weight, along
with information on the subjects' eating habits (such as weekly consumption of red
meat, fish, dairy products, and chocolate). The objective is to explore the
relationships among the three health measures and the subjects' dietary habits. In
such a scenario, employing multivariate analysis becomes essential to comprehend
the interrelationships among each variable.

Commonly utilized techniques in multivariate analysis include:


● Factor Analysis
○ Factor analysis is like finding hidden patterns or common
factors in a set of data. Imagine you have a lot of test scores for
students. Factor analysis helps you see if these scores are
influenced by common factors, like intelligence or study habits,
rather than just looking at each score separately.
● Cluster Analysis
○ Cluster analysis is like grouping similar things together. If you
have a bunch of fruits, it helps you see which ones are similar
and could belong to the same group. For example, apples and
oranges might be in one cluster because they are similar fruits.
● Variance Analysis

24
○ Variance analysis is about understanding differences. If you
planned to spend a certain amount of money but actually spent
more or less, variance analysis helps figure out why. It looks at
the differences (variances) between what was planned and
what actually happened.
● Discriminant Analysis
○ Discriminant analysis is like finding the features that make
things different. If you have two or more groups (like different
types of animals), discriminant analysis helps you identify which
characteristics discriminate or set them apart from each other.
● Principal Component Analysis
○ Principal component analysis is about simplifying complex data.
Imagine you have a lot of information about students, like
grades, study time, and test scores. Principal component
analysis helps you find the most important things that explain
most of the differences among students, making it easier to
understand the data.

Inferential Analysis

Inferential analysis involves drawing conclusions, making predictions, or


generalizing findings from a sample to a larger population. Unlike descriptive
statistics, which aim to summarize and present information about a dataset,
inferential statistics use sample data to make inferences or predictions about a
population.

25
The process of inferential analysis typically involves hypothesis testing, estimation,
and drawing conclusions based on probability theory. Researchers use inferential
statistics to make judgments about the characteristics of a population, determine
the significance of relationships between variables, or make predictions about
future observations.

Common techniques in inferential analysis include:


● Hypothesis Testing: Evaluating hypotheses about population
parameters based on sample data to determine if there is enough
evidence to support or reject a claim.
● Confidence Intervals: Estimating a range of values within which a
population parameter is likely to fall, along with a level of confidence.
● Analysis of Variance (ANOVA): Comparing means across multiple
groups to assess whether there are significant differences.
● Chi-square tests: Assessing the association between categorical
variables.

Inferential analysis is crucial in scientific research, social sciences, economics, and


various other fields where researchers seek to make broader predictions or
conclusions beyond the specific data they have collected.

Let's consider a hypothetical example of inferential analysis in the context of


educational research:
Scenario: Examining the Impact of a Teaching Method on Student Performance

26
Suppose a group of researchers is interested in understanding whether a new
teaching method enhances student performance in mathematics compared to the
traditional teaching method.

Use Case

Background and Problem Statement:


You are a data scientist at ID/X Partners, you are currently carrying out exploratory
data analysis (EDA) to find out more about the related data that will be used. This is
done so that when creating a Machine Learning model, there are no wrong steps
later.

Data related to the number of installments you process is as follows:


55,000, 50,000, 90,000, 95,000, 70,000, 85,000, 80,000, 75,000, 60,000, 65,000

One of the steps in EDA that you do is descriptive statistics. Calculate Average,
Median and Range values!

Solution:
Formula:
- Mean :

27
- Median : sorted data and then find the middle values

- Range : Max Values - Min Values

Result:
Mean (Average):
(50,000+55,000+60,000+65,000+70,000+75,000+80,000+85,000+90,000+95,000
)/10=72,500
Median (Middle Value):
(70,000 + 75,000) / 2 : 72,500
Mode (Most Frequent Value): No mode in this example; all installments are unique.
Range:
95,000−50,000=45,000

28
References

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-6c246ed2468d

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-3087b80eb1c6

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-7bf596237ac6

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-a67a3199dcd4

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-16a188a400ca

https://www.kdnuggets.com/2020/02/probability-distributions-data-scienc

e.html

https://medium.com/diogo-menezes-borges/introduction-to-statistics-for-

data-science-7bf596237ac6

https://www.yourdatateacher.com/2021/04/16/the-most-used-probability-

distributions-in-data-science/

https://towardsdatascience.com/statistical-significance-hypothesis-testing-

the-normal-curve-and-p-values-93274fa32687

29
https://towardsdatascience.com/statistical-significance-in-action-84a4f47b

51ba

https://hotcubator.com.au/research/what-is-univariate-bivariate-and-multi

variate-analysis/

https://sciencing.com/similarities-of-univariate-multivariate-statistical-anal

ysis-12549543.html

https://thecleverprogrammer.com/2021/01/13/univariate-and-multivariate-

for-data-science/

30

You might also like