Professional Documents
Culture Documents
Unit 1 TE Honours
Unit 1 TE Honours
Unit 1 TE Honours
Confidence interval
Hypothesis testing
Estimation
Random variable
Statistics
Statistical inference
Population
Approximate distributions
Central tendency
Table
Correlation
For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The
mean is 4 (20/5). The mode of a data set is the value appearing most
often, and the median is the figure situated in the middle of the data set. It
is the figure separating the higher figures from the lower figures within a
data set. However, there are less common types of descriptive statistics
that are still very important.1
Inferential statistics are among the most useful tools for making educated
predictions about how a set of data will scale when applied to a larger
population of subjects. These statistics help set a benchmark for hypothesis
testing, as well as a general idea of where specific parameters will land
when scaled to a larger data set, such as the larger set’s mean.
The Importance of
Statistics in Machine
Learning
Geeta Kakrani
·
Follow
3 min read
Jul 8, 2023
Introduction:
Machine learning has revolutionized numerous industries, enabling
computers to analyze vast amounts of data and make accurate
predictions or decisions. While machine learning algorithms and
models are at the core of this technology, it is crucial to recognize the
pivotal role that statistics plays in driving the success and
effectiveness of machine learning. In this blog post, we will delve
into the significance of statistics in machine learning and explore its
various applications and benefits.
Introduction
A measure of central tendency is a single value that attempts to describe a set of
data by identifying the central position within that set of data. As such, measures of
central tendency are sometimes called measures of central location. They are also
classed as summary statistics. The mean (often called the average) is most likely the
measure of central tendency that you are most familiar with, but there are others,
such as the median and the mode.
The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate
to use than others. In the following sections, we will look at the mean, mode and
median, and learn how to calculate them and under what conditions they are most
appropriate to be used.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use is
most often with continuous data (see our Types of Variable guide for data types).
The mean is equal to the sum of all the values in the data set divided by the number
of values in the data set. So, if we have � values in a data set and they have
values �1,�2, …,��, the sample mean, usually denoted by �― (pronounced "x
bar"), is:
�―=�1+�2+⋯+���
This formula is usually written in a slightly different manner using the Greek capitol
letter, ∑, pronounced "sigma", which means "sum of...":
�―=∑��
You may have noticed that the above formula refers to the sample mean. So, why
have we called it a sample mean? This is because, in statistics, samples and
populations have very different meanings and these differences are very important,
even if, in the case of the mean, they are calculated in the same way. To
acknowledge that we are calculating the population mean and not the sample mean,
we use the Greek lower case letter "mu", denoted as �:
�=∑��
The mean is essentially a model of your data set. It is the value that is most
common. You will notice, however, that the mean is not often one of the actual
values that you have observed in your data set. However, one of its important
properties is that it minimises error in the prediction of any one value in your data
set. That is, it is the value that produces the lowest amount of error from all other
values in the data set.
An important property of the mean is that it includes every value in your data set as
part of the calculation. In addition, the mean is the only measure of central tendency
where the sum of the deviations of each value from the mean is always zero.
The mean has one main disadvantage: it is particularly susceptible to the influence
of outliers. These are values that are unusual compared to the rest of the data set by
being especially small or large in numerical value. For example, consider the wages
of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data
suggests that this mean value might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries in the $12k to 18k range.
The mean is being skewed by the two large salaries. Therefore, in this situation, we
would like to have a better measure of central tendency. As we will find out later,
taking the median would be a better measure of central tendency in this situation.
Another time when we usually prefer the median over the mean (or mode) is when
our data is skewed (i.e., the frequency distribution for our data is skewed). If we
consider the normal distribution - as this is the most frequently assessed in statistics
- when the data is perfectly normal, the mean, median and mode are identical.
Moreover, they all represent the most typical value in the data set. However, as the
data becomes skewed the mean loses its ability to provide the best central location
for the data because the skewed data is dragging it away from the typical value.
However, the median best retains this position and is not as strongly influenced by
the skewed values. This is explained in more detail in the skewed distribution section
later in this guide.
Median
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to
calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the
middle mark because there are 5 scores before it and 5 scores after it. This works
fine when you have an odd number of scores, but what happens when you have an
even number of scores? What if you had only 10 scores? Well, you simply have to
take the middle two scores and average the result. So, if we look at the example
below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5.
Mode
The mode is the most frequent score in our data set. On a histogram it represents
the highest bar in a bar chart or histogram. You can, therefore, sometimes consider
the mode as being the most popular option. An example of a mode is presented
below:
Normally, the mode is used for categorical data where we wish to know which is the
most common category, as illustrated below:
We can see above that the most common form of transport, in this particular data
set, is the bus. However, one of the problems with the mode is that it is not unique,
so it leaves us with problems when we have two or more values that share the
highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data.
This is particularly problematic when we have continuous data because we are more
likely not to have any one value that is more frequent than the other. For example,
consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that
we will find two or more people with exactly the same weight (e.g., 67.4 kg)? The
answer, is probably very unlikely - many people might be close, but with such a small
sample (30 people) and a large range of possible weights, you are unlikely to find
two people with exactly the same weight; that is, to the nearest 0.1 kg. This is why
the mode is very rarely used with continuous data.
Another problem with the mode is that it will not provide us with a very good measure
of central tendency when the most common mark is far away from the rest of the
data in the data set, as depicted in the diagram below:
In the above diagram the mode has a value of 2. We can clearly see, however, that
the mode is not representative of the data, which is mostly concentrated around the
20 to 30 value range. To use the mode to describe the central tendency of this data
set would be misleading.
We often test whether our data is normally distributed because this is a common
assumption underlying many statistical tests. An example of a normally distributed
set of data is presented below:
When you have a normally distributed sample you can legitimately use both the
mean or the median as your measure of central tendency. In fact, in any symmetrical
distribution the mean, median and mode are equal. However, in this situation, the
mean is widely preferred as the best measure of central tendency because it is the
measure that includes all the values in the data set for its calculation, and any
change in any of the scores will affect the value of the mean. This is not the case
with the median or mode.
However, when our data is skewed, for example, as with the right-skewed data set
below:
We find that the mean is being dragged in the direct of the skew. In these situations,
the median is generally considered to be the best representative of the central
location of the data. The more skewed the distribution, the greater the difference
between the median and mean, and the greater emphasis should be placed on using
the median as opposed to the mean. A classic example of the above right-skewed
distribution is income (salary), where higher-earners provide a false representation of
the typical income if expressed as a mean and not a median.
If dealing with a normal distribution, and tests of normality show that the data is non-
normal, it is customary to use the median instead of the mean. However, this is more
a rule of thumb than a strict guideline. Sometimes, researchers wish to report the
mean of a skewed distribution if the median and mean are not appreciably different
(a subjective assessment), and if it allows easier comparisons to previous research
to be made.
Link https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php
Let’s learn about the measure of dispersion in statistics , its types, formulas, and
examples in detail.
Dispersion in Statistics
Dispersion in statistics is a way to describe how spread out or scattered the data is
around an average value. It helps to understand if the data points are close together or
far apart.
Dispersion shows the variability or consistency in a set of data. There are different
measures of dispersion like range, variance, and standard deviation.
Measure of Dispersion in Statistics
Measures of Dispersion measure the scattering of the data. It tells us how the values
are distributed in the data set. In statistics, we define the measure of dispersion as
various parameters that are used to define the various attributes of the data.
These measures of dispersion capture variation between different values of the data.
Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
Absolute Measure of Dispersion
Relative Measure of Dispersion
These measures of dispersion can be further divided into various categories. They
have various parameters and these parameters have the same unit.
Let’s learn about them in detail.
Absolute Measure of Dispersion
The measures of dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For example – Meters,
Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value in the
distribution.
Mean Deviation: It is the arithmetic mean of the difference between the values and
their mean.
Standard Deviation: It is the square root of the arithmetic average of the square of
the deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the
given data set.
Quartile Deviation: It is defined as half of the difference between the third quartile
and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is
called Interterquartile Range. Its formula is given as Q3 – Q1.
Read More :
Mean deviation
Standard Deviation
Variance
Range
Quartile Deviation
Relative Measure of Dispersion
We use relative measures of dispersion to measure the two quantities that have
different units to get a better idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the highest
and lowest value in a data set to the sum of the highest and lowest value.
Coefficient of Variation: It is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to the
value of the central point of the data set.
Coefficient of Quartile Deviation: It is defined as the ratio of the difference between
the third quartile and the first quartile to the sum of the third and first quartiles.
Read More :
Coefficient of Mean Deviation
Coefficient of Variation
Coefficient of Range
Range of Data Set
The range is the difference between the largest and the smallest values in the
distribution.
Thus, it can be written as
R=L–S
where,
L is the largest value in the Distribution
S is the smallest value in the Distribution
A higher value of range implies higher variation in the data set.
One drawback of this measure is that it only takes into account the maximum and
the minimum value. They might not always be the proper indicator of how the
values of the distribution are scattered.
Example: Find the range of the data set 10, 20, 15, 0, 100.
Solution:
Smallest Value in the data = 0
Largest Value in the data = 100
Thus, the range of the data set is,
R = 100 – 0
R = 100
Note: Range cannot be calculated for the open-ended frequency distributions. Open-
ended frequency distributions are those distributions in which either the lower limit of
the lowest class or the higher limit of the highest class is not defined.
Range for Ungrouped Data
To find the range for the ungrouped data set, first we have to find the smallest and the
largest value of the data set by observing. The difference between them gives the
range of ungrouped data.
We can understyand this with the help of following example:
Example: Find out the range for the following observations, 20, 24, 31, 17, 45, 39,
51, 61.
Solution:
Largest Value = 61
Smallest Value = 17
Thus, the range of the data set is
Range = 61 – 17 = 44
Range for Grouped Data
The range of the grouped data set is found by studying the following example,
Example: Find out the range for the following frequency distribution table for
the marks scored by class 10 students.
Marks Intervals Number of Students
0-10 5
10-20 8
20-30 15
30-40 9
Solution:
For Largest Value: Taking the higher limit of Highest Class = 40
For Smallest Value: Taking the lower limit of Lowest Class = 0
Range = 40 – 0
Thus, the range of the given data set is,
Range = 40
Mean Deviation
Mean deviation measures the deviation of the observations from the mean of the
distribution.
Since the average is the central value of the data, some deviations might be positive
and some might be negative. If they are added like that, their sum will not reveal much
as they tend to cancel each other’s effect.
For example :
Let us consider this set of data : -5, 10, 25
Mean = (-5 + 10 + 25)/3 = 10
Now a deviation from the mean for different values is,
(-5 -10) = -15
(10 – 10) = 0
(25 – 10) = 15
Now adding the deviations, shows that there is zero deviation from the mean which is
incorrect. Thus, to counter this problem only the absolute values of the difference are
taken while calculating the mean deviation.
Mean Deviation Formula :
MD =
Mean Deviation for Ungrouped Data
For calculating the mean deviation for ungrouped data, the following steps must be
followed:
Step 1: Calculate the arithmetic mean for all the values of the dataset.
Step 2: Calculate the difference between each value of the dataset and the mean. Only
absolute values of the differences will be considered. |d|
Step 3: Calculate the arithmetic mean of these deviations using the formula,
M.D =
This can be explained using the example.
Example: Calculate the mean deviation for the given ungrouped data, 2, 4, 6, 8,
10
Solution:
Mean(μ) = (2+4+6+8+10)/(5)
μ=6
M. D =
⇒ M.D =
⇒ M.D = (4+2+0+2+4)/(5)
⇒ M.D = 12/5 = 2.4
Chi-Square (χ2) Statistic: What It Is, Examples, How and When to Use
the Test
By
ADAM HAYES
VIKKI VELASQUEZ
Investopedia / Paige McLaughlin
Trending Videos
For these tests, degrees of freedom are used to determine if a certain null
hypothesis can be rejected based on the total number of variables and
samples within the experiment. As with any statistic, the larger the sample
size, the more reliable the results.
KEY TAKEAWAYS
A χ2 test for independence can tell us how likely it is that random chance
can explain any observed difference between the actual frequencies in the
data and these theoretical expectations.
Goodness-of-Fit
χ2 provides a way to test how well a sample of data matches the (known or
assumed) characteristics of the larger population that the sample is
intended to represent. This is known as goodness of fit.
If the sample data do not fit the expected properties of the population that
we are interested in, then we would not want to use this sample to draw
conclusions about the larger population.3
An Example
For example, consider an imaginary coin with exactly a 50/50 chance of
landing heads or tails and a real coin that you toss 100 times. If this coin is
fair, then it will also have an equal probability of landing on either side, and
the expected result of tossing the coin 100 times is that heads will come up
50 times and tails will come up 50 times.4
In this case, χ2 can tell us how well the actual results of 100 coin flips
compare to the theoretical model that a fair coin will give 50/50 results.
The actual toss could come up 50/50, or 60/40, or even 90/10. The farther
away the actual results of the 100 tosses are from 50/50, the less good the
fit of this set of tosses is to the theoretical expectation of 50/50, and the
more likely we might conclude that this coin is not actually a fair coin.4
A chi-square test is appropriate for this when the data being analyzed are
from a random sample, and when the variable in question is a categorical
variable.2
In addition, the chi-square test cannot establish whether one variable has
a causal relationship with another. It can only establish whether two
variables are related.
SPONSORED