Professional Documents
Culture Documents
Skewed Data A Problem To Your Statistical Model
Skewed Data A Problem To Your Statistical Model
towardsdatascience.com/skewed-data-a-problem-to-your-statistical-model-9a6b5bb74e37
Rajat Sharma
Jul 30, 2019
This article will help you understand what skewed data is, and how it can affect your statistical
insights that you want to achieve using your statistical model.
A data is called as skewed when curve appears distorted or skewed either to the left or to the
right, in a statistical distribution. In a normal distribution, the graph appears symmetry
meaning that there are about as many data values on the left side of the median as on the right
side. For example, below is the Height Distribution graph.
1/9
Blue is for Male and Pink is for Female
Here, you can see blue one have symmetry about 69 and pink one have symmetry about 64. So,
it means that most of the males have height near to 69 and most of the females have height near
to 64. Then there are very few males who have height near to 75 and 63 and females who have
height near to 68 and 58.
2/9
Credit: theschoolrun.com
In case of normal distribution, the mean, median and mode are approximately closer. These
three are all measures of the center of a data. The skewness of the data can be determined by
how these quantities are related to one another.
3/9
Here you can see the positions of all the three measures on the plot. So, you will find that:
The first and second always hold in case of right skewed distribution but third one may not be
valid sometimes. Below is one real life example
4/9
You can clearly see that it is a right skewed data with its tail in the +ve side of the distribution.
Here the distribution tells that most of the people have incomes near to 20K dollars/year and
then the number of people having higher income exponentially decreases with the increase in
income.
5/9
Here you can see three measures position on the plot. So, you will find that:
The first and second always hold in case of right skewed distribution but third one may not be
valid sometimes. Below is one real life example
6/9
Here the distribution tells that most of the people die at the age of near to 90.
Effects of skewness
Real life distributions are usually skewed. If there are too much skewness in the data, then
many statistical model don’t work but why.
So in skewed data, the tail region may act as an outlier for the statistical model and we know
that adversely affect the model’s performance especially models. There are statistical model
that are robust to outlier like a Tree-based models but it will limit the possibility to try other
models. So there is a necessity to transform the skewed data to close enough to a Gaussian
7/9
distribution or Normal distribution. This will
allow us to try more number of statistical
model.
Credit: imgur.com
Log transformation
A log transformation can help to fit a very skewed distribution into a Gaussian one. After log
transformation we can easily see pattern in our data.
In the above image, you can clearly see the patterns after applying log transformation. Before
that we have too many outliers present which may affect our model’s performance.
8/9
Conclusion
If we have a skewed data then it may harm our results. So, in order to use a skewed data we
have to apply a log transformation over the whole set of values to discover patterns in the data
and make it usable for the statistical model.
9/9