Professional Documents
Culture Documents
Abacus Break The Modelling Taboo Break T
Abacus Break The Modelling Taboo Break T
A company wants to launch new variant of its existing line of fruit juice. It wants to carry out
the survey analysis and arrive at some meaningful conclusion about the variant to choose.
Sales director of a company knows that there is something wrong with one of its successful
products, however hasn't yet carried out any market research data analysis. How and what
does he conclude?
These situations are indicative enough to conclude that data analysis is the lifeline of any
business. Whether one wants to arrive at some marketing decisions or fine-tune new product
launch strategy, data analysis is the key to all the problems. What is the importance of data
analysis - instead, one should say what is not important about data analysis.
Merely analysing data isn't sufficient from the point of view of making a decision. How does one
interpret from the analysed data is more important? Thus, data analysis is not a decision making
system, but decision supporting system.
Let us delve into the world of credit cards with which most of us would be fairly familiar. The
following example would help us in understanding the importance of data analysis.
In credit cards, there are 6 buckets. If a customer misses his due date, then he moves from bucket
0 to bucket 1. If the customer goes 30 days past due date, he moves into bucket 2. Similarly, he
moves into bucket 3 when he goes 60 days past due. This goes up to bucket 6. After 180 days
past due, the customer is written-off.
A customer moving from bucket 0 (current bucket) to bucket 1 may be of two types – he might
have missed the payment because of his low willingness or less ability to pay. Another reason can
be that he might have missed that payment despite having both the willingness and the ability to
Page | 1
Abacus – Break the modelling taboo
pay. It might happen that he just forgot to pay. For ex., a person travelling a lot might be one such
account.
In this case, the typical data analysis would focus on eking out those customers who have wilfully
defaulted on the payment from those who are not wilful defaulters.
While data analysis has emerged as one separate field altogether, one of the pivotal aspects of
the industry is building predictive models. As part of the blog series, we will take you through the
process of building a statistical model.
Data Preparation
Let’s look at the snapshot of a real life business problem captured in the form of a typical dataset
Most of us would wonder as to how to start decoding these variables. There is a commonly used
method through which such a dataset starts making sense. First, identify Predictor (Input)
and Target (output) variables. Next, identify the data type and category of the variables.
Page | 2
Abacus – Break the modelling taboo
Let’s digress a bit from the above dataset and understand this step more clearly by taking an
example.
Example: - Suppose, we want to predict, whether the students will play cricket or not (refer below
table). Here you need to identify predictor variables, target variable, data type of variables and
category of variables.
Once we have some grip over the above mentioned concepts, let’s come back to our original
dataset and try to comprehend it.
Page | 3
Abacus – Break the modelling taboo
Page | 4
Abacus – Break the modelling taboo
Outlier Treatment:
What is an Outlier?
Most of us would have encountered this in our PS courses. But with the Summers and the relative
chill around Confluence, let’s go on a recap. Outlier is a commonly used terminology by analysts
and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
Let’s take an example, we do customer profiling and find out that the average annual income of
customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million.
These two customers’ annual income is much higher than the rest of the population. These two
observations will be seen as Outliers.
Data Entry Errors: Human errors such as errors caused during data collection, recording, or
entry can cause outliers in data. For example: Annual income of a customer is $100,000.
Accidentally, the data entry operator puts an additional zero in the figure. Now the income
becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when
compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when the
measurement instrument used turns out to be faulty. For example: There are 10 weighing
machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine
will be higher / lower than the rest of people in the group. The weights measured on faulty
machine can lead to outliers.
Page | 5
Abacus – Break the modelling taboo
Experimental Error: Another cause of outliers is experimental error. For example: In a 100m
sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused
him to start late. Hence, this caused the runner’s run time to be more than other runners. His
total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data. For example: Teens would typically under report the amount of alcohol that they
consume. Only a fraction of them would report actual value. Here actual values might look
like outliers because rest of the teens are under reporting the consumption.
Data Processing Error: Whenever we perform data mining, we extract data from multiple
sources. It is possible that some manipulation or extraction errors may lead to outliers in the
dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake, we
include a few basketball players in the sample. This inclusion is likely to cause outliers in the
dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For
instance: In a renowned insurance company, the performance of top 50 financial advisors
was far higher than rest of the population. Surprisingly, it was not due to any error. Hence,
whenever we perform any data mining activity with advisors, we need to treat this segment
separately.
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
The above bullets might shake off some of the dust gathered around the PS concepts
To understand the impact deeply, let’s take an example to check what happens to a data set with
and without outliers in the data set.
Page | 6
Abacus – Break the modelling taboo
Example:
As you can see, data set with outliers has significantly different mean and standard deviation. In the
first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would
change the estimate completely.
Let’s summarise what we have discussed till now about the outliers.
Once we are aware of the techniques for outlier treatment, it would be a good idea to detail out
each of the technique. That is what has exactly been done below.
Page | 7
Abacus – Break the modelling taboo
A value is identified as outlier if it exceeds the value of the 99th percentile of the variable by some
factor, or if it is below the 1st percentile of given values by some factor. The factor is determined
after considering the variable distribution and the business case. The outlier is then capped at a
certain value above the P99 value or floored at a factor below the P1 value. The factor for
capping/flooring is again obtained by studying the distribution of the variable and also accounting
for any special business considerations.
The data arranged in ascending or descending order can be divided into 100 equal parts by 99
values. These values are called percentiles and denoted by P1, P2, ........., P99.
Page | 8
Abacus – Break the modelling taboo
To detect outliers with this approach, the curve between P95 and P99 is extrapolated beyond P99
by a factor “x”. Any values lying outside this extended curve are identified as outliers. Similarly, the
curve between the 5th percentile and 1st percentile is extended to Minimum value by some factor.
Values lying below this extended curve are outliers. Any value which lies beyond the extended
curve is treated by an appropriate function, which maintains the monotonicity of the values but
brings them to an acceptable range.
Page | 9
Abacus – Break the modelling taboo
Sigma Approach:
With the sigma approach, a value is identified as outlier if it lies outside the mean by + or – “x”
times sigma. Where x is an integer and sigma is standard deviation for the variable. The outlier is
then capped or floored at a distance of ‘y’ times sigma from the mean. “y” is equal to or greater
than “x” and is determined by the practitioner.
We hope that you enjoy the above concepts. We would continue with the next steps in the coming
days.
Till then, happy reading!
Ciao!!
Page | 10