Abacus Break The Modelling Taboo Break T

Abacus – Break the modelling taboo
Break the Modelling Taboo

Objective: A comprehensive guide to approach a data based business problem
What to expect:
 Data preparation
 Data analysis
 Building a model
 Validation of results
Let the fun begin…….

Data analysis is important to businesses will be an understatement. In fact, no business can
survive without analysing available data. Visualize the following situations:
 A company wants to launch new variant of its existing line of fruit juice. It wants to carry out
the survey analysis and arrive at some meaningful conclusion about the variant to choose.
 Sales director of a company knows that there is something wrong with one of its successful
products, however hasn't yet carried out any market research data analysis. How and what
does he conclude?
These situations are indicative enough to conclude that data analysis is the lifeline of any
business. Whether one wants to arrive at some marketing decisions or fine-tune new product
launch strategy, data analysis is the key to all the problems. What is the importance of data
analysis - instead, one should say what is not important about data analysis.
Merely analysing data isn't sufficient from the point of view of making a decision. How does one
interpret from the analysed data is more important? Thus, data analysis is not a decision making
system, but decision supporting system.
Let us delve into the world of credit cards with which most of us would be fairly familiar. The
following example would help us in understanding the importance of data analysis.
In credit cards, there are 6 buckets. If a customer misses his due date, then he moves from bucket
0 to bucket 1. If the customer goes 30 days past due date, he moves into bucket 2. Similarly, he
moves into bucket 3 when he goes 60 days past due. This goes up to bucket 6. After 180 days
past due, the customer is written-off.
A customer moving from bucket 0 (current bucket) to bucket 1 may be of two types – he might
have missed the payment because of his low willingness or less ability to pay. Another reason can
be that he might have missed that payment despite having both the willingness and the ability to
Page | 1
pay. It might happen that he just forgot to pay. For ex., a person travelling a lot might be one such
account.
In this case, the typical data analysis would focus on eking out those customers who have wilfully
defaulted on the payment from those who are not wilful defaulters.
While data analysis has emerged as one separate field altogether, one of the pivotal aspects of
the industry is building predictive models. As part of the blog series, we will take you through the
process of building a statistical model.
Fasten your CP belts and enjoy the ride……
Data Preparation
Let’s look at the snapshot of a real life business problem captured in the form of a typical dataset
Most of us would wonder as to how to start decoding these variables. There is a commonly used
method through which such a dataset starts making sense. First, identify Predictor (Input)
and Target (output) variables. Next, identify the data type and category of the variables.
Page | 2
Let’s digress a bit from the above dataset and understand this step more clearly by taking an
example.
Example: - Suppose, we want to predict, whether the students will play cricket or not (refer below
table). Here you need to identify predictor variables, target variable, data type of variables and
category of variables.
Below, the variables have been defined in different category:
Once we have some grip over the above mentioned concepts, let’s come back to our original
dataset and try to comprehend it.
Page | 3
Below are some of the highlights…
Let’s decode some of the jargons used above:

A primary key, also called a primary keyword, is a key in a relational database that is unique for
each record. It is a unique identifier, such as a driver license number, telephone number (including
area code), or vehicle identification number (VIN). A relational database must always have one
and only one primary key.
An independent variable is a variable that is manipulated to determine the value of a dependent
variable. The dependent variable or the target variable is what is being measured in an
experiment or evaluated in a mathematical equation and the independent variables are the inputs
to that measurement.
After looking at the variables and understanding their nature, the next step is to prepare the data
that we are given. For that, we would look at the following:
 Outlier treatment
 Missing Value Imputation
 Dropping Redundant variables
 Reformatting Variables
Page | 4
Outlier Treatment:
What is an Outlier?
Most of us would have encountered this in our PS courses. But with the Summers and the relative
chill around Confluence, let’s go on a recap. Outlier is a commonly used terminology by analysts
and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
Let’s take an example, we do customer profiling and find out that the average annual income of
customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million.
These two customers’ annual income is much higher than the rest of the population. These two
observations will be seen as Outliers.
What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having
these outliers. The method to deal with them would then depend on the reason of their occurrence.
Causes of outliers can be classified in two broad categories:
1. Artificial (Error) / Non-natural

2. Natural.
Let’s understand various types of outliers in more detail:
 Data Entry Errors: Human errors such as errors caused during data collection, recording, or
entry can cause outliers in data. For example: Annual income of a customer is $100,000.
Accidentally, the data entry operator puts an additional zero in the figure. Now the income
becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when
compared with rest of the population.
 Measurement Error: It is the most common source of outliers. This is caused when the
measurement instrument used turns out to be faulty. For example: There are 10 weighing
machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine
will be higher / lower than the rest of people in the group. The weights measured on faulty
machine can lead to outliers.
Page | 5
 Experimental Error: Another cause of outliers is experimental error. For example: In a 100m
sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused
him to start late. Hence, this caused the runner’s run time to be more than other runners. His
total run time can be an outlier.
 Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data. For example: Teens would typically under report the amount of alcohol that they
consume. Only a fraction of them would report actual value. Here actual values might look
like outliers because rest of the teens are under reporting the consumption.
 Data Processing Error: Whenever we perform data mining, we extract data from multiple
sources. It is possible that some manipulation or extraction errors may lead to outliers in the
dataset.
 Sampling error: For instance, we have to measure the height of athletes. By mistake, we
include a few basketball players in the sample. This inclusion is likely to cause outliers in the
dataset.
 Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For
instance: In a renowned insurance company, the performance of top 50 financial advisors
was far higher than rest of the population. Surprisingly, it was not due to any error. Hence,
whenever we perform any data mining activity with advisors, we need to treat this segment
separately.
What is the impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modelling. There are
numerous unfavourable impacts of outliers in the data set:
 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other statistical
model assumptions.
The above bullets might shake off some of the dust gathered around the PS concepts 
To understand the impact deeply, let’s take an example to check what happens to a data set with
and without outliers in the data set.
Page | 6
Example:
As you can see, data set with outliers has significantly different mean and standard deviation. In the
first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would
change the estimate completely.
Let’s summarise what we have discussed till now about the outliers.
Once we are aware of the techniques for outlier treatment, it would be a good idea to detail out
each of the technique. That is what has exactly been done below.
Page | 7
Capping and Flooring technique:
A value is identified as outlier if it exceeds the value of the 99th percentile of the variable by some
factor, or if it is below the 1st percentile of given values by some factor. The factor is determined
after considering the variable distribution and the business case. The outlier is then capped at a
certain value above the P99 value or floored at a factor below the P1 value. The factor for
capping/flooring is again obtained by studying the distribution of the variable and also accounting
for any special business considerations.
Defining P1, ….., P99:
The data arranged in ascending or descending order can be divided into 100 equal parts by 99
values. These values are called percentiles and denoted by P1, P2, ........., P99.
Page | 8
Exponential Smoothing technique:
To detect outliers with this approach, the curve between P95 and P99 is extrapolated beyond P99
by a factor “x”. Any values lying outside this extended curve are identified as outliers. Similarly, the
curve between the 5th percentile and 1st percentile is extended to Minimum value by some factor.
Values lying below this extended curve are outliers. Any value which lies beyond the extended
curve is treated by an appropriate function, which maintains the monotonicity of the values but
brings them to an acceptable range.
Page | 9
Sigma Approach:
With the sigma approach, a value is identified as outlier if it lies outside the mean by + or – “x”
times sigma. Where x is an integer and sigma is standard deviation for the variable. The outlier is
then capped or floored at a distance of ‘y’ times sigma from the mean. “y” is equal to or greater
than “x” and is determined by the practitioner.
We hope that you enjoy the above concepts. We would continue with the next steps in the coming
days.
Till then, happy reading!
Ciao!!
Page | 10

Abacus Break The Modelling Taboo Break T

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abacus Break The Modelling Taboo Break T

Uploaded by

Copyright:

Available Formats

Abacus – Break the modelling taboo

Break the Modelling Taboo

Let the fun begin…….

Fasten your CP belts and enjoy the ride……

Below, the variables have been defined in different category:

Below are some of the highlights…

Let’s decode some of the jargons used above:

What causes Outliers?

1. Artificial (Error) / Non-natural

Let’s understand various types of outliers in more detail:

What is the impact of Outliers on a dataset?

Capping and Flooring technique:

Defining P1, ….., P99:

Exponential Smoothing technique:

You might also like