Professional Documents
Culture Documents
Employee Attrition Prediction
Employee Attrition Prediction
Submitted by
SWETHA NIHARIKA
(RA1711003011022)
Certified that the report on Employee Attrition Prediction time taken 3rd June 2019 to 28th June
located at Jubilee Hills, Hyderabad. During the period .3rd june 2019 to 28th june 2019.
I hereby declare that the presentation report submitted titled “Employee Attrition”, is a
record of my industrial training programme which I had undergone in the company Smart
Bridge, Hyderabad during the end of the fourth semester between the period 3rd June 2019 to
28th june2019.
1 Report Preparation 50
2 Presentation 25
Total 100
TABLE OF THE CONTENTS
. TRAINING SCHEDULE
.SUMMARY
INTRODUCTION ABOUT THE INDUSTRY
Introduction:-
Business domain
Data understanding
Modelling
1. Application type
This is a classification project, since the variable to be predicted is
binary (attrition or not).
The goal here is to model the probability of attrition, conditioned on
the employee features.
2. Data set
The data set used in this study contains quantitative and qualitative
information about a sample of employees at the company. The data
set contains about 1,500 employees. For each, around 35 personal,
professional and socio-economical attributes will be selected as the
input variables.
More specifically, the variables of this example are:
age
business travel
daily rate
department
distance_from_home
education
education field
employee count
employee number
environment satisfaction
gender
hourly rate
job involvement
job level
job role
job satisfaction
marital status
monthly income
monthly rate
number_companies_worked
over_18
overtime
percent_salary_hike
performance rating
relationship satisfaction
standard hours
stock_option_level
total_working_years
training_times_last_year
work_life_balance
years_at_company
years_in_current_role
years_since_last_promotion
years_with_current_manager
Attrition: satisfaction of the worker with the company (loyal or
attrition).
A loss index.
An optimization algorithm.
The closer the area under curve to 1, the better the classifier. In this
case, the area takes the value 0.836 which confirms what we saw
before in the ROC chart, that the model is prediction attrition with
great accuracy.
Predicted Predicted
positive negative
Real
316 (15.8%) 96 (4.8%)
positive
Real
325 (16.3%) 1263 (63.1%)
negative
The next list depicts the binary classification tests. They are
calculated from the values of the confusion matrix.
Introduction
Employee turnover refers to the percentage of workers who leave an
organization and are replaced by new employees. It is very costly for
organizations, where costs include but not limited to: separation,
vacancy, recruitment, training and replacement. On average,
organizations invest between four weeks and three months training
new employees. This investment would be a loss for the company if
the new employee decided to leave the first year. Furthermore,
organizations such as consulting firms would suffer from deterioration
in customer satisfaction due to regular changes in Account
Reps and/or consultants that would lead to loss of businesses with
clients.
In this post, we’ll work on simulated HR data from kaggle to build a
classifier that helps us predict what kind of employees will be more
likely to leave given some attributes. Such classifier would help an
organization predict employee turnover and be pro-active in helping to
solve such costly matter. We’ll restrict ourselves to use the most
common classifiers: Random Forest, Gradient Boosting Trees, K-
Nearest Neighbours, Logistic Regression and Support Vector
Machine.
The data has 14,999 examples (samples). Below are the features and
the definitions of each one:
Satisfaction level: Level of satisfaction {0–1}.
last_evaluationTime: Time since last performance evaluation (in
years).
Number project: Number of projects completed while at work.
average_montly_hours: Average monthly hours at workplace.
time_spend_company: Number of years spent in the company.
Work accident: Whether the employee had a workplace accident.
Left: Whether the employee left the workplace or not {0, 1}.
promotion_last_5years: Whether the employee was promoted in
the last five years.
Sales: Department the employee works for.
Salary: Relative level of salary {low, medium, high}.
Source code that created this post can be found here.
Data Pre-processing
Let’s take a look at the data (check if there are missing values and the
data type of each features):
Data overview
Since there are no missing values, we do not have to do any
imputation. However, there are some data pre-processing needed:
1. Change sales feature name to department.
2. Convert salary into ordinal categorical feature since there is
intrinsic order between: low, medium and high.
3. Create dummy features from department feature and drop the first
one to avoid linear dependency where some learning algorithms
may struggle.
The data is now ready to be used for modelling. The final number of
features are now 17.
Since there are no missing values, we do not have to do any
imputation. However, there are some data pre-processing needed:
1. Change sales feature name to department.
2. Convert salary into ordinal categorical feature since there is
intrinsic order between: low, medium and high.
3. Create dummy features from department feature and drop the first
one to avoid linear dependency where some learning algorithms
may struggle.
The data is now ready to be used for modelling. The final number of
features are now 17.
Modelling
Let’s first take a look at the proportion of each class to see if we’re
dealing with balanced or imbalanced data, since each one has its own
set of tools to be used when fitting classifiers.
Class counts
As the graph shows, we have an imbalanced dataset. As a result, when
we fit classifiers on such datasets, we should use metrics other than
accuracy when comparing models such as f1-score or AUC (area
under ROC curve). Moreover, class imbalance influences a learning
algorithm during training by making the decision rule biased towards
the majority class by implicitly learns a model that optimizes the
predictions based on the majority class in the dataset. There are three
ways to deal with this issue:
1. Assign a larger penalty to wrong predictions from the minority
class.
2. Up sampling the minority class or down sampling the majority
class.
LEARNING AFTER TRAINING:
We all learn best when we have examples to follow, friends to
share our successes with, buddies to learn from, and mentors in
our midst. Social learning connects learners to one another and
the trainers so that they can discuss and share stories. In-person
meetings, chat groups, forums, and videos of trainees sharing
their stories hosted on the Intranet are effective ways to
incorporate social learning in the learning process. This social
aspect of the learning process increases motivation and facilitates
a smooth transfer of the knowledge.
Create opportunities for practice.