Employee Attrition Prediction

SESSION (2019-2020)
(DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY)

INDUSTRIAL TRAINING AND REPORT
SUBMITTED IN THE PARTIAL FULFILLMENT OF THE

REQUIREMENT FOR THE
DEGREE OF BACHELOR OF ENGINEERING

(COMPUTER SCIENCE & ENGINEERING)
Submitted by
SWETHA NIHARIKA
(RA1711003011022)
Under the guidance of

SWATHI PINNAMANENNI
SMART BRIDGE, HYDERABAD
Of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE &ENGINEERING
Of
FACULTY OF ENGINEERING AND TECHNOLOGY
BONAFIDE CERTIFICATE
Certified that the report on Employee Attrition Prediction time taken 3rd June 2019 to 28th June
2019 is a proof of successful completion of Industrial Training Phase–I programme undergone
by S.V.Swetha Niharika (Register no..RA1711003011022) in the company .Smart Bridge
located at Jubilee Hills, Hyderabad. During the period .3rd june 2019 to 28th june 2019.
Date Signature of the

Industrial Training In-charge
DECLARATION
I hereby declare that the presentation report submitted titled “Employee Attrition”, is a
record of my industrial training programme which I had undergone in the company Smart
Bridge, Hyderabad during the end of the fourth semester between the period 3rd June 2019 to
28th june2019.
Date : 3rd June 2019
Name : S.V.Swetha Niharika
Register Number: RA1711003011022
Signature of the Student

ACKNOWLEDGEMENT
I would like to express my special thanks of gratitude to Dean of

Naranayanama College and B.Swathi Pinnamanenni for making Industrial
training a part of curriculum, which helped us lot to broaden our perspective
immensely and gain experience on working in Industries I would like to
express my deepest gratitude for their constant support and timely help and
guidance.
The internship opportunity I had with Smart Bridge Company Private
Limited was a great chance for learning and professional development. I am
grateful for having a chance to meet so many wonderful people and
professionals who led me though this internship period.
I express my deepest thanks for taking part in useful decision & giving
necessary advice and guidance and arranged all facilities to make life easier. I
choose this moment to acknowledge his contribution gratefully.
I perceive as this opportunity as a milestone in my career development. I

will strive to use gained skills and knowledge in the best possible way, and I
will continue to work on their improvement, in order to attain desired career
objectives.
Rubrics for the Evaluation
S. No Marks Split up Maximum marks Marks Obtained
1 Report Preparation 50
2 Presentation 25
3 Quiz and Viva 25
Total 100
TABLE OF THE CONTENTS
.INTRODUCTION ABOUT THE INDUSTRY
. TRAINING SCHEDULE
.WORK DONE / OBSERVATIONS
.SPECIFIC ASSIGNMENT / PROJECT HANDLED
.LEARNING AFTER TRAINING
.SUMMARY
INTRODUCTION ABOUT THE INDUSTRY
SMARTBRIDGE is a comprehensive one-stop portal catering to the

skill & knowledge development of the young graduates turning
professionals. We host smart solutions for the students that bridge the
gap in the transition phase from academics to workplace. Our main
objective is to bridge the existing gaps between prevailing industry
standards and what the academics offer to the graduates while passing
out of university. Smart Bridge offers suitable skill deployment and
training to the young talent before on boarding their first job. Our skill
development programs are designed considering the present
expectations in the industry. We thereby work along the lines to offer
best programs. Main objectives of Smart Bridge Well directed career
guidance programs for educational institutions Appropriate
certification courses that suit the industry need Train the trainers;
expanded awareness about the current industry standards Liaise with
corporates to offer niche internships Establish technology development
centres in colleges Specialised incubation centres in collaboration with
corporates
 Well directed career guidance programs for educational

institutions
 Appropriate certification courses that suit the industry need
 Train the trainers; expanded awareness about the current industry
standards
 Liaise with corporates to offer niche internships
 Establish technology development centres in colleges
 Specialised incubation centres in collaboration with corporates
 Smart bridge is enterprise business consulting and technology
development for IT and marketing. The foundation of Smart
Is built on people, process and technology working together to create
exceptional innovations.
 Smart bridge is particularly experienced in enterprise consulting
 For oil & gas, food service and food distribution, providing
services including business analytics, ERP, information
management and more.
 Developers and designers build custom (and sometimes
packaged) solutions for customers, including mobile and cloud-
based apps. For example, Foods lets multi-unit restaurants
mobilize their team and keep real-time reports and
communications flowing regarding all operations.
 Other Sagas and enterprise mobile apps have come out of Smart
bridge to establish their own identity, such as SmartMPM
(marketing performance management) and Crisis360 (to get
multi-unit business quickly back on their feet after a disaster).
TRAINING SCHEDULE
To certify that S.V. Swetha Niharika has successfully completed
internship from 3rd June to 28th June 2019.
During this period learned the concepts of Artificial Intelligence with
Python & IBM Watson and completed a project on “Employee
Attrition Prediction”.
Artificial intelligence (AI) is the simulation of human intelligence
processes by machines, especially computer systems. These processes
include learning (the acquisition of information and rules for using the
information), reasoning (using rules to reach approximate or definite
conclusions) and self-correction. Particular applications of AI include
AI can be categorized as either weak or strong. Weak AI, also known

as narrow AI, is an AI system that is designed and trained for a
particular task. Virtual personal assistants, such as Apple's Siri, are a
form of weak AI. Strong AI, also known as artificial general
intelligence, is an AI system with generalized human cognitive
abilities. When presented with an unfamiliar task, a strong AI system
is able to find a solution without human intervention.
Because hardware, software and staffing costs for AI can be

expensive, many vendors are including AI components in their
standard offerings, as well as access to Artificial Intelligence as a
Service platforms. AI as a Service allows individuals and companies
to experiment with AI for various business purposes and sample
multiple platforms before making a commitment. Popular AI cloud
offerings include IBM Watson Assistant, Microsoft Cognitive
Services and Google Al services.
WORK DONE / OBSERVATIONS:
Employee Attrition Prediction
Introduction:-
In order to start with exercise, I have used Employee Attrition

Prediction, which was downloaded from Kaggle. The dataset includes
features like Age, Employee Role, Daily Rate, Job Satisfaction, Years
At Company, Years In Current Role etc. For this exercise, we will try
to study the factors that lead to employee attrition. This is a fictional
data set created by IBM data scientists.
The repository contains three parts
 Data this contains the provided sample data.

 Code this contains the R development code. They are displayed
in R markdown files which can yield files of various formats.
 Docs this contains the documents, like blog, installation
instructions, etc.
Business domain
Human resource analysis, employee attrition prediction, sentiment

analysis.
Data science problem
Normally employee attrition prediction is categorized as a

classification problem, that is, given the data that characterize an
employee, the task is to predict whether the employee will leave the
company in the near future.
Data understanding
In the data-driven employee attrition prediction model, normally two

types of data are taken into consideration.
1. First type refers to the demographic and organizational

information of an employee such as age, gender, title, etc. The
characteristics of this group of data is that within a certain
interval, they don't change or solely increment deterministically
over time. For example, gender will never change for an
individual, and other factors such as years of service increments
every year.
2. Second type of data is the dynamically involving information
about an employee. Recent report that sentiment is playing a
critical role in employee attrition prediction. Classical measures
of sentiment include job satisfaction, environment
satisfaction, relationship satisfaction, etc. With the machine
learning techniques, sentiment patterns can be exploited from
daily activities such as text posts on social media for predicting
churn inclination.
Modelling
1. Prediction models are created based on classification algorithms

2. Such as random forest. Ensemble method is applied to enhance
prediction performance. Resampling techniques (e.g., SMOTE)
are applied to deal with imbalance in the training set for model
building.
3. Term frequency (TF) or term frequency-inverse document
frequency are extracted from text as features for sentiment
analysis. Translation or language-specific tokenization methods
are used for multi-lingual text analysis.
1. Application type
2. Data set
3. Neural network
4. Training strategy
5. Model selection
6. Testing analysis.
7. Model deployment.
1. Application type
This is a classification project, since the variable to be predicted is
binary (attrition or not).
The goal here is to model the probability of attrition, conditioned on
the employee features.
2. Data set
The data set used in this study contains quantitative and qualitative
information about a sample of employees at the company. The data
set contains about 1,500 employees. For each, around 35 personal,
professional and socio-economical attributes will be selected as the
input variables.
More specifically, the variables of this example are:
 age
 business travel
 daily rate
 department
 distance_from_home
 education
 education field
 employee count
 employee number
 environment satisfaction
 gender
 hourly rate
 job involvement
 job level
 job role
 job satisfaction
 marital status
 monthly income
 monthly rate
 number_companies_worked
 over_18
 overtime
 percent_salary_hike
 performance rating
 relationship satisfaction
 standard hours
 stock_option_level
 total_working_years
 training_times_last_year
 work_life_balance
 years_at_company
 years_in_current_role
 years_since_last_promotion
 years_with_current_manager
 Attrition: satisfaction of the worker with the company (loyal or
attrition).
As we can see, we have a total of 48 inputs, which contain the

characteristics of every employee, 1 target, which is the variable
"Attrition" mentioned before. There are 3 unused variables
("Employee Count", "Over18" and "Standard Hours"), which are
constant and will not be used for the analysis since they do not
provide any valuable information.
By using:
a) Python
b) IBM Watson
c) HR Dataset
The chart shows that the number of negative instances (1233) is much
larger that the number of positive instances (237). We use this
information later to design properly the predictive model.
The input-targets correlations analyze the dependencies between each
input variable and the target.
3. Neural network
The neural network takes all the attributes of each of the employees
and it will transform them into a probability of attrition.
For that purpose, we use a neural network with 48 inputs, one hidden
layer with one neuron in it and one output.
4. Training strategy
The next step is to select an appropriate training strategy which
defines what the neural network will learn. A general training strategy
is composed of two concepts:
 A loss index.
 An optimization algorithm.
As we said before, the data set is unbalanced. As a consequence, we

set as error method the weighted squared error. With the positive and
negative weights shown in the next table. Positives weight: 5.20,
negatives weight: 1
We use the quasi-Newton method as optimization algorithm.
Now, the model is ready to be trained. The next chart shows how the
training and selection errors decrease with the epochs of the
optimization algorithm.
The final training and selection errors are training error = 0.206
WSE and selection error = 1.070 WSE, respectively.
5. Model selection
The objective of model selection is to find the network architecture
with best generalization properties, that is, that which minimizes the
error on the selection instances of the data set.
More specifically, we want to find a neural network with a selection
error less than 1.070 WSE, which is the value that we have achieved
so far.
Order selection algorithms train several network architectures with
different number of neurons and select that with the smallest selection
error.
The incremental order method starts with a small number of neurons
and increases the complexity at each iteration. The following chart
shows the training error (blue) and the selection error (orange) as a
function of the number of neurons.
6. Testing analysis
Testing analysis assesses the quality of the model to decide if it is
ready to be use in the production phase, i.e., in a real world situation.
The way to test the model will be comparing the outputs of the trained
neural network against the real targets for a set of data that has not
been used neither for training nor for selection, the testing subset. For
that purpose, we make use of some testing methods commonly used
in binary classification problems.
The ROC curve measures the discrimination capacity of the classifier
between positives and negatives instances. The next chart shows the
ROC curve for our problem.
For a perfect classifier, the ROC curve should pass through the upper
left corner. In this case, the curve is close to it which means that the
quality of the model is good. The next table shows the value of the
area under the previous ROC curve.
The closer the area under curve to 1, the better the classifier. In this
case, the area takes the value 0.836 which confirms what we saw
before in the ROC chart, that the model is prediction attrition with
great accuracy.
Predicted Predicted
positive negative
Real
316 (15.8%) 96 (4.8%)
positive
Real
325 (16.3%) 1263 (63.1%)
negative
The next list depicts the binary classification tests. They are
calculated from the values of the confusion matrix.
 Classification accuracy: 73.9% (ratio of correctly classified

samples).
 Error rate: 26.1% (ratio of missclassified samples).
 Sensitivity: 66.4% (percentage of actual positive classified as
positive).
 Specificity: 76.9% (percentage of actual negative classified as
negative).
In general, these binary classification tests show a good performance

of the predictive model.
7. Model deployment
Once we know that the model can predict employee attrition
accurately, it can be used to evaluate the satisfaction of a given
employee with the company. The predictive model also gives us the
factors which are more significant for a given employee, which allows
the company to act on that variables.
The predictive model takes the form of a function of the outputs with
respect to the inputs. The mathematical expression, which is listed
below, can be embedded into any software.
SPECIFIC ASSIGNMENT/PROJECT HANDLED
Introduction
Employee turnover refers to the percentage of workers who leave an
organization and are replaced by new employees. It is very costly for
organizations, where costs include but not limited to: separation,
vacancy, recruitment, training and replacement. On average,
organizations invest between four weeks and three months training
new employees. This investment would be a loss for the company if
the new employee decided to leave the first year. Furthermore,
organizations such as consulting firms would suffer from deterioration
in customer satisfaction due to regular changes in Account
Reps and/or consultants that would lead to loss of businesses with
clients.
In this post, we’ll work on simulated HR data from kaggle to build a
classifier that helps us predict what kind of employees will be more
likely to leave given some attributes. Such classifier would help an
organization predict employee turnover and be pro-active in helping to
solve such costly matter. We’ll restrict ourselves to use the most
common classifiers: Random Forest, Gradient Boosting Trees, K-
Nearest Neighbours, Logistic Regression and Support Vector
Machine.
The data has 14,999 examples (samples). Below are the features and
the definitions of each one:
 Satisfaction level: Level of satisfaction {0–1}.
 last_evaluationTime: Time since last performance evaluation (in
years).
 Number project: Number of projects completed while at work.
 average_montly_hours: Average monthly hours at workplace.
 time_spend_company: Number of years spent in the company.
 Work accident: Whether the employee had a workplace accident.
 Left: Whether the employee left the workplace or not {0, 1}.
 promotion_last_5years: Whether the employee was promoted in
the last five years.
 Sales: Department the employee works for.
 Salary: Relative level of salary {low, medium, high}.
Source code that created this post can be found here.
Data Pre-processing
Let’s take a look at the data (check if there are missing values and the
data type of each features):
Data overview
Since there are no missing values, we do not have to do any
imputation. However, there are some data pre-processing needed:
1. Change sales feature name to department.
2. Convert salary into ordinal categorical feature since there is
intrinsic order between: low, medium and high.
3. Create dummy features from department feature and drop the first
one to avoid linear dependency where some learning algorithms
may struggle.
The data is now ready to be used for modelling. The final number of
features are now 17.
Since there are no missing values, we do not have to do any
imputation. However, there are some data pre-processing needed:
1. Change sales feature name to department.
2. Convert salary into ordinal categorical feature since there is
intrinsic order between: low, medium and high.
3. Create dummy features from department feature and drop the first
one to avoid linear dependency where some learning algorithms
may struggle.
The data is now ready to be used for modelling. The final number of
features are now 17.
Modelling
Let’s first take a look at the proportion of each class to see if we’re
dealing with balanced or imbalanced data, since each one has its own
set of tools to be used when fitting classifiers.
Class counts
As the graph shows, we have an imbalanced dataset. As a result, when
we fit classifiers on such datasets, we should use metrics other than
accuracy when comparing models such as f1-score or AUC (area
under ROC curve). Moreover, class imbalance influences a learning
algorithm during training by making the decision rule biased towards
the majority class by implicitly learns a model that optimizes the
predictions based on the majority class in the dataset. There are three
ways to deal with this issue:
1. Assign a larger penalty to wrong predictions from the minority
class.
2. Up sampling the minority class or down sampling the majority
class.
LEARNING AFTER TRAINING:
We all learn best when we have examples to follow, friends to
share our successes with, buddies to learn from, and mentors in
our midst. Social learning connects learners to one another and
the trainers so that they can discuss and share stories. In-person
meetings, chat groups, forums, and videos of trainees sharing
their stories hosted on the Intranet are effective ways to
incorporate social learning in the learning process. This social
aspect of the learning process increases motivation and facilitates
a smooth transfer of the knowledge.
Create opportunities for practice.
Multiple research studies have emphasized the importance of

repeated practice to cement one’s newly-acquired skills.
Employees should be provided ample opportunities at the
workplace to practice the skills they have learned from the
training program. According to studies, not having these
opportunities can inhibit an employee’s ability to apply his skills
to solve real-life problems.
SUMMARY
Voluntary employee attrition may negatively affect a company in
various aspects, i.e., induce labour cost, lose morality of employees,
leak IP/talents to competitors, etc. Identifying individual employee
with inclination of leaving company is therefore pivotal to save the
potential loss. Conventional practices rely on qualitative assessment
on factors that may reflect the propensity of an employee to leave
company. For example, studies found that staff churn is correlated
with both demographic information as well as behavioural activities,
satisfaction, etc. Data-driven techniques which are based on statistical
learning methods exhibit more accurate prediction on employee
attrition, as by nature they mathematically model the correlation
between factors and attrition outcome and maximize the probability of
predicting the correct group of people with a properly trained machine
learning model.
The study was conducted with a few objectives in mind which were to
study the HR practices in the organization, to find out the problems
faced by employees of the organization, to find out certain factors
responsible for high attrition rate in the organization and to suggest
some ways by which the company can retain its employees.

Employee Attrition Prediction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Employee Attrition Prediction

Uploaded by

Copyright:

Available Formats

SESSION (2019-2020)

(DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY)

SUBMITTED IN THE PARTIAL FULFILLMENT OF THE

DEGREE OF BACHELOR OF ENGINEERING

Under the guidance of

2019 is a proof of successful completion of Industrial Training Phase–I programme undergone

by S.V.Swetha Niharika (Register no..RA1711003011022) in the company .Smart Bridge

Date Signature of the

Date : 3rd June 2019

Name : S.V.Swetha Niharika

Register Number: RA1711003011022

Signature of the Student

I would like to express my special thanks of gratitude to Dean of

I perceive as this opportunity as a milestone in my career development. I

S. No Marks Split up Maximum marks Marks Obtained

3 Quiz and Viva 25

.INTRODUCTION ABOUT THE INDUSTRY

.WORK DONE / OBSERVATIONS

.SPECIFIC ASSIGNMENT / PROJECT HANDLED

.LEARNING AFTER TRAINING

SMARTBRIDGE is a comprehensive one-stop portal catering to the

 Well directed career guidance programs for educational

AI can be categorized as either weak or strong. Weak AI, also known

Because hardware, software and staffing costs for AI can be

Employee Attrition Prediction

In order to start with exercise, I have used Employee Attrition

The repository contains three parts

 Data this contains the provided sample data.

Human resource analysis, employee attrition prediction, sentiment

Data science problem

Normally employee attrition prediction is categorized as a

In the data-driven employee attrition prediction model, normally two

1. First type refers to the demographic and organizational

1. Prediction models are created based on classification algorithms

As we can see, we have a total of 48 inputs, which contain the

As we said before, the data set is unbalanced. As a consequence, we

 Classification accuracy: 73.9% (ratio of correctly classified

In general, these binary classification tests show a good performance

Multiple research studies have emphasized the importance of

You might also like