Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Rochester Institute of Technology

RIT Digital Institutional Repository

Theses

5-16-2023

Study to Improve the Employee Experiences and Reducing the


Employee Attrition
Mahra Essa Mohammed Belarai Alfalasi
mea8986@rit.edu

Follow this and additional works at: https://repository.rit.edu/theses

Recommended Citation
Alfalasi, Mahra Essa Mohammed Belarai, "Study to Improve the Employee Experiences and Reducing the
Employee Attrition" (2023). Thesis. Rochester Institute of Technology. Accessed from

This Master's Project is brought to you for free and open access by the RIT Libraries. For more information, please
contact repository@rit.edu.
Study to Improve the Employee Experiences and
Reducing the Employee Attrition

by

Mahra Essa Mohammed Belarai Alfalasi

A Capstone Submitted in Partial Fulfilment of the Requirements for the

Degree of Master of Science in Professional Studies:

Data Analytics

Department of Graduate Programs & Research

Rochester Institute of Technology


RIT Dubai
16th May 2023

i
RIT
Master of Science in Professional Studies:
Data Analytics

Graduate Capstone Approval


Student Name: Mahra Essa Mohammed Belarai Alfalasi
Graduate Capstone Title: Study to Improve the Employee Experiences
and Reducing the Employee Attrition

Graduate Capstone Committee:

Name: Dr. Sanjay Modak Date: 15th, May 2023


Chair of committee

Name: Dr.Hammou Messatfa Date: 15th, May 2023


Member of committee

i
Acknowledgments

I feel honored to express my sincere gratitude to Dr. Hammou Messatfa, my thesis supervisor, for
his invaluable guidance and unwavering support during my research. His insightful feedback,
constructive criticism, and mentorship have been crucial in shaping my work and refining my
skills. By reviewing my thesis and providing constructive feedback, he has helped me improve my
research and strengthen my arguments.

I would like to express my sincere gratitude to Dr. Sanjay, the head of the department, for
approving the fascinating subject of my thesis on employee attrition. His support and
encouragement have given me the opportunity to explore this important topic and research deeper
into its complexities.

Finally, I am deeply grateful for the constant support of my family and friends, who have been a
continuous source of inspiration and motivation throughout my academic journey. Their
unwavering belief in me has given me the courage to overcome challenges and pursue my
academic goals. I am truly thankful for their love and encouragement, which have played an
instrumental role in my success.

ii
Abstract

Employee attrition is a significant issue that occurs in every company today, regardless of external
environment changes. According to the definition of attrition, the number of employees gradually
decreases due to retirement, death, and resignations (Marais, 2022). Attrition can occur when a
well-trained and well-adjusted talented person leaves the company for any reason, leaving a gap
in the workplace (BasuMallick, 2021). It is extremely difficult for HR employees to fill the gap
that has been created. For today's managers, minimizing turnover rates is a major concern, and
modern HR managers do this in several ways, the employee's decision was motivated by several
factors (Charaba, 2022). It is the company's responsibility to recognize employee turnover as it
has a significant impact on processes. Retaining the employee reduces the need to find new
employees, improves stability, and wastes less time on training. In this paper, a detailed
interpretive model is used to identify the causes of turnover based on a study that will be
implemented to better determine the correlation of several factors and understand which factors
have the greatest impact on employee turnover in order to solve this major problem and predict
when an employee will leave and end their work life. Various models, including Logistic
Regression, PCA, Random Trees, Neural Networks, and LSVM, will undergo testing to identify
the most accurate model to predict employee attrition and determine its suitability for
implementation.

Keywords:

Attrition, Retirement, Resignations, Stability, Turnover, Machine Learning, Model, PCA,


Random Trees, Neural Networks, and LSVM, Logistic Regression.

iii
Table of Contents

Acknowledgments ........................................................................................................................................................ ii

Abstract ........................................................................................................................................................................ iii

List of Figures ............................................................................................................................................................... v

List of Tables ................................................................................................................................................................ v

Chapter 1....................................................................................................................................................................... 6

1.1 Background ..................................................................................................................................................... 6

1.2 Statement of problem ..................................................................................................................................... 6

1.3 Project goals .................................................................................................................................................... 7

1.4 Methodology .................................................................................................................................................... 7

1.5 Limitations of the Study ................................................................................................................................. 8

Chapter 2 – Literature Review ................................................................................................................................... 9


2.1 Literature Review Main Takeaways .............................................................................................................. 11

Chapter 3- Project Description ................................................................................................................................. 15

3.1 Data Understanding ...................................................................................................................................... 15


3.2 Data Transformation ....................................................................................................................................... 26

Chapter 4- Project Analysis ...................................................................................................................................... 34

4.1 Modeling................................................................................................................................................................ 37

Chapter 5 Conclusion ................................................................................................................................................ 43

5.1 Conclusion ..................................................................................................................................................... 43

5.2 Recommendations ......................................................................................................................................... 44

5.3 Future Work .................................................................................................................................................. 44

Bibliography ............................................................................................................................................................... 45

iv
List of Figures

Figure 3.1.1 Data Insights


Figure 3.1.2 Data Overview
Figure 3.1.3 Attrition Distribution
Figure 3.1.4 Attrition Percentage
Figure 3.1.5 Education Distribution
Figure 3.1.6 Attrition distributed on age field
Figure 3.1.7 Attrition distributed on gender
Figure 3.1.8 Attrition distribution on age, daily rate, distance from home, monthly rate, years at
company worked and monthly income.
Figure 3.1.9 Attrition distribution on education, environment satisfaction, stock option level, work
life balance, and years at company.
Figure 3.1.10 Chi-square of attrition and Marital Status
Figure 3.1.11 Chi-square of attrition and years in current role.
Figure 3.1.12 Chi-square of attrition and job role.
Figure 3.1.13 Chi-square of attrition and job satisfaction.
Figure 3.1.14 Performance rating distribution on Business Travel and Department.
Figure 3.1.15 Anomaly distribution
Figure 3.2.1 Employee attrition dataset’s Principal Component Analysis.
Figure 3.2.2 PCA values implemented in the fields of employee attrition dataset.
Figure 4.1. Employee Attrition Feature selection.

List of Tables

Table 4.1.2 Model Scenarios Summary

v
Chapter 1
1.1 Background
The key to any company's success lies in attracting and retaining top talent. The job of a HR
analyst is to find out which factors keep employees in the company, and which cause others to
leave. Knowing that it is important to determine factors that can change to prevent losing talented
people. Watson Analytics assists in that, by providing data through Kaggle, on past and current
employees that contain various data points about employees, to determine whether they are still
in the same company or shifted to another one. It is important to understand how this affects
turnover.
The dataset attributes determine several factors on employees as job role, job satisfaction,
environmental satisfaction, performance rating, relationship satisfaction, and work-life, which are
indicated by values 1-4, ranging from low to very high. Education followed by scores 1-5, ranging
below college to doctorate. Other attributes indicate the employee's gender, role, turnover, and
age.

1.2 Statement of problem

Employee attrition is a critical and timely issue for all companies. Companies and their HR
departments will benefit greatly from being able to predict with a high degree of certainty which
employees are most likely to leave. By understanding why and when employees are most likely to
leave, HR can take steps to improve retention and potentially plan ahead for new hires. Therefore,
I am interested in exploring what factors may lead to employee turnover, whether age, income,
and satisfaction with the work environment have an impact, or whether other factors have a greater
impact on employee turnover.

6
1.3 Project goals
1-To increase employee’s work span life.
2- To be knowledgeable on what causes employee dissatisfaction that leads to attrition.
3- To be aware of how attrition can be decreased.
4- To analyze the key factors of the employee attrition rate and plan actions.
5-To retain and enhance experience of employees that might consider attrition.

1.4 Methodology
Cross-Industry Standard Process for Data Mining (CRISP-DM) this method will be followed
in this project because its implementation results in a structured, well-organized project. This
approach provides flexibility, long-term planning and framework, and repeatability; it is ideal
for implementing new, simple, or complex data-driven projects because it does not require
extensive training or controversy. It can be followed in 6 phases. Starting with Business
Understanding to understand the goals and scope of the project. This is the initial and most
important step of the framework to have a clear idea of the business, which includes learning
and exploring the pain points to address the problem. This is followed by data collection and
data understanding to have a clear idea of the relationships between each attribute to have
a better understanding and correlation of the data set. In data preparation, all outliers or
missing values are removed in this phase to obtain clean and prepared data. Modeling
involves investigating to understand what exactly fits the selected data set. It involves
understanding the scope of the problem and selecting appropriate models to solve it, as well
as choosing the most accurate techniques to predict the target. Through a quick analysis in
the SPSS tool, a collection of software programs for analyzing scientific data in the social
sciences. SPSS provides a fast, visual modeling environment for models ranging from the
simplest to the most complex. In the next phase, various statistical approaches are used for
evaluation, including the ROC curve, root mean square error, precision, confusion matrix,
and other metrics for model evaluation. This is to determine if the model answers only a
portion of the questions posed. The evaluation phase can also be used to look at the overall
progress and determine if everything is on track or if the analysis and business goals are

7
wrong. The final phase, deployment, is about understanding the scope of the problem and
selecting appropriate models for the solution as well as selecting the most accurate techniques
to predict the goal. This is done after a thorough analysis of the data set.

1.5 Limitations of the Study

The accuracy of predicting employee attrition may be limited by the data available on Kaggle, as
it may not include all the relevant features necessary for the specific context. Additionally, using
data collected by others may result in limited control over the data collection process, which can
impact the quality of the data and limit the ability to address potential sources of bias. Furthermore,
not having full ownership or control over the data may also limit the ability to use it for future
research or address potential sources of bias.

8
Chapter 2 – Literature Review
Some of the studies examined the traits and behaviors of employees that contributed to their
decision to leave or stay with the company. ML (Machine Learning) algorithms were used by
several other researchers to forecast staff turnover. But only a small number of research used XAI
approaches to guarantee the ML models' predictions for employee attrition. The purpose of this
study is to identify the key factors that influence and forecast employee turnover. After training,
an actual IBM analytics dataset with 35 features and 1500 samples is used to evaluate the model.
For the supplied dataset, the Gaussian Naive Bayes classifier delivered the best results. With a
recall rate of 0.54 and a false negative rate of 4.5 percent of all observations, the result is as
expected. This brings together a number of elements, including social, cultural, economic,
professional, and interpersonal ones. Here, six alternative machine learning algorithms were
compared by the authors. The trials' findings indicate that the Random Forest algorithm is the most
accurate in predicting staff attrition. The highest forecast accuracy, which is good, was 85.12%.
According to (Bhardwaj, Shikha & Singh, Ashutosh, 2017), a study has been researched what
influences attrition rate in the industrial industry. Employees from a manufacturing unit (63
engineers and 12 non-engineers) were interviewed either in-person or using a series of prepared
questions as part of a descriptive research to determine those elements. According to the study,
boss, salary, and stress are the main variables influencing attrition rate. Simply finding, evaluating,
providing, and putting the appropriate personnel at the appropriate time and location does not
guarantee that a business will expand. The organization will expand and succeed if the proper
talent is retained for the appropriate amount of time. Based on the findings of this study, it is
concluded that modern manufacturing facilities take their approach to reducing attrition rate a little
too lightly. To strengthen the employee-employer relationship, there should be clear and open lines
of communication. The pay structure has to be updated, and appropriate raises should be provided
and kept. Moreover, to put it simply, attrition occurs when the workforce decreases due to a variety
of factors that are entirely avoidable. Employee attrition can be caused by a lack of trust in the
company's leadership and market value, a hostile work environment, a lack of professional growth
opportunities, and other factors (Contributors, 2020). In addition to that, there are many reasons
why an employee might leave their current position.

9
Some of the main reasons for employee turnover are lack of professional development
opportunities, lack of employee engagement, lack of or poor employee benefits and annual
compensation, disagreements with colleagues or management, no clear company goals or
direction, and employees feel their honest feedback or thoughts aren't considered. (Charaba,
2022). Furthermore, a study conducted by (Lavanya, D. B. L. 2017), addresses the inevitable but
controllable turnover of software personnel. One hundred respondents were randomly selected to
complete a structured questionnaire. The simple random sampling approach was used to analyze
the data. Efficiency values for attrition were calculated using data analysis and SPSS version 20.
Numerous statistical methods were used, including factor analysis, correlation analysis, t-test, chi-
square test, one-way ANOVA, and multiple regression. Multiple regressions were performed to
examine the effects of staff turnover, as the correlation analysis showed a significant relationship.
The results showed that there was no statistically significant difference in the dimensions of the
parameters used to predict employee turnover. According to a chi-square test, there is a significant
relationship between employee job search and turnover rate. Predicting employee turnover can
help save money because the time and expense of finding, hiring, and training new employees
make high turnover rates expensive. Companies can take proactive steps to retain such employees
and reduce costs associated with a turnover by using AI to identify which employees are likely to
leave. In addition, employee satisfaction is higher, as employees are more inclined to stay with a
company if they feel valued and supported. Companies can improve the work environment and
increase employee retention by using AI to identify the factors that influence employee
satisfaction. Organizations rely heavily on their employees as valuable resources, making it
essential to identify individuals who may leave. Employee attrition can occur for various reasons,
making it challenging to predict accurately. To address this issue, the paper explores using machine
learning models to forecast attrition with high accuracy. The study used the IBM attrition dataset
to train and evaluate several models, including Decision Tree, Random Forest Regressor, Logistic
Regressor, Adaboost Model, and Gradient Boosting Classifier. The goal of accurately detecting
attrition is to improve retention strategies and increase employee satisfaction. Ultimately, the study
emphasizes the importance of leveraging machine learning to predict employee attrition accurately
and benefit organizations by developing better retention strategies. (Qutub, Aseel & Al-Mehmadi,
Asmaa & Al-Hssan, Munirah & Aljohani, Ruyan & Alghamdi, Hanan,2021). This article explores
how machine learning techniques can be employed to forecast employee turnover in companies.

10
The objective of this research is to pinpoint the key factors that lead to an employee's departure
and to anticipate whether a specific worker is likely to quit. The piece stresses the importance of
utilizing unbiased data analysis to inform decisions related to HR management and highlights the
advantages of such an approach in terms of enhancing the quality and competence of the
workforce. (Williammorath, 2021) This study aims to explore how employee turnover can be
categorized using various machine learning algorithms, namely Support Vector Classification,
Decision Tree Classifier, AdaBoost Classifier, Random Forest Classifier, Extra Trees Classifier,
Logistic Regression, and Gradient Boosting Classifiers, with reference to the 16 characteristics of
workers. The random forest model was found to produce the best results among all models, which
can be further utilized for real-world prediction purposes. (Liao, 2023). Every organization faces
the challenge of employee attrition, which directly impacts its growth by losing talented and skilled
employees. Employees are valuable resources, and their departure can create dependency issues
in critical positions. There can be various reasons behind an employee's decision to leave, and it is
the organization's responsibility to identify the causes and implications of the attrition process.
Retaining employees can reduce the burden of hiring and training new candidates, improve
stability, and increase productivity. An advanced interpretation model can be utilized to examine
the causes of employee attrition to address this problem. To explain the variables impacting
employee turnover, this article compares the performance of two Explainable AI (XAI) models,
namely the Local Interpretable Model-Agnostic Explainer (LIME) and the Shapley Additive
eXplainer (SHAP). These models extract rational conclusions from the data, which can assist
management officials in reducing the danger of staff attrition. (Sekaran, Karthik & Sundaramurthy,
Shanmugam., 2022).

2.1 Literature Review Main Takeaways

Takeaways retrieved from this reference:

Ref 1 Qutub, Aseel & Al-Mehmadi, Asmaa & Al-Hssan, Munirah & Aljohani, Ruyan & Alghamdi, Hanan.
(2021). Prediction of Employee Attrition Using Machine Learning and Ensemble.

1.It is essential to recognize employees as an asset to any organization, and it's crucial to identify
potential employees who might leave the organization to avoid the cost associated with their
departure.

11
2.Employee attrition can result from several reasons, and it is imperative to understand the
underlying factors that may lead to it.

3.The dataset of IBM attrition is utilized in this study to create machine learning models capable
of accurately predicting employee attrition.

4.The machine learning models employed in this research comprise Decision Tree, Random
Forest Regressor, Logistic Regressor, Adaboost Model, and Gradient Boosting Classifier models.

5.The goal is to precisely identify attrition to assist companies in enhancing various retention
strategies on critical employees and improving employee satisfaction.
Ensemble learning and stochastic gradient descent are utilized in this study to boost the accuracy
of the models.

Ref 2 Williammorath. (2021, May 24). Employee attrition eda walkthrough. Kaggle. Retrieved November 21,
2022, from https://www.kaggle.com/code/williammorath/employee-attrition-eda-walkthrough

1.Artificial intelligence is increasingly being used to support decision-making in various areas of


organizations, including human resources management.

2.The quality and skills of workers are a growth feature and competitive advantage for
companies and understanding employee attrition can help organizations address this issue.

3.Objective data analysis can be used to identify the main causes of employee attrition and
predict whether a particular employee is likely to depart the company.

4.The study used a dataset provided by IBM analytics, which included 35 features and about
1500 samples.

5.The Gaussian Naïve Bayes classifier produced the best results for predicting employee attrition
in the dataset, with a recall rate of 0.54 and a false negative rate of 4.5% of total observations.

Ref 3 Hong, W. (2006). A Comparative Test of Two Employee Turnover Prediction Models. The International
Journal of Management, 24, 216.

12
1. Precise models to predict employee turnover are essential for detecting unanticipated turnover
and providing managers with sufficient time to address related management issues.

2. While the logit and probit models have effectively solved nonlinear classifying and regression
problems, their feasibility in predicting voluntary turnover has not received adequate research.

3. To demonstrate high prediction capabilities, the article uses a numerical example with
voluntary turnover data from a motor marketing enterprise located in central Taiwan.

4. The logit and probit models represent a promising option for predicting employee turnover in
human resource management.

5. According to the article, the use of machine learning techniques such as logit and probit models
can assist organizations in making informed decisions and taking proactive measures to prevent
employee turnover.

Ref 4 Liao, C. (2023). Employee turnover prediction using machine learning models. International Conference
on Mechatronics Engineering and Artificial Intelligence.

1.The study focuses on how different machine learning algorithms can be used to categorize
employee turnover based on the 16 characteristics of workers.

2.The Employee Turnover dataset by E. Babushkin was used as the source of information for the
study.

3.Seven classification models were developed and compared, including naive Bayes, random
forest, logistic regression, support vector machines, and XGBoost.

4.The experiments conducted in the study validated the effectiveness of the machine learning
models.

5.The random forest model was found to produce the best results among all the models tested.
And the findings suggest that the random forest model can be used for real-world prediction
purposes in the context of employee turnover.

13
Ref 5 Sekaran, Karthik & Sundaramurthy, Shanmugam. (2022). Interpreting the Factors of Employee Attrition
using Explainable AI. 10.1109/DASA54658.2022.9765067.

1.Employee attrition is a significant challenge for organizations, and it is crucial to identify and
address it.

2.Each employee represents a valuable resource, and their departure can have significant
implications for an organization's processes.

3.Retention of employees is essential as it reduces the burden of hiring new candidates, increases
stability, and saves time on training.

4.A sophisticated interpretation model, such as the Local Interpretable Model-Agnostic Explainer
(LIME) and Shapley Additive eXplainer (SHAP), can be used to identify the reasons for employee
attrition.

5.These models can provide logical insights from data that can assist management authorities in
mitigating the risk of employee attrition.

14
Chapter 3 - Project Description
3.1 Data Understanding
Displayed below is a screenshot from SPSS which provides an overview of the HR employee
attrition dataset. The fields are categorized according to their measurement type, such as
continuous or nominal data. Continuous data is a quantitative numerical variable with an infinite
number of values between any two values, while nominal values are used to label variables without
providing a numerical value. The overview also includes minimum, maximum, mean (average),
and standard deviation values. Additionally, skewness is calculated as a measure of asymmetry or
skewness in a symmetric distribution. If the data points are not symmetrically distributed to the
left and right sides of the median on a bell curve, then skewness will be present. Finally, the number
of valid attributes for each field is displayed, which is 1470.

15
Figure 3.1.1 Data Insights

Figure 3.1.1 Data Insights

Figure 3.1.2 Data Overview

The diagram provides an overview of the data, indicating that there are no missing values since
"impute missing" is set to "never." However, outliers were identified in several variables, including
"16" in TotalWorkExperience, "11" in MAH_1, and "24" in YearsAtCompany. These values were

16
identified as outliers using Mahalanobis distance, which considers the correlations between
variables and measures how far an observation is from the mean in terms of standard deviations.
Multivariate anomalies, or combinations of two or more variables, are often identified using
Mahalanobis distance, as it is a useful technique for detecting outliers (Stephanie, 2020).

Figure 3.1.3 Attrition distribution

The figure illustrates the distribution of attrition among the sample population, with 141 out of
1470 employees (approximately 10%) choosing to leave the company, while 1,329 employees
(approximately 90%) decided to stay.

Figure 3.1.4 Attrition Percentage

17
The figure displays the percentage distribution of attrition among the employees, with 9.6%
leaving the company, while 90.4% chose to remain employed

Data Partition

Data partitioning is crucial because it increases performance, enables scalability, improves data
isolation and security, and offers flexibility in data processing processes. For the employee attrition
dataset, the data partition is distributed into 70% as a training set and 30% as testing data.

Data Balance

The employee attrition dataset with values of 90.4% ‘No’ and 9.6% ‘Yes’ was imbalanced as
shown below.

To have a better data quality and results and higher modeling accuracy the dataset was balanced
as follows 50.99% ‘No’ and 49.01% ‘Yes’

Outliers

The distribution of the outlier was as shown below, with a count of 11 of outliers that have a
value of 1.

Using SPSS tool Select, a condition of Outlier=1 is discarded.

18
After the application of the condition, all outliers with value of 1 are now discarded as shown
below.

Visualizations of the Employee Attrition

Figure 3.1.5 Education distribution

This figure shows the different education fields of the employees, as it indicated Life Sciences had
the highest number of employees following it with medical and the least number of employees was
in the Human Resources field.

19
Figure 3.1.6 Attrition distributed on age field

In this figure, it is concluded that the age range between 31 to 43 with an average of 35 did not
appear to have attrition unlikely with the other age range from 28 to 37 with an average of 32
where it shows that this group age performed attrition with the presence of outliers in this age
group.

Figure 3.1.7 Attrition distributed on gender

20
As shown in this figure, we can see the distribution of attrition based on gender. From this we can
conclude that most of the departures are among men, since most of the employees are male.

Figure 3.1.8 Attrition distribution on age, daily rate, distance from home, monthly rate, years at company worked
and monthly income.

The boxplot displays various employee attributes, such as age, daily rate, distance from home,
monthly rate, years at the company, and monthly income, and compares them to the occurrence of
attrition. The plot shows that employees who chose to stay with the company tend to be older than
those who left due to attrition. Furthermore, outliers were identified in fields such as daily rate,
years at the company, and monthly income. In contrast, distance from home was found to have a
higher percentage of attrition, indicating that it had a greater impact on employees who left the
company. Lastly, the monthly rate was evenly split between employees who chose to leave and
those who decided to stay with the company.

21
Figure 3.1.9 Attrition distribution on education, environment satisfaction, stock option level, work life balance, and
years at company.

The boxplot presents several important observations. Education and work-life balance have similar
percentages for employees who left the company and those who stayed. Conversely, environment
satisfaction shows a significant difference, with employees who had higher satisfaction levels
staying with the company and those with lower satisfaction leaving. In addition, the plot reveals
that employees who had spent more years with the company were more likely to remain compared
to new employees, with several outliers indicating exceptions. Lastly, the stock option level
indicates that no employees left the company in this field.

22
Figure 3.1.10 Chi-square of attrition and Marital Status.

Based on the chi-square test, which assesses the degree of conformity between a model and actual
data, with a significance probability of less than 0.05 (Hayes, 2023), we can conclude that marital
status is significantly associated with attrition. Specifically, 64% of single employees, 26% of
married employees, and 10% of divorced employees left the company. This suggests that
employees without family responsibilities, such as singles, are more likely to leave the company.

Figure 3.1.11 Chi-square of attrition and years in current role.

Based on the chi-square analysis of attrition and years in current role, it can be inferred that
employees who have recently joined the company have a higher attrition rate of 31.91% compared
to those who have stayed with a rate of 14.9%. However, employees who have stayed for more
than 10 years in their current role remained at the company. Furthermore, there were only around
five individuals who chose attrition among those who stayed in their current role for at least 1 year
to 13 years.

23
Figure 3.1.12 Chi-square of attrition and job role.

Based on the chi-square analysis of job role and attrition, we can determine that employees with
a laboratory background had the highest attrition rate, reaching 29.78%. Sales executives had the
second-highest attrition rate at 25.53%. Conversely, Sales executives had a 21.8% retention rate,
followed by Researchers at 20.16%.

Figure 3.1.13 Chi-square of attrition and job satisfaction.

Based on the chi-square analysis of attrition and job satisfaction, it can be inferred that
employees who have higher job satisfaction (30% and above) tend to remain with the company.
Conversely, those who choose attrition have a job satisfaction percentage of 36.17%, indicating
that they are dissatisfied with their job.

24
Figure 3.1.14 Performance rating distribution on Business Travel and Department.

Based on the performance rating scale, with 3.0 representing Business Travel, it can be observed
that most employees who chose attrition were those who rarely traveled. Additionally, according
to the performance rating scale with 4.0 representing the department, it can be noted that most
employees who chose attrition were from the sales department.

Anomaly detection
A significant problem that has been researched for decades is finding anomalies. For various
purposes, a variety of unique techniques have been created and are utilized to identify
abnormalities. Finding patterns in data that do not correspond to expected behavior is known as
anomaly detection. (A. B. Nassif, M. A. Talib, Q. Nasir and F. M. Dakalbab, 2021)

Figure 3.1.15 Anomaly distribution


In this figure, a number of 1,400 anomalies are considered False and around 14 values are
considered as True values, which means that the number of True anomalies is less compared to
those who are False.

25
3.2 Data Transformation
Data transformation which indicates that data must be put into the proper form before analysis can
be done on it. It is a process that involves transforming, cleaning, and arranging data into a format
that can be used for analysis to assist decision-making. Data transformation was applied on 4 fields
that are years in current role, years since last promotion, years with current manager and num
companies worked with different binning methods that will be discussed in detail in the section
below.

Binning is a different technique that results in the assignment of values to groups to reduce the
number of discrete ranges. Binning is helpful in minimizing the impact of outliers or extremely
high values on the model. Thus, because the numerical values may be converted to frequency
distributed bins using quantiles, binning aids in reducing model bias. Outliers won't influence the
model after training as a result. (Press, 2022)

This method helps to simplify and reduce noise in large datasets and is especially useful when
dealing with data that has a wide range of values. After binning, the numerical values are replaced
with categorical values representing the corresponding bins. Binning is commonly used in data
exploration, visualization, and analysis, as well as in machine learning and modeling.

Continuous variable values are discretized into bins (groups or buckets) using the binning
(grouping or bucketing) method. The binning strategy may be able to address typical data issues
including handling missing values, the existence of outliers and statistical noise, and data scaling
from a modeling perspective. The binning process is also a helpful interpretable tool for model
simplification while enhancing understanding of the nonlinear relationship between a variable and
a given target. Data transformations can then be carried out using the resulting bins. Binding
methods are often used in machine learning applications. (Navas Palencia, Guillermo 2020).

Fixed Width Binning: by reducing information loss and enhancing the predictive potential of the
resultant bins, optimum binning algorithms seek to identify the best cut-points or thresholds for
segmenting a continuous variable into discrete bins.

A continuous variable's range is divided into equal-sized intervals or bins, each of which has the
same width or range of values. This is known as equal width binning. Data points that fall within

26
each bin are aggregated or classified in accordance with the number of bins or bin width that has
been specified by the analyst or data scientist.
1.Num Companies Worked had 9 values before data binning.

2.After applying fixed width binning the fields are transformed into 2 values as distributed
below.

Tile Binning
Tile binning in data transformation is a technique used to transform continuous numerical data into
categorical or discrete data by dividing the data range into fixed intervals or "tiles".
In tile binning, the data range is divided into a specified number of intervals or bins, and each data
point is then assigned to the corresponding bin based on its value. The bin boundaries can be either
evenly spaced or non-uniformly spaced based on domain knowledge or data distribution.

27
Tile binning can be useful for simplifying data and reducing the complexity of models or
algorithms that require discrete data inputs. It can also be helpful in data visualization and
exploratory data analysis by grouping continuous data into meaningful categories that can be more
easily understood and compared.

1.Years Since Last Promotion had 15 values before data binning.

2.After applying Tile Binning the 15 fields are now transformed into 5 fields as shown below.

Optimal binning: To decrease data noise or variability and, in some situations, potentially
enhance model performance, continuous variables are generally discretized into discrete bins or
intervals.

28
This is a screenshot of Years in current role field, as we can see it is made up of 18 values and
need to be transformed through data binning in SPSS which results will be shown below following
Optimal binning:

1. Years in current role field, before data binning had 18 fields.

2.After data binning optimal method of binning was applied and result is shown as below the
data is transformed from 18 to 2 fields with a detailed distribution of values as shown below.

29
3.After data binning we can view the result of optimal binning which as a result may also increase
the prediction models' accuracy by lowering noise or nonlinearity by reducing the fields into 2.

1.Years with current manager field, before data binning had 17 fields

30
2.After data binning optimal method of binning was applied and result is shown as below the
data is transformed from 17 to 2 fields with a detailed distribution of values as shown below.

3.After data binning we can view the result of optimal binning on years with current manager field
which as a result may also increase the prediction models' accuracy by lowering noise or
nonlinearity by reducing the fields into 2.

31
PCA, A statistical method called principal component analysis is used to condense a dataset's
dimensions while preserving as much of its original variance as feasible. It functions by finding
patterns in the data and generating a fresh set of variables called principal components, which are
linear combinations of the initial variables.

In general, Principal Component Analysis (PCA) aims to find a lower-dimensional representation


onto which high-dimensional data can be projected. It serves as a useful tool for reducing the
dimensions of multidimensional data while retaining a significant portion of the information. PCA
achieves this by evaluating the variance of each attribute, as attributes with high variance often
exhibit distinct differences between classes, leading to a reduction in dimensionality. By
employing a feature extraction technique, PCA retains the important variables while discarding
the less relevant ones.

Dimension Reduction is a machine learning (ML) or statistical strategy for reducing the number
of random variables in a job by obtaining a collection of main variables. This process may be
carried out using a variety of methods that simplify the modeling of complex problems, eliminate
duplication, and reduce the chance of the model overfitting. (Contributor, 2018)

Figure 3.2.1 Employee attrition dataset’s Principal Component Analysis.

32
As PCA Principal Component Analysis helps in dimension reduction and recommend what fields
are important to be used, we can determine that until component 11 are components with high
significance that are highly recommended to be used.

Figure 3.2.2 PCA values implemented in the fields of employee attrition dataset.

After applying Principal Component Analysis, PCA’s 11 factors are now part of the dataset field
that will be included in the modeling phase.

33
Chapter 4 - Project Analysis
Feature selection is a technique utilized in machine learning and data analysis to choose a subset
of pertinent characteristics or variables from a larger set present in a dataset. Its primary objective
is to isolate the most critical and enlightening features while eliminating irrelevant, redundant, and
noisy ones to obtain the highest performing subset of the original features without modification.
The selection of features is significant for various reasons, including reducing the dataset's
dimensionality, enhancing model interpretability, decreasing overfitting, and improving model
performance and accuracy by concentrating on the most informative qualities. The selection
process should consider the dataset's specific properties and modeling objectives, and the impact
of feature selection on model performance should be evaluated to ensure an effective strategy is
adopted.

34
Figure 4.1.1 Employee Attrition Feature selection

In this figure we had 39 fields which are considered the most important ones which effects attrition
a value of 1 indicates that these fields are the highest fields that caused attrition and the ones below
1 also had a significant impact on attrition rate as shown in the figure. Adding to that, below 0.95
indicates that these fields are marginal and did not have a great impact on causing attrition such as
Employee Number.

Model evaluation is an important stage in data science that enables us to assess how effectively a
machine learning model works on fresh, untested information. Depending on the nature of the
problem and the data, a variety of assessment measures can be applied.

Here are a few typical data science assessment metrics:

Accuracy: This standard metric for classification issues assesses the proportion of accurate
predictions the model makes.

Precision: Out of all positive predictions made by the model, this statistic calculates the percentage
of true positive predictions (i.e., cases accurately labeled as positive).

35
Recall: Out of all real positive examples in the data, this statistic counts the percentage of genuine
positive forecasts.

F1 score: This is a harmonic mean of recall and accuracy, which offers a single statistic that evenly
balances both measurements.

ROC curve and AUC: They provide as a visual representation of the trade-off between sensitivity
and specificity at various classification thresholds and are used to assess binary classification
algorithms.

Mean squared error (MSE): It evaluates the average squared difference between the anticipated
and actual values, which is a frequent statistic for regression issues.

R-squared: this metric quantifies the percentage of variability in the dependent variable that can
be attributed to the independent variables included in the model.

Overall, the choice of evaluation metrics depends on the specific problem and the goals of the
analysis, and it is important to choose the right metric(s) to ensure that the model performs well
on new, unseen data.

Implemented Binary Classification Assessment (ROC & AUC)

The graphical representation of a binary classification model's performance is commonly known


as the "ROC," or receiver operating characteristic. This statistic is widely used in machine learning
and data analysis.

The ROC curve displays the true positive rate (TPR) and false positive rate (FPR) at various
classification thresholds. The TPR measures the percentage of actual positive instances correctly
classified as positive, while the FPR measures the percentage of actual negative cases misclassified
as positive.

The ROC curve plots TPR against FPR, with each point representing a different classification
threshold. The ideal ROC curve hugs the upper left corner of the graph, indicating high TPR and
low FPR across all possible thresholds.

36
One widely used metric for evaluating the effectiveness of a binary classification model is the area
under the ROC curve (AUC). The AUC ranges from 0 to 1, with a value of 1 indicating perfect
classification and a value of 0.5 indicating random guessing.
In conclusion, the ROC curve and AUC are valuable tools for assessing a binary classification
model's performance and comparing it with other models. They can help in understanding the
trade-offs between sensitivity (TPR) and specificity (1-FPR) and determining the optimal
classification threshold for the model.

4.1 Modeling

To determine which model had the highest accuracy and AUC, various modeling techniques, such
as LSVM, Logistic Regression, Neural Network, Random Trees, and more, were tested on
different numbers of fields. The results of these tests, along with the details of the different
scenarios, will be presented below.
Four different scenarios were tested. In Scenario 1, modeling was performed using 33 features that
were selected based on PCA. In Scenario 2, nine fields were chosen based on feature selection
from the top 10 randomly selected features. In Scenario 3, 38 fields were selected using a feature
selection algorithm. Finally, in Scenario 4, 18 fields were chosen randomly.

37
Scenario 1 The first model with LSVM modeling included 33 fields with an accuracy result of
93.27 and AUC 0.95.

38
Scenario 2 The second model with Random Trees modeling included 9 fields with an accuracy
result of 86.09 with an AUC 0.848.

39
Scenario 3 The third model with LSVM modeling included 38 fields after feature selection with
an accuracy result of 95.06 and AUC 0.97.

40
Scenario 4 The fourth model with Random Trees modeling included 18 fields with an accuracy
result of 85.42 and AUC 0.76.

41
Chosen model
The LSVM model implemented in scenario three, which exhibits the highest accuracy, will be
utilized throughout this study.

Scenario Model Name Overall Accuracy No. of fields

Scenario 1 LSVM 1 93.27 33 field

Scenario 2 Random Trees1 86.09 9 fields

Scenario 3 LSVM 1 95.06 38 fields

Scenario 4 Random Trees1 85.46 18 fields

Table 4.1.2 Model Scenarios Summary

42
Chapter 5 - Conclusion
5.1 Conclusion
The aim of this research was to predict employee attrition using machine learning methods.
Various techniques, including Logistic Regression, Neural Network, LSVM, and Random Trees,
XGBoost linear, were applied to analyze the data. In scenario 1 LSVM was utilized, while Random
Trees was the optimal model in scenario 2. In scenario 3, LSVM was employed with different
fields, and in scenario 4, Random Trees was utilized. The LSVM model in scenario 3 was found
to be the most precise in forecasting employee attrition. The performance of each model was
evaluated, and the results showed that LSVM was the most effective model The study uncovered
that overtime, job role, years in current role, years with current manager, stock option level, marital
status, total working hours, and years at the company were the most critical factors in forecasting
employee attrition. LSVM in scenario 3 outperformed the other models, with an accuracy rate of
around 95%, indicating its potential as an effective tool for predicting employee attrition.

The practical implications of this study are significant for organizations seeking to enhance their
employee retention strategies. By identifying the factors that contribute to employee attrition,
organizations can take proactive measures to address them. The research recommends that
addressing employee satisfaction, job involvement, and work-life balance could be effective
approaches to reducing employee attrition rates.
However, it is important to note that there were limitations to this study. We did not have
ownership of the data used in this study, which may have limited the analysis that we were able to
perform. Furthermore, while we identified several factors that contributed to employee attrition,
there may be other variables that we did not consider due to the nature of the data.

In conclusion, this study demonstrates the potential of machine learning techniques, specifically
LSVM, for predicting employee attrition. Our findings provide valuable insights that can inform
the development of employee retention strategies. However, it is important to consider the
limitations of this study and the need for further research to validate our findings.

43
5.2 Recommendations
To enhance the analysis of employee attrition, consider implementing the following strategies:

1. Ensure high-quality data: Verify the accuracy and completeness of the dataset through data
cleaning and validation techniques.

2. Include additional variables: Incorporate variables such as employee satisfaction, job


engagement, and work-life balance to identify additional factors that could impact attrition rates.

3. Analyze subgroups: Examine attrition rates for different subgroups of employees, such as
different departments or job positions, to identify patterns or trends.

4. Conduct a cost-benefit analysis: Evaluate the potential costs of implementing attrition


prevention strategies compared to the cost of employee turnover to inform decision-making.

5.3 Future Work


1.Consider external factors: Identify external factors, such as economic changes or industry
trends, that could affect attrition rates and include them in the analysis.

2. Investigate causes of attrition: Conduct surveys or interviews with former employees to


understand the reasons for their departure, which can be used to develop targeted prevention
strategies.

44
Bibliography

1.BasuMallick, C. (2021, March 11). What is employee attrition? definition, attrition rate, factors, and reduction
best practices. Spiceworks It Security. Retrieved December 7, 2022, from
https://www.spiceworks.com/hr/engagement-retention/articles/what-is-attrition-complete-guide/

2.Bhardwaj, Shikha & Singh, Ashutosh. (2017). Factors affecting employee attrition among engineers and non-
engineers in manufacturing industry. Business & IT. VII. 26-34. 10.14311/bit.2017.02.04.

3.Chaid. Statistics Solutions. (2021, June 30). Retrieved December 17, 2022, from
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/chaid/

4. Charaba, C. (2022, June 28). Employee retention: The real cost of losing an employee. Personalized Benefits
Administration Software. Retrieved December 2, 2022, from
https://www.peoplekeep.com/blog/employee-retention-the-real-cost-of-losing-an-employee

5.Contributors, T. F. (2020, May 14). The Ultimate Guide on Employee attrition. Techfunnel. Retrieved
November 22, 2022, from https://www.techfunnel.com/hr-tech/employee-
attrition/#:~:text=To%20define%20attrition%20in%20simple%20terms%2C%20it%20is,is%20not%20
conducive%2C%20absence%20of%20professional%20growth%2C%20etc.

6.Contributor, T. T. (2018, November 17). What is dimensionality reduction?: Definition from TechTarget.
WhatIs.com. Retrieved April 27, 2023, from
https://www.techtarget.com/whatis/definition/dimensionality-
reduction#:~:text=Dimensionality%20reduction%20is%20a%20machine,a%20set%20of%20principal%
20variables.

7.E. G. P. Wijayarathna, V. M. I. Senevirathna, and G. S. Walgampaya, "Predicting employee attrition using


machine learning techniques," in 2020 IEEE 10th International Conference on Information and Automation for
Sustainability (ICIAfS), 2020, pp. 1-6. doi: 10.1109/ICIAFS49334.2020.9359422

8.Feature selection: A review and comparative study - e3s-conferences.org. (n.d.). Retrieved April 27, 2023,
from https://www.e3s-conferences.org/articles/e3sconf/pdf/2022/18/e3sconf_icies2022_01046.pdf

45
9.Hayes, A. (2023, January 13). Chi-Square (Χ2) statistic: What it is, examples, how and when to use the test.
Investopedia. Retrieved April 3, 2023, from https://www.investopedia.com/terms/c/chi-square-
statistic.asp

10.Hong, W. (2006). A Comparative Test of Two Employee Turnover Prediction Models. The International
Journal of Management, 24, 216.

11.Implementation of decision tree using C5.0 algorithm in preference and ... (n.d.). Retrieved December 17,
2022, from https://iopscience.iop.org/article/10.1088/1742-6596/1882/1/012132

12.Karamizadeh, Sasan & Abdullah, Shahidan & Manaf, Azizah & Zamani, Mazdak & Hooman, Alireza.
(2013). An Overview of Principal Component Analysis. Journal of Signal and Information Processing.
10.4236/jsip.2013.43B031.

13.Lavanya, D. B. L. (2017). A Study on Employee Attrition: Inevitable yet Manageable. Retrieved December
7, 2022, from https://www.ijbmi.org/papers/Vol(6)9/Version-1/F0609013850.pdf

14.Liao, C. (2023). Employee turnover prediction using machine learning models. International Conference on
Mechatronics Engineering and Artificial Intelligence.

15.Marais, A. (2022, August 28). Employee attrition: Types, rate & reduction practices. Empuls. Retrieved
December 14, 2022, from https://blog.empuls.io/employee-attrition/

16.M. P. Debono, “Are organisations doing enough to retain their talent?”


The Importance of Employee Retention, 2018.

17.Navas-Palencia, Guillermo. (2020). Optimal binning: mathematical programming formulation.

18.N. Bhardwaj, S. Suri, and S. Jain, "Prediction of employee attrition using machine learning techniques," in
2018 International Conference on Advances in Computing, Communication Control and Networking
(ICACCCN), 2018, pp. 276-280. doi: 10.1109/ICACCCN.2018.8685379

19.(PDF) data preparation - researchgate. (n.d.). Retrieved April 15, 2023, from
https://www.researchgate.net/publication/316113863_Data_Preparation

46
20.P. G. Pujar and P. J. Jadhav, "A predictive approach to employee attrition using machine learning techniques,"
in 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), 2018, pp. 208-
212. doi: 10.1109/ICIRCA.2018.8863859

21.Press, S. (2022, December 9). SAP data preparation: Normalization and Binning. Learn SAP from the
Experts. Retrieved April 11, 2023, from https://blog.sap-press.com/data-preparation-in-sap-
normalization-and-binning

22.Principal component analysis - javatpoint. (n.d.). Retrieved April 12, 2023, from
https://www.javatpoint.com/principal-component-analysis

23.Qutub, Aseel & Al-Mehmadi, Asmaa & Al-Hssan, Munirah & Aljohani, Ruyan & Alghamdi, Hanan. (2021).
Prediction of Employee Attrition Using Machine Learning and Ensemble Methods. International Journal of
Machine Learning and Computing. 11. 110-114. 10.18178/ijmlc.2021.11.2.1022.

24.Sekaran, Karthik & Sundaramurthy, Shanmugam. (2022). Interpreting the Factors of Employee Attrition
using Explainable AI. 10.1109/DASA54658.2022.9765067.

25.Salil Choudhary. Academia.edu. (n.d.). Retrieved November 21, 2022, from


https://independent.academia.edu/ChoudharySalil

26.Stephanie. (2020, September 22). Mahalanobis distance: Simple definition, examples. Statistics How To.
Retrieved April 4, 2023, from https://www.statisticshowto.com/mahalanobis-distance/

27.SPSS software. IBM. (n.d.). Retrieved December 16, 2022, from https://www.ibm.com/sa-en/spss

28.R. T. Md Zabirul Islam, M. A. Hossain, and M. R. Islam, "A machine learning approach to predicting
employee attrition," in 2019 4th International Conference on Electrical and Electronics Engineering (ICEEE),
2019, pp. 1-5. doi: 10.1109/ICEEE2019.8724233

29.Team, G. L. (2023, January 16). 4 types of data - nominal, ordinal, discrete and continuous. Great Learning
Blog: Free Resources what Matters to shape your Career! Retrieved March 8, 2023, from
https://www.mygreatlearning.com/blog/types-of-data/

30.Using the CRISP-DM framework for data driven projects. Coforge. (n.d.). Retrieved November 7, 2022, from
https://www.coforge.com/blog/using-the-crisp-dm-framework-for-data-driven-projects

47
31.What is logistic regression? Statistics Solutions. (2022, June 14). Retrieved December 15, 2022, from
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-
regression/

32.What is attrition in hr? HiBob. (2022, November 14). Retrieved December 7, 2022, from
https://www.hibob.com/hr-glossary/attrition/

33.Williammorath. (2021, May 24). Employee attrition eda walkthrough. Kaggle. Retrieved November 21, 2022,
from https://www.kaggle.com/code/williammorath/employee-attrition-eda-walkthrough

34.Y. Fadhloun, N. Hmina, and N. Rami, "Machine learning approach for employee attrition prediction," in 2020
International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2020, pp. 1-5.
doi: 10.1109/ATSIP48719.2020.9263429

35.Zhao, Y., Hryniewicki, M.K., Cheng, F., Fu, B., & Zhu, X. (2018). Employee Turnover Prediction with
Machine Learning: A Reliable Approach. Intelligent Systems with Applications.

48

You might also like