Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

BRIDGING THE GAP IN TECH VACANCIES IN SINGAPORE

Case Study on Glassdoor Salary Estimate and Company Reviews


for the Singapore market.

CS 610: Applied Machine Learning-G2

Group 1:
Joshua Tan Kai Wen
Koh Chin Weng
Muhamad Ameer Noor
Sruthi Basani
Victoria Anne Neo Li Xian
Table of Contents

I. Business Problems................................................................................................................................. 1
II. Approach ............................................................................................................................................... 1
III. Data ....................................................................................................................................................... 2
IV. Exploratory Data Analysis...................................................................................................................... 3
V. Model & Analysis................................................................................................................................... 3
Salary Prediction Modelling ...................................................................................................................... 3
Model 1 - Ridge Regression: ................................................................................................................. 4
Model 2 - Lasso Regression: .................................................................................................................. 4
Model 3 - Gradient Booster: ................................................................................................................. 4
Model 4 - Random Forest: .................................................................................................................... 5
Model 5 - Random Forest - Text Only: .................................................................................................. 5
Topic Modelling ........................................................................................................................................ 5
Latent Dirichlet Allocation (LDA)........................................................................................................... 5
BERTopic ............................................................................................................................................... 6
LDA vs BERTopic .................................................................................................................................... 6
BERTopic – General ............................................................................................................................... 6
BERTopic – Data roles ........................................................................................................................... 7
BERTopic – Developer roles .................................................................................................................. 7
BERTopic – Leadership roles ................................................................................................................. 7
BERTopic – Engineering roles ................................................................................................................ 7
VI. Conclusion and Limitations ................................................................................................................... 7
Conclusion ................................................................................................................................................. 7
Limitations ................................................................................................................................................ 8
References .................................................................................................................................................... 8
Appendix ........................................................................................................................................................ i
Table of Figures
Figure 1 Salary Prediction Models Summary ................................................................................................ 2
Figure 2 Metadata......................................................................................................................................... 2
Figure 3 Salary Prediction Models Summary ................................................................................................ 3
Figure 4 salary estimate distribution ............................................................................................................. i
Figure 5 skills which pay the most ................................................................................................................. i
Figure 6 job education vs salary distribution ................................................................................................ ii
Figure 7 salary distribution for each role ...................................................................................................... ii
Figure 8 work experience vs estimate salary ............................................................................................... iii
Figure 9 seniority title vs salary estimate .................................................................................................... iii
Figure 10 company size vs salary ................................................................................................................. iv
Figure 11 company sector vs salary ............................................................................................................. iv
Figure 12 scatter plot of company rating vs the salary................................................................................. v
Figure 13 Feature Importance of Ridge Regression Model .......................................................................... v
Figure 14 Highest Coefficients of Ridge Regression Model .......................................................................... v
Figure 15 Feature Importance of Lasso Regression Model ......................................................................... vi
Figure 16 Highest Coefficients of Lasso Regression Model ......................................................................... vi
Figure 17 Feature Performance of Gradient Boosting Regressor ................................................................ vi
Figure 18 Feature Performance of Random Forest Regressor ................................................................... vii
Figure 19 SHAP Value of Random Forest Regressor ................................................................................... vii
Figure 20 LDA Topic 1................................................................................................................................. viii
Figure 21 LDA Topic 2................................................................................................................................. viii
Figure 22 BERTopic Topic(s) 4 and 0 ............................................................................................................ ix
I. Business Problems
A study by (Self, et al., 2022) showed that despite unanimously positive reviews about an organization, a
lower advertised salary led to lower Job Pursuit intention (JPI) and perception of fit. Additionally, the
study highlights that salary has a higher influence on job seekers’ perceptions than online employee-
generated reviews.
With businesses embracing new cutting-edge technologies, specifically in Singapore for the case of this
study, a risk of skills mismatch would prove to be inevitable not just locally but globally, given the shortage
for the necessary skills (Tay , 2023) . Following this, the Ministry of Manpower’s jobs report in 2022 has
also highlighted a robust demand for skilled workers in tech ( Manpower Research & Statistics
Department, 2023). To meet the demand for these skills, employers are increasingly seeking talent who
can perform different roles and thus offering higher salaries.
Glassdoor with its mission to help people find jobs has been a widely used tool for people to review and
read reviews about companies. So, what is Glassdoor? We’re a thriving community for workplace
conversations, driven by a simple mission: helping people everywhere find jobs and companies they love”
(Glassdoor, 2023) ; amassing 55 million users per month globally (Smith, 2023).
Salary has also been shown to be one of the key influences in an individual’s employer ratings (employer
branding) through reviews on GlassDoor, with individuals giving better salaries the more they are
compensated (Hammami, et.al, 2020). However, it is also important to note that salary is not the only
determinant factor when looking for a job as the non-monetary incentives and cultural aspects of the
employer matter as well.
As such our study aims to assist individuals in their next job hunt by helping them predict their salary
based on information from job descriptions and the companies who listed them on Glassdoor.
Additionally, as highlighted above, the tech job market is getting increasingly competitive when looking
for talent, hence the model also aims to help employers find out key words via word clouds to see how to
improve their employer branding to entice job seekers from a non-monetary perspective.

II. Approach
To address the first business problem, we built prediction models to estimate salary. Two regression and
three tree-based models were created on numerical and categorical data that were extracted from job
descriptions. For the regression models, we chose Lasso and Ridge regression to avoid the dimensionality
curse as our data contains several categorical features which numbered in the thousands when encoded.
Meanwhile, the tree-based models were chosen as salaries might have outliers and non-linear
relationships with certain features (e.g., salary might grow exponentially as one’s level of position is
raised). All of the model was tuned using Bayesian Optimization, which is more efficient compared to
other methods, due to considering previous performance result to decide where to move next in the
Hyperparameter space (Nguyen, 2019; Turner, et al., 2021). Additionally, we also explored one model that
utilized the text data from job description, to discover any potential unextracted features. The results
were used to help employees to better estimate their salary. Should the model be deployed on a proper
platform, the information can help someone to consider what skills is worth pursuing to improve self-
worth and help job seeker to negotiate the decent salary that they deserved.
To uncover insights on what employees, seek in employers, we employed Latent Dirichlet Allocation (LDA)
and BERTopic. As a probabilistic generative model, LDA assigns associated word probabilities to each topic
and then works backward to determine the topics that would generate those documents. It assumes that

1
each document is a mixture of various topics, and each word within a document is generated from one of
those topics. On the other hand, BERTopic leverages pre-trained BERT (Bidirectional Encoder
Representations from Transformers) models to obtain vector representations of the text data. By
transforming the text into numerical representations, BERTopic enables the clustering of similar
documents using the HDBSCAN algorithm, which groups together documents that share similar
characteristics or topics. These results allow us to identify common themes, keywords, or topics that
emerged from the text data, providing valuable information for understanding the preferences and
priorities of job seekers that can be utilized by companies to improve their employee branding.

III. Data
Since there was no data set readily available on the internet about the salaries of the people working in
tech and employee reviews specifically for tech employees in Singapore, we scrapped Data from
Glassdoor.

Figure 1 Salary Prediction Models Summary

The metadata of the dataset that we use in this project are as follows.

Figure 2 Metadata

2
Data cleaning: The Data scrapped from Glassdoor was not very clean – unlike the usual datasets from the
Kaggle we performed the two steps to clean the data the first, while scrapping the data set the Xpath for
the column company also had the star rating of the company so had to separate both. The second,
majority of the job titles also had extra words apart from the job title like the team’s name, year name, or
salary information or words like ‘urgently hiring’, ‘looking for’ and also to map similar job roles for
example, data science specialist and data scientist have been grouped together to be called data scientist
so we have created a new column to extract only the title of the job and club the similar jobs together.
Missing Data Pattern: While scrapping the data, we found an approximate of < 50 rows where the
estimate_salary values were not captured, we dropped the rows which did not have any salary
information.
Feature extraction: We believe that the data scrapped from Glassdoor might not be predictive enough,
so we extracted the following features from the job description and job title columns:
• job_education : the education qualification required for the job
• job_experience: years of work experience required for the job
• seniority_title: the seniority titles associated with the job like junior, senior etc
• skill set : we have also extracted skill set required for the job - We have first extracted a list of skills,
tools, cloud technologies, programming languages, and certifications from job descriptions and
represented them as binary columns in the final data frame data frame

IV. Exploratory Data Analysis


We did an exploratory analysis to check the relation between our independent variable and some of our
key dependent variables and we found that from Figure 7 (Appendix) job titles with notably higher average
salary includes Solution Architect, Analytics Manager, and Quantitative Analyst. From Figure 6 (Appendix)
on education, it shows that generally higher education relates to higher salary. Nevertheless, MBA and
Doctorate education in tech jobs notably has lower average salary compared to Masters. From Figure 11
(Appendix) Tech employees in the finance sector make the most money while employees from the
customer services sectors are the least paid and from Figure 5 (Appendix) the word cloud it is evident that
programming language skills like python, SQL, java, Scala and cloud skills, big data skills are rewarded
more than the others and last but not the least. from Figure12 (Appendix) the companies which have a
good start rating (4 and above) are good pay masters.

V. Model & Analysis


Salary Prediction Modelling
Model R2 RMSE Features
Ridge Regression 0.663 31677 Numeric & Categorical
Lasso Regression 0.651 32564 Numeric & Categorical
Gradient Boosting 0.618 34027 Numeric & Categorical
Random Forest Regressor 0.567 36261 Numeric & Categorical
Random Forest Regressor 0.474 38696 Text
Figure 3 Salary Prediction Models Summary

3
In our salary prediction project, we employed five different models to tackle the task. Therefore, our
models were chosen based on their ability to handle numerical and categorical features extracted from
job descriptions or to handle text data in general. We aimed to provide individuals with diverse
backgrounds with a comprehensive understanding of their potential salary estimates.
Based on the overall result, the first two models that performed the best were Ridge Regression and Lasso
Regression. We were aware of the curse of dimensionality risk considering some of the categorical
features have numerous classes resulting in about 2,000 total dimensions after encoding. These models
were selected specifically to address the challenge of high dimensionality and avoid overfitting. By
penalizing less important features, we observed improvements in model performance. Although this
approach slightly compromised our initial business case since many potential features will be ignored, it
enhanced the models' overall performance.
Overall, our analysis demonstrated that the choice of model and feature engineering techniques greatly
influenced the predictive performance for salary estimation. By leveraging various regression and
ensemble models, we were able to uncover meaningful insights regarding the factors that contribute to
salary variations. The most interesting findings were skillsets such as cloud-related skills, Flink,
Zookeeping, Knime, Talend, D3 Java, and Jupyter notebook can boost job seekers’ salary estimate. The
feature importance, SHAP value, and coefficients were displayed on Figure 13 to 19 in the appendix.

Model 1 - Ridge Regression:


Ridge Regression is a modified linear regression model using L2 penalty (sum of squares of the weights).
This model was chosen for its ability to handle correlated predictors and improve generalization as proven
in studies such as the one by Cule & De Lorio (2013) and Schreiber-Gregory (2018). Tuned using Bayesian
Optimization with 10 iterations that resulted in Alpha value of 5.14814. The Ridge Regression model
achieved an R2 of 66.32% and an RMSE of 31,677.22.
Feature importance analysis revealed insights that experience, education, and seniority title had the
highest influence on salary. Interestingly, Apache Flink skill also appearing as top performer. Delving into
the coefficient’s, more interesting insight is found, where Cloud-Related skills, Talend, D3 Java, and
Jupyter Notebook came in top with highest positive coefficients.

Model 2 - Lasso Regression:


Similar to Ridge, Lasso Regression modified linear regression model by introducing penalty which in this
case is L1 (sum of absolute values of the weights). Tuned using Bayesian Optimization which result in alpha
value of 9.4498. The model yielded an R2 of 65.07% and an RMSE of 32,564.95. Notably, the model
indicated that skills such as Zookeeping and Knime were worth acquiring, and cloud-related skills and
company ratings had a positive influence on salary.

Model 3 - Gradient Booster:


We employed the Gradient Boosting model, which is an ensemble learning method based on decision tree
regressors with an in-built correction mechanism. It iteratively builds decision trees and corrects errors
made by previous trees, resulting in a stronger predictive model (Shi, et al., 2022). The model was tuned
using Bayesian Optimization, yielding an R2 of 61.77% and an RMSE of 34,027.43. Feature importance
analysis revealed that company, job experience, education level, and seniority were the most influential
factors in determining salary.

4
Model 4 - Random Forest:
We also utilized the Random Forest model, which aggregates the predictions of multiple decision trees to
create a robust regression model. By reducing variance and overfitting, this ensemble approach improves
accuracy and generalization capabilities. Tuned with Bayesian Optimization, the Random Forest model
achieved an R2 of 56.68% and an RMSE of 36,261.91. Feature importance analysis highlighted the impact
of job experience, education, and seniority on salary, with negative influences observed for lower
experience and education levels.

Model 5 - Random Forest - Text Only:


To explore the potential of important features being unextracted using the original text data from job
descriptions, we trained a Random Forest model exclusively on the text. The text was preprocessed using
tokenization and count vectorization techniques. This approach resulted in an R2 of 47.44% and an RMSE
of 38,696. The performance was lower compared to the other models, which suggests that the feature
extraction has managed to get more insights into what can affect salary.

Topic Modelling
Today's businesses operate in a fast-paced environment, and employers are continuously working to find
and keep top personnel. Glassdoor, a website where workers can anonymously express their thoughts
and recommendations about their employers, is one useful tool that offers an inside glimpse into the
employee experience. Companies can learn important information about their overall performance and
pinpoint areas for development by carefully reviewing these reports. We scrapped reviews data from the
top 50 paying companies in the tech industry for Singapore. Data was collected in a methodical manner
from a variety of positions and divided into four main categories: data, developer, leadership, and
engineer. This categorisation made it possible to conduct a targeted investigation and gather distinct
viewpoints from various organisational job functions.
A surprising result from the analysis was the use of the term "good" in the "cons" section of ratings of
both 1 and 5 stars. This could seem counterintuitive at first. On closer inspection, it was discovered that
the term "good" was frequently used in an ironic way in the cons section. This finding demonstrated the
necessity of going beyond a preliminary study to have a deeper knowledge of employee review.
Topic modelling methods were used to obtain more detailed information. The evaluations were grouped
into several areas by using natural language processing techniques, revealing underlying trends that might
not be immediately obvious. Topic modelling is an unsupervised method because it does not rely on labels
or categories that have already been established. Instead, it makes use of the text's inherent structure
and patterns to spot and study word clusters that constitute coherent topics. The hidden themes and
subjects that arise from the data are revealed by topic modelling, which examines the word co-occurrence
and word patterns across texts. We will be evaluating Latent Dirichlet Allocation and BERTopic with regard
to topic modelling.

Latent Dirichlet Allocation (LDA)


Latent Dirichlet Allocation (LDA) is a popular unsupervised topic modelling method. The LDA embedding
space is a probabilistic space where each document is represented by a distribution over topics and each
topic is represented by a distribution over words. LDA embeddings capture the underlying topics in the
text data and provide a way to measure the relevance or similarity between documents based on their
topic distributions. Documents with similar topics are expected to have similar embeddings.

5
LDA embeddings offer a way to measure the relevance or similarity between documents based on their
topic distributions. Documents with similar topics tend to have similar embeddings. For example, in
Appendix Figure X, we can observe the representation of Topic 1, which consists of a mixture of words
such as "good," "life," "balance," "culture," and "pay." This topic can be interpreted as describing a good
working environment that encompasses various aspects related to work-life balance, organizational
culture, and compensation. Similarly, in Appendix Figure X, Topic 2 comprises a mixture of different words
like "learn," "lot," "opportunity," and "technology." These terms imply a context where individuals have
the chance to acquire knowledge, engage in various activities, and explore cutting-edge technologies.
However, it is important to note that the word "nil" also appears in this topic and it dilutes the overall
understanding of the topic.
In short, while LDA is able to group the words into topics, the topics are too broad and some words do not
seem to be clearly linked to the topic. The result was displayed on Figure 20 and 21

BERTopic
BERTopic is an algorithm that combines BERT (Bidirectional Encoder Representations from Transformers)
embeddings with hierarchical clustering to identify coherent topics in text data. BERT embeddings are
dense vector representations generated by pretraining a deep transformer model on a large corpus of
text. The BERTopic embedding space is a high-dimensional vector space where each document is
represented by a dense vector capturing its semantic meaning - words with similar meanings associations
are expected to have similar embeddings in the BERTopic space.
In Appendix Figure X, we interpreted topic 4 good work-life balance and the words associated with it are
“life balance” and “good life”. These terms explicitly reflect the topic's focus on achieving a positive
equilibrium between work and personal life. Similarly, topic 0 in BERTopic is related to "learning
opportunities" and is characterized by terms such as "learn lot" and "opportunity learn." These terms
clearly convey the topic's emphasis on gaining knowledge and educational possibilities. BERTopic is able
to clearly differentiate topics with easily interpretable terms - BERTopic's ability to provide easily
interpretable terms for each topic can be attributed to its deep learning-based approach, which enables
it to capture more nuanced and context-dependent representations of words and phrases within the
documents. The intertopic distance map is displayed on Figure 22.

LDA vs BERTopic
BERTopic can better capture more nuanced relationships between words and generate more accurate
topic representations because it uses BERT that captures the meaning of words based on their
surrounding context while LDA ignores word order and contextual information. BERTopic assigns
representative keywords to each topic, providing a concise summary of the topic's content. These
keywords are derived from the highest-ranking words within the topic, providing clearer and more
interpretable results compared to LDA, which assigns probabilities to words within topics.

BERTopic – General
Across all roles, employers can work on branding themselves with themes such as 1) good learning
opportunities, 2) good colleagues, 3) good work benefits, 4) good company culture, 5) good work-life
balance, 6) flexible work arrangements, and 7) a good internship experience.

6
BERTopic – Data roles
For data roles, themes important to employees are 1) good colleagues, 2) good work benefits, 3) good
learning opportunities, 4) good company culture, 5) good work-life balance, and 6) a good internship
experience. One area that can be further explored is topic 4 which includes terms such as “life balance”,
“mental trauma”, “great team”, “accounting firm” and “edge technology”. This dichotomy makes it hard
to identify the topic and more research should be done.

BERTopic – Developer roles


For developer roles, themes important to employees are 1) good colleagues, 2) good salary, 3) good
learning opportunities, 4) good company culture, 5) good work-life balance, and 6) good learning
opportunities. Topic 0 is most confusing and requires more research as it includes terms such as “tech
stack”, “new technology”, “learn lot”, “life balance”, “depend team”.

BERTopic – Leadership roles


For leadership roles, themes important to employees are 1) good colleagues, 2) good work benefits, 3)
good opportunities to grow fast, 4) good company culture, and 5) good managers.

BERTopic – Engineering roles


For engineering roles, themes important to employees are 1) colleagues, 2) salary, 3) learning
opportunities, 4) company culture, and 5) work-life balance. Interestingly, the company ST Engineering
was a topic and themes associated with it are learning opportunities and growth opportunities. On the
other hand, Topic 6 is confusing as it includes terms such as “flexible work”, “work hour”, “project hand”,
“normal gov structure”, “load dependent character”. Topic 7 is also confusing as “company sea region”,
“company sea”, “employee help”, employer help” and “employer carefully”. These two topics are not
clearly interpretable and requires more research.
Thus, BERTopic demonstrates its capability to capture important themes across different roles, providing
insights into the aspects valued by employees. However, there are instances where certain topics are less
clear and require additional research to fully interpret their meaning. This highlights the importance of
further investigation and analysis to gain a comprehensive understanding of the identified topics and their
significance in the context of specific roles.

VI. Conclusion and Limitations


Conclusion
For the salary prediction, the models did relatively well with the highest R2 being 66%, but needs more
improvement to be deployed in a real business case. Most of the feature performance result were simply
showing results that might not be too insightful to job seekers, such as company, experience, and seniority
affects salary. However, interesting insights on skillsets related to cloud and data offers job seekers a good
picture of how pursuing those skillsets would improve their “worth” for upskilling and salary negotiation.

For the prediction of topics on what employees’ value from their workplace, we use the BERTopic which
is superior to LDA, assigns word probabilities in topics. BERT captures word relationships and generate
precise topic summaries, assigning representative keywords. From our project, it has been identified that
employers should prioritize branding around themes like learning opportunities, colleagues, work
benefits, company culture, work-life balance, flexible arrangements, and internship experience across all

7
roles. We hope this project will be the bedrock to help better the future of filling the job vacancies in
Singapore’s tech job market amidst the present war for talents.

Limitations
The data that can be scraped from Glassdoor for salary was in the form of salary ranges. We took the
average in the range as the value that we use for the target variable. This might result in less accurate
estimates in the modelling, especially on the data with bigger range. This limitation was most likely be the
cause that the highest R2 could only reach 66%.
Looking from Glassdoor perspective, the original input that they received were actually in the exact
number of salaries instead of salary range. This implies that more granular and accurate data exist in
Glassdoor. Therefore, the salary estimates modelling in this project can actually serve as the basis for
Glassdoor to re-create the modelling with better data. The models can then be deployed as new function
in Glassdoor as “Know Your Worth” feature that helps job seekers estimates their net worth and simulates
what upskilling are worth pursuing, therefore increasing Glassdoor service value as a company.

References
Cule, E., & De Iorio, M. (2013). Ridge regression in prediction problems: automatic choice of the ridge
parameter. Genetic epidemiology, 37(7), 704-714.
Glassdoor. (2023). Glassdoor About Us.
Hammami, et.al, A. (2020). Salary perception and career prospects in audit firms.
Manpower Research & Statistics Department. (2023). Report: Job Vacancies 2022.
Nguyen, V. (2019, June). Bayesian optimization for accelerating hyper-parameter tuning. In 2019 IEEE
second international conference on artificial intelligence and knowledge engineering (AIKE) (pp. 302-305).
IEEE.
Schreiber-Gregory, D. N. (2018). Ridge Regression and multicollinearity: An in-depth review. Model
Assisted Statistics and Applications, 13(4), 359-365.
Self, et al., T. (2022). The Role of Online Reviews and Salary on Hospitality Students’ Perceptions and
Intentions.
Shi, Y., Ke, G., Chen, Z., Zheng, S., & Liu, T. Y. (2022). Quantized training of gradient boosting decision trees.
Advances in neural information processing systems, 35, 18822-18833.
Smith, C. (2023). Glassdoor Statistics and User Count (2023). Retrieved from
https://expandedramblings.com/index.php/numbers-15-interesting-glassdoor-statistics/
Tay , H. (2023). Tech talent: Skills take time to catch up with demand. Retrieved from
https://www.straitstimes.com/singapore/skills-take-time-to-catch-up-with-demand
Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., & Guyon, I. (2021, August). Bayesian
optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the
black-box optimization challenge 2020. In NeurIPS 2020 Competition and Demonstration Track (pp. 3-26).
PMLR.

8
Appendix

Figure 4 salary estimate distribution

Figure 5 skills which pay the most

i
Figure 6 job education vs salary distribution

Figure 7 salary distribution for each role

ii
Figure 8 work experience vs estimate salary

Figure 9 seniority title vs salary estimate

iii
Figure 10 company size vs salary

Figure 11 company sector vs salary

iv
Figure 12 scatter plot of company rating vs the salary

Figure 13 Feature Importance of Ridge Regression Model

Features Coefficients

Cloud Related (AWS, Azure, Google Cloud) 350+

Company Related (industry, founded, revenue, sector) 272.11

Talend 272

D3 Java 272

Jupyter Notebook 272

Job Requirement Related (Education, Experience, Seniority) 272


Figure 14 Highest Coefficients of Ridge Regression Model

v
Figure 15 Feature Importance of Lasso Regression Model

Features Coefficient

Azure Administrator 5067

Cloud Strategy 1861

Company_Rating 1277
Figure 16 Highest Coefficients of Lasso Regression Model

Figure 17 Feature Performance of Gradient Boosting Regressor

vi
Figure 18 Feature Performance of Random Forest Regressor

Figure 19 SHAP Value of Random Forest Regressor

vii
Figure 20 LDA Topic 1

Figure 21 LDA Topic 2

viii
Figure 22 BERTopic Topic(s) 4 and 0

ix

You might also like