Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Faculty of Computing & IT

University of Sialkot

An Automated Cv / Resume Analyzer Via Using


Machine Learning and Natural language Processing
Algorithm.

Session: BSSE Fall 2021 - 2025

Project Advisor: Ma’am Shimza


Submitted By

Sania Afzal 21101001-049

Hamid Akram 21101001-054

Fatima Iftikhar 21101001-123

Haroon Ahmad 21101001-126

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
1
DECLARATION

I certify that project title An Automated Cv / Resume Analyzer Via Using Machine

Learning and Natural language Processing Algorithm is under my supervision with

students of Bachelor of Science (Software Engineering), Faculty of Computing &

Information Technology, University of Sialkot, Pakistan, worked under my supervision.

__________________________________
Ma’am Shimza
Department of Software Engineering
Faculty of Computing & Information Technology
University of Sialkot, Punjab, Pakistan.

Dated: 03-June-2024

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
2
TABLE OF CONTENTS
PROBLEM STATEMENT ................................................................................................. 5
BACKGROUND STUDY .................................................................................................. 6
PROJECT MOTIVATION ..................................................................................................... 9
PROJECT/PRODUCT SCOPE .......................................................................................... 9
PROJECT OVERIEW/GOAL .......................................................................................... 10
HIGH-LEVEL SYSTEM COMPONENTS ...................................................................... 10
PROJECT DEVELOPMENT METHODLOGY .............................................................. 11
PROJECT PLAN .............................................................................................................. 13
WORK BREAKDOWN STRUCTURE ....................................................................... 13
GANTT CHART ........................................................................................................... 13
HARDWARE AND SOFTWARE SPECIFICATION ..................................................... 13
TOOLS AND TECHNOLOGIES..................................................................................... 13
BUSINESS PLAN ............................................................................................................ 14
ETHICAL CONSIDERATION ........................................................................................ 16
WORK DIVISION............................................................................................................ 16
BUDGET (IF APPLICABLE) .......................................................................................... 17
REFERENCES ................................................................................................................. 18
APPENDIXES: ................................................................................................................. 19
APPENDIX A: FYP CODE SNIPPETS OR PSEUDOCODE ........................................ 19

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
3
List of Figures
Figure 1: High level System Components ....................................................................... 11
Figure 2: Work Break Down Structure ............................................................................. 13
Figure 3: Gantt Chart ........................................................................................................ 13
Figure 4 : Calculation COCOMO I................................................................................... 18

List of Tables
Table 1:Literature Review Comparison .............................................................................. 6
Table 2 : Task Division Table ........................................................................................... 16
Table 3:Bill of Material (Calculation I) ............................................................................ 17

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
4
Final Year Project Proposal
ABSTRACT
This project introduces an automated CV analyzer system employing Machine Learning
(ML) and Natural Language Processing (NLP) to enhance candidate selection in modern
recruitment processes. It parses CVs using NLP to extract pertinent details like skills,
experience, and education. ML algorithms then assign weighted scores to candidates
based on their qualifications. Recruiters receive a ranked list of candidates, aiding in
efficient screening. The system employs various techniques including keyword extraction
and sentiment analysis to provide insights. It mitigates bias, enables data-driven
decisions, and offers instant feedback to candidates, potentially transforming recruitment
into a faster, more accurate, and efficient process.

INTRODUCTION
After completing Upon their education, individuals typically enter the job market, often
using their Curriculum Vitae (CV) or Resume as their primary representation. However,
many individuals start working before finishing their formal education. In the modern job
search landscape, technology has made the process more efficient. Yet, the abundance of
applicants for each position can overwhelm employers, making CV assessment
challenging. Some companies mandate specific formats for applicants to ease the process,
but it remains tedious and prone to errors. Automated CV analyzers, leveraging
technologies like natural language processing and machine learning, offer a solution by
improving efficiency and accuracy in candidate evaluation, addressing volume,
inconsistency, and bias issues.

PROBLEM STATEMENT
The traditional process of Manually reviewing CVs poses inefficiencies and errors due to
the lack of standardization, creating challenges for HR professionals, especially in larger
organizations handling numerous applications. There's a clear need for a more efficient,
accurate, and unbiased candidate evaluation method. Implementing an automated CV
analyzer can address this need, improving recruitment speed and fairness, facilitating
prompt hiring of top talent. This solution is crucial for enterprises, recruitment agencies,
and SMEs, given the substantial market size in HR software and recruitment outsourcing.
The issue is particularly critical when managing high application volumes and requires
continuous attention. Addressing this problem is vital for enhancing efficiency, fairness,
talent acquisition speed, and overall organizational effectiveness.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
5
BACKGROUND STUDY
Table 1:Literature Review Comparison
Reference Research Methodology Tool/techniques Accuracy/ Limitation
Paper/App Result

[1] Resume Data Machine Not Limited data


Analyzer using Extraction and Learning NLP specified sets potential
machine resume ranking ,custom bias
learning Algorithms
[2] Ai Riven NLP-based NLP Machine 80 Data set
Resume resume learning Accuracy limitations
Screening with screening
NLP
[3] AI for Resume Resume NLP deep High Scalability
parsing , job parsing , job learning accuracy in diversity in job
matching matching job roles
matching
[4] Automated End-o-end NLP Automated Not Complexity of
Resume resume ML Algorithms specified NLP
scanner using screening processing
NLP
[5] Resume Data Machine 85% Data quality
sorting using Extraction learning NLP Accuracy processing
ML Algorithm, Resume in job time
NLP matching
[6] Leveraging Resume NLP Machine Not Bias in data
NLP for Screening and learning specified generation
Resume matching
Screening
[7] Resume Resume Deep learning Not Limited
sorting using Screening and NLP specified sample size
ML ranking
[8] Automated Data NLP AI 78% Model
Resume Extraction Techniques Accuracy interpretability
Screening with screening
NLP
[9] Machine CV Analysis Machine Varies with Implementatio
learning for and ranking learning NLP data set n complexity
Automated CV
Analysis
[10] AI powered cv CV parsing job NLP, AI Not Integration
parsing and matching Techniques specified challenges
Matching

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
6
[11] Analyzing Utilizes NLP Utilizes NLTK 85% Study limited
CV/Resume and ML for syntax Accuracy to engineering
using Natural techniques for analysis. students' CVs,
Language efficient Uses PDFminer potentially not
Processing and information for PDF file representative
Machine extraction. parsing. of broader
Learning Segments CVs Uses Beautiful demographics.
based on Soup for HTML Model's
topics. conversion. effectiveness
Extracts Techniques may vary with
structured data include syntax different CV
from analysis, layouts and
unstructured PDF/HTML smaller sample
text. processing, data.
Assesses data decision tree Limited dataset
using decision algorithms, and biases suggest
tree algorithms logistic model may not
and logistic regression. be universally
regression. applicable
across various
fields
[12] A Multi- Data collection Text mining. Not Current models
Criteria from resumes. Clustering specified are incomplete
Analysis and Data algorithms. and require
Advanced preprocessing TF-IDF for improvements
Comparative through Term feature in
Study of Frequency- extraction. recommendatio
Recommendati Inverse KMeans for n accuracy.
on Systems Document clustering. Progression
Frequency Cosine similarity and automatic
(TF-IDF). for measuring construction of
Clustering similarity. skill ontologies
using KMeans. Implementation need to be
Measuring using Python addressed for
similarity with (Scikit-learn). better results.
cosine Use of the An interface
similarity WordNet lexical should be
between job database for designed and
offers and semantic implemented
resumes. similarity. for better
validation with
live data sets.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
7
[13] A CV Parser Utilizes entity Utilizes Natural 85- 90% Ensures higher
Model using extraction Language Accuracy accuracy in
Entity processes in Processing candidate
Extraction big data. (NLP) for entity matching.
Process and Converts extraction and Applicability
Big Data Tools unstructured or text analysis. limited to
semi-structured Integrates predefined
CV data into tokenization, formats.
structured POS tagging, May struggle
format. and semantic with diverse or
Text analytics technologies. unconventional
process Uses Hadoop CV structures.
includes text MapReduce for Language
cleaning, efficient issues may
tokenization, processing of arise due to
and POS large datasets. ambiguous
tagging. grammar.
Accurately Periodic
extracts updates
relevant data required for
like names, new CV
dates, and structures or
technical skills. additional
languages.
[14] ResumAI: Entity NLP Algorithms 98.63% Additionally,
Revolutionizin Extraction Word Accuracy the
g Automated Tokenization Embedding computational
Resume and Techniques: complexity
Analysis and Normalization: Similarity with BERT
Recommendati Word Measures: makes it less
on with Multi- Embeddings: Classification suitable for
Model Classification Algorithms: real-time
Intelligence Algorithms. Libraries and applications
Potential Utilizes Tools: compared to
Candidate Information SpaCy NLP TF-IDF.
Selection using Extraction (IE) Module for Text
Information and Skyline Processing
Extraction and Queries. Converts PDFs
Skyline Processes PDF to UTF-8
Queries candidate encoded text.
resumes using Tokenizes and
(NLP) recognizes
techniques. named entities

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
8
[15] A Document Utilizes Python Python, FastAPI, Tested with Incorporates
Vectorization for model HTML, CSS, 150 multiple
Approach to development, JavaScript, resumes. factors
Evaluated
Resume FastAPI for Panda, NLTK, normalized
with up to
Ranking API design, Gensim. 50,000 through
System and web Google Colab, synthesized TOPSIS
technologies VSCode, virtual resumes. algorithm.
for HRM environment Showed Major
Dashboard. with virtualenv. fast limitations:
Data Implementation response constrained
times in
processing includes TF-IDF, computational
filtering.
using Panda, Cosine power, use of
NLTK, Similarity, open-source
Gensim TOPSIS platforms, and
libraries for algorithm. limited dataset
text size.
tokenization, Supports only
cleaning, and CSV file
word formats, needs
frequency further
dictionary development
creation. for flexibility
and
extendibility.

Project Motivation
There have been lots of work done While job search processes have evolved, CV/Resume
evaluation still relies heavily on manual methods. However, leveraging advancements in
Natural Language Processing (NLP) and Machine Learning (ML) offers a promising
solution. These technologies, already commonplace in activities like emailing and online
shopping, can automate and improve candidate selection processes. Importantly, NLP
and ML are integrated into daily routines, such as email and online shopping,
highlighting their familiarity and potential for streamlining CV evaluation.

Project/Product Scope
The project targets the creation of a web-based application to automate CV analysis using
Natural Language Processing (NLP) and Machine Learning (ML). It will assess and rank
CVs based on relevance across multiple domains, aiming for a comprehensive candidate
© Department of Software Engineering
Faculty of Computing & IT
University of Sialkot
9
profile. However, the computational requirements may present challenges for smaller
organizations. Nonetheless, the project's success promises to greatly improve the
efficiency and speed of candidate selection processes.

PROJECT OVERIEW/GOAL
The project aims to create an automated CV analyzer system using advanced
technologies like Natural Language Processing (NLP) and Machine Learning (ML) to
transform recruitment processes. By addressing inefficiencies and biases in traditional
CV evaluation methods, the system offers a more efficient, accurate, and fair approach to
candidate selection. Key features include comprehensive analysis, adaptability across
domains, efficiency, accuracy, and instant candidate feedback. The final product will be a
web-based application providing recruiters with ranked candidate lists based on job
requirements, leveraging NLP and ML algorithms for CV parsing and analysis. Expected
packaging involves a cloud-based web solution accessible via standard browsers, with
hardware and software components including servers, storage, programming languages
like Python, frameworks like TensorFlow, and libraries like spaCy and scikit-learn,
alongside web development technologies and database systems for efficient data
management.

HIGH-LEVEL SYSTEM COMPONENTS


The main high-level system components for the proposed An Automated Cv / Resume
Analyzer Via Using Machine Learning and Natural language Processing Algorithm
system can be outlined as follows:
1. File Conversion Component:
• Its Converts CV/Resume from formats like PDF/DOCX to HTML to preserve
formatting and font size.

2. Segmentation Component:
• Its Identifies segments in the HTML using headings, font size, and structure,
facilitated by parse tree and font analysis

3. Information Extraction Component:


• Its Utilizes pattern recognition, NLP, and ML to extract structured data from
segments, organizing it using JSON.

4. Evaluation Component:
• Its Applies the ID3 decision tree algorithm to classify and rank CVs, calculating
entropy and information gain to identify the best attributes for evaluation.
Compares performance with logistic regression for potential enhancements.
© Department of Software Engineering
Faculty of Computing & IT
University of Sialkot
10
5. Training Component:
• Its Continuously trains the system using collected data to refine decision-making
and enhance accuracy, updating training data based on feedback from decision
tree evaluations.

Each of these components plays a vital role in processing, analyzing, and evaluating CVs
to automate candidate selection in a structured, efficient manner.

Figure 1: High level System Components

PROJECT DEVELOPMENT METHODLOGY

The project development methodology or architecture for analyzing CV/Resume using


Natural Language Processing (NLP) and Machine Learning (ML) follows a structured
approach segmented into several phases, as outlined in the document. Here's a concise
summary:

1. System Design:
• File Conversion: Convert CV/Resume from formats like PDF/DOCX to HTML
to retain original formatting and font size information.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
11
• Reverse Engineering: Use HTML code to carry information on text styling,
aiding in accurate segmentation

2. Segmentation:
• Font-Based Segmentation: Segment the CV/Resume based on font sizes and
styles identified in the HTML, since headings and key sections usually have
larger or bold fonts.
• Syntax Analysis: Utilize syntax trees and pattern recognition to identify and
finalize segments accurately.

3. Extracting Qualification Features:


• Information Extraction: Employ pattern recognition and named entity
recognition for simple data (e.g., email, phone number) and multinomial logistic
regression for complex data (e.g., institutions and degrees).
• Structured Data Representation: Convert the extracted information into JSON
format for easier manipulation and analysis.

4. Training the System:


• Machine Learning: Use decision tree learning (ID3 algorithm) to evaluate the
CV/Resume, based on attributes like CGPA, project weight, and skills.
• Continuous Learning: Train the system with new CVs to improve decision-
making over time.

5. Evaluation and Decision Making:


• Information Gain Calculation: Calculate entropy and information gain to
prioritize attributes.
• Decision Tree Implementation: Utilize the decision tree for CV classification,
and compare results with other algorithms like logistic regression for performance
tuning.

Overall, the methodology combines both NLP and ML techniques to automate CV


analysis and ranking, providing a sophisticated and scalable solution for candidate
selection.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
12
PROJECT PLAN

WORK BREAKDOWN STRUCTURE

Figure 2: Work Break Down Structure

GANTT CHART
Sample Gantt chart

Figure 3: Gantt Chart

HARDWARE AND SOFTWARE SPECIFICATION


• Machine Type: A 64-bit quad-core processor (Intel Core i5 or equivalent) with a
clock speed of at least 2.4 GHz, 8 GB RAM, and a minimum of 250 GB SSD.
• Operating System and Utilities: Any recent version of Windows, macOS, or
Linux.

TOOLS AND TECHNOLOGIES


• Front-End Tools: HTML5, CSS3, and JavaScript frameworks like React or
Angular.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
13
• Back-End Tools: Node.js for server-side logic and MySQL for database
management.

Reasons for tool selection:


• Front-End: HTML5 and CSS3 ensure modern web standards, while
React/Angular offer efficient client-side development.
• Back-End: Node.js provides a scalable server-side environment, and MySQL is a
reliable database management system.

Needs and constraints for tool support:


• PDFminer: A Python library used for extracting text and metadata from PDF
files.
• urllib: A Python library used for downloading and opening URLs.
• Beautiful Soup: A Python library used for parsing HTML and XML documents
into a navigable tree structure.
• NLTK: A Python library used for natural language processing, including
tokenization, stemming, lemmatization, and named entity recognition.
• NumPy: A Python library used for array-like operations and matrices.
• SciPy: A Python library used for scientific computing and technical computing,
including signal and image processing.
• ID3: A decision tree algorithm used for classification tasks.
• Git: A version control system used for tracking changes in the code.
• Google Colab: A cloud-based service used for running code, storing data, and
collaborating with others.
• Kaggle: A platform for data science and machine learning, including datasets,
competitions, and APIs.

BUSINESS PLAN
1. Executive Summary:
The project aims to create a web app that uses NLP and ML to analyze CVs, helping
recruiters evaluate and rank them based on relevance. It targets large enterprises,
recruitment agencies, and SMEs to simplify their hiring process. The outcome should be
a fairer, more efficient method of candidate selection, offering ranked lists of candidates
aligned with job criteria.
2. Business Description:
The project aims to tackle inefficiencies and biases in traditional CV evaluation. Its
mission is to transform recruitment by introducing an automated CV analyzer for a fairer,

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
14
more efficient selection process. The vision is to be a top provider of smart recruitment
solutions, enhancing the experience for employers and candidates alike.
3. Market Analysis:
The target market comprises large enterprises, recruitment agencies, and SMEs dealing
with a surge in job applications. With a significant market size, there's a rising demand
for HR software and recruitment outsourcing. The project aims to meet recruiters' needs
for a precise, efficient, and unbiased candidate evaluation method. Its competitive edge
stems from employing advanced NLP and ML techniques, offering a more sophisticated
and scalable solution than traditional CV analysis methods.
4. Products or Services:
The project will provide an automated CV analyzer system to assess and rank candidates
according to their qualifications and alignment with job requirements. It will include
features such as CV parsing, information extraction, entity recognition, skill matching,
sentiment analysis, and a ranked list of candidates.
5. Marketing and Sales Strategy:
The marketing approach targets HR professionals, recruiters, and hiring managers
through online ads, social media, and industry events. Pricing will follow a subscription
model with tiers based on CV volume and features. Sales will emphasize relationship-
building, showcasing the system's value, and securing early adopters for positive reviews.
6. Operations and Management:
The project will involve a four-person team, each assigned specific roles. They'll adopt
agile project management, emphasizing regular communication and collaboration.
Management will rely on online tools, meetings, and progress reports to ensure smooth
progress.
7. Financial Projections:
The project will need funding for hardware, software, and cloud services, with financial
projections covering operating expenses, revenue, and profit margins. Funding sources
will include personal contributions, scholarships, and potential investor backing.
8. Funding and Investment:
The project will explore funding options like scholarships and grants from the university.
Additionally, the team will consider partnerships with HR tech companies or
organizations for support.
9. Risk Analysis:
The Potential risks involve data collection and annotation challenges, developing robust
NLP and ML models, scalability issues with handling large CV volumes, and competition
from existing tools. To mitigate these risks, the team will focus on thorough data
collection, continuous testing and refinement of models, scalable infrastructure design,
and a robust marketing and sales strategy.
10. Sustainability and Social Responsibility:
© Department of Software Engineering
Faculty of Computing & IT
University of Sialkot
15
The project will prioritize ethical considerations by maintaining data privacy, fairness,
and transparency in the evaluation process. It also aims to foster diversity and inclusion
by offering equal opportunities to all candidates, irrespective of their backgrounds or
demographics.
11. Conclusion:
The paper proposes a model to extract and segment data from CVs/Resumes based on
their values. It acknowledges that CV/Resume ranking and weighting may vary
depending on employer preferences. Each segment of the process focuses on a specific
task, such as Natural Language Processing or Machine Learning. The model suggests a
unique approach by converting data into HTML code to discern values. Ultimately, it
ranks CVs/Resumes according to necessary data and employer requirements, considering
past criteria.

ETHICAL CONSIDERATION
The text discusses ethical considerations in research, covering key principles such as
informed consent, confidentiality, non-maleficence, beneficence, justice, and respect for
autonomy. Informed consent ensures participants are fully informed and consent
voluntarily, while confidentiality protects their privacy and data. Non-maleficence avoids
harming participants, while beneficence seeks to maximize benefits and minimize risks.
Justice demands fair treatment and equitable distribution, and respect for autonomy
respects participants' rights to make decisions.

WORK DIVISION
Table 2 : Task Division Table

Team Members Skills

Sania Afzal (21101001-049 ) Front End Development

Feasibility Study

Hamid Akram (21101001-054) Frontend Development

Backend Development

Test Case Development

Fatima Iftikhar (21101001-0123) Backend Development

User interface design.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
16
Haroon Ahmad (21101001-0123) Frontend Development

Backend Development

Database Development

BUDGET (if applicable)


Table 3:Bill of Material (Calculation I)

Sr Hardware & Quantity Unit Cost Total Cost


Software

1 1 TB SSD 1 9000 9000

2 RAM 16 GB 2 10000 20000

3 Network 20 Mbps 1 3500 3500

4 Cloud Services 1 15000 15000

5 IDE ,NLP, ML Open source 0 0


Libraries

Total 5 37500 47500


The Total Cost is 47500 PKR

Effort Calculation:
• Effort (Person-Months) = a * (KLOC)^b
• Effort = 2.4 * (5)^1.05
• Effort ≈ 13.00 person-months

Time Calculation:
• Time (Months) = c * (Effort)^d
• Time = 2.5 * (13.00)^0.38
• Time ≈ 6.62 months

Person Calculation:

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
17
• Number of Persons = Effort / Time
• Number of Persons = 13.00 / 6.62
• Number of Persons ≈ 1.96 persons
Figure 4 : Calculation COCOMO I

REFERENCES
1. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive
bayes text classification. In AAAI-98 workshop on learning for text
categorization (Vol. 752, pp. 41-48).
2. Kawtar, N., BDIoT'19: Proceedings of the 4th International Conference on Big
Data and Internet of Things, October 3, 2019
3. p . Shivratri, P. Kshirsagar, R. Mishra, R. Damania, and N. Prabhu, “Resume
parsing and standardization,” 2015
4. 1. Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on
pattern analysis and machine Intelligence, 21(9), 871-883.
5. 1. Yi, X., Allan, J., Croft, W.B.: "Matching Resumes and Jobs Based on
Relevance Models," SIGIR, Amsterdam, pp. 809–810 (2007).
6. 1. Q. Le and T. Mikolov, “Distributed representations of sentences and
documents,” in International conference on machine learning, PMLR, 2014, pp.
1188–1196
7. 1. Westermann, F., Wei, J. S., Ringner, M., Saal, L. H., Berthold, F., Schwab, M.,
Khan, J. (2002). Classification and diagnostic prediction of pediatric cancers
using gene expression profiling and artificial neural networks. GBM Annual Fall
meeting Halle 2002,2002(Fall).
8. 1. Sinha, A.K., Amir Khusru Akhtar, M. and Kumar, A., 2021. Resume screening
using natural language processing and machine learning: A systematic review.
Machine Learning And Information Processing: Proceedings Of ICMLIP 2020,
pp.207-214.
9. 3. Orosz, G., Szántó, Z., Berkecz, P., Szabó, G., & Farkas, R. (2022). HuSpaCy:
an industrial-strength Hungarian natural language processing toolkit. arXiv
preprint arXiv:2201.01956.
10. 1. Sanyal, S., Hazra, A., Ghosh, S., & Adhikary, A. (2017). Extraction of
Information from Unstructured Data in Resumes.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
18
Appendixes:

APPENDIX A: FYP CODE SNIPPETS OR PSEUDOCODE


FYP (Final Year Project) code snippets and pseudo code are concise representations of
programming logic or algorithms used in a final year project. Code snippets are short
sections of actual code written in a specific programming language, showcasing practical
implementation. Pseudo code, on the other hand, is a high-level, human-readable
description of the logic or algorithm without strict adherence to any programming
language syntax, making it a valuable tool for planning and explaining complex
processes. Both serve as educational aids to help understand, develop, and document their
project's technical aspects.

© Department of Software Engineering


Faculty of Computing & IT
University of Sialkot
19

You might also like