Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Advancement of Computer Technology and its Applications

Volume 5 Issue 3

Data Set of AI Jobs

Divyam Pithawa1, Sarthak Nahar2*, Shivam Sharma3*, Er. Nikhil Chaturvedi4*


1,2, 3
B. Tech Student, CSE, Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, India
4
Assistant Professor, CSE, Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, India
*CorrespondingAuthor
E-Mail Id: divyampithawa@gmail.com

ABSTRACT
The automated, targeted extraction of information from websites is known as web
scraping. Similar technology used by search engines is marked as “Web Crawling.”
Although human data collection is a possibility, automation is frequently faster, more
efficient, and less prone to mistakes.
Online job portals frequently collect a substantial amount of data in the form of resumes
and job openings, which may be a useful source of knowledge on the features of market
demand.
Web scraping may be categorized into three steps: the web scraper finds the needed links
on the internet; the data is then scraped from the source links; and finally, the data is
shown in a CSV file. For doing the scrape, the Python language is used.
As part of the job series of datasets, this dataset can be helpful for finding a job as an AI
engineer!

Keywords:- Web Scraping, Job Portal, Dataset, AI Jobs

INTRODUCTION enormous potential it represents, business


The World Wide Web [1] consists of an executives should concentrate on locating
interlinked network of information, and maintaining that special, crucial
which is presented to users through piece of data. ((IDC), 2017) David
websites. Reinsel et. al., “Data Age 2025 [2].”

The way we share, gather, and publish Web / Data Scraping


data has changed substantially as a result Web scraping [3], also known as data
of the World Wide Web. The quantity of scraping, is a term used for automatically
information available continues to grow. retrieving data from the internet and
As data volume, diversity, and structuring it in a useful manner.
importance rise, business executives must
focus on the most important facts. Choosing to manually copy and paste
data from a job website will take days as
Not all data is equally significant to most of the data is viewed using a
consumers or corporations. Businesses browser. Our aim is to automate this
that can recognize and capitalize on the process with the help of a web scraper,
crucial fraction of data that will have a which will scrape the job postings in
meaningfully beneficial impact on user mere seconds to achieve the task.
experience, resolve complicated issues,
and generate new economies of scale will The use of fully automated algorithms
prosper throughout this data that can transform entire webpages into
transformation. To fully realize the well-organized data sets have replaced

HBRP Publication Page 1-7 2022. All Rights Reserved Page 1


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

smaller ad hoc, human-assisted details from the websites.


procedures in contemporary web scraping
strategies to adapt to a variety of settings. RELATED WORK
This dataset [6] contains 732 columns
Modern online scraping solutions may containing 6 classes named as S.No, Title,
parse markup languages and JSON files Company, Location, Salary, and Details. It
in addition to integrating computer visual contains data from a single website, and
analytics and natural language processing the size of the dataset is small.
to mimic how people navigate the web.
Butler (2007; Butler) (Yi et al. 2003) [4]. This dataset [7] contains 24300 columns
containing 8 classes named as S.No,
Job information is regarded as important Company, GOC, job_title, primary_ind,
information in the web scraping industry. sector_name, revenue, and size.
In developed nations, 51% of workers are
searching for new jobs, and 58% of them This dataset is way too big and not
do so online, based on the 2017 State of everyone has a high-end system to train a
the American Workplace study by Gallup model with this dataset. It contains data
[5]. from two different websites, LinkedIn and
Glassdoor, but it is in two different files.
This indicates that there is a significant
online job market, and being able to METHODOLOGY
monitor the data can benefit a job The project's process involves compiling
aggregator, a business seeking to employ, all the data that has been gathered from
or a potential employee. multiple resources using Python scripts
that make use of the web scraper
OBJECTIVE Beautiful Soup's rich capabilities.
The objective of this research project is to
develop an AI Jobs dataset that uses the Coding
concept to collect a specific piece of The project's fundamental web scraping
information from a job website such as script presents the data that was scraped
TimesJob. Indeed, and Shine, and and saved in the dataset of the
represent this information in a structured employment websites—in this example,
format. The first phase is responsible for TimesJobs, Shine, and Indeed.
connecting to the World Wide Web.
● TimesJobs Scraper
The second phase collects the Using the Web Scraper, we have
information from the websites using web extracted Job Title, Company Name,
scraping techniques, and the third phase Experience Required, Salary, Location,
stores the collected information into a Job Description, and Key Skills for AI
single common CSV file that contains Jobs from TimesJobs.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 2


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

Fig.1:- Code for Implementation of TimesJobs Scraper.


● Shine Scraper
Using the Web Scraper, we have extracted Job Title, Company Name, Experience Required,
and Location for AI Jobs from Shine.

Fig.2:- Code for Implementation of Shine Scraper.


● Indeed Scraper
Using the Web Scraper, we have extracted the Job Title, Company Name, Location, and
Job Description for AI Jobs from Indeed.

Fig.3:- Code for Implementation of Indeed Scraper.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 3


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

● Creating DataFrame
A DataFrame is created which contains the details of the AI Jobs which we have scraped
from the websites.

Fig 4:- Creation of DataFrame.

● Exporting DataFrame to a file (.csv)


Generating a CSV file which is the DataSet of AI Jobs, is our objective.

Fig.5:- Exporting DataFrame to a file (.csv).


● A Preview of the DataSet Generated

Fig.6:- A Preview of DataSet of AI Jobs.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 4


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

TOOLS & SOFTWARE USED or labelled data swiftly and logically.


We have used the following Tools and It provides a variety of data
Software in our project – structures and operations for
handling numerical and time series
Python data. The NumPy library serves as
Python is a general-purpose, high-level the foundation for this library.
programming language. Its design Pandas is speedy and provides its
philosophy prioritizes code readability users with great productivity and
and heavily employs indentation. Using performance.
its language structures and object- ● Re
oriented technique, programmers may Python's built-in re-module can be
create clear, logical code both for small- used to utilize regular expressions. A
and large-scale projects. regex, or regular expression, is a
In this project, we used the following group of characters that generates a
Python Libraries: - search pattern. To determine whether
a text contains a particular search
● BeautifulSoup pattern, use RegEx.
Data from XML and HTML files can
be extracted using the Python library Visual Studio Code
Beautiful Soup. Along with your Microsoft created the source-code editor
preferred parser, it offers intuitive Visual Studio Code, sometimes known as
ways to browse, search, and modify VS Code, for Windows, Linux, and
the parse tree. It frequently saves macOS. Among the features are debugging
programmers hours or even days of assistance, syntax highlighting, intelligent
effort. code completion, snippets, code
● Request refactoring, and integrated Git.
The requests library is the de facto
industry standard for making HTTP Jupyter Notebook
requests [8] in Python. It provides an Jupyter Notebook is a free and open-
easy-to-use API for communicating source tool that enables you to create and
with HTTP activities like GET, share documents with live code,
POST, etc. The Requests library's equations, visuals, and text. Jupyter
methods use a specific web server Notebook is maintained by the Project
identified by its URL to perform Jupyter team.
HTTP operations.
● NumPy Among some of the features are debugging
NumPy is the abbreviation for assistance, syntax highlighting, intelligent
Numerical Python. It is a Python code completion, snippets, code
library for manipulating arrays. refactoring, and integrated Git.
Additionally, it supports matrices,
the Fourier transform, and linear RESULT
algebra functions. NumPy was A DataSet [9] of AI Jobs contains
originally developed by Travis approximately 1500 entries from various
Oliphant in the year 2005. It's an job websites; in this case, TimesJobs,
open-source project, so one can use it Shine, and Indeed have been successfully
without any restrictions. generated. It includes Job Title, Company
● Pandas Name, Experience Required, Salary,
Pandas is an open-source library Location, Job Description, and Key Skills
primarily made for using relational as features.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 5


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

CONCLUSION relatives, classmates, and friends during


The DataSet that is generated can be used the research project. Last but not least, we
by candidates who are actively looking for wish to thank everyone who has
a job as an AI Engineer, which can help contributed in any manner to our efforts.
them identify the requirements of Being able to learn at SHRI VAISHNAV
recruiters for a particular job, which can VIDYAPEETH VISHWAVIDYALAYA
give them a major competitive advantage. in Indore has been a blessing.
We can also analyze the job trends in the
market. REFERENCES
1. “World Wide Web.”
FUTURE SCOPE SpringerReference, doi:
We can increase the entries in the DataSet 10.1007/springerreference_28639.
by fetching more entries from the same 2. Mellor, C. Seagate sponsors re-
websites or we can also scrape more fashioned IDC digital data-flood
websites for AI Jobs. We can also increase blather. The Register. Apr. 2017.
the features in the DataSet. Machine https://www.theregister.com/2017/
Learning models [10] can also be built to 04/05/seagate_sponsors_refashione
find the important skills required by d_idc_digital_universe_blather/
recruiters. (accessed: Jun.2022).
3. Coble, Z. Collecting Data with
ACKNOWLEDGEMENT Web Scraping.
We are grateful to various folks for their 4. Zhao, B. Web Scraping.
advice and help during the period of Encyclopedia of Big Data: 951-953,
finishing our research project. First and 2022, doi: 10.1007/978-3-319-
foremost, we would like to express our 32010-6_483.
gratitude to Dr. Anand Rajavat, Head of 5. State of the American Workplace.
the Department of Computer Science & GALLUP. 2017.
Engineering at S.V.I.I.T., Indore, and Er. https://www.gallup.com/workplace/
Nikhil Chaturvedi, the research project 238085/state-american-workplace-
mentor, for their invaluable assistance and report-2017.aspx (accessed:
guidance. They also assisted us in Jul.2022).
discovering new technologies by sharing 6. Pawar. H. AI jobs in India. Kaggle.
their technical knowledge. With his 2020.
supervision, direction, and constructive https://www.kaggle.com/datasets/hri
criticism, he provided inspired us. thikpawar/ai-jobs-in-india (accessed:
Jun.2022).
We also want to thank our director, Dr. 7. Bansal. S. 2021 DS / ML related
Anand Rajavat, from the bottom of our Jobs - Linkedin, Glassdoor. Kaggle.
hearts for all of his help. Dec. 2021.
https://www.kaggle.com/datasets/shi
We extend our heartfelt gratitude to the vamb/glassdoor-jobs-data (accessed:
teaching and non-teaching personnel in the Jun.2022).
Computer Science & Engineering 8. Freeman, A (2022). Making HTTP
department at SVVV Indore for their kind Requests. Pro Angular:665-690.
cooperation and provision of the necessary 9. Pluijmaekers, P. , Lelli, F (2022) . A
information. Dataset Containing Job Descriptions
Suitable for NLP and NN
We appreciate the support and helpful Processing. doi:
suggestions provided by our parents, 10.20944/preprints202206.0346.v1.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 6


Advancement of Computer Technology and its Applications
Volume 5 Issue 3

10. Chen R.H, Chen, C. (2022)


Machine Learning Modeling.
Artificial Intelligence:119-124, doi:
10.1201/9781003214892-14.

HBRP Publication Page 1-7 2022. All Rights Reserved Page 7

You might also like