Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

In this documentation, we aim to provide a comprehensive overview of our

project.
Introduction
This project is a final project divided into four key parts: Data Collection, Data
Visualization, Data Cleaning and Machine Learning, and Project Documentation.

I. Data Collection
We have chosen the website 'enbek.kz' as the data source for this project. This
resource provides valuable information about job vacancies, employment, and
career opportunities in the Republic of Kazakhstan.
In the first part of the project, we focused on data collection. At this stage, we
selected the 'enbek.kz' website and conducted the data collection process using web
scraping techniques. This stage plays a critical role, as the quality and reliability of
the data are of fundamental importance for subsequent analysis
II. Data Visualization
In the second part of the project, we focus on data visualization. We utilize
various graphical tools such as histograms, bar charts, box plots, pie charts, and
others to visually represent the data's statistics. Visualization helps us not only to
better understand the data but also to identify potential trends and anomalies.
III. Data Analytics
After visualizing the data, in the third part of the project, we conduct data
cleaning and apply machine learning algorithms for a more in-depth data analysis.
This allows us to process the data, identify and rectify errors, make predictions,
uncover patterns, and extract valuable insights from the data.

Further in the documentation, more detailed descriptions of the project, the results
of the data analysis, as well as conclusions and recommendations will be
presented.

METHODS:
We conducted web scraping, deliberately navigating through web pages to extract
key data about job vacancies, such as positions for teachers, doctors, directors, etc.
Having effectively collected over 3000 unique job vacancy records, we stored
them in a structured CSV format for ease of subsequent analysis.
Utilizing Python libraries like Pandas and Matplotlib, we performed an in-depth
analysis of the collected data, identifying main trends such as the distribution of
jobs across various sectors, examining average salary levels by profession, and
analyzing the geographical distribution of job offers. Additionally, we employed
machine learning techniques, including clustering and time series analysis, to
forecast labor market trends. Finally, we visualized the results in interactive
dashboards, ensuring easy interpretation and demonstration of the data to key
stakeholders.

VISUALIZATION
To effectively communicate the insights we've gathered from our job market data
analysis, we will employ a series of visualizations. These visual representations,
crafted using powerful libraries like matplotlib and seaborn, are designed to make
complex data more accessible and understandable. We will be showcasing various
types of visualizations, each tailored to highlight specific aspects of the data:

a. Bar Graphs: We created bar graphs representing average salaries across


various occupations, providing a clear comparison and highlighting salary
variations and trends within each profession.

b. Scatter Plots: By plotting work experience against occupational areas, we


explored the correlation between these variables. This helped us identify
patterns, outliers, and dependencies, offering insights into the relationship
between experience and profession.
c. Bar Charts: Focusing on educational levels among different professional
groups, we used bar charts to compare and contrast the educational
backgrounds required in various fields.

d. Box Plots: We utilized box plots to analyze and compare the distribution of
salaries across different educational levels, revealing insights into the impact
of education on wage patterns.
e. Pie Charts: To understand the geographical distribution of job
opportunities, we created pie charts representing the most common job
locations, helping stakeholders identify key areas for employment.

f. Line Charts: We used line charts to depict the dynamic nature of job
postings over time, providing an understanding of the ebb and flow in job
market demands

g. Heatmaps: Our heatmaps provided a unique perspective on the number of


professions in relation to work time and job fields, offering a multifaceted
view of the job market.
Through these visualizations, we aim to provide a comprehensive and digestible
overview of our findings, making it easier for stakeholders to grasp the nuances
and dynamics of the current job market.

Data Analytics:
1)File Import and Verification:
Initiated the analysis by importing the data file, followed by a NaN check to ensure
data completeness and reliability, crucial for accurate analysis.

2)Removing Duplicates from Data:


Employed software code to identify and eliminate data duplicates, preventing
skewness in analysis from redundant information.

3)Using LabelEncoder:
Utilized LabelEncoder to transform categorical data from text to numeric format,
catering to the requirement of numerical data for most machine learning
algorithms.

4)Change Column Name PRICE:


Segmented the 'PRICE' column into 'Minimum Price' and 'Maximum Price' for
granular analysis of pricing trends.

5)Using Linear Regression to Predict:


Applied a linear regression model to forecast values based on existing data,
leveraging this technique's ability to predict one variable from others.
Output: MSE of Linear Regression: 1331731004.9552906
R2 of Linear Regression: 0.7676413189175552

6)Assessing the Accuracy of the Decision Tree Model:


Evaluated the Decision Tree model's precision in making predictions based on the
given data, focusing on its decision-making accuracy at various nodes.
Output: Decision Tree Accuracy: 0.9712041884816754
7)Assessing the Accuracy of the K-Nearest Neighbors (KNN) Model:
Assessed the KNN model's accuracy, a technique that predicts data points based on
proximity to 'K' nearest data elements.
Output: Accuracy of KNN: 0.6151832460732984
8)Assessing the Accuracy of the Random Forest Model:
Analyzed the performance of the Random Forest model, a more complex ensemble
approach combining multiple decision trees for enhanced prediction accuracy.
Output: Accuracy of Random Forest: 0.7643979057591623
Conclusion:

You might also like