Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2021 4th International Conference on Artificial Intelligence and Big Data

Application of Data Mining in Predicting College


Graduates Employment
Shouwu He Xiaoying Li* Jia Chen
Campus of Nanning Campus of Nanning Campus of Nanning
Guilin University of Technology Guilin University of Technology Guilin University of Technology
Nanning, China Nanning, China Nanning, China
2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD) | 978-1-6654-1515-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAIBD51990.2021.9459039

heshou5@126.com 409057436@qq.com 8154649@qq.com

Abstract—With the popularization and enrollment expansion factors that affect the graduates' employment units [4]. Zhang
of higher education in China, the employment of college students et al. established a classification decision tree based on C5.0
has been the focus of public attentions. In this study, taking the algorithm and explored the influencing factors of college
graduates of Grade 2016 in Guilin University of Technology as graduates' employment direction, such as their academic
example, data mining techniques are applied in predicting performance, CET-4 and CET-6 scores, origin, major, failing
employment by using five influencing factors. Based on the subjects, employment region and city [5]. Tang et al. collected
CART algorithm, the Gini indexes were calculated and decision the employment information of college graduates of traditional
tree was constructed. Furthermore, the random forest algorithm
Chinese medicine, studied the influencing factors of
is used to improve accuracy rate of employment prediction. After
employment based on C4.5 algorithm, and further used random
data collection, cleaning and conversion, 496 employment
records were obtained, 70% of them were taken as the training
forest algorithm to improve the accuracy of employment
samples. The constructed model was tested by the remaining forecast [6]. Yang extracted four attributes affecting
samples and the accuracy reaches 81%. Finally, the features employment units, such as major, academic performance,
academic achievement and graduation qualification are English competence and computer skills, and predicted the
identified as the significant factors of students ’ employability. employment trend of students based on C4.5 algorithm. The
The combination model of decision tree and random forest accuracy rate of the classification model was up to 81% [7].
provides a new method for employment forecasting, which is Zhou et al. analyzed the employment data of graduates based
feasible and adaptable to employment guidance of schools. on C4.5 algorithm, verified the accuracy of the model by cross
validation and realized the high precision and high
Keywords—data mining, employment prediction, decision tree practicability prediction model [8].
algorithm, CART, random forest
Overall, the data mining methods have proven their utility
I. INTRODUCTION to study employability problems. According to the accuracy ,
we conclude that logistic regression, decision trees, random
Since the expansion of college enrollment from 1999, the forest algorithms are the best data mining techniques for
gross enrollment ratio of higher education in China has employment studies [9]. However, most of the researches
witnessed a rapid rise. In this context, the number of college mainly focuses on the influencing factors of employment and
graduates continues to expand and reaches 8.74 million in 2020 accurate predicition of employment trend. Most of them only
[1]. The graduates employment has gradually become the focus consider the factors, such as students' academic performance,
both the whole society and academia. Therefore, applying data English competence and computer skills, while few researches
mining technology to employment prediction and finding out focus on the factors, such as students' family background and
potential patterns can provide decision-making basis for part-time job in college associations. To a certain extent, these
employment guidance and effectively promote graduates’ factors affect the students' employment and prediction results.
employability in school.
This study aims to develop a data mining model of
At present, many scholars have carried out the researches predicting students’ employment and analyze factors
of data mining technology on college employment. In the influencing graduates' job-hunting. The employment-related
studies [2,3], different classification algorithms offered by information of graduates was collected, the CART decision
WEKA, including logistic regression, the Naive Bayes and tree algorithm was used by five factors affecting employment
decision tree etc., were performed in order to forecast and the learning accuracy is further improved by random forest
graduates employment. It is learned that the logistic regression algorithm. The employment prediction model constructed in
is the best classifier according to the experimental result. this paper provides a new method for employment guidance in
Decision Trees was used by more authors to analyze and colleges.
predict the employability of university graduates. Liu et al.
integrated the basic information database, scores database and II. METHODOLOGY
employment information database of college graduates. The A. Decision Tree Algorithm
ID3 decision tree algorithm was used to find out the main

*Corresponding author: Xiaoying Li (e-mail:409057436@qq.com)

978-0-7381-3170-2/21/$31.00 ©2021 IEEE 65

Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 19:10:06 UTC from IEEE Xplore. Restrictions apply.
Decision tree is a classical method in machine learning. It is attributes a* , the data set is divided into two parts D1 and D2 .
often used to solve classification and prediction problems. The The left and right nodes of current node are established.
general decision tree consists of one root node, a number of
internal and leaf nodes [10]. The internal node of a tree (5) For the left and right child nodes, recursively call steps
corresponds to a feature and the sample set is divided into the 1-4 to generate the decision tree.
internal nodes according to feature attributes. Leaf nodes C. Random Forest Algorithm
indicate the class to be assigned to a sample. The path from
root node to leaf node corresponds to classification rules. The Random Forest is an ensemble, supervised machine
decision tree algorithm mainly includes ID3, C4.5 and CART learning algorithm. Random forest uses decision tree as base
algorithm. classifier. In essence, random forest is an improved decision
tree algorithm, by random selection of the samples and feature
B. CART Algorithm attributes, each tree will not be over-fitting even if it is not
The Classification And Regression Trees (CART) uses pruned. The process of random forest algorithm is as follows.
Gini index as an impurity measure in building the decision tree. Input: data set D , the number of samples N , the total
Suppose there are K categories, p k is the proportion of the k
number of features attributes M , the number of trees
class samples in the current sample set D k  1,2,3,  , K  , contained in random forest L , the number of random
the Gini value of data set D is: features of each decision tree mm  M 
K Output: random forest T
Gini ( D )  1  k 1
pk
2

Steps:
(1) For i  1,2,3,  , L :
In (1), the smaller the Gini value is, the lower the impurity
is, and the better the feature is. a) Random sampling with replacement from data set D
and select n samples, namely bootstrap sampling, as the
For the sample set D , it is assumed that the discrete training sample set of each tree.
attribute a has V possible values a 1 , a 2 ,  , a V  . If it is used to
b) In the construction of decision tree, m attributes are
divide the sample set, the splitting Gini index can be calculated
randomly selected from the total feature set M . According to
as follows.
the feature selection method (such as Gini index), the feature
indexes are calculated and the best split feature is selected.
V
Dv
Gini ( D, a )   D
Gini ( D v )  c) Follow step b) to split the node until it can no longer be
v 1 split, and return Tsubi .
(2) The L completely growing decision trees above are
For the feature set A , the attribute a* minimizing the
combined into the random forest T and return T .
value Gini ( D, a ) is selected as the optimal partition attribute,
(3) For the samples in test set, after the decision of each
namely a*  argamin
A
Gini D, a  . tree Tsubi , the voting or weighted average is adopted as the final
The process of CART algorithm is as follows. prediction value.
Input: data set D , feature set A  a1 , a 2 ,  , a m  III. EXPERIMENT
Output: decision tree T This study aims to find a classification model of students
employment and better forecast whether the graduates can find
Steps: jobs. Fig. 1 explains the research procedures, including data
collection, data preprocessing, model construction and model
(1) For the data set D of current node, if the number of
evaluation.
sample sets is less than a certain threshold value or there is no
feature A   , the decision subtree is returned.

(2) Calculate the Gini (D ) of sample set D, and return to the


decision subtree if the Gini coefficient is less than the threshold
value.
(3) For the data set D, calculate the Gini ( D, a ) value of Fig. 1. Research procedures.
each feature a a  A .
A. Data Collection
(4) Select the feature attribute a* that makes Gini ( D, a ) This study integrated graduates information of Grade 2016
minimum, and establish the root node. According to the feature at the Department of Computer Application in Guilin
University of Technology, including the basic information,

66

Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 19:10:06 UTC from IEEE Xplore. Restrictions apply.
academic performance and employment data etc. Five main TABLE II. TRAINING DATA SET
factors are extracted, including academic achievement,
Employment
scholarship, graduation qualification, family status (whether Feature
poor or not) and association members. Attributes
0 1
A detailed description of each attribute is as follows. The 1 34 55
state of scholarship, whether the student was awarded a
scoreLevel 2 29 103
scholarship or not, is represented as “yes” or “no”. The
situation of obtaining graduation qualification is divided into 3 9 117
“yes” or “no”. In order to get access to graduation, the students 0 53 132
have to complete some required courses. If a student has many isScholar
1 19 143
failed subjects, it is usually hard for him to graduates. The
0 20 23
family backgrounds mainly refers to whether poor or not. The isGQ
involvement of school associations or clubs is represented as 1 52 252
either “yes” or “no”. At last, the employment status is regarded 0 59 168
as a dichotomization-employment and unemployment, namely isPoor
1 13 107
“yes” or “no”.
0 65 229
B. Data Preprocessing isAM
1 7 46
The discretization method of academic performance in [11]
was adopted. Students' scores can be regarded as a random
variable X , and it obeys or approximately obeys the normal C. Establishing Decision Tree
distribution X ~ N  ,  2  , where  is the mean of X and 1) The Gini index of training set In training sets, 275
 is the standard deviation of X . According to the principle of students are employed and 72 are unemployed, so the Gini
mathematical statistics, the sample mean x and sample index of training data can be calculated.
variance s are the unbiased estimates of  and  , so  and   72  2  275  2 
 can be respectively replaced by x and s . According to the GiniD   1 -      0.329
  347   347  
characteristics of sample data, academic score are graded from  
A, B to C, A-level: X  x  0.43s , B-level: 2) Gini value of each feature attribute
x  0.43s  X  x  0.43s and C-level: X  x  0.43s a) scoreLevel Because the CART algorithm adopted
respectively. two branches, according to the value of scoreLevel, the
The data schema is shown in Table Ⅰ. Five features, namely datasets D are split into D 1 , D 2  and D 3  s and the Gini
academic achievement, scholarship, graduation qualification, value can be calculated.
family status and association members, are represented
  63  2  158  2 
respectively by scoreLevel, isScholar, isGQ, isPoor and isAM Gini D 1, 2   1 -      0.408
in abbreviated format.   221   221  
 
  9   117  
2 2

Gini D 3   1 -  
TABLE I. GRADUATE EMPLOYMENT DATA
   0.133
Feature   126   126  
Values Description  
Attributes
 126 221 
scoreLevel A,B and C academic achievement Gini D, scoreLevel    * 0.133  * 0.408  =0.308
 347 347 
isScholar yes, no whether be awarded a scholarship or not
b) isScholar
isGQ yes, no whether to get graduation qualification or not
  53  2  132  2 
Gini D 0   1 -      0.409
  185   185  
isPoor yes, no family status, whether poor or not
association members, students’ involvement  
isAM yes, no
in school associations or clubs
  19  2  143  2 
Employment yes, no types of graduate employment Gini D 1   1 -      0.207
  162   162  
 
In the experiment, records with incomplete information  185 162 
Gini D, isScholar    * 0.409  * 0.207  =0.315
were deleted and finally the total sample number is 496, 70%  347 347 
of that is training data and the remaining 30% is test data. For
data conversion, the attribute value "yes" is represented by 1, c) Others Similarly, we can calculate the Gini values of
while "no" is represented by 0. The score grades C, B and A other attributes.
are represented by 1, 2 and 3 respectively. The number of Gini D, isGQ  = 0.311,
training samples under different categories of different Gini D, isPoor  = 0.318,
attributes is shown in Table Ⅱ.

67

Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 19:10:06 UTC from IEEE Xplore. Restrictions apply.
Gini D, isAM  = 0.326 In addition, for students with low marks, whether to
From the above results, it can be seen that the Gini index of participate in associations becomes the main factor affecting
"scoreLevel <= 2" is the smallest, so it is taken as the root node employment. At the same time, the needy students have higher
of decision tree. In the experiment, Python language is used to employment rate, because of the employment idea --" first job
implement CART algorithm. In order to avoid over-fitting in to reduce the family burden” .
decision tree, pruning is usually carried out. The decision tree Based on the observations in experiment, It can be
by post-pruning is shown in Fig. 2. concluded that improving professional skills and changing
students’ employment concepts is conducive to enhancing the
employability of college graduates.
IV. DISCUSSION
A. Model Evaluation
After the algorithm model is established, it needs to be
evaluated to judge the quality of the model [12]. Generally,
training set is used to build the model and test set is used to
evaluate the model. In the CART model in Fig. 2, the depth of
the tree is selected as 5, the correct rate is 77.18%. The depth
of decision tree is changed from 1 to 20, and the learning curve
of the decision tree is obtained in Fig. 3.

Fig. 2. Decision tree by post-pruning.

D. Generating Classification Rules


According to the decision tree in Fig. 2, we can extract
some prediction rules of employment.
a) If scoreLevel = 3 and isGQ = 1, then employment =
yes.
b) If scoreLevel = 3 and isGQ = 0, then employment =
difficulty. Fig. 3. The depth and accuracy of decision tree.
c) If scoreLevel = 2 and isGQ = 1, then employment = In Fig. 3, when the depth of decision tree is low, the
yes. accuracy rate is higher. When the depth of the tree increases,
d) If scoreLevel = 1 and isGQ = 1 and isAM = 1, then the model is not over-fit. The accuracy rate drops rapidly and
employment = yes. tends to be stable gradually. The accuracy rate fluctuates
e) If scoreLevel = 1 and isGQ = 1 and isAM = 0, then between 77% and 79%. On the whole, the accuracy is not very
employment = difficulty. high.
f)If scoreLevel < 3 and isGQ = 0 and isAM = 1, then Furthermore, we implement random forest algorithm on the
employment = yes. same data set, based on the scikit-learn module in Python. The
g) If scoreLevel < 3 and isGQ = 0 and isAM = 0 and number of trees in random forest was adjusted from 0 to 50,
isSchorlar = 1, then employment = yes. and the accuracy was obtained in Fig. 4. As can be seen from
Fig. 4, when the number of trees is small, the accuracy
h) If scoreLevel < 3 and isGQ = 0 and isAM = 0 and
fluctuates greatly. With the increase of the number of trees, the
isSchorlar = 0 and isPoor = 1, then employment = yes.
classification accuracy is gradually stable. The accuracy
i)If scoreLevel < 3 and isGQ = 0 and isAM = 0 and fluctuates between 80% and 81%. Compared with CART
isSchorlar = 0 and isPoor = 0, then employment = difficulty. model, the accuracy of random forest is higher and the
E. Results and Analysis classification results are better.
According to the classification rules, it can be seen that the
important factors determining students’ employment are
scoreLevel and isGQ. If the students have good academic
performance, few failed subjects, and part-time work in
associations or get scholarships, they are more likely to find
employment. Students with medium academic record and not
many failing subjects can basically get jobs. However, students
with poor academic records are facing more difficulties in
finding job. It can be seen that students' professional ability is
very important for employment.
Fig. 4. Learning curve of random forest.

68

Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 19:10:06 UTC from IEEE Xplore. Restrictions apply.
The main parameters to adjust when using random forest However, there are some problems that still need further
algorithm: n_estimators, criterion, max_depth, research. More employment data should be collected to
min_samples_split, min_samples_leaf, and so on. The increase sample sizes and improve the veracity of forecasting
optimization model by cross-validated is listed in Table Ⅲ, the results. More attributes affecting employment should be
accuracy finally reaches 81%. considered, such as students’ gender, major and social practice.
In addition, the types of employment should be divided in more
TABLE III. THE OPTIMIZATION RANDOM FOREST MODEL detail.
Parameters Values Description ACKNOWLEDGMENT
n_estimators 21 the number of trees in the forest This work was supported by Guangxi Education
Department Basic Research Ability Promotion Project for
criterion gini classification criteria, Gini or entropy
young and middle-aged teachers in universities in China, under
max_depth 5 maximum depth of tree No. 2019KY0270 and 2018KY0240.
the minimum number of samples required
min_samples_split 13
to split an internal node REFERENCES
min_samples_leaf 1 the minimum number of samples in a leaf [1] C.J. Yue, J. Xia, W.Q. Qiu, “An empirical study on graduates ’
employment: Based on 2019 national survey,” Journal of East China
Normal University(Educational Sciences), no. 4, pp. 1–17, 2020.
[2] M. T. R. and Y. Yusof, “Application of data mining in forecasting
B. Analysis of Feature Importance graduates employment,” Journal of Engineering and Applied Sciences,
The feature importance in decision tree and random forest vol. 12, pp. 4202-4207, 2017.
model, that is the influence degree of every attribute on [3] K. C. Piad, M. Dumlao, M. A. Ballera and S. C. Ambat, "Predicting IT
employment, is shown in Fig. 5. The most important features employability using data mining techniques," 2016 Third International
Conference on Digital Information Processing, Data Mining, and
of both tree models are scoreLevel and isGQ. The changes of Wireless Communications (DIPDMWC), Moscow, Russia, 2016, pp.
ispoor and isAM values in the two models can be understood as 26-30.
the influence brought by ensemble learning. [4] Z. Liu and Z.G. Zhao, “Analysis and calculation of high school graduate
student based on data mining,” Journal of Shenyang Normal
University(Natural Science Edition), vol. 34, pp. 105–108, January 2016.
[5] L.Y. Zhang, F.C. Wang and Z.Y. Han, “Analysis of the influencing
factors of university graduates employment based on decision tree
algorithm--A case study of information college of Beijing Forestry
University,” Forestry Education in China, vol. 35, pp. 46 – 51, March
2017.
[6] Y. Tang and P. Wang, “Study on employment forecasting of graduates
of traditional Chinese medicine based on C4.5 and random forest
algorithm,” China Medical Herald, vol. 14, pp. 166-169, August 2017.
[7] F. Yang, “Decision tree algorithm based university graduate
employment trend prediction,” Informatica, vol. 43, no 4, pp. 573-579,
2019.
Fig. 5. Feature importance. [8] F.J. Zhou, L.X. Xue, Z.Q. Yan and Y.X. Wen, “Research on college
graduates employment prediction model based on C4.5 algorithm,” J.
V. CONCLUSION Phys.: Conf. Ser., vol. 1453, pp. 1–6, 2020, CISAI 2019.
[9] A. MOUMEN, E. H. BOUCHAMA and Y. EL BOUZEKRI EL
As the situation of undergraduate employment becomes IDIRISSI, "Data mining techniques for employability: Systematic
more and more severe, employment instruction is an important literature review," 2020 IEEE 2nd International Conference on
part of students' management in colleges and universities. How Electronics, Control, Optimization and Computer Science (ICECOCS),
to apply data mining algorithm to forecast students’ Kenitra, Morocco, 2020, pp. 1-5.
employment is an urgent task. Based on the CART decision [10] Z.H. Zhou, Machine Learning, Beijing: Tsinghua University Press, 2016,
tree algorithm, this paper establishes an employment prediction pp.74–79.
model and analyzes the influencing factors of employment. [11] L.X. Wang and H. Xu, “On an approach to assessment of students'
academic grades,” Journal of Yanbian University(Natural Science), vol.
Then we further adopts random forest algorithm to improve the 27, pp. 304–307, December 2001.
learning accuracy. The experimental results show that the
[12] C. D. Casuat and E. D. Festijo, "Predicting Students' Employability
combination of decision tree and random forest algorithm can using Machine Learning Approach," 2019 IEEE 6th International
effectively predict students’ employment situation. The model Conference on Engineering Technologies and Applied Sciences
proposed provides a new method for employment prediction (ICETAS), Kuala Lumpur, Malaysia, 2019, pp. 1-5.
and has practical application value.

69

Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 19:10:06 UTC from IEEE Xplore. Restrictions apply.

You might also like