D04-Prediction Analysis Techniques of Data Mining A Review

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2nd INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND SOFTWARE ENGINEERING (ICACSE-2019)

Prediction Analysis Techniques of Data Mining:


A Review

Mohini Chakarverti1, Nikhil Sharma2 and Rajiva Ranjan Divivedi3

Abstract—Data mining within the databases is called a technique from which the extraction of necessary information can be done
from the raw information. With the help of the prediction analysis technique provided by the data mining the future scenarios
regarding to the current information can be predicted. The prediction analysis is the combination of clustering and classification. In
order to provide prediction analysis there are several techniques presented through many researchers. In this review paper, various
techniques proposed by various authors are analyzed to understand latest trends in the prediction analysis. The prediction analysis
techniques have two steps which are feature extraction and classification. The various classification techniques are reviewed in terms
of certain parameters and compared in terms of their outcomes.
Keywords: Classification, Clustering, K-means, SVM



I. INTRODUCTION 1 to one class is high in comparison to the objects that


belong to separate classes. The data is extracted from
Data mining is the patterns for analyzing information and
previously existing data sets such that the patterns among
the process to extract the interesting knowledge. In data
them and the future outcomes possible can be determined.
mining, various data mining tools available which are used
Future predictions are not provided through prediction
to analyze different types of data. Foranalyzing the data
analysis.
information few applications which is used by data mining
The prediction analysis process provides risk
are such as making decisions, analysis on market basket,
assessment forecast and acceptable level of reliable for the
production control, and customer retention, scientific
applications. This approach thus, helps in predicting the
discovers and education systems [1]. Applied to similar
future possibilities. Any kinds of currently available data
cluster and not same type of data is referred to clustering
and historical facts applied to business are analyzed by the
in this approach. The clusters are generated by analyzing
predictive models such that the feedbacks of customers
similar patterns of the input data. While categorizing
related to the products can be understood. This study also
genes with same functionality and in population gain
helps in recognizing the potential risk and opportunities of
insight into structures can be inherited in biology for
this data. Several techniques have been applied by this
deriving plant and animal taxonomies. In city, similar
study for making future business forecasts along with
houses and lands area can be identified by employing
machine learning, statistical modelling and data mining.
clustering in geology. To discover new theories,
The information is thus, extracted and then used further
information clustering can be used to classify all
for predicting trends and behavioural patterns using
documents available on Web. The unsupervised data
predictive analytics. The predictive web analytics are
clustering classification method creates clusters and
improved by calculating the statistical probabilities of
objects as these in different clusters are distinct and that
future events online.
are in same cluster are very similar to each other. In data
In any kind of past, present or future event of interest
mining, cluster analysis is considered a traditional topic
which is unknown, the predictive analytics is applied. The
which is applied for the knowledge discovery. The data
variables which can be measured and analyzed are used by
objects are grouped as a set of disjoint classes which are
predictive analytics software applications for predicting
known as cluster [2].The similarity of objects that belong
the likely behaviour of individuals. For instance, for the
potential driving safety variables being used in insurance
1Research Scholar, IEC College of Engineering and company, variables like driving record, pricing, age,
gender, location, and type of vehicle are considered. High
Technology, Greater Noida, India
2,3Assistant Professor, IEC College of Engineering and
level of expertise is needed in predictive analytics with the
statistical methods and ability to build predictive data
Technology, Greater Noida, India models. Data engineers help in gathering relevant data and
E-mail: 1mchakarverti@gmail.com, preparing it for analysis. Therefore, with data
2nikhilsharma1694@gmail.com,
visualization, dashboards and reports are supported
3rajiv.ranjan.0077@gmail.com

Electronic copy available at: https://ssrn.com/abstract=3350303


348 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019)

through software developers and business analysts. label that is known. Test set is not dependent on
Clustering methods divided into categories are as follows: training set.
a. Partitioning Methods: The basic functioning of
this method is the collection of the samples in a B. SVM Classifier
way to generate clusters of same objects that are In this study the author proposed SVM classifier for
of high similarities. Here, the samples that are regression, classification and also the general pattern
dissimilar are grouped under different clusters recognition. Due to its high generalization performance
from similar ones. These methods completely without requiring any prior knowledge to add in it, this
rely on the distance of the samples [3]. classifier is considered to be good in comparison to other
b. Hierarchical Methods: A given dataset of classifiers. The performance is even better such as
objects are decomposed hierarchically within this extremely high of the input space dimension. The SVM
technique. There are two types in classification of requires best classification function identification for
this method is done with the involvement differentiating of training data between the two classes.
decomposition. It is divisive and agglomerative The classification function metric may represent in a
methods based upon [4]. Agglomerative geometric manner as well [7]. The hyperplane f(x) is
technique is the bottom up technique at which the separated through the linear classification function for the
first step is the formation of the separate group. linearly separable dataset. This hyperplane passes through
Merging is done when the groups are near to each the middle of two classes which can be said to separating
other. them. xn is classified by testing the sign function of the
c. Density based Methods: In many techniques the new data instance function f(xn); xn which refers to the
distance amongst the objects is taken for the positive class if f(xn)> 0. This is done after the
separation of the objects into clusters as a base determination of a new function.
into clusters. However, these methods can only Determination of the best function by increasing the
be helpful while identifying the spherical shaped margin between the two classes is an important objective
clusters. It is difficult to obtain arbitrary shaped of SVM. There are many linear hyper planes because of
using the technique of density based clustering. this fact. Hyper plane is amongst the two classes an
d. Grid based Methods: It is known as the amount of space or distance present. Margin is closest
generation of grid structure by the quantizing the between the closest data points to a point with a shortest
space of the object to the finite number of cells. distance on thehyper plane. This can further help us in
This method is independent as it is not dependent defining the way to extend the margin which can help in
on the availability of the number of data objects selecting only a few hyperplanes for the solution to SVM
and also has a high speed. even when so many hyperplanes are available [8].
For an identification of the target function the aim of
A. Classification in Data Mining
the SVM is to produce linear function. Performance of the
Within the data mining the prediction of the group regression analysis can help to extend the SVM. The error
membership for instance information can be done with the models are of quiet help here for the SVRs. Within an
help of the classification technique [5]. epsilon amount the error is defined zero of the differences
Prediction analysis is the process in which outcome between real and predicted values. In the off chance, there
will be predicted on the basis of current data. For example, is a linear growth in the epsilon insensitive error. Through
on the basis of current weather information it will be the reduction of Lagrangian, the support vectors can be
analyzed that day can be either “sunny”, “rainy “or studied. The insensitivity to the outliers can be of
“cloudy. beneficial for the support vector regression. The demerit
Two steps are followed within this process. They are: of SVM is that the computations are not efficient enough.
a. Model Construction: Model construction There are many solutions proposed for this. The breakage
explains the group of classes of predetermined. of one big problem into numerous numbers of smaller
Wide numbers of tuples are utilized in the problems is one way to solve this issue. There are only
construction of the model known as training set. some selected variables for the efficient optimization for
Classification of the rules, decision trees or each problem. Until all the problems are solved
mathematical formulae/regression is shown in eventually, this process keeps working in iterative nature.
this method. The problem of learning SVM is to be solved also by
b. Model Usage: The second way used in the recognizing the approximate minimum enclosing a set of
classification is model usage. In order to classify instances in the program.
the test data, the training set is designed of the This review paper is based on the prediction analysis
unknown from the unknown data for the accuracy which is generally done with the classification techniques.
analysis [6]. The result of the classification of the This paper is organized such that in the section 1, the
model is used to compare in sample test with a introduction of the prediction analysis is given with

Electronic copy available at: https://ssrn.com/abstract=3350303


Prediction Analysis Techniques of Data Mining: A Review 349

various classification techniques. In the section 2, the mining. The K-means algorithm has been used to analyze
literature survey is written on the prediction analysis. different existing diseases. The cost effectiveness and
human effects have been reduced using proposed
II. LITERATURE REVIEW prediction system based data mining.
Min Chen, et al. presented[9]on the basis of multimodal Bala Sundar V., et al. (2012) examined [15] real and
disease risk prediction (CNN-MDRP) algorithm called a artificial datasets that have been used to predict diagnosis
novel convolution neural network. The data was gathered of heart diseases with the help of a K-mean clustering
from a hospital which included within it, both structured technique in order to check its accuracy. The clusters are
as well as unstructured data. In order to make predictions partitioned into k number of clusters by clustering which
related to the chronic disease that had been spread in is the part of cluster analysis and each cluster has its
several regions, various machine learning algorithms were observations with nearest mean. The first step is random
streamlined here. 94.8% of prediction accuracy was initialization of whole data,and then a cluster k is assigned
achieved here along with the higher convergence speed in to each cluster. The proposed scheme of integration of
comparison to other similar enhanced algorithms. clustering has been tested and its results show that the
Akhilesh Kumar Yadav, et al. presented an analysis highest robustness, and accuracy rate can be achieved
of different analytic tools that have been used to extract using it.
information from large datasets such as in medical field Daljit Kaur, et al. (2013) explained [16] that data that
where a huge amount of data is available [10]. The contains similar objects has been divided using clustering.
proposed algorithm has been tested by performing The data that contains similar objects is clustered in same
different experiments on it that gives excellent result on group and the dissimilar objects are placed in different
real data sets. In comparison with existing simple k-means clusters. The proposed algorithm has been tested and
clustering algorithm using the algorithm results are results show that this algorithm is able to reduce efforts of
achieved in real world problem. numerical calculation and complexity along with
Sanjay Chakrabotry, et al. (2014) presented clustering maintaining an easiness of its implementation. The
tool analysis for the forecasting analysis [11].The weather proposed algorithm is also able to solve dead unit
forecasting has been performed using proposed problem.
incremental K-mean clustering generic methodology. The Ming, J., et al. (2018) proposed multi-dimensionality
weather events forecasting and prediction becomes easy and nonlinearity which are the important characteristics of
using modelled computations. Towards the end section, the technical and economic data. It is possible to research
the authors have performed different experiments to check for the technical and economic data, the big data and data
the proposed approach’s correctness. mining analysis approaches are used. Simplification of the
Chew Li S., et al. (2013) presented [12] that the fluctuation pattern and influencing factors of the mineral
results of a particular university’s students have been products price are done [17]. The prediction model of the
recorded to keep a track using Student Performance geological missing data is established on the basis of
Analysis System (SPAS). The design and analysis has techniques of geo statistics and artificial neural network.
been performed to predict student’s performance using The proposed model helps in providing an analysis and
proposed project on their results data. The data mining discussion about the regularity of geological data of group
technique generated rules that are used by proposed boreholes along with their geological data. As per the
system provide enhanced results in predicting student’s performance results achieved it is seen that the strength of
performance. The student’s grades are used to classify proposed model is high along with its prediction accuracy.
existing students using classification by data mining Sakhare, A.V., et al. (2017) presented a survey about
technique. the road accident analysis techniques which play an
Qasem A., et al. (2013) suggested that the data important role in transportation. The description of road
analysis prediction [13] is considered as important subject accident data analysis is done using various data mining
for forecasting stock return. The future data analysis can techniques. This paper also studied the k-mean algorithm
be predicted through past investigation. The past historical in proper manner. SOM is used to create and analyze the
knowledge of experiments has been used by stock market clusters [18]. A self organizing technique uses the neural
investors to predict better timing to buy or sell stocks. network along with an unsupervised learning method. The
There are different available data mining techniques developed technique helps in improving the accuracy. The
amongst which, a decision tree classifier has been used by improvement of road transportation system is important to
authors in this work. reduce the deaths or injuries of people. The accident
K. Rajalakshmi, et al. (2015) presented study related reasons can be predicted and the accuracy of analysis can
to [14] medical fast growing field authors. In this field be improved to a greater extent in comparison to the k-
every single day, a large amount of data has been means clustering algorithm by applying the proposed
generated and to handle this much of large amount of data approach.
is not an easy task. By the medical line prediction based Chauhan, C., et al. (2017) presented a review of
systems, optimum results are produced using medical data various algorithms and techniques which help in

Electronic copy available at: https://ssrn.com/abstract=3350303


350 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019)

identifying the criminals. After several reviews it was seen advises are given by this research [20]. To ameliorate the
that the efficiency of ID3 algorithm was more advanced pedagogical process, presage the performance of students,
[19]. When analyzing the experimental data, highly provide a comparative analysis of precision of data mining
effective classification rules were generated by this algorithms and recognize the maturity of open source
algorithm. Detecting the hidden links of networks of co- implements, these studies have provided good result
offenders was done using hidden link algorithms which outcomes.
helped in showing the possible future of crime partner. Lee, E., Jang, et al. (2018) proposed an international
With the application of Bayes theorem, the accuracy of competition on the game data mining. From one of the
classification techniques was improved to 90%. The data major game companies called NCSOFT, the commercial
and victim system in which the attacks occurred were game log data was extracted to propose this technique.
analyzed using forensic tool kit which also helped in The data is made open by the researches for developing
generating the file. It was concluded that the violent and applying previously proposed data mining techniques
crimes were solved and the accuracy was limited by on the game log data. An action role-playing game from
applying Criminal investigation analysis (CIA) tool. NCSOFT named as Blade & Soul was used to collect data
Anoopkumar M., et al. (2016) presented a for competition [21]. Around 100 GB of game logs were
comprehensive study of the various researches done
achieved from 10,000 players within the data. Predicting
previously in Educational Data Mining (EDM). For
the possibilities of a player to churn is the major objective
improving the academic performances of students and
of the competition. The two periods in which the business
then improving the effectiveness of institutions, the
educational data is analyzed by different techniques. The model was modified and a free-to-play model was
literature is accumulated and relegated, the preceding generated from a monthly subscription helped in defining
work is recognized and then forwarded to the computing the time in which the player would churn. Deep learning,
educators and professionals by the study explored in this tree boosting and linear regression techniques were
paper. The edification and invigoration of impuissant applied as per the results achieved through the
segment students within the institution, well-fortified competitions amongst highly ranked competitors.

Table 1: Comparison of various techniques

Authors Techniques/ Datasets Attributes Tools Shortcoming Results


Algorithms Used
Min Chen, et al. Naïve Bayesian, Heart 79 MATLAB This classifier has high Decision tree performs better
KNN and Decision Diseases complexity. in comparison to other
tree classifiers.
Akhilesh Kumar Foggy K-mean Lung cancer 9 WEKA` Complexity is high. Foggy k-mean performs well
Yadav, et al. Algorithm Data as compared to K-means
Sanjay Incremental k-mean Air pollution 7 WEKA Accuracy is less The accuracy of proposed
Chakrabotry et al. clustering Data method is achieved up to 83.3
Algorithm percent.
Chew Li S. et al. BF Tree classifier Student’s 9 WEKA Complexity is high which BF Tree performs well as
Performance increases the execution compared to other tree
time. classifiers
Qasem A. et al. Decision tree STOCK Data 170 WEKA Accuracy is less which C4.5 classifier performs well
Prediction can be increased. as compared to ID3
K.Rajalakshmi Medical fast Prediction 3 Python A large amount of data The cost effectiveness and
growing field based systems has been generated and to human effects have been
handle this much of large reduced using proposed
amount of data prediction system based data
mining.
BalaSundar real and artificial to predict 5 WEKA The clusters are Show that the highest
datasets diagnosis of partitioned into k number robustness, and accuracy rate
heart diseases of clusters by clustering can be achieved using it.
which is the part of
cluster analysis.
Daljit Kaur contains similar dissimilar 12 Python algorithm is able to The proposed algorithm is
objects has been objects reduce efforts of also able to solve dead unit
divided using numerical calculation and problem.
clustering complexity
Ming, J multi- technical and 2 MATLAB The prediction model of during the process of mineral
dimensionality and economic the geological missing development there is a loss of
nonlinearity the data data is established on the a lot of geological data that
Characteristics of basis of techniques of geo decreases
the technical statistics and artificial
neural network.

Electronic copy available at: https://ssrn.com/abstract=3350303


Prediction Analysis Techniques of Data Mining: A Review 351

Authors Techniques/ Datasets Attributes Tools Shortcoming Results


Algorithms Used
Sakhare a survey of road Road 2 WEKA This paper studied the k- The accident reasons can be
accident analysis Accident Data mean algorithm in proper predicted and the accuracy of
methods an Analysis manner. SOM is used to analysis can be improved to a
important role create and analyze the greater extent in comparison
played in clusters. A self to the k-means clustering
transportation organizing technique uses algorithm by applying the
the neural network along proposed approach.
with an unsupervised
learning method.
Chauhan, C., Review of various Criminal Data 3 MATLAB Detecting the hidden It was concluded that the
&Sehgal, S algorithms and Analysis links of networks of co- violent crimes were solved
techniques which offenders was done using and the accuracy was limited
help in identifying hidden link algorithms by applying Criminal
the criminals which helped in showing investigation analysis (CIA)
the possible future of tool.
crime partner.
Anoopkumar different Data comprehensiv 5 MATLAB For improving the To ameliorate the pedagogical
Mining Methods e survey academic performances process, presage the
especially the of students and then performance of students,
mostly utilized improving the provide a comparative
effectiveness of analysis of precision of data
institutions, the mining algorithms and
educational data is recognize the maturity of
analyzed by different open source implements,
techniques. these studies have provided
good result outcomes.
Lee, E., Commercial game d tested on 3 MATLAB From one of the major Deep learning, tree boosting
log data the game log game companies called and linear regression
competition data of Blade NCSOFT, the techniques were applied as
framework was & Soul of commercial game log per the results achieved
used for game data NCSOFT data was extracted to through the competitions
mining propose this technique. amongst highly ranked
competitors.

III. CONCLUSION [4] Osamor VC, Adebiyi EF, Oyelade JO and Doumbia S (2012),
“Reducing the Time Requirement of K-Means Algorithm” PLoS
Future prediction is done from the current information by ONE, vol. 7, 2012, pp-56-62.
the prediction analysis which is the technique of data [5] AzharRauf, Sheeba, SaeedMahfooz, Shah KhusroandHumaJaved
(2012), “Enhanced K-Mean Clustering Algorithm toReduce
mining. The combining of clustering and classification is Number of Iterations and Time Complexity,” Middle-East Journal
known as the prediction analysis. Clustering algorithm of ScientificResearch, vol. 5, 2012, pp. 959-963.
groups the data according to their similarity and [6] Thair Nu Phyu, “Survey of Classification Techniques in Data
classification algorithm assigns class to the data. In terms Mining”, 2009, Proceedings of the International MultiConference
of many parameters several prediction analysis algorithms of Engineers and Computer Scientists, volume 3, issue 12, pp- 551-
559, IMECS.
are reviewed and analyzed in this paper. The literature [7] Chuan-Yu Chang, Chuan-Wang Chang, Yu-Meng Lin, (2012)
survey is done on various techniques of prediction “Application of Support Vector Machine for Emotion
analysis from where problem is formulated. The Classification”, 2012 Sixth International Conference on Genetic
formulated problem can be solved in future to increase and Evolutionary Computing, volume 12, issue 5, pp- 103-111.
accuracy of prediction analysis. [8] HimaniBhavsar, Mahesh H. Panchal, (2012) “A Review on Support
Vector Machine for Data Classification”, 2012, International
Journal of Advanced Research in Computer Engineering &
REFERENCES Technology (IJARCET) Volume 1, Issue 10.
[1] AbdelghaniBellaachia and ErhanGuven (2010), “Predicting Breast [9] Min Chen, YixueHao, Kai Hwang, Fellow, IEEE, Lu Wang, and
Cancer Survivability Using Data Mining Techniques”, Washington Lin Wang (2017), “Disease Prediction by Machine Learning over
DC 20052, vol. 6, 2010, pp. 234-239. Big Data from Healthcare Communities”, 2017, IEEE, vol. 15,
[2] Oyelade, O. J, Oladipupo, O. O and Obagbuwa, I. C (2010), 2017, pp- 215-227.
“Application of k-Means Clustering algorithm for prediction of [10] Akhilesh Kumar Yadav, DivyaTomar and SonaliAgarwal (2014),
Students’ Academic Performance”, International Journal of “Clustering of Lung Cancer Data Using Foggy K-Means”,
Computer Science and Information Security, vol. 7, 2010, pp. 123- International Conference on Recent Trends in Information
128. Technology (ICRTIT), vol. 21, 2013, pp.121-126.
[3] AzharRauf, Mahfooz, Shah Khusro and HumaJaved (2012), [11] Sanjay Chakrabotry, Prof. N.K Nigwani and Lop Dey (2014),
“Enhanced K-Mean Clustering Algorithm to Reduce Number of “Weather Forecasting using Incremental K-means Clustering”, vol.
Iterations and Time Complexity”, Middle-East Journal of Scientific 8, 2014, pp. 142-147.
Research, vol. 12, 2012, pp. 959-963.

Electronic copy available at: https://ssrn.com/abstract=3350303


352 2nd International Conference on Advanced Computing and Software Engineering (ICACSE-2019)

[12] Chew Li Sa., BtAbang Ibrahim, D.H., DahlianaHossain, E. and bin [17] Ming, J., Zhang, L., Sun, J.& Zhang, Y, “Analysis models of
Hossin, M. (2014), "Student performance analysis system (SPAS)", technical and economic data of mining enterprises based on big
in Information and Communication Technology for The Muslim data analysis”, International Conference on Cloud Computing and
World (ICT4M), 2014 The 5th International Conference on, vol.15, Big Data Analysis (ICCCBDA), 2018, IEEE, 3rd.
2014, pp.1-6. [18] Sakhare, A. V., &Kasbe, P. S “A review on road accident data
[13] Qasem A. Al-Radaideh, Adel Abu Assaf and EmanAlnagi analysis using data mining techniques”, International Conference
“Predicting Stock Prices Using Data Mining Techniques”, the on Innovations in Information, Embedded and Communication
International Arab Conference on Information Technology Systems (ICIIECS), 2017.
(ACIT’2013), vol. 23, 2013, pp. 32-38, (2013). [19] Chauhan, C., &Sehgal, S, “A review: Crime analysis using data
[14] K. Rajalakshmi, Dr. S. S. Dhenakaran and N. Roobin (2015), mining techniques and algorithms”, International Conference on
“Comparative Analysis of K-Means Algorithm in Disease Computing, Communication and Automation (ICCCA), 2017.
Prediction”, International Journal of Science, Engineering and [20] Anoopkumar M, & Rahman, A. M. J. M. Z, “A Review on Data
Technology Research (IJSETR), Vol. 4, 2015, pp. 1023-1028. Mining techniques and factors used in Educational Data Mining to
[15] BalaSundar V, T Devi and N Saravan, (2012) “Development of a predict student amelioration, International Conference on Data
Data Clustering Algorithm for Predicting Heart”, International Mining and Advanced Computing (SAPIENCE), (2016).
Journal of Computer Applications, vol. 48, 2012, pp. 423-428. [21] Lee, E., Jang, Y., Yoon, D.-M., Jeon, J., Yang, S., Lee, S, Kim, K.-
[16] DaljitKaur and KiranJyot (2013), “Enhancement in the J, “Game Data Mining Competition on Churn Prediction and
Performance of K-means Algorithm”, International Journal of Survival Analysis” using Commercial Game Log Data Transactions
Computer Science and Communication Engineering, vol. 2 2013, on Games, IEEE, 2018.
pp. 724-729.

Electronic copy available at: https://ssrn.com/abstract=3350303

You might also like