Professional Documents
Culture Documents
A Discretization Method For Industrial Data Based On Big Data Technology
A Discretization Method For Industrial Data Based On Big Data Technology
Xiang Wan*, Cheng Wang, Zhengming Tang, Haijun Sun, Shan Gao, Lei Qiao
Wuhan Second Ship Design and Research Institute, Hubei Wuhan 430064, China
*Corresponding author e-mail: wanwuhan@yeah.net
Abstract—A parallel improvement of the traditional II. THE TRADITIONAL CLUSTERING ALGORITHM
K-Means clustering algorithm is achieved based on the
Mapreduce architecture, and the new parallelized clustering The clustering algorithm groups data with common
algorithm is used to realize the discretization of industrial big attributes and characteristics into a cluster according to
data in this paper. The new algorithm streamlines the certain characteristics and attributes of the data. The data of
calculation process, meanwhile, saves the computational the same cluster is highly homogeneous. On the contrary,
overhead caused by data analysis and communication the data of different clusters is highly heterogeneous to
consumption caused by information transfer. achieve the discretization of the data. The clustering method
does not set clustering parameters and discrete targets in
Keywords—Big data, industrial data, K-Means clustering advance, thus it belongs to the unsupervised learning
algorithm method with good objectivity and reliability.
I. INTRODUCTION K-Means algorithm is a clustering algorithm based on
Industrial data come from a wide variety of sources and centroid division, in addition, the average value of data in a
diversity, hence there are significant differences between cluster is usually used as the centroid of the cluster. In
industrial parameters, and sometimes the differences can K-Means algorithm, the initial data set D is divided into k
even reach several orders of magnitude. If the differences clusters, which are C1, C2, ..., Ck. Besides, any data in the
between parameter values is not processed properly in initial data set D belongs to only one cluster, then for 1İ i
advance, the workload of parameter identification and and jİ k, there is Ci D and Ci ģCj= Ø. The distance
difference processing in subsequent data mining will be between the data is characterized by Euclidean distance:
increased, which will adversely affect the efficiency and
accuracy of mining work. Therefore, it is necessary to d (m, ci ) m, ci (1)
perform a discretization operation on industrial data so that
it can present a unified model which is conducive to data
mining. In Equation (1), ci is the centroid of the cluster Ci, and m
is any data in the cluster Ci. The clustering quality function
By replacing the initial data with several groups, clusters, is defined as:
intervals, labels, etc., data discretization reduces the
differences between the initial data parameters and reduces k
the size of the initial data. Consequently, the efficiency of
subsequent mining work is improved, meanwhile the mining
E ¦ ¦ d (m, c )
i 1 mCi
i
2
(2)
205
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 28,2023 at 10:34:45 UTC from IEEE Xplore. Restrictions apply.
that there are a total of a node, each node is responsible for
the completion of (n/a) tasks. The time complexity can be
expressed by T*=pqtk/a. It can be seen that the calculation
efficiency is greatly improved.
B. Conclusions
By discretizing the industrial data, the data is divided
into 3 and 4 clusters. As shown in the clustering results, the
data in each cluster has a high degree of similarity, and the
densely distributed data is classified as one cluster, the data
with sparse edges are classified into one category, thus the
goal of data classification is achieved basically. Through the
example verification, the parallelization algorithm proposed
in this paper can be used for the discretization of industrial
big data.
ACKNOWLEDGMENTS
Fig.2 The Flow of Parallelization Algorithm This work was financially supported by Natural Science
Foundation of Hubei Province (2019CFB281).
IV. CALCULATION RESULTS AND CONCLUSIONS
A. Calculation Results REFERENCES
[1] ZHANG Wang, WANG Hui, Parallel K-Means Clustering Algorithm
Taking industrial data as the research object, the data is in Web Personalized Service, Microelectronics & Computer.vol.
discretized using the parallelization algorithm, and the 24,no.10,pp.65-67,2007.
number of clusters is set to 3 and 4. The discretization [2] LU Yi-qing, LIN Jin-xian, Parallel PSO combined with K-means
results are shown in Fig.3 and 4. clustering algorithm based on MPI, Journal of Computer Applications.
Vol.31,no.2,pp.428-431,2011.
[3] Li Cheng-hua, Zhang Xin-fang, Jin Hai, Xiang Wen, MapReduce: a
New Programming Model for Distributed Parallel Computing,
Computer Engineering&Science. Vol.33, no.3, pp.129-135, 2011.
[4] LIU Zhi-hui, ZHANG Quan-ling, Research overview of big data
technology, Journal of Zhejiang University (Engineering Science).vol.
48,no.6,pp.957-972,2014.
[5] TAO Xue-jiao, HU Xiao-feng, LIU Yang, Overview of Big Data
Research, Journal of System Simulation. No.S1,pp.142-146,2013.
206
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 28,2023 at 10:34:45 UTC from IEEE Xplore. Restrictions apply.