Deep Learning Methods in Mining Ver Ver Ver

Mining, Metallurgy & Exploration
https://doi.org/10.1007/s42461-020-00238-1
Machine Learning and Deep Learning Methods in Mining

Operations: a Data-Driven SAG Mill Energy Consumption Prediction
Application
Sebastian Avalos1 · Willy Kracht2,3 · Julian M. Ortiz1
Received: 10 February 2020 / Accepted: 13 May 2020

© Society for Mining, Metallurgy & Exploration Inc. 2020
Abstract
Semi-autogenous grinding mills play a critical role in the processing stage of many mining operations. They are also one
of the most intensive energy consumers of the entire process. Current forecasting techniques of energy consumption base
their inferences on feeding ore mineralogical features, SAG dimensions, and operational variables. Experts recognize their
capabilities to provide adequate guidelines but also their lack of accuracy when real-time forecasting is desired. As an
alternative, we propose the use of real-time operational variables (feed tonnage, bearing pressure, and spindle speed) to
forecast the upcoming energy consumption via machine learning and deep learning techniques. Several predictive methods
were studied: polynomial regression, k-nearest neighbor, support vector machine, multilayer perceptron, long short-term
memory, and gated recurrent units. A step-by-step workflow on how to deal with real datasets, and how to find optimum
models and final model selection is presented. In particular, recurrent neural networks achieved the best forecasting metrics
in the energy consumption prediction task. The workflow has the potential of being extended to any other temporal and
multivariate mineral processing datasets.
Keywords Energy consumption · Semi-autogenous grinding mill · Machine learning · Deep learning · Mining
1 Introduction incorporation into greenfield/brownfield projects will help

reduce costs and will lead to rethinking fundamental
Current changes in the Chilean energy matrix [25], from paradigms (production and processing strategies) [23]. In
fossil fuels to renewable energies, have impacted different addition, integration of new space-time predicting tools
industries and particularly the mining sector. Solar energy along with automated real-time updating models is one of
the main challenges in the geometallurgical framework [1,
22]. Those contexts encourage and demand the development
of tools capable of predicting the energy consumed by
Sebastian Avalos mining systems, being comminution the greatest energy
sebastian.avalos@queensu.ca consumer with an average close to 50% of the entire mine
consumption [5].
Willy Kracht
wkracht@uchile.cl
In comminution, the semi-autogenous grinding mill
(SAG) represents the largest energy consumer. Theoretical
Julian M. Ortiz and empirical energy consumption models [17, 19, 29] base
julian.ortiz@queensu.ca their inferences mainly on feed/product size distributions,
SAG sizing, bearing pressure, feed hardness, water addition,
1 The Robert M. Buchan Department of Mining, Queen’s and grinding charge level, and commonly assume steady-
University, 25 Union Street, Kingston, ON, K7L 3N6, Canada state and isolation from up- and downstream processes.
2 Department of Mining Engineering, Universidad de Chile, While those techniques provide adequate design guidelines,
Santiago, Chile they suffer from a lack of accurately forecasting instant and
3 Advanced Mining Technology Center, AMTC, Universidad different time support energy consumptions on intercon-
de Chile, Santiago, Chile nected comminution circuits where upstream/downstream
bottlenecks lead SAG mills to operate below designed 2.1.1 Polynomial Regression
regimes. One of the state-of-the art methods combines those
techniques with real-time operational data as model predic- The polynomial regression approach fits a non-linear model
tive control (SAG MPC [28]). However, this method still over the pairs {x, y} by a polynomial
function
n of nthpdegrees
requires experts to properly model the SAG mill dynamics of the predictors: fn (x) = λ0 + m j =1 p=1 λjp xj , where
for further predictions. Lastly, when models are input- λ0 and λjp are the model parameters representing the
sensitive and some of those input measurements are expen- coefficient associated to the j th attribute to the pth power.
sive and/or time consuming, the model robustness suffers They are found by minimizing the sum of errors between
when real-time predictions are required. predicted values fn (x) and the observed responses y via the
2
In the last few years, regression methods that avoid common least-squares method as Sj=1 yj −fn (xj ) . Note
theoretical/empirical models have gained attention when that powers of attributes are combined, but attributes are not
dealing with on-demand forecasting using real-time opera- multiplied with each other.
tional information. Among them, support vector machines
[7], gene expression programming [14], and hybrid mod- 2.1.2 k -Nearest Neighbor Regression
els combining genetic algorithms and neural networks [13]
have shown promising accuracy and precision predicting Let xt ∈ Rm be a testing data, the k-nearest neighbor
power and specific energy consumption. Also, artificial neu- regression model computes the weighted averages of the k
ral networks, particularly recurrent neural networks, have closest neighbors to xt as:
been applied to SAG mill circuit with feedback control k
j =1 ϕ(xj , xt ) · yj
purpose rather than energy consumption predictions, with fk (xt ) = k (1)
great results [15]. All data-driven methods are context (data j =1 ϕ(xj , xt )
availability) and representational (data workflow) sensitive. where yj is the known value of the j th closest neighbor and
Therefore, a comparative study of their performances must ϕ(xj , xt ) is a predefined kernel function.
be carried out prior to their industrial implementations for From the family of kernel functions [16],the radial basis
better decision making. This work seeks to (1) describe d(xj , xt )
function is used in its form ϕ(xj , xt ) = exp − ,
the nature of each regression method (Section 2), (2) show β
how to deal with the available data and create an adequate where β is a decay parameter and d(xj , xt ) is the Euclidean
data workflow (Section 3), (3) analyze their performances m
squared distance d(xj , xt ) = (xj,l − xt,l )2 with m being
(Section 4), and (4) draw conclusions from the previous l=1
results by ranking methods according to their performances. the number of measured attributes, and xj,l and xt,l the lth
measured value of the j th closest neighbor and the testing
data, respectively. As a decay parameter, we use half of the
2 Predictive Methods average distance [21] between all k closest neighbors to xt
1 k
as β = d(xj , xt ).
We compare the SAG mill energy consumption prediction 2k j =1
for several predictive methods ranging from polynomial The k-closest neighbors meet the following condition
regressions to recurrent neural networks. Their general 0 ≤ d(x1 , xt ) ≤ d(x2 , xt ) ≤ ... ≤ d(xk−1 , xt ) ≤ d(xk , xt ),
internal structures are described in this section while their regardless of the explicit form of the distance function. Note
final applied structures are shown in Subsection 3.5. that the distance function operates in the original data space
while the regression model uses weighted average distances
2.1 Machine Learning in kernel space.
In machine learning, regression techniques combine statis- 2.1.3 Support Vector Regression
tics and algorithmic methods to train models using error
metrics, as least squares, between expected and predicted In 1995, Vapnik [32] cleverly identified that in machine
outputs. We describe the following techniques: polynomial learning there are two options for any model to approximate
regression, k-nearest neighbor regression, and regressions an unknown mapping function: (1) keeping the confidence
using support-vector machine (SVM). interval fixed and minimize an empirical risk, or (2) keeping
In the following, let {X, Y } ∈ Rm × R be a training the value of an empirical risk fixed and minimize the
dataset with S number of pair samples {xj , yj } ∈ {X, Y }, confidence interval. The first approach is mostly used by
X ∈ Rm being a set of independent variables (predictors) neural networks where the internal architecture implicitly
with m measured attributes and y ∈ R the corresponding set defines its capacity while the second one is implemented by
of measured dependent variables (responses). support vector machines.
Support vector regression (SVR) models [6, 32] firstly optimum value C may fall, we explore the best value from
applied a non-linear transformation over all inputs x ∈ a sensitivity analysis.
X onto a new H -dimensional feature space using some
predefined kernel function ϕ(x) ∈ RH [10, 16]. In this new 2.2 Deep Learning
space, a linear model f (x, w) = H h=1 wh · ϕh (x) + b is
fitted to the data, where ϕh (x) is an h-dimensional feature Predictive models in deep learning [9] are made by
of the input in the new space, wh its associated weight, and deep neural networks. Neural networks are capable of
b a bias term which is usually dropped when the input data capturing non-linear relationships [33] by performing affine
is previously normalized to have zero mean. transformations, between input vectors, and weight matrices
1 S plus bias vectors, passing the results through non-linear
The empirical risk is R(w) = Lε yj , f (xj , w) ,
S j =1 functions. Deep learning models are parametrized by
where the loss function Lε is called the ε-insensitive [32] inner parameters = {W, b} (W: weights, b: biases),
loss function and is defined as: which are tuned to minimize a loss function by gradient-
based optimizers during training. The internal architecture
0 if |y − f (x, w)| ≤ ε
Lε = (2) refers to how neural networks are interconnected. Different
|y − f (x, w)| − ε otherwise. architectures yield to different performances on specific
In order to control the model complexity, SVR minimizes structured datasets. In this work, we explore the feedforward
the dual problem expression: architecture of a multilayer perceptron and two broadly
used recurrent architectures, the long short-term memory
1 S
(LSTM) and gated recurrent units (GRU).
w2 + C ξj + ξj∗ (3)
2
j =1
2.2.1 Multilayer Perceptron
where C ∈ R∗+ is a positive real number known as constant
of regularization, and ξj , ξj∗ are the distances of the j th The perceptron [26] consists in a single artificial neuron that
sample to the boundaries of the ε-insensitive zone when the receives an impulse X ∈ Rm as input and modulates it by
j th sample falls outside the zone. The previous definition is a matrix multiplication with a vector w (wT ∈ Rm ). A bias
translated to minimize Eq. 3 subject to: scalar b ∈ R is added to the resulting value and then it
⎧ passes through a non-linear activation function g(·), giving
⎨f (xj , w) − yj ≤ ε + ξj
⎪
a single output z ∈ R. This is expressed as z = g(wT X+b).
s.t. yj − f (xj , w) ≤ ε + ξj∗ (4)
⎪
⎩
When more than one perceptron is connected to the input X,
∗
ξj ≥ 0, ξj ≥ 0, j = 1, ..., S the first layer is generated and expressed as z = g(WT X +
b), with WT ∈ Rnp × Rm , z ∈ Rnp , b ∈ Rnp and np
By solving the Lagrangian under the optimality con-
being the number of perceptrons. Once another perceptron
strains [30, 32], the solution of the previous dual problem,
receives the first layer z as input, a multilayer perceptron
and therefore the SVR regression model, for any testing data
(MLP) [12] is built. Then, the first layer is referred as the
xt ∈ Rm , has the form:
first hidden layer with an amount of neurons nH equal to

Nsv
the number of perceptrons np fully connected to the inputs.
f (xt ) = αi − αi∗ K(xi , xt ) (5) It can be extended to more than one hidden layer, with
i=1
different amount of neurons per layer [24]. The non-linear
where K(xi , xt ) is a kernel function, αi , αi∗ ∈ [0, C] are activation function g(·) used in this work is the rectified
the Lagrange multipliers, and Nsv is the number of support
linear units function
vectors from the set of samples X at which αi − αi∗ = Multilayer perceptrons are trained by the back-

H propagation error algorithm [27] which computes the
0. We used K(xi , xt ) = ϕh (xi )ϕh (xt ) as the kernel
h=1
derivative of the error between predictions and targets with
function using the same kernel ϕ(·) ∈ RH that mapped the respect to each weight and bias via chain rule [27, 31].
original input space into the H -dimensional feature space,
specifically, the radial basis function. 2.2.2 Recurrent Neural Networks
The SVR model complexity is mainly driven by the C
and ε values. From a theoreticalanalysis, Cherkassky and Recurrent neural networks (RNN) have internal architec-
ln S tures, whose main feature are the handling of sequence input
Ma [2] suggest setting ε = τ σ with τ = 3, S the and learning temporal features dependencies. This is car-
S
number of samples and σ its standard deviation. Although ried out by a recurrent hidden state that modulates the input
Cherkassky and Ma [2] also suggest an interval where the at each time according to the previous hidden time state.
Let {x1 , .., xt , .., xτ } = xt ∈ Rm be a sequence of τ ele- 2.2.3 Long Short-Term Memory
ments, each one representing a xt vector of m raw inputs.
The hidden state ht at time t (0 ≤ t ≤ τ ) is updated by: LSTM [11] uses an internal cell that performs several
combinations of affine transformations, element-wise mul-
tiplications, and activation functions. The building blocks of
0 if t = 0 an LSTM architecture are:
ht = (6)
g ht−1 , xt ; if t > 0 • xt : input vector at time t. Dimension (m, 1).
• Wf , Wi , Wc , Wo : weight matrices for xt . Dimensions
where are the inner RNN parameters and g(·) an (nH , m).
• ht : hidden state at time t. Dimension (m, 1).
activation function. Here, = {W, U, V, b, c} are as
• Uf , Ui , Uc , Uo : weight matrices for ht−1 . Dimensions
follows: the weight matrices W for the input xt , U for the
previous hidden state ht−1 , and V for the current state ht , (nH , m).
• bf , bi , bc , bo : bias vectors. Dimensions (nH , 1).
and the bias vector b for the input xt , and c for the output
• V: weight matrix for ht as output. Dimension (K, m).
ot . They
are related
by h t = g Wxt + Uht−1 + b and
• c: bias vector for output. Dimension (K, 1).
ot = Vht +c . The predicted value at time t is just yt = o t .

When dealing with a single continuous output, yt ∈ R. where m is the number of variables as input, K is the
Inner parameters are initialized as random values from number of desired output variables, and nH is the number
a normal Gaussian truncated function. Optimum values of hidden units, a hyperparameter of LSTM networks. For
are obtained as result of training the RNN by applying the a better understanding of the following descriptions, refer
backpropagation through time (BPTT) [34] algorithm. The to the illustrated architecture in Fig. 1 (left). At each time
Adam technique [18] is used as the optimizer algorithm. t ∈ {1, ..., τ }, the LSTM receives the input xt , the previous
The loss function to minimize during training is the mean hidden state ht−1 , and previous memory cell ct−1 . The
squared error between the real (yt ) and predicted values (yt ). forget gate ft = σ Wf xt + Uf ht−1 + bf measures the
When BPTT is performed, long-term dependencies are information carried by xt deciding howmuch to forget. The
hardly captured due to vanishing or exploding gradients input gate it = σ Wi xt + Ui ht−1 + bi , on the other hand,
problems [9]. As explained by Chung et al. [4], different decides what to learn from xt . Both ft and it use sigmoid
approaches have been tried, from new training algorithms (σ (x) = (1+e−x )−1 ) as the activation function over a linear
to more sophisticated activation functions. LSTM [11] was combination of xt and ht−1 .
one of the first attempts to achieve long-term dependencies. A
candidate memory cell c̃t = T anh Wc xt + Uc ht−1 +
Cho et al. [3] proposed an alternative architecture, the Gated bc is obtained by passing the linear combination of xt
Recurrent Units (GRU). Their internal architectures are and ht−1 through a T anh function. The final memory cell
illustrated in Fig. 1 and subsequently described. ct = ft ct−1 + it c̃t is then computed as a sum of
Fig. 1 Schematic information flow of long short-term memory (left) and gated recurrent unit (right) cells
Table 1 Summary statistics over collected dataset on semi-autogenous grinding
Original datasets SAG mill 1 SAG mill 2

Variable Min Mean Max St Dev Count Min Mean Max St Dev Count
Feed tonnage (ton/ h) 0 897 2111 498 16,340 0 2070 3476 1,143 15,905
Energy consumption (kW h) 0 9423.1 12,248.0 1220.1 16,340 0 17,077 19,688 1528 15,905
Bearing pressure (psi) 0 12.3 13.7 2.3 16,340 0 14.3 18.3 3.7 15,905
Spindle speed (rpm) 0 9.2 10.7 0.7 16,340 0 9.0 10.0 0.6 15,905
Water (m3 / h) 0 319.2 499.9 116.2 16,340 0 705.6 998.4 164.3 15,905
Solid percentage (%) 0 75.7 90.4 4.8 16,340 0 76.0 98.1 4.3 15,905
(1) what to forget from the past memory cell as an element- respect to the previous LSTM. In the GRU internal building
wise multiplication between ft and ct−1 , and (2) what to blocks, the same xt , ht , V, and c blocks as LSTM are used
learn from the candidate memory cell as an element-wise in addition to:
multiplication between it and c̃t . • Wr , Wz , Wh : weight matrices for xt . Dimensions
The output gate ot = σ Wo xt + Uo ht−1 + bo , similar
(nH , m).
to it and ft , is a linear combination of xt and ht−1 passing
• Ur , Uz , Uh : weight matrices for ht−1 . Dimensions
through a sigmoid function. The output gate controls the
(nH , m).
amount of information passing from the current memory
• br , bz , bh : bias vectors. Dimensions (nH , 1).
cell ct to the final hidden state ht = T anh ct ot , which
is computed as an element-wise multiplication between ot where again m is the number of variables as input, K
and T anh ct . The final output is obtained as in any other is the number of desired output variables, and nH is the
regular RNN as yt = Vht + c . number of hidden units. Again, for a better understanding
of the following descriptions, refer to the illustrated
2.2.4 Gated Recurrent Unit architecture in Fig. 1 (right). At each time t ∈ {1, ..., τ },
the GRU receives the input xt and the previous state ht−1
GRU [3] handles the input vector and previous hidden state defining an update gate z t = σ Wz x t +
Uz h t−1 + b z and a
in a different way. Here, only the ht vector passes from reset gate rt = σ Wr xt + Ur ht−1 + br , both using sigmoid
time t to time t + 1 inside the architecture, decreasing the (σ ) as the activation function over a linear combination of
number of internal operations, and inner parameters with xt and ht−1 .
Table 2 Summary statistics over training, validation, and testing datasets on semi-autogenous grinding
SAG mill 1 Training | Validation | Testing dataset

Variable Min Mean Max St Dev Count
Feed tonnage (ton/ h) 0 |0 |0 906 | 827 | 937 2111 | 1953 | 1903 498 | 473 | 479 8600 | 3400 | 4340
Energy consumption (kW h) 0 |0 |0 9907 | 9162 | 8668 12,248 | 10,691 | 10,654 1233 | 942 | 899 8600 | 3400 | 4340
Bearing pressure (psi) 0 |0 |0 12.7 | 12.5 | 11.4 13.7 | 13.7 | 13.7 2.2 | 2.0 | 2.3 8600 | 3400 | 4340
Spindle speed (rpm) 0 |0 |0 9.2 | 9.2 | 9.0 10.3 | 10.7 | 10.6 0.7 | 0.7 | 0.6 8600 | 3400 | 4340
Water (m3 / h) 0 |0 |0 327 | 330 | 295 500 | 497 | 500 65 | 66 | 196 8600 | 3400 | 4340
Solid Percentage [%] 0 |0 |0 77 | 76 | 73 90 | 83 | 84 4 |5 |4 8600 | 3400 | 4340
SAG mill 2 Training | Validation | Testing dataset

Variable Min Mean Max St Dev Count
Feed tonnage (ton/ h) 0 |0 |0 2077 | 2204 | 1986 3477 | 3448 | 3452 1136 | 1141 | 1121 7953 | 3181 | 4771
Energy consumption (kW h) 0| 0| 0 16,709 | 17,439 | 17,449 19,688 | 19,419 | 19,533 1504 | 1415 | 1492 7953 | 3181 | 4771
Bearing pressure (psi) 0 | 0| 0 13.8 | 14.8 | 14.7 18.3 | 18.3 | 18.3 3.5 | 3.7 | 3.9 7953 | 3181 | 4771
Spindle speed (rpm) 0 | 0.5 | 0 9.1 | 8.9 | 8.9 10.0 | 9.9 | 9.9 0.6 | 0.6 | 0.7 7953 | 3181 | 4771
Water [m3 / h] 32 | 0 | 0 660 | 742 | 757 942 | 941 | 998 107 | 112 | 236 7953 | 3181 | 4771
Solid Percentage [%] 0| 0| 0 77.5 | 75.8 | 73.7 86.9 | 98.1 | 81.0 3.3 | 4.2 | 4.9 7953 | 3181 | 4771

A candidate hidden state h̃t = tanh Wh xt + rt [%]. Summary statistics of both datasets are presented in
Uh ht−1 + bh is obtained by passing the linear combination Table 1.
of xt and ht−1 through a T anh function, deciding how much The dataset is split into three sub-datasets: training,
past information to forget from ht−1 with rt in an element- validation, and testing sets (Table 2). This is an arbitrary
wise multiplication. The final hidden state ht = (1 − zt ) · division and we intend to have a proportion of ∼ 50/20/30
ht−1 + zt · h̃t is a linear interpolation between the candidate between train/validation/testing data. For SAG mill 1, the
hidden states h̃t and the previous state ht−1 , weighted by the information is split into 179, 71, and 90 days (8600, 3400,
update gate zt . The output (K predicted values) is obtained 4340 data points), while for SAG mill 2 the information
similar as in the LSTM case as yt = Vht + c . is split into 166, 66, and 99 days (7953, 3181, 4771
Experiments on understanding the relevance of inner data points) for the train, validation, and testing datasets,
parameters in the GRU architecture [8] showed that respectively.
deleting the bias vectors on rt and zt slightly reduced the Note that both the validation and testing datasets contain
accuracy on standard tasks. Lower accuracies were obtained unseen data. In other words, all predictive methods are
when ht−1 was not considered by zt and rt . Also, Wu and trained using the first 50% of the historical data, and
King [35] determined the relevance of the reset gate rt in validated over the upcoming 20% of unseen historical data
contrast with zt , again on standard tasks. (without feeding the previous 50%), and then tested over the
last 30% of unseen historical data (without being fed with
the previous 70%).
3 Experiments
3.2 Assumptions
3.1 Dataset
Information about downstream or upstream processes is
Two datasets corresponding to real operational information not available, which implies that bottlenecks are not easily
of SAG mills are available, with data points every recognized. This leads to a mix of operational stages,
30 min for a total time of 340 days and 331 days, going from steady-state to under capacity and vice versa.
respectively. At each time t, the datasets contain feed Stationarity of all variable distributions and expert-agent
tonnage (FT) [ton/ h], energy consumption (EC) [kW h], passivity is assumed throughout this work during training
bearing pressure (BPr) [psi], spindle speed (SSp) [rpm], to simplify the approach. Descriptions, limitations, and
feed water (Wtr) [m3 / h], and solid percentage (SPe) potential improvements are detailed below.
Fig. 2 Correlation matrices on training, validation. and testing sets on both SAG mills
Fig. 3 Scatter plots of energy consumption and each variable over training (black), validation (green) and testing (red) datasets on both SAG mills
Stationarity The SAG mill performance can be seen consumption are unknown by the expert agent to take
as a temporal phenomenon where ores, mineralogically any further action over the future mill performance. This
characterized, coming from several geometallurgical assumption could limit the industrial applicability since
units are combined and fed to the primary grinding expert agents may consider the forecasting information
circuit. This performance is reflected in the dataset to react under critical situations. This assumption can
where assuming stationarity means that the entire be tackled by actually training models in a real-time
dataset belongs to a planned combination of different environment or by reinforcement learning where expert
geometallurgical units, with different mineralogical agent actions, cause-effect situations, and environmental
characterizations and with no systematic temporal descriptions can be partially built to emulate the real in
variation. This assumption limits the versatility of all situ experience.
predictive methods since they will not be suitably trained Additivity One of the aims is to forecast EC at different
to forecast under other combinations of geometallurgical time intervals, requiring the average of the EC to suitably
units with different ore characteristics. This assumption train and test models. It induces the need of EC to be
can be tackled by training the same methods under additive. In fact, the units of the energy consumptions are
different blends of geometallurgical units in the feed or kW h and the discretization of the data set is constant so
longer datasets over the same SAG mill. averaging adjacent ECs is consistent and maintains the
Passivity All predictive methods in this work are trained units of kW h.
using a past dataset (training set representing 250 days of
data for SAG mill 1 and 166 for SAG mill 2) and not yet 3.3 Problem Statement
in a real-time operation, which leads to assume passivity
from the expert agent (operator or expert system) Since the information is available every 30 min, the
perspective. This means that predictions of energy upcoming energy consumption ECt+1 at 0.5 h support
Table 3 Normalization parameters during preprocessing and back-transformation
SAG mill 1 SAG mill 2

Variable mvar svar Support m(sh)
EC
(sh)
sEC Variable mvar svar Support m(sh)
EC
(sh)
sEC
FT 883 492 EC (0.5h) 9696 1205 FT 2104 1154 EC (0.5h) 16,895 1500
BPr 12.7 2.1 EC (1h) 9696 1125 BPr 14.1 3.5 EC (1h) 16,895 1390
SSp 9.2 0.7 EC (2h) 9696 1063 SSp 9.0 0.6 EC (2h) 16,895 1292
EC (4h) 9696 1008 EC (4h) 16,895 1209
EC (8h) 9696 960 EC (8h) 16,895 1137
(0.5h)
is denoted simply as ECt+1 in reference to ECt+1 . An pressure BPr is associated to the mill weight, considering
(1h) the grinding media and ore. The heavier the mill, the more
upcoming energy consumption at 1-h support, ECt+1 ,
is
obtained by averaging the next two energy consumptions, energy is required for its rotation. Hence, a relationship is
ECt+1 and ECt+2 . Similarly, by averaging the upcoming expected between EC and BPr. So far, both BPr and SSp are
energy consumptions, different supports are computed. Let considered input variables.
s be the time support in hours, which represents the average In semi-empirical models, the variable related to the feed
(sh) tonnage is usually found in terms of a certain feed interval
over a temporal interval of a given duration, then ECt+1 is
calculated as: of granulometric material size, commonly F80 in reference
to the 80% passing size of the feed. Hence, we incorporate
FT as another input variable although it represents the entire
size distribution. Note that FT has many points with zero
ECt+1 + .. + ECt+2s
EC(sh)
t+1 = (7) value which are kept as they may represent actual gaps
2s in the feed. With respect to the feed water Wtr and solid
percentage SPe, we assumed that these factors are captured
Five different supports (sh) are considered: 0.5h, 1h, 2h, by the bearing pressure BPr. Therefore, Wtr and SPe are not
4h, and 8h. considered in the prediction.
The input attributes from the available dataset are From the previous analysis and assumptions, at each time
selected by analyzing the correlation coefficient matrix t, the considered input variables are FTt , BPrt , and SSpt
(Fig. 2) and scatter plots in Fig. 3 over the training, while SPet and Wtrt were left out. To account for trends, and
validation, and testing sets. EC shows some linear relation since FT and SSp are operational decisions, the differences
with BPr and SSp but not with FT, Wtr, and SPe. The FTt+1 - FTt and SSpt+1 - SSpt are also considered inputs.
influence of SSp is known from phenomenological analysis Therefore, the dataset of predictors and output {X, Y } ∈
and has been captured in semi-empirical models [20, R5 × R, at each time support sh, has samples {xt , yt } ∈
29]. From a phenomenological point of view, the bearing {X, Y } made by xt = FTt , BPrt , SSpt , FTt+1 - FTt , SSpt+1
Fig. 4 Machine learning results on SAG mill 1 validation dataset. Relative root mean squared errors (left) and correlation coefficients (right)
Fig. 5 Machine Learning results on SAG mill 2 validation dataset. Relative root mean squared errors (left) and correlation coefficients (right)
(sh)
- SSpt and yt = ECt+1 . We also tried several other from the training dataset and applied on both training and
combinations of input variables, but all led to results with testing datasets.
lower quality. Let xt(var) represent one of the input SAG operational
variables (var) at time t, and its normalized expression
3.4 Preprocessing Dataset (var) vart − mvar
is xt = , where mvar and svar represent
svar
As usual when dealing with multivariate raw datasets, the mean and standard deviation of var. Let yt be the
preprocessing the information is required. The reason normalized energy consumption prediction; it is required
(sh)
behind this is explained by the models themselves, to be back-transformed to obtain the expected ECt as
(sh) (sh) (sh)
which require the input to fall in certain regions (e.g., ECt = yt ∗ sEC + mEC . Table 3 contains the parameters
normalization), to be coded in categories (e.g., one-hot to perform the normalization and back-transformation.
(sh) (sh)
encoding), to avoid collinearity (e.g., principal component Note that all mEC (mean) and sEC (standard deviation)
analysis), among others. In this work, we proceed to are computed over the training dataset and not over
normalize the entire raw datasets using the training the testing dataset since testing data are not a priori
information, so the mean and standard deviation are derived known.
Table 4 Machine learning best methods and their corresponding final models and performances at each time support, on both SAG mills
Machine SAG mill 1 SAG mill 2

learning EC (0.5h) EC (1h) EC (2h) EC (4h) EC (8h) EC (0.5h) EC (1h) EC (2h) EC (4h) EC (8h)
Method PR PR SVR SVR SVR PR PR SVR SVR SVR

Model n:2 n:2 C : 0.5 C : 0.5 C : 0.005 n:2 n:2 C : 0.5 C : 0.1 C : 0.005
rRMSE (%) 8.63 9.21 9.39 9.14 8.83 5.87 6.18 6.01 5.62 5.43
CoefCorr 0.94 0.84 0.73 0.66 0.56 0.87 0.77 0.70 0.66 0.54
Note that we have decided to normalize the first three Support vector regression: The radial basis function is
attributes of xt , FTt , BPrt , and SSpt while for the last two used as kernel function. Assuming around 8500 normal-
attributes, the differences between the original values FTt+1 ized samples during training and following the expres-
- FTt and SSpt+1 - SSpt will be replaced by the differences exposed in Section 2, the parameter ε = 3 · 1 ·
sion
between the normalized values of FT and SSp. The output ln 8500
= 0.09. The best SVR model is found by
yt has also been normalized. 8500
exploring the results with a regularized parameter C in
the set [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50,
3.5 Structures of Predictive Methods 100, 500, 1000, 5000, 10000].
(sh)
Multilayer perceptron: The multilayer perceptron models
The architectures of each model used to predict ECt , are made by two hidden layers, both with the same
more precisely yt , were described in Section 2. Each number of hidden nodes nH . The first one connected
method has one or more hyperparameters to be optimized to to the input xt ∈ R5 (five input variables), the second
achieve the best performance over the training dataset. The one fully connected to the first layer, and the output
following is a description of the model architectures used in computed with one neuron connected to the nH nodes
this work and the sensitivity ranges explored to obtained the of the second layer. The sigmoid function is used as the
best hyperparameters for each method. activation function. The best MLP model is found by
varying the number of hidden nodes in the set nH = [4,
Polynomial regression: Six nth degrees polynomials were 8, 12, ..., 592, 596, 600]. All MLP models are trained
explored, with n = [1, 2, 3, 4, 5, 6]. with an early stopping criterion meaning on average
k-Nearest neighbor regression: The radial basis function 150 epochs (number of passes over the training dataset)
and the Euclidean squared distance are used as the during training.
kernel function and metric distance, respectively. We Recurrent neural networks: As only the energy consump-
explored the optimum model varying the number of tion is predicted, one single output is desired (K =
nearest neighbors k in the range [1, 2, 3, ... , 98, 99, 100]. 1). A temporal window of the previous 4 h is used
Fig. 6 Deep learning results on SAG mill 1 validation dataset. Relative root mean squared errors (left) and correlation coefficients (right)
Fig. 7 Deep learning results on SAG mill 2 validation dataset. Relative root mean squared errors (left) and correlation coefficients (right)
on both LSTM and GRU models, so the length of criterion. Once models are trained, they are compared
their architectures is made by τ = 8. The optimum using the validation dataset. The best method and its
model is found by varying the number of hidden units respective optimum model are selected, for each SAG
in the set nH = [4, 8, 12, ..., 592, 596, 600]. Simi- mill and each time support. Once selected, they are tested
lar to MLP, all LSTM and GRU were trained with an over the testing dataset, reporting the energy consumption
early stopping criterion meaning on average 60 epochs metrics.
during training.
3.6 Performance Assessment 4 Results and Discussion
For every method and each model, the training, validation, This section presents the results achieved by all methods
and testing stages work as follows. Models are trained and for every model configuration in order to find the
over the training dataset so machine learning methods best model of each method, for each SAG dataset and
minimize the least-squares error and deep learning methods at each time support. All results are computed over the
minimize their loss function with an early stopping corresponding validation dataset. Performance metrics are
Table 5 Best deep learning methods and their corresponding final models and performances at each time support, on both SAG mills
Deep SAG mill 1 SAG mill 2

learning EC (0.5h) EC (1h) EC (2h) EC (4h) EC (8h) EC (0.5h) EC (1h) EC (2h) EC (4h) EC (8h)
Method LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Model (nH ) 280 212 240 260 516 596 376 576 260 488
rRMSE (%) 5.61 6.54 6.70 6.90 6.87 3.71 4.62 4.86 4.76 4.69
CoefCorr 0.94 0.83 0.76 0.70 0.66 0.89 0.78 0.68 0.61 0.53
computed to derive the best model and the actual predictions 4.1 Machine Learning Methods
are illustrated and discussed.
Let y, y , and N be the real value, predicted value, and We begin by analyzing the performance results of machine
number of testing samples, respectively. The performance learning methods on SAG mill 1 (Fig. 4) and SAG
metrics used to compare models are the relative root mill 2 (Fig. 5) in the validation datasets. Each method
mean squared error (rRMSE) and the Pearson correlation behaves similarly on both SAG mills in terms of the their
coefficient as: performance variabilities during sensitivity, but with better
metrics over the SAG mill 2 dataset. In particular, the
correlation coefficient has a constant decay when the time
1/N N
t=1 (yt −
yt2 )
rRMSE = support increases, despite the method used, while no similar
y pattern is shown by the rRMSE.
N
t=1 (yt yt −
− y)( y) Polynomial regressions achieve their best performances
Corr. Coeff. = (8) when models have two or three degrees, then their qualities
N N
t=1 (yt − y)
2 yt −
t=1 ( y )2 drastically drop by using higher degrees. In the case of k-
nearest neighbor regression results, their rRMSE fluctuates
from 12 to 14% and 6 to 8% for SAG mill 1 and SAG mill 2,
respectively. The support vector regressions have a convex

with y = 1/N N t=1 yt and similarly for
y . The chosen behavior when increasing their regularization parameter, on
model must trade-off between having the lowest rRMSE both SAG mills, while their correlation coefficient has a
(close to 0) and the highest correlation coefficient (close concave shape. We decided that the best models are selected
to 1). by minimizing the rRMSE.
Fig. 8 SAG mill 1. Prediction of energy consumption at different supports, from 0.5 (top) to 8 h (bottom). Daily graphs (left) of real (black circles)
and predicted consumptions (red dots) along with their scatter plots (right) and correlation coefficients
The best machine learning methods and their corre- increases compared with the flatter behaviors in SAG mill
sponding optimal models, for each time support and SAG 2. Results in SAG mill 2 are better than those in SAG mill
validation dataset, are summarized in Table 4. We first note 1 but on both, recurrent neural networks perform better
that k-nearest neighbor regression was not the best on any than multilayer perceptrons. In particular, LSTM and GRU
possible combination. On the other hand, a simple quadratic models are not significantly different, beyond the latter
polynomial regression was optimal on both SAG mills for having more stable metrics despite its complexity.
time supports 0.5h and 1h. As the time support increases, The rRMSE and correlation coefficient of multilayer
the support vector regression shows better performances on perceptron models are stable between 4 and 200 hidden
both SAG mills and at 2-, 4-, and 8-h time support. Being units but beyond these metrics present strong fluctuations.
able to select one method over another at each time support, The rRMSE values fluctuate between 7.5 and 11.0% in SAG
on two different datasets, is one of the reasons of why we mill 1, and between 5.0 and 6.5% in SAG mill 2, considering
should compare them when the dataset is modified or when all time supports. The recurrent neural network methods,
applying these techniques on a different operational dataset LSTM and GRU, have similar behaviors. In SAG mill 1,
is desired. the rRMSE curves have convex behaviors, responding to the
known principle of underfitting and overfitting the training
4.2 Deep Learning Methods dataset when models have low and high complexities,
respectively. The optimum numbers of hidden units that
The performances of deep learning methods on the first deliver the minimum rRMSE are different from time
and second SAG mill are shown in Figs. 6 and 7, supports, on both methods. In SAG mill 2, the models do not
respectively. By comparing them, we appreciate that metrics exhibit convex behavior but rather more linear trends that
in SAG mill 1 fluctuate more when the model complexity slightly decrease the rRMSE while increasing the number
Fig. 9 SAG mill 2. Prediction of energy consumption at different supports, from 0.5 (top) to 8 h (bottom). Daily graphs (left) of real (black circles)
and predicted consumptions (red dots) along with their scatter plots (right) and correlation coefficients
of hidden units. This is shown on the three deep learning networks were able to outperform it on both SAG mills and
methods but slightly more visibly by LSTM. Interestingly, at each time support. This can be attributed mainly to the
the correlation coefficients are more or less stable after 150 LSTM and GRU nature of being easily fed with a temporal
hidden units on all methods in SAG mill 2. window of information and having an internal state of the
The best deep learning methods and their corresponding current process. The rRMSEs of the optimum models in
optimal models at each time support and SAG validation SAG mill 1 and SAG mill 2 are close to 6.0% and 4.5%,
dataset are summarized in Table 5. We note that multilayer respectively. The correlation coefficients at EC (0.5h) are
perceptron achieved great results but recurrent neural close to 0.9, decaying to around 0.60 for EC (8h) .
Fig. 10 Histograms of differences between real and predicted energy consumption on both SAG mills and all time supports, using the best models
on testing datasets
4.3 Best Models on Testing Datasets and SAG mill 2, respectively. Result quality drops when
increasing the time support to 8h, having a correlation
Deep learning methods outperform machine learning coefficient of 0.65 and 0.61 on SAG mill 1 and SAG mill 2,
methods, at every time support, on both SAG mill datasets respectively.
with lower optimum rRMSE values and similar correlation Lastly, the presented workflow for modelling key mining
coefficients. Over the testing datasets, the real and predicted operational variables can be extended to any other temporal
EC (sh) are compared and illustrated, at each time support, and multivariate mineral processing datasets. We would like
by showing the daily graph predictions along with their to emphasize that similar comparisons and sensitivities must
scatter plots and correlation coefficient in Figs. 8 and 9 for be carried out to establish the optimum model, since no
the first and second SAG mills, respectively. Additionally, method can guarantee to be the best on any context.
the differences between the real and predicted ECs are
plotted as histograms in Fig. 10, always using the best Funding Information The authors received funding provided by the
method and model found in the validation stage. Natural Sciences and Engineering Council of Canada (NSERC),
funding reference number RGPIN-2017-04200 and RGPAS-2017-
We observe that for SAG mill 1 there is a slight over 507956, and the Chilean National Commission for Scientific and
estimation of the energy that will be consumed on all time Technological Research (CONICYT), through CONICYT/PIA Project
supports while accounting for local and global trends. In AFB180004, and the CONICYT/FONDAP Project 15110019.
fact, the percentage relative error between real and predicted
energy consumptions are close to −5.8% at most time Compliance with Ethical Standards
supports. On the other hand, results over SAG mill 2 are way
Conﬂict of interest The authors declare that they have no conflict of
better, with predictions following local and global trends of
interest.
the real values and with a percentage relative error close to
0.05% at most time supports Fig. 10.
Lastly, the decays in correlations and precision may come References
from the phenomenon itself, where short-term decisions
have huge impact on the SAG mill performance. Also, the 1. van den Boogaart K, Tolosana-Delgado R (2018) Predictive
normal residence time of ore feeding material is no longer Geometallurgy: An Interdisciplinary Key Challenge for Mathe-
matical Geosciences. In: Handbook of Mathematical Geosciences
than 10 to 15 min, so forecasting farther away from half pages 673–686 Springer
hour can only capture trends but not local variations. 2. Cherkassky V, Ma Y (2004) Practical selection of SVM
parameters and noise estimation for SVM regression. Neural
networks 17(1):113–126
3. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On
5 Conclusion the properties of neural machine translation: Encoder-decoder
approaches. arXiv:1409.1259
This work explored the best technique between machine 4. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical
learning and deep learning methods to forecast the evaluation of gated recurrent neural networks on sequence
modeling. arXiv:1412.3555
energy consumption of two different SAG mills. We 5. Cochilco (2013) Actualización de Información sobre el Consumo
exposed the normal workflow of creating machine and de Energı́a asociado a la Minerı́a del Cobre al año 2012.Tech. rep..
deep learning models when dealing with raw operational COCHILCO
datasets. In particular, the upcoming energy consumptions 6. Cortes C, Vapnik V (1995) Support-vector networks. Machine
learning 20(3):273–297
were predicted by accounting for low-cost and easy to 7. Curilem M, Acuña G, Cubillos F, Vyhmeister E (2011)
acquire operational information (feed tonnage, bearing Neural networks and support vector machine models applied to
pressure, and spindle speed). The relevance of performing energy consumption optimization in semiautogeneous grinding.
sensitivities over each model parameter was illustrated to Chemical Engineering Transactions 25:761–766
8. Dey R, Salemt FM (2017) Gate-variants of Gated Recurrent
define the optimal model setting. Moreover, we pushed Unit GRU neural networks. In: 2017 IEEE 60th international
forward those methods by testing them on forecasting the midwest symposium on circuits and systems (MWSCAS). IEEE,
upcoming energy consumptions at different time supports, pp 1597–1600
from half an hour to 8 h. 9. Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep
learning, volume 1. MIT press, Cambridge
Deep learning methods, in particular recurrent neural
10. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998)
networks, outperform machine learning techniques such as Support vector machines. IEEE Intelligent Systems and their
k-nearest neighbor regression and support vector regression, applications 13(4):18–28
on both SAG datasets. In terms of time support, great results 11. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
Neural computation 9(8):1735–1780
were obtained at 0.5h support on both SAG mills. Indeed,
12. Hornik K, Stinchcombe M, White H (1989) Multilayer feed-
the EC shows a correlation of 0.86 and 0.84, between real forward networks are universal approximators. Neural networks
and predicted values, using LSTM models on SAG mill 1 2(5):359–366
13. Hoseinian FS, Abdollahzadeh A, Rezai B (2018) Semi- a battery energy storage system to operate a semi-autogenous
autogenous mill power prediction by a hybrid neural genetic grinding mill. Journal of Cleaner Production 165:273–280
algorithm. Journal of Central South University 25(1):151–158 24. Ramchoun H, Idrissi MAJ, Ghanou Y, Ettaouil M (2016)
14. Hoseinian FS, Faradonbeh RS, Abdollahzadeh A, Rezai B, Multilayer Perceptron: Architecture Optimization and Training.
Soltani-Mohammadi S (2017) Semi-autogenous mill power IJIMAI 4(1):26–30
model development using gene expression programming. Powder 25. Román-Collado R, Ordoñez M, Mundaca L (2018) Has electricity
Technology 308:61–69 turned green or black in Chile? A structural decomposition
15. Inapakurthi RK, Miriyala SS, Mitra K (2020) Recurrent Neural analysis of energy consumption. Energy 162:282–298
Networks based Modelling of Industrial Grinding Operation. 26. Rosenblatt F (1961) Principles of neurodynamics, perceptrons
Chemical Engineering Science, 115585 and the theory of brain mechanisms (No. VG-1196-G-8). Cornell
16. Izenman AJ (2008) Modern Multivariate Statistical Techniques: Aeronautical Lab Inc, Buffalo, NY
Regression, Classification, and Manifold Learning Springer, 1st 27. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning repre-
edition sentations by back-propagating errors. Nature 323(6088):533–536
17. Jnr WV, Morrell S (1995) The development of a dynamic 28. Salazar J-L, Valdés-González H, Vyhmesiter E, Cubillos F (2014)
model for autogenous and semi-autogenous grinding. Minerals Model predictive control of semiautogenous mills sag. Minerals
Engineering 8(11):1285–1297 Engineering 64:92–96
18. Kingma DP, Ba J (2014) Adam: A method for stochastic 29. Silva M, Casali A (2015) Modelling SAG milling power and
optimization. arXiv:1412.6980 specific energy consumption including the feed percentage of
19. Morrell S (2004a) A new autogenous and semi-autogenous intermediate size particles. Minerals Engineering 70:156–161
mill model for scale-up, design and optimisation. Minerals 30. Smola AJ, Sch¨ölkopf B (2004) A tutorial on support vector
Engineering 17(3):437–445 regression. Statistics and computing 14(3):199–222
20. Morrell S (2004b) Predicting the specific energy of autogenous 31. Van Ooyen A, Nienhuis B (1992) Improving the convergence of
and semi-autogenous mills from small diameter drill core samples. the back-propagation algorithm. Neural networks 5(3):465–471
Minerals Engineering 17(3):447–451 32. Vapnik V (1995) The nature of statistical learning theory.
21. Navot A, Shpigelman L, Tishby N, Vaadia E (2006) Nearest Springer-Verlag, New York
neighbor based feature selection for regression and its application 33. Warner B, Misra M (1996) Understanding neural networks as
to neural activity. In: Advances in neural information processing statistical tools. The american statistician 50(4):284–293
systems, pages 996–1002 34. Werbos PJ (1990) Backpropagation through time: what it does and
22. Ortiz J, Kracht W, Townley B, Lois P, Cardenas E, Miranda how to do it. Proceedings of the IEEE 78(10):1550–1560
R, Alvarez M (2015) Workflows in geometallurgical prediction: 35. Wu Z, King S (2016) Investigating gated recurrent neural networks
challenges and outlook. In: 17th Annual Conference of the for speech synthesis. arXiv:1601.02539
International Association for Mathematical Geosciences IAMG
23. Pamparana G, Kracht W, Haas J, Dı́az-Ferrán G, Palma-Behnke Publisher’s Note Springer Nature remains neutral with regard to
R, Román R (2017) Integrating photovoltaic solar energy and jurisdictional claims in published maps and institutional affiliations.

Deep Learning Methods in Mining Ver Ver Ver

Uploaded by

Copyright:

Available Formats

You might also like

Deep Learning Methods in Mining Ver Ver Ver

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Methods in Mining Ver Ver Ver

Uploaded by

Copyright:

Available Formats

Mining, Metallurgy & Exploration

Machine Learning and Deep Learning Methods in Mining

Received: 10 February 2020 / Accepted: 13 May 2020

1 Introduction incorporation into greenfield/brownfield projects will help

Table 1 Summary statistics over collected dataset on semi-autogenous grinding

Original datasets SAG mill 1 SAG mill 2

SAG mill 1 Training | Validation | Testing dataset

SAG mill 2 Training | Validation | Testing dataset

Table 3 Normalization parameters during preprocessing and back-transformation

SAG mill 1 SAG mill 2

Machine SAG mill 1 SAG mill 2

Method PR PR SVR SVR SVR PR PR SVR SVR SVR

3.6 Performance Assessment 4 Results and Discussion

Deep SAG mill 1 SAG mill 2

You might also like