Professional Documents
Culture Documents
Factor Analysis in Fault Diagnostics Using Random Forest: Nagdev Amruthnath
Factor Analysis in Fault Diagnostics Using Random Forest: Nagdev Amruthnath
Factor Analysis in Fault Diagnostics Using Random Forest: Nagdev Amruthnath
popular techniques, one of the main challenges of this method operates in a high-temperature environment. Vibration sensors
is the need for high domain knowledge and experience. In are mounted on X-axis and Y-axis. The vibration data is
most cases with hundreds of machines in a manufacturing collected in time series at a sample rate of 2048 Hz. This data
facility, this method almost becomes impractical and is collected every 5 minutes for 5 months continuously. Time
expensive. Due to this issue supervised machine learning signals are converted to frequency signals, and the features are
techniques are mostly used to diagnose the faults in the extracted in both domains. Since the operating frequency for
machines. Some of the most commonly used classification machinery was known to be 26.1Hz, the features in frequency
models for fault diagnosis are multi-class SVM, K-nearest domain were collected around this band. Total of 36 vibration
neighbor, neural networks, and decision trees. In changing features and ambient temperature around this machinery was
environment such as manufacturing, if the classification obtained.
models are not trained for all states of the machine then, a new
state of the machine (not part of the trained model) will be In most clustering models, the number of clusters to be
misclassified to a known state. Hence, unsupervised learning formed is user-defined. The most commonly used techniques
techniques have become more popular in fault state detection for finding the optimal number of clusters are by using AIC,
using clustering. Some of the commonly used techniques in BIC, within sum of square (WSS), gap statistic or silhouette
the Gaussian finite mixture model [5], self-organizing map, with the method. As the size of the data increases, AIC and
hierarchical clustering and density-based clustering. BIC methods fail to provide the optimal number of clusters
[8]. In such cases, WSS and silhouette width is calculated for
Factor analysis is a technique that involves identifying k clusters. Using elbow method for WSS, optimal number of
significant factors for a particular group (or cluster). This has clusters can be identified. For silhouette width, the kth cluster
been widely used in classification problems such as customer that provides the maximum separation is considered as the
segmentation where the critical factors that affect each optimal number of clusters. In this research, both WSS and
customer’s decision are identified to achieve specific goals. silhouette techniques are performed to identify the optimal
This concept has been widely used in other problems such as number of clusters as shown in Figure and Figure
regression and dimensionality reduction. Similar to customer
segmentation, in maintenance, it is important to identify the
key features that attribute to a specific fault or failure mode of
the machine. These specific features are used to study the root
cause of a particular problem in the machine, and necessary
design changes could be made to eliminate the problem
completely. In other instances, when a fault is detected, these
specific features can be used to verify the state of the machine.
Figure 3: GMM cluster results for six clusters Figure 7: Fan imbalance created on 22-Sep
in normal condition as shown in Figure 8. The dominant best split among all predictors, randomly sample mtxy
cluster during this period was cluster 3. of the predictors and choose the best split from
among those variables. (Bagging can be thought of as
From the above spectrum analysis, the modes of each of the the special case of random forests obtained when mtxy
clusters were diagnosed. Based on the information, some of = p, the number of predictors.)
the inferences can be drawn using the cluster plot generated 3. Predict new data by aggregating the predictions of
using GMM. The conclusions are as follow the ntree trees (i.e., majority votes for classification,
GMM model was capable of diagnosing the machine average for regression).
In Boosting, successive trees give extra weight to points
repair states. This is a clear indication of the robustness of
incorrectly predicted by earlier predictors. Finally, a weighted
identifying the change in process or environment
vote is taken for prediction.
Imbalance state of the fan is observed since the beginning
of the data collection as seen in Figure 3. Although an In bagging successive trees do not depend on earlier trees.
assumption was made during spectrum analysis when Each is independently constructed using a bootstrap sample of
creating the baseline, clustering technique was capable of the data. Finally, a majority vote is taken for prediction.
identifying the imbalance state
Clustering technique was capable of detecting machine An estimate of the error rate can be obtained, based on the
powered off state as well. training data, by the following [9]:
1. At each bootstrap iteration, predict the data not in the
From the above results we can conclude that we can bootstrap sample the tree grown with the bootstrap
conclude that by using clustering and spectrum analysis, we sample.
can overcome some of the many challenges of supervised 2. Aggregate the OOB (Out of Bag) predictions.
classification methods. Some of the advantages of the above Calculate the error rate, and call it the OOB estimate
technique are as follows of error rate.
There is no requirement of training the model with all
Variable importance in the random forest is defined based
the states of the machine
on the interaction with other variables. Random forest
The above procedure can be implemented in a shorter
estimates the significance of variable based on how much the
period. Hence, the benefit of CBM can be realized faster prediction error increases when data for a particular variable is
There is no need to retrain the model when a new state of permuted while the other variables are left unchanged. The
the machine is identified. calculated for variable importance are carried out each tree at
a time as the random forest is constructed. Today, random
forest is used in various applications such as banking, retail,
III. FACTOR ANALYSIS FOR CLUSTERED DATA the stock market, medicine and image analysis. Some of the
In maintenance, upon detecting and diagnosing the faults, main advantages of this technique is as follows
identifying the important features that affect that are specific 1. The same algorithm can be used for both classification
to a particular cluster is important. The factors contributing to and regression problems
a specific state of the machine is used in studying the root- 2. There is no issue of overfitting when this algorithm is
used either for classification or regression.
cause of the problem and potentially eliminating the problem.
3. The random forest can also be used for identifying
It is also used in validating the cluster results. In this paper, we
important variables in the data while building the
discuss a supervised learning technique called random forest
models.
which will be used to identify the important features that are 4. It can handle large datasets efficiently without variable
specific to a specific fault of the machine (or cluster). deletion
Random forest is an ensemble learning technique that is In this research, the clusters are used as the response
used both in regression and classification problems. In a variable, and the feature data is used as the predictor variables.
regular decision tree, a single decision tree is built. But, in a A total of 500 trees are generated using random forest
random forest, a number of decision trees are built. The technique. The accuracy of different models was considered to
number of trees is usually user defined. In an ensemble choose the best model. The optimal model was chosen with
process, a vote from each decision tree is used in deciding the mtxy = 19.The summary of Resampling results across tuning
final class. In this technique, a sample of data with parameters is as shown in Table 2.
replacement is used for building the decision tree along with
the subset of variables. This sampling and subsetting are Table 2: Resampling results across tuning parameters
performed at random. Hence, this technique is called a random mtry Accuracy Kappa
forest. The algorithm for random forest is given as follows [9]
1. Draw ntree bootstrap samples from the original data. 2 0.8709 0.8451
2. For each of the bootstrap samples, grow an unpruned 19 0.8853 0.8623
classification or regression tree, with the following 37 0.8806 0.8567
modification: at each node, rather than choosing the
5
After identifying the best model, the important variables for In Figure 10, we can observe the importance of all the
every cluster group was identified. features for all six clusters. In the following results, we can
identify that all the features has some amount of significance
for cluster 1, 2, 3, 5 and 6. For cluster 4, SdYAxisF feature
had no significance. We can also observe that in cluster 2, 3,
4, 5 and 6 have ambient temperature as the most significant
variable. While in cluster 1, MeanXAxisT was the most
significant variable. From the above analysis, we can also
observe that all the all the clusters have different levels of
significance for different clusters. This observation provides a
strong conclusion that the clusters are unique with different
characteristics.
IV. CONCLUSION
In rotating machinery, vibration analysis is one of the most
sought techniques for condition-based monitoring. In a highly
dynamic environment such as manufacturing, unsupervised
Figure 9: Optimal model selection using Random Forest machine learning techniques such as clustering are used to
group the data into clusters. These individual clusters
represent a state of the machine. The mode of each state can
The results are as shown in Figure 10. be diagnosed using frequency spectrum analysis. The
V. REFERENCES