Professional Documents
Culture Documents
Failure Detection and Prediction: Computer Science and Engineering
Failure Detection and Prediction: Computer Science and Engineering
Akash Pandey
akashpandeyssm71@gamil.com
Computer Science And Engineering
I. Abstract
In today’s world we need quick and exact results which accurate precision. If we have errors or failures in our it
can be achieved by Distributed Systems. Cloud plays a will hamper the business and the services we are
vital role in the performance of these distributed systems, providing to the customers. Like the data which supposed
as one or the other is kind of have some connection with to be fetched in few seconds might take few minutes or
the cloud. Cloud provides a platform from where the hours, which will act as a barrier in front of the achieving
system can fetch and execute quickly in no time with the purpose of using cloud. This bad performance is
these distributed systems working. Now it will be harsh against the Software Level Agreements , cloud provider
to expect any technology providing such a high level of is supposed to look after a number of features and their
flexibility without any maintenance needed . It can face performance like pay per use, abstraction, scalability , it
failures and since it is working on a very higher level the determines Quality Of Service.
failure will also temper the work accordingly. These For any service provider the availability and response
failures can be seen in different platforms or can say time of the cloud matters a lot. This response time also
different layers of cloud , like data-centers . Data-centers depends upon the way any fault occurred is handled or
are one of the platforms or space which is very highly avoided. This is a very big challenge for service provider
affected because of failures. As said these failure can as it will led to a big chunk of monetary loss for the
exist in different levels and due to different reasons , for organization. Now the time matters a lot we need a
example we can face failures in the cloud virtual system which can bring the processes back to the normal
processors . This can hamper our processing power and state . It will be a great option to first predict the failure
we will fail to achieve the target because of which we instead of waiting for one. But Failure Detection is a
brought cloud into picture. Now it became very clear , it challenging work because of the complex nature of
is very necessary to keep in mind about these failures and cloud, which is really complex and this complexity is not
we really need to be careful about them. Because these visible to customers because of abstraction. So, now it
failures will result in anything like hampering speed , seems like the troubleshooting done at a physical level in
results. So here taking into considerations the parameters centers are much easier as compared to cloud, the later
like CPU , Memory and other we can add a layer for one can be time consuming and complex. Detection at
detection of faults and will train the same with these the very beginning of the failure will be the best possible
parameters. way to handle it in most simplest and quick manner. This
technique of early detection is called Failure Prediction.
Now all the log files and other logs about performances
I. INTRODUCTION plays a vital role to do the prediction of failures in the
Cloud is the technology which stands out separately cloud . Monitoring data about interaction messages ,
because of its features provided to the customers. The software logs, traffic ,system virtualization information
main feature which give the customers a great flexibility and many more are collected. Now this collected data is
is pay per use, some other features are flexibility and not a small file of data , but it is a very big chunk of data
adaptability . It is provides automation for tasks and self- which makes really challenging to deduce some info
service , most of the things are given in the hands of from the data. Here the provider need to take vital steps
customer , they arrange it according to their need. We to tackle the problems and take right steps to bring the
have a number of Cloud Deployment models , depending needed and fruitful insights from the data. From fruitful
upon these models the chances of failure might differ , insights here it means brings some analysis results which
as we have private , public , community and hybrid can help predict the failures and also help in
cloud. troubleshooting. The approach which can be followed
So, we must have some failure recovery methods to here is statistical analysis. The data with labels on them
tackle these situations . More than tackling any failure of failure scenario or a non-failure scenario are feeded to
knowing the origin of it matters. As it might come up in supervised learning models and these models bring some
future again and again, so we need to get to the source generalised way to predict the case of failure depending
and fix it. Sometimes it will be impossible to have an eye on the given scenario. These datasets are called training
on these sources . Here we need some methods some datasets, every point in these datasets are used to train the
technologies which can seperatly work on detecting these models , here every points denote the attributes like log
failures and sometimes even predicting them before there messages, traffic and other attributes with the label on
occurrences. It will save a lot of time solving the failure that scenario of failure or no-failure. Over here we will
situations and all. The prediction methods can help in use two models, the first one uses log messages and
early predictions of the sources or the situations where other interaction messages with other components in
the failures might occur. We can analyse these scenarios cloud , the other module will be used at the time of
for different types of failures and can work on methods prediction. The prediction done by these modules will be
for tackling them as well. On a final note Cloud came failure or no-failure, in case of failure the processes to
into picture to help get the work done quickly with overcome these will be launched.
g Lee models. making for
However, fault
Failure the labeled tolerance
dataset is system.
Prediction not always
available
in
Failure real world
cloud
Detection computing
Fault managing systems.
Process
P(M)=Probability of Occurrence of M
P(M/O)=Probability of Outcome when M is given will
help give the result. So, it is a classification case and here
Message Logs we can use Naïve Bayes Classifier to classify the
outcomes as failure or non-failure.
P(a/b)=P(b/a)p(a)/p(b)
Messages Dictionary Over here P( M) can be calculated with the help of
Dictionary as the words in the messages will not lie in
Pattern only one rank and we can calculate it’s probability on the
Generation basis of number of ranks it is being found.
V. PREDICTING FAILURE.
Suppose if we have a message as ‘M’ , we need to first When we get the message we make some pattern out of it
split the sentence into words. Now take variables equal to and we send that pattern to the module and the module
the number of ranks for example a1,a2,a3,a4,a5,….,an
finds the probability for that pattern from the now Applications, 42(3), 980–989.. doi:
keeping the probability of that pattern we will find the 10.1016/j.eswa.2014.09.014.
probability of failure occurring. If the Probability found 2. Fu, S. (2011). Performance Metric
is higher than a threshold value then it will be considered Selection ,for Autonomic Anomaly
as failure. Sometimes for better result or depending upon Detection on Cloud Computing
the scenario the previous pattern can also be taken into Systems. 2011 IEEE Global
consideration while calculating this probability. Telecommunications Conference -
GLOBECOM 2011. .doi:
VI. TROUBLE DETECTION. 10.1109/glocom.2011.6134532.
3. 2020. [online] Available at:
This part will talk about the assurance providing module, <https://www.researchgate.net/publication
the one which will detect in real after prediction whether /224256595_Ensemble_of_Bayesian_Pred
failure has occurred or not. Now we have two classes ictors_for_Autonomic_Failure_Managem
over here , means we can put our cloud system into two ent_in_Cloud_Computing> [Accessed 24
classes one where we detect a failure and the other March 2020.].
where we don’t. So the classes can be named as
abnormal and normal behavior. The classification will be
done on the basis of the performance of components like
CPU, memory usages , temperature of system etc. We
need data to train the module first about the trouble then
only we will be able to use it in real time. But there can
be scenarios where the module itself becomes faulty
because of some component or may be because of any
other reason the it fails to store data. Now if we use
probabilistic approach to detect the failure it will be a
impossible thing . Because of less data available in for
failure state. So we can use semi-supervised learning
over here , we will find the relative probability for the
major class which is the normal class , as the probability
of the state belonging to normal class will be less it will
fall in abnormal class. Another aspect which we need to
consider over here is we are detecting the failure on the
basis of certain set of attributes like performance of CPU
and other as mentioned earlier. Now we must work with
the best set of attributes which can define the class in the
best possible way. So we can use dimensionality
reduction techniques to select the best attributes , like
PAC . Again here we work with the covariance between
these attributes , like if we have a and b two attributes the
covariance will help us decide which to keep in the
working dataset. As we want mutual attributes so
covariance tells how mutually related two sets are and
the set with high covariance will be selected.
VII. CONCLUSION.