Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Failure Detection And Prediction

Akash Pandey
akashpandeyssm71@gamil.com
Computer Science And Engineering

I. Abstract

In today’s world we need quick and exact results which accurate precision. If we have errors or failures in our it
can be achieved by Distributed Systems. Cloud plays a will hamper the business and the services we are
vital role in the performance of these distributed systems, providing to the customers. Like the data which supposed
as one or the other is kind of have some connection with to be fetched in few seconds might take few minutes or
the cloud. Cloud provides a platform from where the hours, which will act as a barrier in front of the achieving
system can fetch and execute quickly in no time with the purpose of using cloud. This bad performance is
these distributed systems working. Now it will be harsh against the Software Level Agreements , cloud provider
to expect any technology providing such a high level of is supposed to look after a number of features and their
flexibility without any maintenance needed . It can face performance like pay per use, abstraction, scalability , it
failures and since it is working on a very higher level the determines Quality Of Service.
failure will also temper the work accordingly. These For any service provider the availability and response
failures can be seen in different platforms or can say time of the cloud matters a lot. This response time also
different layers of cloud , like data-centers . Data-centers depends upon the way any fault occurred is handled or
are one of the platforms or space which is very highly avoided. This is a very big challenge for service provider
affected because of failures. As said these failure can as it will led to a big chunk of monetary loss for the
exist in different levels and due to different reasons , for organization. Now the time matters a lot we need a
example we can face failures in the cloud virtual system which can bring the processes back to the normal
processors . This can hamper our processing power and state . It will be a great option to first predict the failure
we will fail to achieve the target because of which we instead of waiting for one. But Failure Detection is a
brought cloud into picture. Now it became very clear , it challenging work because of the complex nature of
is very necessary to keep in mind about these failures and cloud, which is really complex and this complexity is not
we really need to be careful about them. Because these visible to customers because of abstraction. So, now it
failures will result in anything like hampering speed , seems like the troubleshooting done at a physical level in
results. So here taking into considerations the parameters centers are much easier as compared to cloud, the later
like CPU , Memory and other we can add a layer for one can be time consuming and complex. Detection at
detection of faults and will train the same with these the very beginning of the failure will be the best possible
parameters. way to handle it in most simplest and quick manner. This
technique of early detection is called Failure Prediction.
Now all the log files and other logs about performances
I. INTRODUCTION plays a vital role to do the prediction of failures in the
Cloud is the technology which stands out separately cloud . Monitoring data about interaction messages ,
because of its features provided to the customers. The software logs, traffic ,system virtualization information
main feature which give the customers a great flexibility and many more are collected. Now this collected data is
is pay per use, some other features are flexibility and not a small file of data , but it is a very big chunk of data
adaptability . It is provides automation for tasks and self- which makes really challenging to deduce some info
service , most of the things are given in the hands of from the data. Here the provider need to take vital steps
customer , they arrange it according to their need. We to tackle the problems and take right steps to bring the
have a number of Cloud Deployment models , depending needed and fruitful insights from the data. From fruitful
upon these models the chances of failure might differ , insights here it means brings some analysis results which
as we have private , public , community and hybrid can help predict the failures and also help in
cloud. troubleshooting. The approach which can be followed
So, we must have some failure recovery methods to here is statistical analysis. The data with labels on them
tackle these situations . More than tackling any failure of failure scenario or a non-failure scenario are feeded to
knowing the origin of it matters. As it might come up in supervised learning models and these models bring some
future again and again, so we need to get to the source generalised way to predict the case of failure depending
and fix it. Sometimes it will be impossible to have an eye on the given scenario. These datasets are called training
on these sources . Here we need some methods some datasets, every point in these datasets are used to train the
technologies which can seperatly work on detecting these models , here every points denote the attributes like log
failures and sometimes even predicting them before there messages, traffic and other attributes with the label on
occurrences. It will save a lot of time solving the failure that scenario of failure or no-failure. Over here we will
situations and all. The prediction methods can help in use two models, the first one uses log messages and
early predictions of the sources or the situations where other interaction messages with other components in
the failures might occur. We can analyse these scenarios cloud , the other module will be used at the time of
for different types of failures and can work on methods prediction. The prediction done by these modules will be
for tackling them as well. On a final note Cloud came failure or no-failure, in case of failure the processes to
into picture to help get the work done quickly with overcome these will be launched.
g Lee models. making for
However, fault
Failure the labeled tolerance
dataset is system.
Prediction not always
available
in
Failure real world
cloud
Detection computing
Fault managing systems.
Process

Fig 1. Both Failure Prediction and III. PROBLEM


Detection must be 1 for process to proceed. Complexity is the biggest problem which comes in front
of CSP’s(Cloud Service Provider) while tackling these
There could be a case where these module’s prediction failure scenarios. So the CSP need to provide the best
might not match with the real scenario, means wrong and suitable fault management for their customers. For
prediction happened. So in this case we can check the providing the best and suitable fault management system
state messages of cloud and if it is in failure the launched one need to be clear about each and every scenario in the
processes to overcome the failure will execute else these cloud and it’s failure. The CSP need to be well
will be rolled back. acknowledged about the why behind the failure, means
the reason because of which the failure happened. It is
II. LITERATURE SURVEY possible to provide best fault management system but
Auther Paper Problems Issues might not be suitable one if CSP is not clear about above
Name Name mentioned things. Now the reasons behind these failures
Yukihiro Online Various In the can be human error, cloud downtime, spike in cloud
Wantabe Failure types of future, demand, failure of other third party components, disk
, Prediction message accuracy failures and many more. These failures may led to
Hiroshi in logs iiii can downtime of resources , web applications and other
ii Cloud ii communic be ii services provided like web. The most suitable example of
Otsuka, Datacenters a improved such scenario will be Amazon , it was down for 45
Masatak by Real- ted by including minutes and it led to million dollars of loss. The dynamic
a Time between more nature makes it really challenging to use the data
Sonoda, Message component parameters. collected at one time for failure predictions as this
Shinji Pattern s dynamic nature might led to very different scenario other
Kikuchi Learning than the one we fed into the module while the training
and iii process. The CSP also need to keep up with the
Yasuhide innovations and updates, the changes made in the
Matsumo hardware or updates made can also bring new failures in
t different scenarios. So the module must be very clear
o about the variations done in the components and the real
failures occurred. As large as the cloud system it will
Qjang Ensemble Assumptio In future, affect the revenue accordingly in the failure situation. So
Guan, of n integrating to maintain customer satisfaction and provide quick
Ziming Bayesian that two failure resolving plans in the case of failure will help CSP grow
Zhang Predictors training managemen and have a good relation with the customer, which will
and and dataset is t help grow the economy of the CSP’s.
Song Fu Decision labeled; approaches
Trees for will lead to
Proactive better
Failure dependabilit IV. SOLUTION
Manageme y The solution for the problem, which will be discussed in
nt this term paper can be divided into four steps. First of all
in Cloud the Failure Prediction module will be used to predict
Computing whether Failure will occur or not at that timestamp. This
Systems Prediction will led the next step, like if the prediction
gives Failure result then the process to resolve it will be
Dinh- Fuzzy Fault Most of Probabilistic launched. These processes are mainly storage
Mao Detection the approach management process, cloud state will be saved, other
Bui, in existing can be used back ups will be made for recovery. After these processes
Thien IaaS Cloud prediction to increase are issued these will not execute complete but just initiate
Huynh Computing methods the ability to work. Now the full execution of these processes will
and are of depend upon next stage result. The third step deals with
Sungyou supervised self- the detection of weather the failure has really occurred or
n learning decision not. This is done by checking the log data about the state
of the cloud. The final step will be checking the result where n is equal to highest rank. For every word
given by third step, if the third step’s result is failure then belonging to that any particular rank type incrementation
the processes will be proceeding with their work to will be done to that variable. Like a1 will become a1+1 if
resolve the failure else the processes will be rolled back. word belongs to Rank 1 type message. Now the variable
with the highest value will decide the rank of that
i. First Step to Predict the Occurrence of message.
Failure
The first step does the prediction about the failure as If max(a1,a2,a3,a4,…..,an) is a1 then rank of the message
discussed above. This is done by analysing the message will be 1.
logs and interaction logs in the cloud systems. Here it
makes a generalisation kind of pattern in accordance with iii. Information About Failure.
which when this pattern matches with some other real
time pattern it results in failure or non-failure depending Every message has a priority field which shows the
on that generalisation. This prediction is done for a information about the type of Failure. This info helps in
particular time and with some particular message logs. next level of separation , like these failures can of error ,
As cloud fault, alerts and many more.
is a very dynamic technology and it can have dynamic
changes in the systems or updations in the components iv. Generation of Pattern.
too. These changes needed to be also stored in logs with
their pattern and results. As these changes can also bring The patterns are generated on the basis of sequence of
failures or may differ in the working of the cloud. messages which came in the log. This pattern can be
understood in this way if the rank 1 occurred and then 2
ii. The first thing done in Prediction part occurred which led to failure. The pattern generated in
is Message classification and Pattern this case will be like 1-2-failure. These patterns are
generation actually represented in form of vectors of binary value
For the Predicting the situation of failure by the module
which represents the occurrence of failure or not. Also
it first need to be trained for the patterns of message logs
another vector used for representation of the pattern or
in which failure has occurred and also it need to be
trained in the patterns of message logs in which failure the order in which the messages came (rank of the
has not occurred to identify both of them. Here as shown messages i.e [1,2] [1]) .So this pattern is a kind of
in the below figure the messages stored in the log by sequence of messages in any order which will be
monitoring the cloud is first separated into type of matched when module is asked to do the prediction in
messages. real scenario.
This separation can be done by using a message
dictionary where the messages having certain set of v. Outcome from Pattern.
words are assigned a certain rank number.
The patterns obtained have some labels like failure or no-
failure with them. These outcomes must have a good
covariance with the pattern vectors , means how much
Rank Messages Words the pattern vector is able to define the outcome. This is
the main thing which a module need to learn. So we need
to find the covariance between the pattern vector column
1. Interface, component failure. and the outcome column. Now if the covariance is high
and positive it means we have a clearly defined outcome
2. Mail Failure, Storage Fault. by patterns. If the covariance is low or moderate then we
can combine some other patterns to define the outcome
3. Segmentation Fault, Invalid properly. Here we will use bayes theorem , the
Notification. generalized form of which is here:-

P(M)=Probability of Occurrence of M
P(M/O)=Probability of Outcome when M is given will
help give the result. So, it is a classification case and here
Message Logs we can use Naïve Bayes Classifier to classify the
outcomes as failure or non-failure.

P(a/b)=P(b/a)p(a)/p(b)
Messages Dictionary Over here P( M) can be calculated with the help of
Dictionary as the words in the messages will not lie in
Pattern only one rank and we can calculate it’s probability on the
Generation basis of number of ranks it is being found.

V. PREDICTING FAILURE.
Suppose if we have a message as ‘M’ , we need to first When we get the message we make some pattern out of it
split the sentence into words. Now take variables equal to and we send that pattern to the module and the module
the number of ranks for example a1,a2,a3,a4,a5,….,an
finds the probability for that pattern from the now Applications, 42(3), 980–989.. doi:
keeping the probability of that pattern we will find the 10.1016/j.eswa.2014.09.014.
probability of failure occurring. If the Probability found 2. Fu, S. (2011). Performance Metric
is higher than a threshold value then it will be considered Selection ,for Autonomic Anomaly
as failure. Sometimes for better result or depending upon Detection on Cloud Computing
the scenario the previous pattern can also be taken into Systems. 2011 IEEE Global
consideration while calculating this probability. Telecommunications Conference -
GLOBECOM 2011. .doi:
VI. TROUBLE DETECTION. 10.1109/glocom.2011.6134532.
3. 2020. [online] Available at:
This part will talk about the assurance providing module, <https://www.researchgate.net/publication
the one which will detect in real after prediction whether /224256595_Ensemble_of_Bayesian_Pred
failure has occurred or not. Now we have two classes ictors_for_Autonomic_Failure_Managem
over here , means we can put our cloud system into two ent_in_Cloud_Computing> [Accessed 24
classes one where we detect a failure and the other March 2020.].
where we don’t. So the classes can be named as
abnormal and normal behavior. The classification will be
done on the basis of the performance of components like
CPU, memory usages , temperature of system etc. We
need data to train the module first about the trouble then
only we will be able to use it in real time. But there can
be scenarios where the module itself becomes faulty
because of some component or may be because of any
other reason the it fails to store data. Now if we use
probabilistic approach to detect the failure it will be a
impossible thing . Because of less data available in for
failure state. So we can use semi-supervised learning
over here , we will find the relative probability for the
major class which is the normal class , as the probability
of the state belonging to normal class will be less it will
fall in abnormal class. Another aspect which we need to
consider over here is we are detecting the failure on the
basis of certain set of attributes like performance of CPU
and other as mentioned earlier. Now we must work with
the best set of attributes which can define the class in the
best possible way. So we can use dimensionality
reduction techniques to select the best attributes , like
PAC . Again here we work with the covariance between
these attributes , like if we have a and b two attributes the
covariance will help us decide which to keep in the
working dataset. As we want mutual attributes so
covariance tells how mutually related two sets are and
the set with high covariance will be selected.

VII. CONCLUSION.

We conclude this paper i by attempting to solve


complex
And very complicated problem of proactive
management of failures
in a large scale i cloud system. Proactive
management of
failures must be at highest priority in terms of problems
to
be solved with in the hands of the i CSPs. Failure
prediction and
detection cannot only increase consumer satisfaction, it
can also increase revenue i sales and decrease
economic
loses.
VIII. REFERENCES
1. Bala, A., & Chana, I. (2015). Intelligent
failure prediction models for scientific
workflows. Expert Systems with

You might also like