Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/346393403

MicroRAS: Automatic Recovery in the Absence of Historical Failure Data for


Microservice Systems

Conference Paper · December 2020

CITATIONS READS

0 19

4 authors, including:

Li Wu Johan Tordsson
Technische Universität Berlin Umeå University
4 PUBLICATIONS   8 CITATIONS    86 PUBLICATIONS   2,593 CITATIONS   

SEE PROFILE SEE PROFILE

Alexander Acker
Technische Universität Berlin
18 PUBLICATIONS   20 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CACTOS - Context Aware Cloud Topology Optimisation and Simulation View project

Execution Time Predictions of Single Task Applications on heterogeneous Clusters View project

All content following this page was uploaded by Li Wu on 26 November 2020.

The user has requested enhancement of the downloaded file.


MicroRAS: Automatic Recovery in the Absence of Historical Failure Data for
Microservice Systems

Li Wu∗† , Johan Tordsson∗‡ , Alexander Acker† , Odej Kao†


∗ Elastisys
AB, Umeå, Sweden, Email: {li.wu, johan.tordsson}@elastisys.com
† Distributed and Operating Systems Group, TU Berlin, Berlin, Germany, Email: alexander.acker, odej.kao@tu-berlin.de
‡ Department of Computing Science, Umeå University, Umeå, Sweden

Abstract—Microservices represent a popular paradigm to One of the key problems arising in the automatic re-
construct large-scale applications in many domains thanks to covery is to determine: given a detected service perfor-
benefits such as scalability, flexibility, and agility. However, it mance anomaly, which action(s) should be taken to
is difficult to manage and operate a microservice system due to
its high dynamics and complexity. In particular, the frequent mitigate the issue? In a microservice system, the selection
updates of microservices lead to the absence of historical failure of recovery actions is difficult to achieve because of the
data, where the current automatic recovery methods fail short. following challenges: (1) delusive corrective actions: Due
In this paper, we propose an automatic recovery method named to the dynamics and complexity of microservice systems,
MicroRAS, which requires no historical failure data, to mitigate the analysis of performance anomaly detection and root
performance issues in microservice systems. MicroRAS is a
model-driven method that selects the appropriate recovery cause localization include frequently false positives. Such an
action with a trade-off between the effectiveness and recovery incorrect analysis results in delusive corrective actions, thus
time of actions. It estimates the effectiveness of an action reducing the probability to select the best possible recovery
in terms of its effects of recovering the pinpointed faulty actions and increasing the risk of executing incorrect actions;
service and its effects of interfering with other services. The (2) frequent updates: Microservices are updated frequently
estimation of action effects is based on a system-state model
represented by an attributed graph that tracks the propagation to meet customers’ needs, (e.g., Netflix updates thousands
of effects. For the experimental evaluation, several types of of times per day [8]). These dynamic microservices make
anomalies are injected into a microservice system based on the historical data of recovery unavailable, decreasing the
Kubernetes, which also serves a real-world workload. The precision of the existing data-driven methods, thus aggravat-
corresponding benchmarks show that the actions selected by ing the difficulty of action selection; (3) a large number of
MicroRAS can recover the faulty services by 94.7%, and reduce
the interference to other services by at least 44.3% compared metrics: Due to the large-scale of microservices, the number
to baseline methods. of monitoring metrics is very high (e.g., Netflix exposes 2
million metrics [9]). It would cause significant overhead and
Keywords-automatic recovery; microservices; performance
issues; cloud computing; Kubernetes delay if all these metrics were to be used for action selection;
(4) uncertainty: the dynamics of the infrastructures and
microservices introduce a great uncertainty to the system,
I. I NTRODUCTION
it is hard to foresee the impact of the applied recovery
Microservices architecture design is increasingly deployed actions. Therefore, to ensure the selected action is effective
in large scale software systems, particularly in cloud-based to a detected performance issue, it is crucial to develop a
systems [1]. The state of the art literature shows that method to predict its effects without historical failure data.
microservices-based architectures can enhance the adaptabil- In the literature(briefly surveyed in Section VI), different
ity to technological changes, improve the scalability, and approaches have been proposed to recover issues in cloud,
more importantly, reduce the time-to-market [2]. However, networks, and distributed systems. For example, rule-based
microservice systems tend to be fragile due to the highly- approaches select recovery actions by matching the user-
distributed nature and the large number of messages passed defined rules [10]. However, the rules require frequent up-
between services [3]. dates following the corresponding changes in the microser-
To achieve resilient microservices, proposed solutions vices, which conflicts the goal of automatic recovery. Case-
ranging from fault-tolerant service design and development, based approaches identify recovery actions by matching
service resilience testing, and self-healing exist or have previous failure cases [11]. However, their overhead and
to be developed yet. Several investigations have focused delay are high due to the numerous metrics in microservices.
on resiliency patterns in microservices design [4], [5] and This also holds true for the learning-based approaches [12].
testing [6], [7]. By contrast, less attention has been placed Further suggested methods select recovery actions by ana-
on automatic recovery techniques. lyzing the action properties [13], [14]. However, they are
Microservices call graph
highly dependent on the probabilistic parameters learned Anomalous microservice
Web API
from recovery history (e.g., the success rate of an action Affected interactive microservices
Microservice 1
to a given failure), and can be misled by delusive corrective Affected co-located microservices
actions. Notably, all above approaches assume that historical Web API

failure data is available to learn, which is not always true in Web API
Microservice 2
Web API A list of recovery actions:
microservice systems. - scale-out service
Microservice 3 Microservice 4 - migration service
To overcome the shortcoming of requiring historical fail- - restart host
Web API ...
ure data in the existing work, we propose an automatic
recovery selection method, Microservices Recovery Action Microservice 5 Web API Web API

Selection – MicroRAS, to mitigate the performance issues. Microservice 1 Microservice 1


Web API
MicroRAS is a model-based method that can adapt to HTTP
Web API

Request
the frequent changes of microservices without requiring Microservice 3 Microservice 2

User Web API Web API


historical data of previous failures and can reduce the Operators
potentially destructive consequences of recovery actions by Microservice 5 Microservice 4

assessing their side effects. MicroRAS firstly models the


Host 1 Host 2
system state with an attributed graph used to track the
propagation of positive and negative effects of recovery Microservice System

actions. Next, it estimates the benefit (positive effects) and Figure 1: Motivating example: when scaling out Microser-
the risk (negative effects) associated with each action by vice 3 (MS 3) to recover a performance degradation, the
predicting the future state, where the system would transit consequences can be: 1) MS 3 is recovered - when resources
with the selected action. Lastly, it aggregates all these in the cluster are sufficient and recovery time is short;
effects into an effectiveness value with a fuzzy logic and 2) Performance of MS 1 and 2 also degrades - when the
selects the best possible action with a trade-off between recovery time is long and anomaly propagates from MS 3 to
action effectiveness and the time for the action to mitigate MS 1 and 2; and 3) Performance of MS 1 and 5 (or MS 1,
the issue. We evaluate our MicroRAS method by applying 2, and5 ) also degrades - when the resources of Host 1 (or
the selected recovery actions to mitigate different types of Host 2) where the new service instance runs is insufficient.
performance anomalies injected into a microservice system
where the Sock-shop1 microservices benchmark is deployed ity, we focus on a small part of a large-scale microservice-
on Kubernetes running in Google Cloud Engine (GCE)2 . based application shown in Figure 1. This application con-
The results show that the actions selected by MicroRAS sists of five microservices (MS) deployed on two hosts,
can mitigate the performance issues well, with recovering where MS 1, 3 and 5 are co-located on Host 1, MS 1, 2
the performance of faulty services by 94.7% and minimizing and 4 are on Host 2. Requests to MS 1 are load-balanced
the affect on other service within 15.2%. In conclusion, our across two replicas. The interactions among microservices
main contributions are the following: are henceforth referred to as the microservices call graph.
• We propose a recovery action selection method based Subsequently, slower response times of MS 3 are observed
on real-time data collection and action properties ob- and classified as a performance anomaly. The operators
served during non-anomalous operation instead of his- identify the root cause as MS 3, by manually debugging
torical failure data (Section IV). or root cause analysis tools. Meanwhile, they obtain a list
• We propose an action effects estimation model to of feasible recovery actions based on their expert knowledge
capture the positive and negative effects associated with and previous experience, the latter commonly maintained as
a recovery action, which is adaptive to the anomalous scripts or playbooks [15]. The recovery actions can be restart
context of the system (Section IV). service, scale-out service, restart host, etc.
• We evaluate MicroRAS by mitigating different types, Let us take scale-out service as an example of recovery
levels, and contexts of anomalies. The experimental action. When MS 3 scales out, the consequences of this
results show that the actions selected by MicroRAS action can be diverse. If the recovery time (the time it takes
can mitigate the faulty service well, with affecting other for the action to have an effect and the microservice to
services 44.3% less, and are completed at least 4 times recover from the performance issue) of scale-out is very
faster than baseline recovery strategies (Section V). short and the available resources are sufficient, the perfor-
mance issue would be mitigated. However, if the recovery
II. M OTIVATING E XAMPLE
time is too long, the performance anomaly could propagate
In this section, we use a concrete example to illustrate the to upstream microservices, i.e., MS 1 and 2 in the blue
motivation of our proposed method. For the sake of simplic- box in the call graph, increasing the response times of MS
1 Sock-shop - https://microservices-demo.github.io/ 1 and 2. Even worse, as Microservice 1 is a user-facing
2 Google Cloud Engine - https://cloud.google.com/compute/ microservice, it could cause service disruption for end-users
MicroRAS track the propagation of action effects across services and
Action
Knowledgebase
System-state Model hosts. (2) it predicts the benefit and risk of each potential
Faulty Service M
recovery action, by estimating the future state the system
Localization would transit into by applying that action. (3) it aggregates
Action Effects Estimation
the action benefit and risk into an effectiveness value and
SLO Violation
Benefit Risk
formulates the action selection as an optimization problem
Recovery Action Selection with effectiveness and recovery time as the objectives. After
Data Collection
the action execution, the observed state change caused by the
Action execution
action and the recovery time are used to update the action
Microservice System knowledge base. We remark that the runtime complexity of
Figure 2: The workflow of recovery action selection. MicroRAS is low, as the time complexity of the operations
in our method is linear to the number of the services, nodes
directly. Besides, if the available resources on the host are and potential actions. Thus it scales well with the size of
insufficient, the scale-out action might affect the co-located the microservice system.
microservices i.e., MS 1 and 5 (purple in Figure 1), or even
the entire application. IV. T HE M ICRO RAS M ETHOD
To mitigate performance issues in microservice systems In this section, we describe the three key modules in
without causing significant downtime, it is crucial to iden- MicroRAS to select the best appropriate recovery action
tify the appropriate action that can recover the anomalous without historical failure data, namely system-state model
services but also minimize the side effects on other services (Section IV-A), action effects estimation (Section IV-B), and
and the recovery time. In this paper, we propose a model to recovery actions selection (Section IV-C).
assess the positive and negative effects of potential recovery
actions, and select the best possible action with a trade-off A. System-state model
between the effects and the recovery time. To estimate the potential effects of an action on the
system, awareness of the system context in different states
III. OVERVIEW OF M ICRO RAS is necessary. We build a system-state model to capture the
To adapt to the dynamics of the microservice systems, context, including the dependency among components in the
where the recovery history is not always available, we system and their states of resources.
propose MicroRAS, a recovery action selection method to In a microservice system, services inter-communicate
automatically mitigate the performance issues based on real- through lightweight protocols and are deployed across mul-
time contextual information of the system. The main idea of tiple hosts. An action applied to one service does thus not
our method is to understand the anomalous context of the only influence the service itself but also other services,
system with a system-state model, using the data collected either through invocation paths or their co-located hosts.
in real-time, then selecting recovery action by predicting the Understanding service influence is similar to the anomaly
effects on the recovered system state if an action were to propagation problem [23]. Therefore, we model the system-
be applied. Recovery time is an important factor in this state with an attributed graph which can not only show
decision, as unattended anomalies can propagate quickly, the dependencies among services and hosts but also track
and it is a common objective in the literature [12], [16], [17], the propagation of action effects. In addition, we define
MicroRAS takes both the action effects and recovery time the state of services and hosts in the system as a set of
as objectives and models the selection as an optimization variables SV . As MicroRAS aims at performance issues
problem. caused by resource bottlenecks, we store in SV the resource
Before the recovery process is initiated, detection of usage sv RU and resource allocation sv RA , in terms of
slower response times of microservices, location of the faulty CPU, memory, etc., The major notations are summarized
service, and a set of feasible recovery actions (a knowledge in Table I. We define the system state model as follows:
base) are required. The methods for anomaly detection and System-state model: A set of system states M , including
faulty service localization have been well addressed in the normal mN , abnormal mA , and recovered mR state, is
literature [18]–[22], and the actions a knowledge base can defined using an attributed graph G together with a set
be configured based on properties of the available recovery of system state-variables SV , including resource usage
actions as observed in non-anomalous operations. sv RU and resource allocation sv RA . Notably, the normal
Figure 2 shows the workflow of recovery action selection. and abnormal states are fully observable, whereas accurate
Once the selection process is triggered, MicroRAS selects prediction of mR is key to select the recovery action.
the recovery action with the following steps: (1) it gathers Once the response time between two services is slow and
the run-time contextual information of the system and mod- classified as a performance anomaly, MicroRAS constructs
els the system-state with an attributed graph that can also an attributed graph that holds the normal mN and abnormal
Table I: Notations used in MicroRAS. s1 s service
Notation Description h host
h2
A a list of na feasible actions, A = {ai }n a
i=1
s2 s anomalous service

G an attributed graph Abnormal-state (mA) h1 h affected host


s3 s4
S a set of ns services S = {si }n s
i=1
service invocation

nh service deployment
H a set of nh hosts H = {hj }j=1 deleted edge
s5
si a service with c pods, si = {sij }cj=1 added edge
(a) Attributed graph
sij a pod of service si , the pod runs on host hj
Potential ...
s3 .migration s3 .scale-out
M a set of system state models, M = {mi }i∈{N,A,R} recovery action

each mi is represented by a unique {G, SV }


Recovered-state (mR) s1 s1
mN ,mA , mR normal, abnormal, and recovered system state
SV a set of state variables of ns services/pods h2 h2
ns +nh s2 s2
and nh hosts, SV = {svk }k=1
h1 h1
svk state variables of a pod/host, svk = {sv RU , sv RA } s3 s4 s3 s4
sv RU resource usage (1 vCPU, 1GB memory, etc)
sv RA resource allocation such as host capacity, s5 s5
pod limits (2 vCPU, 2GB memory) (b) System state after a = s3 migration (c) System state after a = s3 scale-out
E, Eb , Er action effectiveness and its compositions: benefit, risk
U T (sij ), U T (hj ) resource utilization of pod sij , host hj (%) Figure 3: System states prediction.
T recovery time of an action Section IV-C. Hence, in order to estimate the action ef-
fects, we need to predict the recovered state, including the
mA system states, using the method proposed in our previ- attributed graph and system state-variables after an action
ous work [22]. In addition, the state variables SV are stored execution, in order to identify the potential affected hosts
in the node attributes. The data for graph construction and and compute the resource utilization of the faulty service
state variables are gathered from the run-time monitoring of (action benefit Eb ) and affected hosts (action risk Er ).
hosts and services, including use of a service mesh. To predict the recovered state of an action, we need
The attributed graph in Figure 3(a) corresponds to our to know not only the current system state but also the
motivating example in Figure 1. In Figure 3(a), the solid properties of the action, as the action affects the system
lines indicate service invocations and the dashes lines show state in different ways. Although there is a wide range of
which host the service runs on. For each service and host, we recovery actions, we only consider how the action modifies
collect resource usage and allocation in normal and abnormal the system-state model. For a recovery action ai in the action
states. In particular, for service si which runs with multiple set A, we include the following properties:
replicas (pods), we collect the resource data for each of the • Topology: Some actions change the service location,
c pods sij . In Figure 3(a), c = 2 for service s1 , and 1 for thus changing the topology of the attributed graph.
the other services. • Resource usage: Some actions change the resource
consumption of the service or host. For example, restart
B. Action effects estimation can recover anomalies caused by memory leaks, thus
Based on the observed normal and abnormal system reducing resource usage.
states, we estimate effects associated with each potential • Resource allocation: Some actions change the capacity
recovery action by predicting the future state that the system of hosts or the resource limits of pods. Example in-
would transit into if applying the action. clude scale-up and scale-out actions that increase the
Action effects in MicroRAS are composed of positive and allocated resources of a host or a service.
negative effects. We define the positive effect as the benefit Note that these properties, including the recovery time
Eb that the identified anomalous service would achieve in used in Section IV-C, are stored in the action knowledge base
terms of service performance and the negative effect as in Figure 2. All properties are obtained in non-anomalous
the risk Er that the affected hosts would have in terms operations and can be updated after the action is executed.
of resource contention, thus affecting the services that run Based the abnormal state mA and the topology property of
on the hosts. Due to the uncertainty and complexity of a recovery action, MicroRAS predicts the graph changes in
microservice systems, it is difficult to accurately estimate recovered state mR . Figures 3(b) and (c) show the recovered
the performance of a service after recovery action execution, states after migrating and scaling out anomalous service s3 ,
we first estimate the resource utilization U T of a service where service migration removed the link between s3 and
and next map the estimated U T into fuzzy sets of service h1 and adds a new link between s3 and h2 , whereas service
performance using a fuzzy inference system described in scale-out adds a new link between s3 and h2 . In these two
recovery actions, the affects hosts are h1 and h2 . Knowledge Base

After the attributed graph is predicted, MicroRAS es- Membership


Rules
functions
timates the resource utilization of the faulty service and
affected hosts. When an action applies to pod sij of faulty Benefit crisp crisp
service si or affects host hj , the resource utilization in Risk
Fuzzification Defuzzification Effectiveness

recovered state mR is defined as the ratio between resource


fuzzy fuzzy
usage and allocation: Inference Engine

sv RU (hj , mR )
U T (hj , mR ) = RA (1)
sv (hj , mR ) Figure 4: The structure of the fuzzy inference to combine
sv RU (sij , mR ) action risk and benefit.
U T (sij , mR ) = RA (2)
sv (sij , mR ) Once the future pod resource usage (Equation 4) and host
Assuming an ideally equal load balancing between pods resource allocation (Equation 5) are determined, we can
of a service, the utilization of service
c
si is defined as: estimate the future resource utilization of the affected host
R 1X hj where the pod sij runs on, as shown in Equation 6. Host
U T (si , m ) = U T (sij , mR ). (3)
c j=1 resource usage sv RU (hj , mA ) increases by pod resource
where c is the total number of pods. As some actions may usage sv RU (sij , mA ) if pod sij is migrated to hj , or
under-provision the resource, the estimated U T can be over decreases by sv RU (sij , mA ) if sij is migrated from hj to
1, thus its range is defined as U T > 0. another host.
Based on the action properties and system state-variables sv RU (hj , mA ) ± sv RU (sij , mA )
U T (hj , mR ) = (6)
in normal and abnormal states, MicroRAS estimates the sv RA (hj , mR )
resource usage and resource allocation of pod sij and host
Future resource utilization of hj is summarized in Equa-
hj in the recovered state as follows.
tion 7. For each host that service si runs on, we calculate
The future resource usage of a pod in recovered state
the resource utilization and use the maximum utilization as
varies with the recovery action. If the action modifies the
the risk of the action.
pod, it is the configured resource usage ∆sv RU ; If the pod (
∆sv RU
is newly created by the action, it is assigned with the service R sv RA (hj ,mR )
, if hj is modified,
U T (hj , m ) =
normal resource usage after load-balancing. Otherwise, the as per Equation 6, if sij is migrated.
pod keeps the abnormal resource usage. Taking Figure 3 (7)
as an example, the resource usage of pod s32 in action C. Recovery action selection
migration in Figure 3(b) is ∆sv RU ; The resource usage
After we estimate the benefit and risk associated with
of pod s31 in action scale-out in Figure 3(c) keeps the
each recovery action in terms of resource utilization, we map
abnormal resource usage sv RU (s31 , mA ), and pod s32 is
these into fuzzy sets of service performance and aggregate
assigned with the load-balanced normal resource usage of
them into a single crisp effectiveness value through a fuzzy
service s3 , which is sv RU (s3 , mN )/2. We summarize the
inference system [24], illustrated in Figure 4. Finally, we
pod resource usage in Equation 4.

RU
formulate the selection problem as an optimization problem
∆sv ,
 if sij is modified, and select an appropriate action with a trade-off between
RU R RU A
sv (sij , m ) = sv (sij , m ), if sij is not modified, action effectiveness and recovery time.
 svRU (si ,mN )

To calculate the effectiveness value, the fuzzy inference
c , if sij is a new pod.
(4) system uses membership functions to determine the degree
Pod future resource allocation sv RA (sij , mR ) depends on that its inputs belong to each of the relevant fuzzy sets. For
the pod limits and the available resources of host hj it runs this purpose, three overlapping fuzzy sets are created. For
on. If the available resources in hj is sufficient (exceeds the action risk, host resource utilization values between 0
the pod limits), sv RA (sij , mR ) is equal to the pod limits and 70% are in the Low range, values between 50% and
and otherwise to the available resources in hj . The pod 80% are in the Medium range, and values above 80% are
limits can be modified by the action with ∆sv RA or kept in the High range.
in abnormal state. The available resource of hj is the host A membership function defines how the input value is
resource allocation sv RA (hj , mR ) with a consumption of mapped to the membership degree between 0 and 1, where
sv RU (hj , mA ), where sv RA (hj , mR ) can be modified by 0 means the input does not belong to the given fuzzy set,
the action or remain the same as sv RA (hj , mA ): and 1 means the input completely belongs to it. Similar
( to [14], [25], the membership functions for the three fuzzy
RA R ∆sv RA , if hj is modified, sets in inputs are respectively a R-function, a trapezoidal
sv (hj , m ) = RA A
sv (hj , m ) otherwise. function and a L-Functions, as shown in Figure 5(a). The
(5) membership function used in the output is three triangular
1.0 1.0
low By setting the weights, users can prioritize the effectiveness
0.8 0.8 medium
high and recovery time. We finally select the action that has
Membership

low

Membership
0.6 0.6
medium the highest utility value among the recovery action in A
0.4 high 0.4
according to Equation 8.
0.2 0.2
0.0 0.0 V. E XPERIMENTAL E VALUATIONS
0.00 0.25 0.50 0.75 1.00 1.25 0.0 0.2 0.4 0.6 0.8 1.0
risk effectiveness
(a) input function (b) output function In this section, we evaluate the performance of MicroRAS
through experiments on a cloud testbed. The experimental
Figure 5: Fuzzy membership functions. setup, evaluation results, and comparisons are presented.
Table II: Fuzzy rules for action effectiveness.
A. Experimental Setup
Benefit Risk Effectiveness
Testbed: We evaluate MicroRAS in a testbed hosted in
Low High Low
Low Medium Low
Google Cloud Engine (GCE)2 , where we create a Kubernetes
Low Low Medium cluster, run a microservices benchmark named Sock-shop1 ,
Medium High Low and deploy data collection tools. In the cluster, there is
Medium Medium Medium one master node and four worker nodes; three of them
Medium Low Medium are dedicated for microservices and the last one for data
High High Low collection. In addition, one VM outside the cluster is used
High Medium Medium
for the workload generator. The detailed configurations of
High Low High
hardware and software are shown in Table III.
functions, as shown in Figure 5(b). Taking the action risk Benchmark: Sock-shop1 is a widely used microservices
0.7 as an example, according to its membership function in benchmark that simulates an e-commerce website that sells
Figure 5(a), it has membership degree 0.2 in Low set, 0.7 in socks. It consists of 13 microservices, which are independent
the Medium set, and 0 in the High set. These values are used and intercommunicate using REST APIs. Seven out of the
for the fuzzy rules in the fuzzy reasoning. The fuzzy rules for 13 microservices are for the main business goals, such as
the inference system are defined based on the microservice frontend and backend services. In the deployment, we limit
system and its administrative policy. MicroRAS uses the the CPU resource to 1 vCPU and memory to 1 GB for
fuzzy rules shown in Table II. these seven key microservices. For simplicity, we set the
Based on the inputs, some fuzzy rules are fired and inte- replication factor to one for each microservice. To measure
grated. The decisions are made according to the aggregation the consequences of different actions in the same anomaly
of the fired fuzzy rules. The aggregated fired fuzzy rules out- scenario, we taint each microservice to a specific cluster
put a single fuzzy set which is the input of the defuzzification node and reset the environment for each action.
procedure. We use the centroid method for defuzzification Workload Generator: We use Locust3 to simulate con-
to convert the fuzzy set into crisp effectiveness value. current users in an application. In each case, 500 users are
After obtaining the effectiveness values of the potential provisioned and in total about 600 queries are generated per
recovery actions, we formulate the action selection as an second to Sock-shop in normal state. The queries to different
optimization problem, taking the effectiveness and action services are selected to reflect real user behavior, e.g., more
recovery time as the objectives. Recovery time of an action requests are sent to the entry points front-end and catalogue,
is measured as the time between the action initiation and and fewer to the other services.
completion in normal status, which initially was obtained Data Collection: We collect resources relevant metrics
by executing the recovery action in non-production environ- (e.g., CPU usage, memory usage) in container and node
ments. Once the action is executed to actually recover an levels, using cAdvisor4 and node-exporter5 , and collect
anomaly, the recovery time is updated with the time between response times for each microservice invocation with the
action initiation and action taking effect. Istio6 service mesh. We use Prometheus7 to pull all the
Given a set of potential recovery actions A, for each metrics every 5 seconds and store in a time-series database.
recovery action ai ∈ A, its effectiveness is E(ai ) and Faults Injection: We evaluate our MicroRAS with two
its recovery time is T (ai ). For consistency purposes, we different types of anomalies (CPU hog and memory leak),
normalize the values of effectiveness and recovery time different levels of anomalies (stressing services and hosts),
into the range (0, 1) through Min-Max normalization. The and different contexts (with total cluster resources either
performance of action ai is quantified with a utility function 3 Locust - https://locust.io/
u(ai ): 4 cAdvisor - https://github.com/google/cadvisor
u(ai ) = we E(ai ) − wt T (ai ) (8) 5 Node-exporter - https://github.com/prometheus/node_exporter
where we and wt are user-defined weights for action effec- 6 Istio - https://istio.io/

tiveness and recovery time (we + wt = 1, 0 < we , wt < 1). 7 Prometheus - https://prometheus.io/
Table III: Hardware and software configuration of testbed. 1000
Hardware Configuration 2000

Response Time (p95)

Response Time (p95)


Component Master node Worker node(x4) Workload generator
800
1500
Operating System Container-Optimized OS Container-Optimized OS 18.04.2 LTS 600
1000
vCPU(s) 1 4 6 400
Memory(GB) 3.75 15 12 500 200
Software Version
0 0
Kubernetes Istio Prometheus Node-exporter 0 20 40 60 80 100 120 0 20 40 60 80 100 120
1.14.1 1.1.5 2.3.1 v0.15.2 Time Time
(a) scale-out pod (b) restart pod
Table IV: Details of anomaly scenarios.
anomaly type host-level service-level
Figure 6: Two recovery actions for service performance
cluster sufficient 4*95 3*95 issue caused by insufficient host resources: (a) scale-out pod
CPU Hog (vCPU * %)
cluster insufficient (other hosts) 4*80 4*80 recovers the issue, but (b) restart pod has no effect.
Memory Leak (vm * %) 1*73 2*50 Recovered Percentage (p95) Affected Percentage (p95)
Recovered Percentage (p50) Affected Percentage (p50) 20.0
1.2 1.14

Recovered/Affected Percentage
sufficient or insufficient to resolve the anomaly). To inject 1.021.02 17.5
0.981.01 1.011.0 1.0 0.980.97
1.0 0.89 15.0
the CPU hog and memory leak, we use stress-ng 8 , a tool

Recovery Time (s)


0.8 0.75 12.5
to load and stress computer system to exhaust the CPU
10.0
0.6
and memory resources continuously. To inject performance 7.5
issues in microservices, we customize the existing Sock- 0.4
5.0
0.2 0.21 0.18
shop docker images by installing the faults injection tool. 0.2 0.15 0.12 0.11 0.13
0.09 0.06 0.07 2.5
0.04 0.01
The injected microservice is catalogue and the injected 0.0 0.0
host_c
pu _cpu ster ster mory mory
host is the host catalogue runs on. In the cluster resources service host_cpu_clu rvice_cpu_clu host_me service_me
se
sufficient scenario, we only inject anomalies to service or Anomaly Scenarios
host. In the cluster resources insufficient scenario, we also Figure 7: MicroRAS performance in terms of p95 and p50.
stress the other hosts. The details of the anomaly scenarios
• Affected Percentage (AP) quantifies the negative effects
are shown in Table IV.
of a recovery action on affected services {si |i =
In each case, we run the microservices in normal status
1, 2, ...N }, where N is the number of affected services.
for 2 minutes with the workload generator running. We
AP is defined as the mean percentage of decreased
next introduce the anomaly and let it run for 3 minutes
performance for affected services from abnormal state
before MicroRAS is used to select and execute a recovery
perf (si , mA ) to recovered state perf (si , mR |a) with
action. After action executed, we collect another 5 minutes
action a, to the normal state perf (si , mN ):
of data to measure the action consequences. To increase the
generality, we repeat 3-5 times for each anomaly scenario. 1 X perf (si , mR |a) − perf (si , mA )
AP (a) =
This produces a total of 23 experimental cases. In each N s perf (si , mN )
i
anomaly scenario, we take 6 types of recovery actions, which (10)
are: no action, restart pod (in the same host), migrate pod • Recovery Time (RT) quantifies time from initiating the
(shutdown and start the pod again), scale-out pod, scale- mitigation action until the performance of the anoma-
up pod and restart host. Figure 6 gives two examples of lous service and any affected services have stabilized.
data collected after applying scale-out pod and restart pod
B. Experimental Results
when the host CPU resource is insufficient, with pod scale-
out (Figure 6(a)) having positive effect while pod restart In our experiments, under normal workload, the 50th
(Figure 6(b)) did not recover the anomaly. percentile (p50) of service response times is around 10 ms in
Evaluation Metrics: To quantify the performance of normal status, and is in range (35 ms, 300 ms) in abnormal
recovery action selection, we use following metrics: status, depending on the anomaly types; the 95th percentile
(p95) of response times is around 40 ms in normal status,
• Recovered Percentage (RP) quantifies the positive ef- and ranges from 160 ms to 2000 ms in abnormal status.
fects of a recovery action a on the anomalous service Figure 7 shows the results of our proposed recovery action
sa . It is defined as the percentage of service perfor- selection method for mitigating different anomaly scenarios.
mance recovered from abnormal state perf (sa , mA ) For each anomaly scenario, the bar charts show the mean
to recovered state perf (sa , mR ) with action a, to the recovered percentage (RP) and affected percentage (AP) in
abnormal deviation from normal state perf (sa , mN ). terms of p95 and p50 of response times and the dashed line
perf (sa , mA ) − perf (sa , mR |a) shows the mean recovery time.
RP (a) = (9) From the results, we can observe that MicroRAS can on
perf (sa , mA ) − perf (sa , mN )
average mitigate all the injected performance issues within
8 stress-ng - https://kernel.ubuntu.com/ cking/stress-ng/ 15 seconds. Furthermore, after applying the MicroRAS
Table V: Performance of MicroRAS in different types of anomaly scenarios.
Fault Scenario CPU Hog Memory Leak Host-level Service-level Cluster Sufficient Cluster Insufficient Overall
Recovered Percentage 1.002 0.99 0.91 1.007 1.002 0.883 0.947
Affected Percentage 0.137 0.155 0.178 0.12 0.137 0.156 0.152
Recovery Time(s) 12.175 9.333 7.917 13.867 12.175 11.167 10.913

(a) Recovered Percentage (b) Affected Percentage (c) Recovery Time (s)
host_cpu_cluster service_cpu host_cpu_cluster service_cpu host_cpu_cluster service_cpu

1.0 0.5 100


0.8 0.4 80
0.6 0.3 60
0.4 0.2 40
0.2 0.1 20
service_cpu_cluster host_cpu service_cpu_cluster host_cpu service_cpu_cluster host_cpu

host_memory service_memory host_memory service_memory host_memory service_memory


No Action Random Selection Restart MicroRAS

Figure 8: Performance summary for different strategies, performance metrics, and anomaly scenarios.
selected recovery actions, the anomalous service recovers the problem. The operating team randomly selects one
at least 0.91 of its degraded performance across all anomaly action from the candidates and applies it [17].
scenarios, and the degradation in the recovered state was • Restart: This is a very popular recovery strategy, which
less than 18% from normal performance, except in the can be applied at various levels. In a production envi-
host_cpu_cluster anomaly scenario. The performance of this ronment, a significant fraction of failures can be cured
host_cpu_cluster anomaly scenario is lower than for the by restarts [26]. We perform restarts at the host or pod
others as the resources of the cluster are insufficient, thus level to resolve host and service level issues.
any recovery action is bound by overall resource shortage We compare the performance of each recovery strategy on
and thus negatively affects other services. different anomaly scenarios in terms of RP, AP, and recovery
We aggregate RP p95 and AP p95 according to the time in Figure 8. We observe that the performance issues
type of anomaly scenarios in Table V. We can see that cannot recover, or even deteriorates and affects other services
MicroRAS overall can recover 0.947 of anomalous service if no action is taken. All the actions selected by the other
degraded performance and mitigate the issues on average in three strategies can improve the performance of anomalous
11 seconds. The performance of the service-level is better services. Notably, MicroRAS selects the best action in all
than host-level. This is because the service-level issues can scenarios but one, only slightly beaten by restart for memory
be recovered entirely with the provided actions. However, leaks at the service level (Figure 8(a)) and has a shorter
the host-level issues can only be mitigated by most of the recovery time (Figure 8(c)) than others.
provided actions, further actions such as cluster scale-out Both Restart and our MicroRAS have a good performance
are required to fix the issues entirely. For the same reasons, in terms of RP to all types of anomaly scenarios. However,
MicroRAS performs better in cluster sufficient than cluster Restart has a higher risk of affecting other services and
insufficient cases. longer recovery times when the anomaly exists at the host
level. Figure 9 shows the AP and number of affected services
C. Comparisons
in each anomaly case. The solid lines show the results of
To evaluate the performance of MicroRAS further, we MicroRAS, and the dashed lines show the results of Restart.
compare it with three recovery strategies which require no We observe that MicroRAS and Restart have similar AP
historical data and are commonly used in the comparisons and affected number for service-level anomalies. However,
in the literature. Restart has a higher AP and affected number for host-
• No Action: Here, the operation team just passively level anomalies. This is because compared to service-level
observes the system without taking any actions. This operations, restarting a host commonly takes a longer time
strategy shows the potential damages of the injected and all the services running on the host would be restarted,
performance issues, when left unattended. which introduces fluctuations and uncertainty in the system.
• Random Selection: This strategy might be adopted Finally, we compare the overall performance of all strate-
when the operation team cannot determine the correct gies for all types of anomaly scenarios. Table VI shows the
recovery action precisely, but urgently are trying to fix performance, in terms of RP, AP, and recovery time (RT), for
MicroRAS: Affected Percentage MicroRAS: Affected Number
Restart: Affected Percentage Restart: Affected Number Learning-based approaches use reinforcement learning
6
or deep learning to generate recovery policies [12] or
1.0 host_cpu host_cpu_cluster host_memory service_cpu service_memory
service_cpu_cluster 5 commands [33] without human intervention. This kind of
Affected Percentage(p95)

0.8
4 approach views the system as a black-box and can adapt to

Affected Number
0.6 the changes in microservices. However, it requires a large
3
0.4 set of historical failure data to train the model, which is
2
0.2 difficult to obtain in microservice systems. Our method can
1 complement this approach to help recover newly updated
0.0
0 services. Once failure data is available, this learning-based
0.2
1 4 9 12 17 20 23 method can provide another recommendation for the action.
Anomaly case Number
Model-based approaches model different aspects of a
Figure 9: Affected percentage and number comparison. healing process, such as the properties of the fault [34], the
the four recovery strategies. We can observe that MicroRAS properties of the actions [35], or use theoretical techniques,
outperforms other strategies overall. In particular, MicroRAS like Markov decision theory [36]. Similar to our method,
achieves a recovered percentage of 94.7%, and affects other consequences of recovery actions are also considered in [13],
services at least 44.3% less and is completed at least 4 times [14], [37]. M.Fu et al. [37] define the impact of an action on
faster than other strategies. service response times which are caused by the increasing
requests introduced by different recovery patterns. However,
Table VI: Overall performance of different strategies.
the impact of a performance issue in microservice systems,
Metrics RP AP RT(s) such as hardware failure, software bugs, etc, cannot manifest
No Action 0.037 0.138 -
in the number of requests, so their impact model is not
Random Selection (RS) 0.646 0.273 40.565
Restart 0.897 0.295 62.652 suitable to our problem. Others [13], [14] define their models
MicroRAS 0.947 0.152 10.913 based on probabilistic parameters learned from recovery
Improvement to RS(%) 46.6 44.3 73.1 history (e.g., the prior probability of the system being in
Improvement to Restart (%) 5.5 48.5 82.6
a stable state after executing an action). However, these
probabilistic parameters are difficult to obtain in frequently
VI. R ELATED W ORK updated microservice systems. Our MicroRAS system es-
A wide variety of techniques and approaches have been timates the action consequences based on a system-state
proposed to mitigate problems in cloud, networks, and dis- model which is built solely on data collected in real-time
tributed systems [27], [28]. Some of them work on a specific and action properties defined in normal status.
recovery approach, such as reboot [29], check-pointing [30],
self-adaptation [31], placement [16], etc. Some of them VII. C ONCLUSION AND F UTURE W ORK
focus on general recovery strategies. We herein review In this paper, we propose a method named MicroRAS,
the related work in general automatic recovery from the to select the best possible recovery action based on an
following aspects. action effectiveness assessment model in order to mitigate
Rule-based approaches transfer the expert knowledge into the performance degradation in microservice systems. We
IF-THEN rules and use policies to match the rules for estimate the positive and negative effects for each action
reacting to the faults [10]. This kind of approach is easy and select the action with the best trade-off between action
to develop. However, formalizing the rules requires a lot effects and recovery time based on data collected in real-
of expertise, and is a difficult and time-consuming task. In time and knowledge obtained in normal status, without
addition, frequent human intervention to revise the rules is use of historical failure data. The estimation of the effects
required to keep them up-to-date in dynamic microservice utilizes a system-state model, which is represented by an
systems. attributed graph used to track the propagation of action
Case-based approaches take previous failures as cases effects across services and hosts. The experimental results
and match against the cases when a new fault occurs [11], show that MicroRAS can effectively recover the anomalous
[32]. This kind of approach can effectively avoid repeating services by 94.7% of their degraded performance while
past mistakes and can adapt to the changes in the system. affecting the performance of other services at least 44.3%
However, similar problems in microservices commonly give less. Finally, the mitigation is completed at least 4 times
rise to different symptoms due to technology heterogeneity faster than baseline recovery strategies.
and frequent updates. Thus, it is difficult and error-prone As our method considers a single-step ahead mitigation
to apply the case matching. Furthermore, the number of only, some performance issues cannot recover completely.
exposed metrics in microservices is very high, computing In the future, we plan to investigate multi-action recovery
the similarity between these metrics would cause significant strategies and how to schedule the recovery actions to
overhead and delay. minimize outage time. In addition, we consider to extend
our method to a hybrid one with a learning process to be [18] M. Pahl and F. Aubet, “All eyes on you: Distributed multi-
used when historical failure data is available. dimensional iot microservice anomaly detection,” in 2018
14th International Conference on Network and Service
ACKNOWLEDGMENT Management (CNSM), 2018, pp. 72–80.
[19] A. Samir and C. Pahl, “Dla: Detecting and localizing
This work is part of the FogGuru project which has received anomalies in containerized microservice architectures using
funding from the European Union’s Horizon 2020 research and markov models,” in 2019 7th International Conference
innovation programme under the Marie Skłodowska-Curie grant on Future Internet of Things and Cloud (FiCloud), 2019,
agreement No 765452. The information and views set out in this pp. 205–213.
publication are those of the author(s) and do not necessarily reflect [20] M. Ma et al., “Automap: Diagnose your microservice-based
web applications automatically,” in Proceedings of The Web
the official opinion of the European Union. Neither the European Conference 2020, New York, NY, USA: Association for
Union institutions and bodies nor any person acting on their behalf Computing Machinery, 2020, pp. 246–258.
may be held responsible for the use which may be made of the [21] O. Ibidunmoye, F. Hernández-Rodriguez, and E. Elmroth,
information contained therein. “Performance anomaly detection and bottleneck identifica-
tion,” ACM Comput. Surv., vol. 48, no. 1, 2015.
R EFERENCES [22] L. Wu et al., “MicroRCA: Root Cause Localization of
[1] D. Gannon et al., “Cloud-native applications,” IEEE Cloud Performance Issues in Microservices,” in NOMS, 2020.
Computing, vol. 4, no. 5, pp. 16–21, 2017. [23] J. Weng et al., “Root cause analysis of anomalies of multi-
[2] A. Balalaie et al., “Microservices architecture enables de- tier services in public clouds,” IEEE/ACM Transactions on
vops: Migration to a cloud-native architecture,” IEEE Soft- Networking, vol. 26, no. 4, pp. 1646–1659, 2018.
ware, vol. 33, no. 3, pp. 42–52, 2016. [24] S. Frey et al., “Cloud qos scaling by fuzzy logic,” in
[3] P. Jamshidi et al., “Microservices: The journey so far and 2014 IEEE International Conference on Cloud Engineering,
challenges ahead,” IEEE Software, vol. 35, no. 3, pp. 24–35, 2014, pp. 343–348.
2018. [25] H. Arabnejad et al., “A comparison of reinforcement learn-
[4] S. Haselböck et al., “Decision guidance models for mi- ing techniques for fuzzy cloud auto-scaling,” in CCGRID,
croservices: Service discovery and fault tolerance,” in ECBS 2017, pp. 64–73.
’17, 2017. [26] C. Wang et al., “Performance troubleshooting in data cen-
[5] G. Toffetti et al., “Self-managing cloud-native applications: ters: An annotated bibliography?” SIGOPS Oper. Syst. Rev.,
Design, implementation, and experience,” Future Genera- vol. 47, no. 3, pp. 50–62, 2013.
tion Computer Systems, vol. 72, pp. 165–179, 2017. [27] P. Garraghan, R. Yang, Z. Wen, A. Romanovsky, J. Xu,
[6] V. Heorhiadi et al., “Gremlin: Systematic resilience testing R. Buyya, and R. Ranjan, “Emergent failures: Rethinking
of microservices,” in ICDCS, 2016, pp. 57–66. cloud reliability at scale,” IEEE Cloud Computing, vol. 5,
[7] H. S. Gunawi et al., “Fate and destini: A framework for no. 5, pp. 12–21, 2018.
cloud recovery testing,” in NSDI, 2011, pp. 238–252. [28] I. Brandic, “Towards self-manageable cloud services,” in
[8] Why Netflix, Amazon, and Apple Care About Microservices, 2009 33rd Annual IEEE International Computer Software
(accessed: 30.05.2020). and Applications Conference, vol. 2, 2009, pp. 128–133.
[9] J. Thalheim et al., “Sieve: Actionable insights from mon- [29] G. Candea et al., “Microreboot — a technique for cheap
itored metrics in distributed systems,” in Middleware ’17, recovery,” in Proceedings of the 6th Conference on Sym-
2017, pp. 14–27. posium on Operating Systems Design Implementation -
[10] H. Mfula et al., “Self-healing cloud services in private Volume 6, ser. OSDI’04, USA, 2004, p. 3.
multi-clouds,” in HPCS, 2018, pp. 165–170. [30] R. Koo and S. Toueg, “Checkpointing and rollback-recovery
[11] S. Montani et al., “Case-based reasoning for autonomous for distributed systems,” IEEE Transactions on Software
service failure diagnosis and remediation in software sys- Engineering, vol. SE-13, no. 1, pp. 23–31, 1987.
tems,” in ECCBR, 2006, pp. 489–503. [31] V. Nallur and R. Bahsoon, “A decentralized self-adaptation
[12] Q. Zhu et al., “A reinforcement learning approach to mechanism for service-based applications in the cloud,”
automatic error recovery,” in DSN, 2007, pp. 729–738. IEEE Transactions on Software Engineering, vol. 39, no. 5,
[13] S. Ossenbühl et al., “Towards automated incident handling: pp. 591–612, 2013.
How to select an appropriate response against a network- [32] G. Li et al., “A self-healing framework for qos-aware
based attack?” In IMF, 2015, pp. 51–67. web service composition via case-based reasoning,” in Web
[14] J. Shetty et al., “Proactive cloud service assurance frame- Technologies and Applications, 2013, pp. 654–661.
work for fault remediation in cloud environment,” IJECE, [33] H. Ikeuchi et al., “Recovery command generation towards
vol. 10, no. 1, p. 987, 2020. automatic recovery in ict systems by seq2seq learning,” in
[15] B. Beyer, Site reliability engineering : How Google runs NOMS, 2020, pp. 1–6.
production systems. 2016. [34] Y. Dai et al., “Self-healing and hybrid diagnosis in cloud
[16] F. Díaz-Sánchez, S. Al Zahr, and M. Gagnaire, “An exact computing,” in Cloud Computing, 2009, pp. 45–56.
placement approach for optimizing cost and recovery time [35] A. Samir and C. Pahl, “Self-adaptive healing for container-
under faulty multi-cloud environments,” in 2013 IEEE 5th ized cluster architectures with hidden markov models,” in
International Conference on Cloud Computing Technology FMEC, 2019, pp. 68–73.
and Science, vol. 2, 2013, pp. 138–143. [36] K. R. Joshi, M. A. Hiltunen, W. H. Sanders, and R. D.
[17] S. Huang et al., “Differentiated failure remediation with Schlichting, “Automatic model-driven recovery in dis-
action selection for resilient computing,” in PRDC, 2015, tributed systems,” in SRDS’05, 2005, pp. 25–36.
pp. 199–208. [37] M. Fu et al., “Runtime recovery actions selection for spo-
radic operations on cloud,” in ASWEC, 2015, pp. 185–194.

View publication stats

You might also like