Smilax

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340238386

Elixir: An Agent for Supporting Elasticity in Docker Swarm

Chapter · March 2020


DOI: 10.1007/978-3-030-44041-1_96

CITATIONS READS
5 286

2 authors:

Michail S. Alexiou Euripides Petrakis


Georgia Institute of Technology Technical University of Crete
11 PUBLICATIONS 20 CITATIONS 181 PUBLICATIONS 4,069 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Euripides Petrakis on 31 May 2021.

The user has requested enhancement of the downloaded file.


Smilax: Statistical Machine Learning
Autoscaler Agent for Apache FLINK

Panagiotis Giannakopoulos, Euripides G.M. Petrakis

Abstract Smilax is a statistical machine learning autoscaler agent for appli-


cations running on Apache Flink. Smilax agent acts proactively by predict-
ing the forthcoming workload in order to adjust the allocation of workers to
the actual needs of an application ahead of time. During an online training
phase, Smilax builds a model which maps the performance of the applica-
tion to the minimum number of servers. During the work (optimal) phase,
Smilax maintains the performance of the application within acceptable lim-
its (i.e. defined in the form of SLAs) while minimizing the utilization of re-
sources. The effectiveness of Smilax is assessed experimentally by running
a data intensive fraud detection application.
1 Introduction
The relationships between service providers and customers are important
for achieving high level of satisfaction and trust. In cloud computing, the
service provider - customer relationship is not arbitrary but it is shaped by a
Service Level Agreement (SLA). The SLA specifies obligations and penalties
in case of non-compliance with the agreement. There is a direct relationship
between quality of service and amount of computing resources allocated to
the client’s application. Typically, the operation of the provider is assisted
by software agents which monitor changes in the performance and regu-
late the allocation of computing resources to the application either reactively
(as they occur) or, proactively (i.e. ahead of time). Reactive scaling policies
are easy to implement but, leave room for both over or under-utilization
of resources [8, 1]. Proactive scaling policies are capable of predicting pos-
sible SLA violations and make optimal resource allocations decisions but,
come with a complex implementation and requires a-priori knowledge of
the workload for model training. The training of the model can be carried-
out either offline or online.
Stream processing enables a variety of brand-new applications charac-
terized by increased data generation and the low latency response. Apache

Panagiotis Giannakopoulos
School of Electrical and Computer Engineering, Technical University of Crete (TUC), Cha-
nia, Greece, e-mail: pgiannakopoulos1@isc.tuc.gr
Euripides G.M. Petrakis
School of Electrical and Computer Engineering, Technical University of Crete (TUC), Cha-
nia, Greece, e-mail: petrakis@intelligence.tuc.gr

1
2 P. Giannakopoulos, E. G.M. Petrakis

Flink1 is a distributed processing engine for stateful computations over un-


bounded and bounded data streams in critical-mission applications such as,
fraud detection (i.e. detection of suspicious transactions), anomaly detection
(i.e. detection rare or suspicions events), rule-based alerting (i.e. identifica-
tion of data which satisfy one or more rules) and many more. However, if
streaming data is generated at different speeds, Apache Flink cannot auto-
matically and optimally adjust the utilization of its computing resources.
Existing resource allocation policies for Apach Flink are all reactive or, the
resource scaling decisions resort to human operators who monitor the per-
formance of the system.
Smilax is an autonomous agent which monitors and maintains the per-
formance of Apache Flink within acceptable limits (i.e. defined in the form
of SLAs) while minimizing the utilization of computing resources. During
a training phase, a reactive scaler collects workload and performance infor-
mation and adjusts (scales-up or down) the number of servers whenever
the performance limit (i.e. the SLA) is violated. During the optimal phase,
the agent explores the performance and builds a statistical machine learn-
ing model which registers the optimal mapping between workload, perfor-
mance and number of servers. Model fitting takes place at run time from
production data. As soon as the model is deemed stable, the agent switches
to optimal mode. The model is then used for predicting the performance
of the application and for making scaling decisions proactively (i.e. before
SLA violations occur). The stability of the model is constantly monitored
and as soon as a model change is detected (e.g. the workload becomes un-
predictable), the agent switches back to reactive mode to start collecting new
data in order to build a new performance model.
Apache Flink and Smilax are deployed on Docker Swarm2 , a low-footprint
virtualization platform based on Docker containerization. Smilax is evalu-
ated on a fraud detection application which runs an 6.5-hours workload that
produces up to four thousands records per second. The experimental results
demonstrate that Smilax always makes accurate predictions and results in
less SLA violations and better overall utilization of computing resources
compared to its reactive counterpart.
Related work on autoscaling is discussed in Sec. 2. An introduction to
Flink and how an autoscaler for Flink can be designed is discussed in Sec. 3.
Smilax solution is presented in Sec. 4 followed by experimental results in
Sec. 5. Conclusions, system extensions and issues for future research are dis-
cussed in Sec. 6.
2 Related Work
In regards to proactive scaling for stream processing, the ideas are not quite
mature yet and have not been incorporated into commercial real-time an-

1 https://flink.apache.org/
2 https://docs.docker.com/engine/swarm/
Smilax 3

alytics platforms. Initial ideas for a statistical machine learning model for
the scaling of resources are discussed in [4]. Arabnejad et al. [2] compare
two different autoscaling types of Reinforcement Learning (RL), which is
SARSA and Q-learning. The autoscaler dynamically resizes Web applica-
tions in order to meet the quality of service requirements. Bibal Benifa and
D. Dejey [3] propose the RLPAS algorithm, which applies RL using a neural
network in order to reduce the time for convergence to an optimal policy.
Rossi, Nardelli and Cardellini [5] propose RL solutions for controlling the
horizontal and vertical elasticity of container-based applications in order to
cope with varying workloads. These autoscalers do not adapt their scaling
model to changes of the application’s behavior at run-time.
The following solutions are all reactive: DS2 [7] enables automatic scal-
ing of Apache Flink applications. A controller assesses the running applica-
tion at operator level in order to detect possible bottlenecks in the data-flow
(i.e. operators that slow down the whole application). In contrast to Smilax
which monitors and scales applications at job level (i.e. multiple operators
or tasks may execute in a job), DS2 is designed to adjust the parallelism of
each operator separately in order to maintain high throughput. Autopilot3
is a proprietary solution for Ververica platform which is designed to drive
multiple high throughput, low latency stream processing applications on
Apache Flink. There are also solutions which have been incorporated into
the real-time analytics platforms of commercial cloud providers: Apache
Heron4 is the stream processing engine of Twitter; Dataflow5 is a serverless
autoscaling solution that supports automatic partitioning and re-balancing
of input data streams to servers in the Google Cloud Platform.
3 Smilax Ecosystem
Apache Flink provides an extensive toolbox of operators for implement-
ing transformations on data streams (e.g. filtering, updating state, aggregat-
ing). The data-flows or jobs (i.e. operations chained together) form directed
graphs (Job Graphs), that start with one or more sources and end at one
or more sinks. The Flink cluster consists of a Job Manager and a number
Task Managers (workers). The Job Manager controls the operation of the
entire cluster: schedules the workers, reacts to finished or failed tasks, load
balances the workload among Task Managers, coordinates checkpoints and
recovery from failures. The Task Managers are the machines (servers) which
execute the tasks of a workflow. A task represents a chain of one or more
operators that can be executed in a single thread or server. A task can be
executed in parallel (on separate Task Managers). Each parallel instance of
a task is a subtask. The number of subtasks running in parallel is the paral-
lelism of that particular task.

3 https://docs.ververica.com/user_guide/application_operations/autopilot.html
4 https://incubator.apache.org/clutch/heron.html
5 https://cloud.google.com/dataflow
4 P. Giannakopoulos, E. G.M. Petrakis

The number of allocated Task Managers varies over time and it is reg-
ulated by Smilax agent. Smilax agent monitors the operation of all tasks
and, depending on workload and performance, decides to change the par-
allelism of a task (i.e. scale-up or down). Flink is particularly flexible, but
making the most out of it can become a challenging task that requires in
depth understanding of its underlying architecture (especially in the case of
multiple workflows with many operators executing in multiple layers). For
simplicity of the discussion, the following assumptions do apply in Smilax:
each Task Manager (worker) runs the entire Job Graph (workflow), which
means that the number of allocated workers is identical to the parallelism
of the job. Rescaling actions (e.g. adding or removing a worker) will mod-
ify the parallelism of all operators of a subtask at the same time. Changing
the parallelism of individual operators would require that Smilax monitors
each operator separately and takes scaling decisions based on the perfor-
mance of each individual operator, in the example of DS2 [7]. If more than
one workflow run on the same Flink cluster, taking optimal scaling decisions
for each individual workflow requires monitoring the performance of each
workflow separately (i.e. a separate model must be built for each workflow).
An application receives data records (or events) from streaming sources
such as Apache Kafka6 . Kafka, queues data from application sources like
databases, sensors, mobile devices, cloud services etc. Kafka reads data
streams in topics and in parallel (i.e. events are appended to more than one
partitions defined for that topic). The incoming workload is monitored by
inspecting the Kafka topics which are the data sources of the running job.
The workload represents the number of records per second the system re-
ceives. Kafka queues are empty if the system consumes (processes) the re-
ceived data at a rate higher than the production rate; otherwise, the data
remains in the queue (slow records). The average length of Kafka queues
is an indicator of whether the system can keep up with the data produc-
tion rate. Prometheus service 7 is responsible for the monitoring of running
applications. Prometheus retrieves the Kafka metrics by querying the HTTP
endpoint of JMX8 (i.e. Prometheus cannot connect to Kafka directly). Apache
Zookeeper9 is a coordination service for the Kafka queues.
The percentile of slow records is computed as queue-length/workload. In
Smilax, quality of service is represented by an SLA metric which is defined
as the percentile of slow records per second that a client (e.g. application
owner or user) can accept. In this work, the threshold is 90% (i.e. less than
10% of the number of records can remain in the queue or more than 90% of
the records are processed instantly). Smilax, collects information from Kafka

6 https://kafka.apache.org
7 https://prometheus.io
8 https://docs.oracle.com/en/java/javase/15/jmx/
9 https://zookeeper.apache.org
Smilax 5

queues and adjusts (scales-up or down) the number of workers as soon as


the SLA is violated.

Fig. 1: Flink deployment with Kafka and Smilax autoscaler.


Fig.1 is an abstract architecture of Smilax ecosystem. Each shape with
dotted line represents a Virtual Machine (VM) with a Docker environment
installed. Boxes with solid lines within a Docker represent containers run-
ning the specified services. The entire cluster runs on Docker Swarm using
Docker images of Apache Flink10 . Prometheus service runs in a separate
container and is responsible for monitoring the system in terms of allocated
resources (i.e. number of servers), workload and performance. The contain-
ers within this Docker are configured to run either as Job Manager or Task
Managers. The cluster is deployed as Flink Session Clusterso that the the
lifecycle of the running job is independent from the lifetime of the cluster.
In any other case (i.e. application mode or per-job mode), the cluster would
shut down prior to rescaling.
Initially, all services are constrained into one single VM. New servers can
be added into the same VM as well. Once the capacity of the VM is ex-
hausted, new servers will be added in a new VM (i.e. a Swarm with two
nodes). Deployment of new servers is a three layer process. Fig. 2 illustrates
the three rescaling layers. Scaling commands address the CLI of Flink layer
(top-most layer). However, the option to scale Flink using CLI11 is temporar-
ily disabled in version 1.11 (or older) but will be soon supported in a future
version. The only way to change the parallelism of a job is to first stop the
job (and take save-point) and, then re-run the job with the new parallelism.
As a result, incoming records remain in Kafka until the job recovers. In the
mean-time, Smilax discards all performance metrics and no scaling action
takes place.
Scaling actions issued on Flink layer (i.e. for allocating new server nodes
in containers) address also the Docker layer (Docker CLI). A number of Task

10 https://flink.apache.org/news/2020/08/20/flink-docker.html
11 https://issues.apache.org/jira/browse/FLINK-12312
6 P. Giannakopoulos, E. G.M. Petrakis

Fig. 2: Resource adjustment layers.


Managers can be in hot-standby (e.g. can be nodes de-allocated recently due
to scale-down). If the required resources are there, the number of servers in
hot-standby is reduced by the number of requested servers and these servers
will join the cluster again. If not enough servers are in hot-standby, addi-
tional servers will be allocated by addressing the Docker layer, provided
that the existing VM has the capacity to accommodate the new servers. Oth-
erwise, the request is propagated to the VM infrastructure layer (i.e. Open-
stack in this work) which manages the allocation and de-allocation of VMs
[1]. For scaling down, if the job reached the minimum parallelism allowed,
the operation is aborted; otherwise the operation is propagated from Flink to
Docker layer. The de-allocated containers are put in hot-standby for future
use. If their number exceed a pre-defined limit, they are removed. Notice
that, Docker has no knowledge whether a container is in use or it is idle. In-
formation on which containers are active (are receiving records) is obtained
from Flink. Finally, if the VM is empty it is removed as well.
4 Smilax
The autoscaler implements a controller which is in a constant switch be-
tween two states, referred to as training (or exploration) phase and, work
(or optimal) phase respectively [6].
4.1 Exploration Phase
Smilax explores the behavior of the application under varying workloads
and different parallelism (i.e. number of servers). The controller takes scal-
ing decisions reactively (i.e. as soon as the SLA is violated).
Let Wmax be the capacity of a single Task Manager (server). The capacity
represents the maximum incoming records rate which a single server can
process without violating the SLA. If st and wt are the number of servers
and workload respectively when SLA is violated for first time, then Wmax =
wt /st . In order to use a better approximation, the capacity is updated each
time SLA is violated and the average Wmax value is computed. The reactive
scaler checks periodically (i.e. every 15 seconds) whether a scale-up or a
scale-down decision must be taken. If S and W are the number of servers
and the workload when the SLA is violated, then the system needs S0 =
W /Wmax servers and S0 −S servers must be added (assuming that the capacity
of each server is the same). Conversely, if the number of servers is more
than W /Wmax /0.9 (i.e. there are at least 10% more servers than needed), the
cluster is over-utilized and the autoscaler will remove the extra servers. This
condition must be satisfied for a number of 10 consecutive samples (taken
every 15 seconds).
Smilax 7

The performance model captures the relation between workload, num-


ber of servers, and percentage of slow records. The model is described by
one dependent and two independent variables: y = f (X) = f (x1 , x2 ) where,
x1 is the workload, x2 is the number of servers and y is the percentage of
slow records. The model is trained using non-linear regression. The degree
of the polynomial is selected by applying brute-force search (i.e. the solu-
tion is searched exhaustively). This is a three step iterative process during
which the method will test all degrees from 1 through maxDegree by ap-
plying Algorithm 1. The computational complexity of the method increases
with maxDegree. For each degree, a model is created using non-linear regres-
sion on the collected measurements. The Route-Mean-Square-Error (RMSE)
is computed by comparing the actual against the predicted value of the
model and describes how concentrated the dataset is around the regression
line. The degree of the model is the one with the minimum RMSE value.
Algorithm 1 Computing the best-fit degree for dataset (X,y).
1: procedure F IND D EGREE(X,y,maxDegree)
2: minRMSE ← ∞
3: bestDegree ← 0
4: degree ← 0
5: while degree < maxDegree do
6: model ← NonLinearRegression(X,y,degree)
7: yactual ← y
8: y predicted ← model.predict(X)
9: RMSE ← MeasureRMSE(yactual , y predicted )
10: if RMSE < minRMSE then
11: minRMSE ← RMSE
12: bestDegree ← degree
13: degree ← degree + 1
return bestDegree

The stability of the model is checked periodically by applying a Boot-


strapping technique [10]. This is also a two stage process: in the first phase,
the method checks whether the dataset comprises enough data to create an
accurate model. If the standard deviation of the predicted values is less than
the model stability threshold λ , the method proceeds to phase two whose
purpose is to assess that the model is accurate to predict the performance of
the application for each parallelism [6]. The stability threshold is the maxi-
mum acceptable error of the model and it is user defined (i.e. λ = 0.05 for
both phases). Once this phase is completed (i.e. the model is deemed stable),
the controller switches to optimal control.
4.2 Optimal Phase
The autoscaler takes scaling decisions proactively (i.e. the model attempts
to predict SLA violations before they occur). Smilax no longer uses instant
metrics in order to take decisions about parallelism. Instead, near future pre-
dictions of the workload are used to determine both, the performance and
8 P. Giannakopoulos, E. G.M. Petrakis

the optimal parallelism of the system. This is also a two stage process: (a) in
the first phase, future predictions of the workload are derived based on past
(i.e. recent) values by applying linear regression. Assuming that the rate will
not change in the near future, the workload is predicted based on the slope
of this curve representing the rate of change of the workload (i.e. whether it
increases, decreases or it is steady). For the next 60 seconds and for every 5
seconds, the output is an array with 12 values. (b) For each future workload
and according to the model, the performance (i.e. percentile of slow records)
of the application can be predicted as well. For each predicted value of the
workload, the performance takes a value for each possible parallelism. The
optimal parallelism Starget is the minimum parallelism which satisfies the
SLA (i.e. the percentage of slow records per second is less than 10%). Algo-
rithm 2 illustrates this process.
Algorithm 2 Scaling policy during optimal control
1: procedure P ROACTIVE S CALER
2: w f uture ← WorkloadPredictor()
3: parallelismSet ← [1 . . . nmax ]
4: Starget ← nmax
5: for ni ∈ parallelismSet do . select optimal parallelism
6: performancePoints ← Predict(ni , w f uture )
7: evaluation ← CheckViolation(performancePoints)
8: if evaluation then
9: Starget ← ni
10: break
11: Snew ← Hysterisis(Starget )
12: scale(Snew )
13: procedure W ORKLOAD P REDICTOR
14: w past ← GetPastWorkload(10mins)
15: slope ← LinearRegression(w past )
16: w f uture ← slope.predict(1min)
return w f uture
17: procedure C HECK V IOLATION(performancePoints)
18: for each pointi ∈ performancePoints do
19: if point i ≥ SLA then return false
return true
20: procedure H YSTERISIS(Starget )
21: Sold ← CurrentNumberOfTaskmanagers
22: if Starget > Sold then
23: Snew ← Sold + α · (Starget − Sold )
24: else if Starget < Sold then
25: Snew ← Sold + β · (Starget − Sold )
26: else
27: Snew ← Sold
return Snew

To prevent rapid oscillations in parallelism values, hysteresis gains α and


β are defined in the range [0,1]. The final parallelism Snew is computed as:
Smilax 9
(
Sold + α(Starget − Sold ), if Starget > Sold
Snew = (1)
Sold + β (Starget − Sold ), if Starget < Sold
Parameter α specifies how quickly the system will make the transition
from S_old to S_target. A low value means that new servers will be added
gradually but, this could cause SLA violations (since the number of servers
is less than needed); a value close to 1 means that servers will be added
quickly (or at once if α = 1). A value close to 1 should be preferred in order
to avoid SLA violations. Parameter β specifies how quickly the system will
scale down from S_old to S_target. A value close to 1 will cause servers to
be de-allocated quickly. A high or low value will not cause SLA violations
but could cause under-utilization of resources (if more than Starget servers
are retained for long time. Henceforth, α = 0.9 and β = 0.4.
Changes in the environment such as system updates or hardware failures
could lead the application to behave in an unpredictable way. Change point
detection is applied to detect whether the model is capable of predicting the
behavior of the application. Smilax computes the percentile (per second) of
the residuals kactualPer f ormance − predictedPer f ormancek. Residual values
are collected in periods of 5 seconds. The residuals of an accurate model
must be close or equal to zero. Online change detection [9] captures abrupt
changes in the streaming data. The function returns a score (i.e. prediction
error) which is constantly compared against a user defined threshold (0.08
in this work). In case of very low threshold, the controller will mark the
performance model as inaccurate for very small deviations. Conversely, a
high threshold could cause SLA violations. If the model is no longer valid
(i.e. it is no longer capable of predicting the performance and take scaling
decisions proactively), the controller switches again to exploration mode in
order to train a new model.
5 Experiments
The purpose of the following set of experiments is to assess the performance
of Smilax autoscaler for adjusting the resources of an Apache Flink clus-
ter running a Click Fraud Detection application. This type of fraud occurs
on the Internet in pay-per-click (PPC) online advertising applications. Web-
site owners post advertisements and receive re-numeration based on how
many Web users click on the advertisements. Fraud occurs when a person
or software imitates a legitimate user by clicking on an advertisement with-
out having an actual interest in it. The application receives records from a
Kafka topic with elements User-IP, User-ID, time-stamp and event type (e.f.
“click”). The production rate of these records represents the workload of
the application. The application attempts to detect fraud by searching for
the following patterns every 60 seconds: (a) Counts of User ID’s per unique
IP address, (b) Counts of IP addresses per unique User ID and, (c) Click-
Through Rate (CTR) per User ID.
10 P. Giannakopoulos, E. G.M. Petrakis

Flink runs on a VM (8 CPUs, 16Gb Ram) on the Openstack infrastructure


of TUC and starts with 1 Task Manager in a container (1 CPU with 2 Gb
RAM). Prometheus is deployed on a separate container (1 CPU with 1Gb
RAM). Smilax runs on a second VM (4CPUs with 8Gb RAM). A third VM
runs Kafka with Zookeeper and JMX exporter (4 CPUs with 8Gb RAM).
The duration of the experiment is 6.5 hours or 1834 samples. During ex-
ploration mode, samples are taken every 15 seconds while, during optimal
mode, samples are taken every 5 seconds. The workload is generated using
Faban12 which runs on a fourth VM (2 CPUs with 4 Gb RAM). Faban can
be used for the generation of benchmarks based on real workload distribu-
tions. The workload follows a Gaussian distribution in two stages. Smilax
switched to optimal control after 4 hours. This is the duration of the train-
ing (exploration) phase during which Stability Check applies every 2 hours
(threshold λ = 0.05). Fig. 3 illustrates (a) workload distribution at the top
most graph, (b) predicted parallelism (i.e. number of servers allocated) in the
middle and, (c) points in time where SLA violations occurred (slow records)
the third (bottom) graph. All three graphs have the same x-axis showing
times in number of samples. The accuracy of the model did not change in
the optimal phase and no SLA violations occurred.

Fig. 3: Workload, predicted values of parallelism and slow records.


The resource allocation policy is not the same during the two phases. Dur-
ing the training phase, the autoscaler allocates as many servers are required
12 http://faban.org
Smilax 11

(Sec. 4.1). During the optimal phase, allocation of resources is dictated by


Eg. 1. Compared to a pure reactive policy the results of this experiment in
Fig. 3 reveal three important advantages of proactive over reactive scaling:
(a) allocation of resources ahead of time (i.e. before SLA violations occur),
(b) the reactive scaler takes greedy decisions ending-up in sub-optimal al-
location of resources (i.e. in this experiment, the reactive autoscaler utilized
1.85 servers per minute as compared to the proactive autoscaler which uti-
lized 1.80 servers per minute) and, (c) less charges to service providers due
to SLA violations. As shown in the last (performance) graph, the reactive
policy encountered SLA violations in three cases during the optimal phase
causing the respective scaling actions. Instead, the proactive policy triggered
all scaling actions ahead of time and avoided all SLA violations.
Fig. 4 illustrates the performance model that was built during the training
phase and represents performance (i.e. percentage of slow records) as a func-
tion of the workload (i.e. number of input records per second) for various
cases of parallelism (i.e. number of servers). The model explored the perfor-
mance from 1 up to 4 servers. As expected, the percentage of slow records
(and subsequently) SLA violations decrease with the number of servers. Al-
though selecting the maximum parallelism (i.e. 4 servers) would be a safe
choice, this would result in under-utilization of computing resources. The
model indicates that, for a given workload, the optimal choice is to select
the minimum parallelism that achieves no SLA violations.

Fig. 4: Performance model of the Fraud Detection application.


6 Conclusions
Smilax is application agnostic and can be used to support optimal scaling
decisions proactively. As a proof of concept, the paper shows how Smilax
12 P. Giannakopoulos, E. G.M. Petrakis

supports optimal scaling in applications running on Apache Flink so that


they use as few resources as possible while maintaining their performance
within acceptable limits. Hence, clients are not charged for idle or under-
utilized resources and providers are not charged for violating their SLAs
with their clients. Smilax builds-upon early ideas by Bodik [4]. The origi-
nal work has been improved on certain methodology aspects including, al-
gorithmic model construction, model validity, incorporation within a state-
of-the-art streaming platform (i.e. Apache Flink) and verification in a high
impact fraud detection use case.
The assumptions in regards to Apache Flink customization have to be
relaxed. A more effective autoscaler would take scaling decisions at a finer
granularity level (i.e. operator level). Optimizing change point detection in
order to capture gradual changes of the performance (and not only abrupt
change), improving the learning method in order to handle unexpected
changes of the workload (e.g. outliers), estimating parameters λ , α, β auto-
matically (i.e. using machine learning) to match the peculiarities of different
workloads, are also important issues for future research.
References
1. M. Alexiou and E. G.M. Petrakis. Elixir: An Agent for Supporting Elasticity in Docker
Swarm. In Advanced Information Networking and Applications (AINA 2020), volume
1151, pages 1114–1125, Caserta, Italy, 4 2020.
2. H. Arabnejad, C. Pahl, P. Jamshidi, and G. Estrada. A Comparison of Reinforcement
Learning Techniques for Fuzzy Cloud Auto-Scaling. In 17th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing (CCGRID 2017), pages 64–73, Madrid,
Spain, 5 2017.
3. J. V. Bibal Benifa and D. Dejay. RLPAS: Reinforcement Learning-based Proactive
Auto-Scaler for Resource Provisioning in Cloud Environment. Mobile Networks and
Applications, 24(4):1348–1363, 8 2019.
4. P. Bodik, R. Griffith, C.A. Sutton, A. Fox, M.I. Jordan, and D.A. Patterson. Statistical
Machine Learning Makes Automatic Control Practical for Internet Datacenters. In
Hot Topics in Cloud Computing (HoTCloud 2009), pages 195–203, San Diego, California,
USA, 6 2009. USENIX Association.
5. Rossi F, M. Nardelli, and V. Cardellini. Horizontal and Vertical Scaling of Container-
Based Applications Using Reinforcement Learning. In IEEE 12th International Confer-
ence on Cloud Computing (CLOUD 2019), pages 329–338, Milan, Italy, 5 2019.
6. P. Giannakopoulos. Supporting Elasticity in Flink. Technical report, ECE School,
Technical Univ. of Crete (TUC), Chania, Greece, 10 2020.
7. V. Kalavri, J. Liagouris, M. Hoffmann, D. Dimitrova, M. Forshaw, and T. Roscoe.
Three steps is all you need: Fast, accurate, automatic scaling decisions for distributed
streaming dataflows. In 13th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 2018), pages 783–798, Carlsbad, CA, 10 2018.
8. P. Sharma, L. Chaufournier, P. Shenoy, and Y. C. Tay. Containers and Virtual Machines
at Scale: A Comparative Study. In 17th Intern. Middleware Conference, pages 1:1–1:13,
12 2016.
9. J. Takeuchi and K. Yamanishi. A Unifying Framework for Detecting Outliers and
Change Points from Time Series. EEE Trans. on Knowledge and Data Engineering,
18(4):482–492, 2006.
10. H. Yu, B. E. Chapman, A. Di Florio, E. E Eischen, D. H. Gotz, M. Jacob, and R. H. Blair.
Bootstrapping Estimates of Stability for Clusters, Observations and Model Selection.
Computational Statistics, 34(1):349–372, 2019.

View publication stats

You might also like