OSV - DA-Hari Krishna D R

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Digital Assignment 1

Name Hari Krishna D R and Jairam B

Registration No 21MAI0052 and 21MAI0053

Department SCOPE

Subject Code / Subject CSE5002 / Operating Systems and Virtualization


A Survey on Container Live Migration
Abstract: The virtualization of containers is a technique to run multiple processes in an isolated
manner. Container gained popularity in improved application management and deployment
because of its lightweight environment, flexible deployment, and fine-grained sharing of
resources. Organizations are using containers extensively to deploy their increasingly complex
workloads resulting from new technologies including online infrastructure, big data, and the
Internet of things in controlled clusters or data centers in the private and public cloud. It opens
the possibility of saving a container’s entire state and restarting it later. Checkpointing is used to
perform a live migration of containers, and it allows the state of a running container to be saved
and restarted on the same or a separate host later on. Multiple dumps will handle this
checkpointing and restart. It is transparent for running applications and network connections. In
this survey paper, we present a taxonomy for live container migration and problems faced during
live migration. Further, the survey is carried out on the basis of the proposed taxonomy helping
to identify the sustainable solution for live migration of containers. Here we concentrate on more
container technologies like docker, LXC, OpenVZ.

Key words: containers, migration, live migration, check pointing, docker, OpenVZ, LXC

I. Introduction

The two main virtualization technologies today are virtual machines (VMs) and containers.
Containers are known to boot faster than VMs, and thus, lower service downtime in the
application.

Containers are a new technology relative to VMs. Containerization technologies like OpenVZ
[4][32], LXC and Docker [14] provide a different way of virtualization compared to classic VMs
due to their lightweight structure. There are key differences between the two technologies: One
difference is that while VMs imitate the operating system kernel, containers make use of
hardware and kernel of a single host operating system in a shared manner. Containers
encapsulate applications with their required binaries in order to provide the application as a
service. Therefore, containers have less virtualization cost and use less resources relative to VMs
because of their lightweight nature. Furthermore, since a container does not need its own
operating system, it uses only the resources required for the application upon container start.
Both of the virtualization techniques have capability to provide increased efficiency in the
utilization of the resources in big data centers, which is achieved by the migration of the
encapsulated service. Live migration has become an increasingly popular topic because of its
contribution to the consolidation of services. There are many other reasons to migrate a service
from a source host to a destination.

These include system maintenance (for a software or hardware update), load balancing, efficient
resource utilization, service protection from attacks through moving target defense, etc. In
addition to those, migration management frameworks have also been developed such as
OpenStack, to be able to apply for load balancing.

Live service migration is a novel technique that provides fast handovers during the runtime of
the application, warranting seamless and reliable low latency communications. Hence, it is
considered an ideal technique to improve network flexibility. The two main live migration
technologies today are VMs and containers. Several works compared VMs and containers, and
have shown that containers perform better because they are lighter and boot faster. This is crucial
for live migration since the main objective is to provide a seamless and reliable communication.
Live migration is about moving application instances around without disconnecting the clients.

Live containers[1] migration refers to the process of moving application between different
physical machines or clouds without disconnecting the client. Memory, file system, and network
connectivity of the containers running on top of bare metal hardware are transferred from the
original host machine to the destination keeping the state without downtime. Live migration
process of containers heavily leans on checkpoint/resume strategies by benefiting the small sized
nature of containers compared to VMs [19].

It involves moving a service from one host to another with the minimum possible downtime.
Live migration is also required for system maintenance, load balancing [8], and protecting
services from attacks through moving target defense [25]. While migrating a service, the system
should not be vulnerable to attacks.

A live migration can help with server maintenance scenarios or unbalanced load. Popular
implementation strategies include "pre-copy" where state is copied to alternate host and traffic
switched, or "post-copy" [3] where initial state is copied and the remainder is "lazy loaded". Live
migration is a particular case of service migration where the service is transparently relocated to
another physical host seamlessly. This means that the resulting downtime is not detectable by the
end user, and the end user does not realize that the server was relocated (e.g. by detecting a new
IP address). Container technology has been widely adopted on various computing platforms, like
cloud platforms, CI/CD, and DevOps. It employs layered image management to enable agile
deployment of applications and leverages cgroup and namespace to provide an isolate
environment for each application and mitigate resource contention among concurrently running
applications.

Docker makes it convenient for developers to package the application runtime into an image, and
run the application on any OS with the assistance of Docker daemon. In a large-scale data center,
millions of Docker containers are usually managed by various orchestration tools (i.e.,
Kubernetes, Mesos, and Swarm). None of them can fulfill the live migration requirements for
Docker containers in the scenes of load balancing, host maintenance, and system upgrade.
Kubernetes[2] leverages replicate controller to relocate Docker containers as an alternative
solution. It can seamlessly move stateless Docker container to another node, but will bring
terrible downtime for stateful Docker containers which are more and more popular in cloud
environments. As a result, live migration of Docker container is a desirable and valuable
technology for resource utilization and QoS guarantee in data centers. There are mature works
and migration mechanisms (i.e., Pre-Copy, Post-Copy, and LoggingReplay) for live migrations
of virtual machines[7] (VMs).

Different from VM, Docker has a layered image, a shared kernel runtime, and a richer functional
management architecture, which make live migration of a Docker container more complicated.
Correspondingly, live migration of Docker containers consists of three-part tasks, i.e., migration
of image, migration of runtime, and migration of management context. During the procedure of
live migration, it is undoubtedly important to guarantee the integrity of three key components,
also known as component-integrity. Besides, the scalability and downtime of the live migration
are the critical metrics in a data center. As a result, ideal live migration of Docker containers
should provide good scalability and negligible downtime in performance, while guaranteeing the
component-integrity of Docker containers in terms of functionality.
The basic idea of live migration algorithm, first proposed by Clark et. al. [9]. First Hypervisor
marks all pages as dirty, then algorithm iteratively transfer dirty pages across the
network until the number of pages remaining to be transferred is below a certain threshold or a
maximum number of iterations is reached. Then Hypervisor mark transferred pages as clean,
since VM operates during live migration, so already transferred memory pages may be dirtied
during iteration and must need to be re-transferred. The VM is suspended at some point on the
source for stopping further memory writes and transfer remaining pages. After
transferring all the memory contents, the VM resumes at destination. Performance is measured
with hundred virtual machines, migrating concurrently with standard industry
benchmarks. It shows that for a variety of workloads, application downtime due to
migration is less than a second.

Method1:

A high-performance virtual machine migration design based on Remote Direct


Memory Access (RDMA) was proposed by Huang et al. [10]. InfiniBand is an emerging
interconnect offering high performance and features such as OS-bypass and RDMA.
RDMA is a direct memory access from the memory of one computer into that of
another without involving either one's operating system. By using RDMA remote
memory can be read and write (modified) directly, hardware I/O devices can directly access
memory without involving OS.

Method 2:

Luo et. al. [11] describe a whole-system live migration scheme, which transfers the whole
system run-time state, including CPU state, memory data, and local disk storage, of the
virtual machine (VM). They propose a three-phase migration (TPM) algorithm as well as
an incremental migration (IM) algorithm, which migrate the virtual machine back to the source
machine in a very short total migration time. During the migration, all the write accesses to the
local disk storage are tracked by using Block-bitmap. The migration downtime is around 100
milliseconds, close to shared-storage migration. Using IM algorithm, total migration
time is reduced. Performance overhead of recording all the writes on migrated VM is very
low.
Method 3:

Bradford et. al. [12] presented a system for supporting the transparent, live wide-area migration
of virtual machines which use local storage for their persistent state. This approach is
transparent to the migrated VM, and does not interrupt open network connections to and from the
VM during wide area migration, guarantees consistency of the VM’s local persistent state at the
source and the destination after migration, and is able to handle highly write-intensive
workloads.

The challenges of live migration [15][16] are to reduce the time as the service is down during the
migration process. Other challenges include:

1. Slight performance degradation during migration.


2. Difficulty moving between hosts if dependent services (e.g. big data, proprietary
services) aren't available in the alternate location.
3. Loss of data integrity because of live migration attacks [26].
4. Duplication of data.
5. Restoring of data and even states.
6. Maintenance of system
7. Load balancing
8. Security
9. Loss of Data consistency
10. Down time should be minimum.

Here the main challenges are:

1. Down time should be as minimum as possible.


2. Security.
3. Load balancing.

In this survey we are going to discuss about the procedure of live migration of containers, main
challenges faced during live migration and solutions proposed for each challenge faced while
performing live migration of containers.
Live Migration of Containers vs Virtual Machines

In [20], the author proposed a multi-layer framework for live migration of applications.
Encapsulated in a container or virtual machine. Experiments were conducted to make
comparisons based on the results obtained using the VM. The framework aims to provide
excellent performance based on frequent needs. Transition to mobile edge cloud architecture.
Mobile edge cloud is a network an architecture that provides cloud services at the edge of a
cellular network.

The study in [20] shows the difference between VM and container migration. analyzed.
Experimental results show a container (LXC is used in the experiment) It has significant
advantages over VMs (KVM is used for experiments) in terms of total migration time,
application downtime, and data sent from the source node. Sent Destination in transition. The
main reason for this is explained by the fact that containers are lighter than VMs and the contents
of container storage are predominant. teeth Based on the application running in the container.
However, in the case of VM, it is Different, d. H. The contents of VM memory are related to
many other processes, such as: A background process that is usually unrelated to the migrated
service.
II. Live Migration and its working
a. Brief Description of how live migration is performed

Figure 1: Live Migration

Source Node - where a container is placed before live migration


Destination Node - where a container will be placed after live migration

To perform the migration, the platform freezes container at the source node blocking memory,
processes, file system and network connections, and gets the state of this container. After that, it
is copied to the destination node. The platform restores the state and unfreezes the container at
this node. Then, there is a quick cleanup process at the source node.

It is pretty straightforward: you get the state, you copy the state, and you restore the state.
However, please note, there is a freeze timeframe, and we have to consider this during the
application architecture design, as it can be an issue for some applications.
There are two kinds of solutions of live migration. One of them is pre-copy memory. If you want
to migrate a container, platform turns track memory on the source node, and copies this memory
in parallel with the destination node until the difference becomes minimal. After that, it freezes
the container, gets the rest of the state, migrates it to a destination node, restores and unfreezes it.

Figure 2: Pre-Copy Memory Migration

Another solution is post-copy memory, or in other words - lazy migration. The system freezes
container at the source node at the beginning, gets the state of the fastest changing memory
pages, moves the state to the destination node, restores it, and unfreezes the container. The rest of
the state is copied from the source node to the destination one in a background mode.

Figure 3: Post-Copy Memory Migration

b. Procedure of performing Live migration using containers

Container-type virtualization is an ability to run multiple isolated sets of processes, known as


containers, under a single kernel instance. Having such an isolation opens the possibility to save
the complete state of (in other words, to checkpoint) a container and later to restart it.
Checkpointing itself is used for live migration, in particular for implementing high-availability
solutions. Here we are going in to discuss about the checkpointing and restart feature for
containers. The feature allows one to checkpoint the state of a running container and restart it
later on the same or a different host, in a way transparent for running applications and network
connections. Checkpointing and restart are implemented as loadable kernel modules plus a set of
user-space utilities. Its efficiency is proven on various real-world applications.

The following metrics are usually used to measure the performance of live migration:

1. Preparation Time: The time when migration has started and transferring the
VM’s state to the target node. The VM continues to execute and dirty its memory.
2. Downtime: The time during which the migrating VM’s is not executing. It
includes the transfer of processor state.
3. Resume Time: This is the time between resuming the VM’s execution at the target
and the end of migration, all dependencies on the source are eliminated.
4. Pages Transferred: This is the total amount of memory pages transferred,
including duplicates, across all of the above time periods.
5. Total Migration Time: This is the total time of all the above times from start to
finish. Total time is important because it affects the release of resources on both
participating nodes as well as within the VMs.
6. Application Degradation: This is the extent to which migration slows down the
applications executing within the VM.

Before performing migration, we should make sure all these metrics are satisfied. Container is an
isolated entity (meaning that all the inter-process relations, such as parent-child relationships and
inter-process communications, are within the container boundaries), its complete state can be
saved into a disk file—the procedure is known as checkpointing. A container can then be
restarted back from that file. The ability to checkpoint and restart a container has many
applications, such as:

• Hardware upgrade or maintenance.


• Kernel upgrade or server reboot.
c. Prerequisites and Requirements for System Checkpointing and
Restart

Checkpointing and restarting [4][5][29] a system has some prerequisites which must be supplied
by the OS which we use to implement it. First of all, a container infrastructure is required which
gives:

1. PID virtualization – to make sure that during restart the same PID can be assigned to a
process as it had before checkpointing.
2. Process group isolation – to make sure that parent child process relationships will not
lead to outside a container.
3. Network isolation and virtualization – to make sure that all the networking connections
will be isolated from all the other containers and the host OS.
4. Resources virtualization – to be independent from hardware and be able to restart the
container on a different server.

For the purpose of checkpointing/restoring, the CRIU tool is used. CRIU (Checkpoint/Restore in
User Space) is a software tool for the Linux OS. Using this tool, a running application can be
frozen and it can be checkpointed as a set of disk files. The files can be utilized to resume the
application and run it starting from the state at the time of the freeze. Application live migration
becomes possible with this feature. CRIU is supported as integrated with Docker, OpenVZ and
LXC / LXD [23].

The main feature of the CRIU tool is that it is basically developed in the user space. Instead of
kernel space. This feature allows this tool to provide live container migration by allowing users
to see and restore currently running applications. instance. Explain the process migration
performed by the CRIU tool There are three main phases: checkpointing, page server activity,
and recovery.

CRIU offers the possibility to save a running process as a set of files, such as: B. Page maps, file
descriptors, and open sockets. In other words, CRIU will search A process tree that provides
sufficient information about the processes associated with. Gather Resurrection [24]. More
specifically, at the start of the checkpoint, the damper process Go through the process directory
under / proc and create a process tree structure by collecting the necessary information about the
relevant processes. Next, the parasite code Added to the task at the appropriate point to execute
the CRIU subroutine Address space for related processes. The parasite code is in CRIU.
Connected Accept commands from CRIU.

After the dump process is the parasite code Extracted from the task and reverts to the original
code. CRIU releases the process and gives the control to the operating system fully. In the end,
CRIU evaluates the entire gathered data and records this information to dump files. At restoring
stage, CRIU reads the image files and resolves which resources are shared between processes.
Then, by calling the operating system function fork (), CRIU creates processes on the destination
node. After that, CRIU arranges necessary settings for files, namespaces, maps, private memory
areas, sockets and ownership. Finally, memory allocation to the exact location, timers,
credentials, threads It will be restored to accommodate the activation of the landing page process
[3]. CRIU is only required for containers with stateful applications and is not recommended as
memory contents and state for use in stateless applications Execution is not important for
container recovery. CRIU will stop soon Run the container process and check the number of
image files you need Restore the container to the stopped state. In other words, CRIU It is
basically used to move container storage to a persistent collection of files. This simplifies
transfer and recovery [25]. Next figure Used by CRIU to display a sequence of live migrations of
containers.

Understanding Checkpointing and Restart

The checkpointing and restart procedure is initiated from the user-level, but it is mostly
implemented at the kernel-level, thus providing full transparency of the checkpointing process.
Also, a kernel-level implementation does not require any special interfaces for resources re-
creation. The checkpointing procedure consists of the following three stages:

1. Freeze processes – move processes to previously known state and disable network.

2. Dump the container – collect and save the complete state of all the container’s processes
and the container itself to a dump file.

3. Stop the container – kill all the processes and unmount container’s file system.
Figure 4: CRIU Principle Diagram
The procedure to perform restarting:

1. Restart the container – create a container with the same state as previously saved in a
dump file.
2. Restart processes – create all the processes inside the container in the frozen state, and
restore all of their resources from the dump file.
3. Resume the container – resume processes’ execution and enable the network. After that,
the container continues its normal execution.

The first step of the checkpointing procedure [4][5][6] and also the last step of restart procedure
before processes can resume their execution is process-freeze. The freeze is required to make
sure that processes will not change their state and saved processes’ data will be consistent. It is
also easier to reconstruct frozen processes.
It is very important to save a consistent state of all the container’s processes. All process
dependencies should be saved and reconstructed during restart. Dependencies include the process
hierarchy (see Figure 1), identifiers (PGID, SID, TGID, and other identifiers), and shared
resources (open files, System, IPC objects, etc.). During the restart, all such resources and
identifiers should be set correctly. Any incorrectly restored parameter can lead to a process
termination, or even to a kernel oops.

Figure 5: Process Hierarchy

As most of the resources must be restored from the process context, a special function (called
“hook”) is added on top of the stack for each process during the restart procedure. Thus, the first
function which will be executed by a process will be that “hook,” and the process itself will
restore its resources. For the container’s unit process, this “hook” also restores the container state
including mount points, networking (interfaces, route tables, iptables rules, and conntracks), and
System IPC objects; and it initiates process tree reconstruction.
d. Live Migration

Using the checkpointing and restart feature, it is easy to implement live migration. A simple
algorithm is implemented which does not require any special hardware like SAN or iSCSI
storage:

1. Container’s file system synchronization. Transfer the container’s file system to the
destination server. This can be done using the rsync utility.
2. Freeze the container. Freeze all the processes and disable networking.
3. Dump the container. Collect all the resources and save them to a file on disk.
4. Second container’s file system synchronization. During the first synchronization, a
container is still running, so some files on the destination server can become outdated.
That is why, after a container is frozen and its files are not being changed, the second
synchronization is performed.
5. Copy the dump file. Transfer the dump file to the destination server.
6. Restart the container on the destination server. At this stage, we are creating a container
on the destination server and creating processes inside it in the same state as saved in
dump file. After this stage, the processes will be in the frozen state.
7. Resume the container. Resume the container’s execution on the destination server.
8. Stop the container on the source server. Kill the container’s processes and unmount its
file system.
9. Destroy the container on source server. Remove the container’s file system and config
files on the source server.

In the above migration scheme, Stages 3–6 are responsible for the most delay in service. Let us
take a look at them again and dig in a little bit deeper:

1. Dump time – the time needed to traverse over all the processes and their resources and
save this data to a file on disk.
2. Second file system sync time – time needed to perform the second file system
synchronization.
3. Dump file copying time – time needed to copy the dump file over the network from the
source server to the destination server.
4. Undump time – time needed to create a container and all its processes from a dump file.

III. Migration Optimizations

Second file system sync time and dump file copying time are responsible for about 95% of all
the delay in service. That is why optimization of these stages can make sense [17]. The following
options are possible:

1. Second file system sync optimization – decrease the number of files being compared
during the second sync. This could be done with the help of file system changes tracking
mechanism.
2. Decreasing the size of a dump file:
a. Lazy migration – migration of memory after actual migration of container, i.e.,
memory pages are transferred from the source server to the destination on
demand.
i. Request a page from swap.
ii. Resend the request to the source server.
iii. Find the page on the source server.
iv. Transfer the page to the destination server.
v. Load the page to memory.

Figure 6:Lazy Migration from source server to destination server


During live migration, all processes’ private data are saved to a dump file, which is then
transferred to the destination server. In the case of large memory usage, the size of the dump file
can be huge, resulting in an increase of dump file transfer time, and thus in an increased delay in
service. To handle this case, another type of live migration can be used—lazy migration. The
idea is the following—all the memory pages allocated by processes are marked with a special
flag, which is cleared if a page is changed. After that, a container can be frozen and its state can
be dumped, but in this case only pages without this flag are stored. That helps to reduce the size
of a dump file.

b. Iterative migration – iterative migration of memory before actual migration of


container.

Figure 7:Lazy Migration from source server to destination server

Another way to decrease the size of the dump file is to transfer memory pages in advance. In this
case, all the pages are transferred to the destination server before container freeze. But as
processes continue their normal execution, pages can be changed and transferred pages can
become outdated. That is why pages should be transferred iteratively. On the first step, all pages
are marked with a clean flag and transferred to the destination server. Some pages can be
changed during this process, and the clean flag will be removed in this case. On the second step,
only the changed pages are transferred to the destination server.
IV. Challenges faced during live migration of containers and
solution proposed.

Early challenge in the development of container technology was to find effective way to isolate
and provide security to different containers in the same machines.

The challenges are

1. Downtime challenges: Downtime should be as minimum as possible: When a system


administrator needs to upgrade hardware, it`s very painful to migrate all the customers
from one hardware node to another hardware node, and in many cases it's just impossible
without downtime.
- Solution proposed
o Migration is a process of moving a container from one server to another server.
Migration can facilitate in providing fault tolerance as container or VM can be
migrated to another host if the system experience failures. It can also serve for
balancing load, tackling hardware failures, scaling and reallocating resources. The
process of migration either of VM or container consists of mainly three classes:
1. Memory migration: Memory migration can be divided into two types: pre-
copy and post-copy. In post-copy [13], a container transfers memory after the
processor state is sent to the target location whereas in pre-copy migration
mechanism memory is transferred repetitively first and processor state
afterwards to the target.
a. Steps of post-copy migration is given below
i. Stop container at the source
ii. Sent processor state, registers state and devices states to
destination
iii. Resume destination container with no memory
iv. In scenario where container tries to access pages not
transferred yet, container is stopped and fault page is demanded
over the network
Figure 8:Process of post-copy migration

b. Pre-copy follows similar steps with the difference in time of


transferring memory. Steps of pre-copy migration is summarized
below:
i. Container at source continues to run while memory pages are
getting copied to the destination.
ii. Copying is repetitive but the subsequent steps only copy pages
modified during last transfer.
iii. Container at source is stopped then cup state is copied.
iv. Destination container is started.

Figure 9:Process of post-copy migration


Figure 10:Comparison of post-copy and pre-copy

2. Network migration: Network connectivity should be maintained after


migration by preserving open connections. When migrating within the same
LAN, original IP address should be retained even after migration.
Nonetheless, an unsolicited ARP reply is generated to advertise location of
destination. However, if the migration is in WAN, technologies such as
Virtual Private Networks (VPNs), tunneling and DNS servers can be used.
3. Disk migration: Disk migration can be optimized by using the concept of
deltas. In this process, write operations in the sources are intercepted and
deltas are generated. Deltas are the communication units containing written
data, size of the data and location on the disk. First step of the process is
examining the stored data and locating blocks which have changed since last
write. The changed data is then sent to the destination through WAN or LAN.
Another important feature of Suspend/Resume operation is disconnected
operations. In the disconnected operation, a client can access critical data
during temporary failures of data repository through the use of contents of the
cache. The modifications in the cache can be transferred when disconnection
ends.
4. Suspend/Resume Migration: Suspend/Resume migration technology is based
on providing user mobility in a secure way. In this process, container or VM is
migrated to a destination host where it is inactive during transfer. Main points
to undertake for suspend resume migration is listed below.
a. Network connections dropped and reestablished at the destination host.
b. Processor state, registers state and devices states sent to destination.
c. Images, local persistent state, ongoing network connections are
transferred and Support for disconnected operation is provided.
d. Apply delta disk operations for optimizing disk transfer
5. Record/Replay Migration: Record/Replay migration is usually used for
recovering state. Steps of Record/Replay migration is listed below.
a. Find checkpoint of last state
b. Repeat the events from log to get to the desired state.
Events can be classified into deterministic or nondeterministic. For containers
and VM, replaying needs logging of non-deterministic events which affects
the computation. Deterministic events are the regular events such as
arithmetic, memory and branch instructions and the outcome of such events
can be deterministic. Non-deterministic events such as interrupts, input from
devices such as keyboard, mouse, network and clock outcome cannot be
determined when the process is repeated. Non-deterministic events can be
classified into two categories as external input and time. Time events is the
exact point during execution when the event occurs. External input is data
from other devices or human beings. For replaying a container or VM, non-
deterministic events that affect the computation are required to be logged.
Deterministic events are not logged and can be computed during replay.
Replaying the non-deterministic events from the log and computing
deterministic events can get container to the desired state. However,
Record/Replay method should try to minimize challenges such as maximizing
trace completeness, reducing log file size and providing low performance
overhead.
6. Migration Via Memory wrapping: Memory wrapping also known as m-warp
[18], a fast and live container migration approach targeting a common intra-
host migration scenario in public clouds: When performing container
migration, it is preferable to choose/provision a destination VM on the same
physical host (as long as the underlying host remains available with sufficient
resources), as the intra-host migration can leverage local memory bandwidth
for fast state transferring and avoid costly inter-host network communication.
Such intra-host container migration is particularly applicable for a VM that
needs to be temporarily shut down for maintenance, upgrade, and recovering
from failures during which its hosted processes/containers must be migrated.
Figure 11:Migration Time vs Memory Size
o The expectation of migration mechanism is to have a zero downtime.

o Overhead Analysis:

We conduct the overhead analysis of the live container migration using with the
following configurations: Each (the source and destination) VM is configured to
have sufficient resources (4 virtual CPUs and 4 GB Memory) running on the same
physical host (12 physical CPUs and 128 GB memory). The network bandwidth
between these two intra-host VMs is set to 10 Gbps. Two main metrics are used
for gauging the performance of live container migration: total migration time —
the time between the start and the end of the whole migration; and frozen time —
the time during which the migrated container is suspended (i.e., in the last
iteration)
Figure 12: Breakdown of frozen time at each stage of live container migration

Figure 13: Architecture of m-warp

Memory wrapping (m-warp) is a fast and live intra-host container migration approach. In m-
warp, instead of copying a container’s memory, it relocates the ownership of the container’s
physical memory pages from the source VM to the destination VM on the same host via a
highly-efficient memory relocation mechanism. The preliminary evaluation shows that m-warp
leads to sub-second total container migration time regardless of the container sizes and
significant application-level performance improvement for memory intensive applications.

2. Problem of security: Live migration is also required for system maintenance, load
balancing, and protecting services from attacks through moving target defense. While
migrating a service, the system should not be vulnerable to attacks. Live migration of
containers can be vulnerable to many kinds of attacks [26] such as eavesdropping, man-
in-the-middle, denial of service (DoS) etc. The migration system should take precautions
for these types of attacks.
3. Live Migration Attacks: Live migration of containers/VMs is susceptible to active and
passive attacks. Active attacks cause loss of data integrity, whereas passive attacks cause
the loss of sensitive data confidentiality. Some of the most remarkable attacks can be
listed as man-in-the-middle, DoS, overflow and replay attacks.
a. Man-in-the-Middle Attack: Attackers can eavesdrop on the data while migrating
from source host to destination and modify data content, which could result in the
loss of data integrity.
b. Denial of Service (DoS) Attack: By using false resource advertisement, an
attacker can attract more virtual machines towards a specific machine. This will
result in migrating virtual machines, stealing the bandwidth resource by
preventing the actual required migrations. This can lead to serious problems in the
cloud system, where migrations are started in an automatic manner.
c. Overflow Attack: Stack overflow can be caused by attackers by creating
congestion in the communication channel traffic, which can result in the memory
corruption of the running processes.
d. Replay Attack: Attackers can re-transmit the previous replicates of memory pages
to the destination host where the changed ones are required. This happens because
of frequent dirty page occurrences. Attackers can also modify the order of
memory pages sent to destination from source. This results in ordering problems
in the destination host.
4. Live Migration Security Factors:
The factors that need to be achieved for making live migration secure are as follows:
a. Access Control: Access control policies should be defined to ensure only users
with granted privileges have control on the system.
b. Authentication: Authentication is required between the source and the destination
hosts for the migration process.
c. Non-Repudiation: All the actions of both the source and destination hosts should
be observed. While live migration is occurring, all activities should be logged.
d. Data Confidentiality: Data encryption is required while migrating data between
source and destination hosts.
e. Communication Security: The data transmission channel should be protected on
the migration path between source and destination hosts.
f. Availability: The system should be protected against DoS attacks to make
resources available for legitimate users.
g. Privacy: The migration traffic is required to be isolated from the other networks
in order to protect the system from man-in-the-middle and sniffing attacks.

V. Secure Live Migration of Containers:

Machen et al. [19] proposed a layered framework for live migration of applications
encapsulated either in containers or virtual machines. Experiments were conducted to make
comparisons based on the results obtained when working with VMs and containers. The
framework aims at achieving good performance

Proposed Model

Model Architecture

This section describes the proposed model architecture. In our proposed model, there are five
main components. Two of these are the source and destination instances. The remaining main
components are an application server, a database (DB) server and the client interface. As
shown in Figure 1, all components have secure connections established between them. The
application server behaves as the controller of our model that initiates the migration. It
connects to the instances by SSH and issues commands over that SSH [21][22] channel. The
application server creates the related SSH channels between the instances and itself in this
model. In order to achieve that, it creates an SSH channel between itself and the host instance
(instance-1) first. Then, it commands instance-1 to run the application on Docker, and start
migration if requested. That is, it commands instance-1 to send related checkpoint files to
instance-2 (destination instance) by using the SSH channel created by the parameters
provided by the application server. Parameters are also provided by using the SSH channel,
which means that an SSH command is sent to instance-1 to connect it to the other instance by
using the SSH command parameters.

Figure 14: Model Architecture


Figure 15: Proposed Model activity diagram
Model Execution Plan for Stateless Applications:

In our stateless application example, Clock Application, we retrieve the system time from
cloud instances. We created a table in the database to store the current instance time. The
users logged into the migration system, after navigating the clock tab should observe the
clock timestamp with the provider cloud instance IP on the screen. Because our instances are
located in the same geographic location, they have the same system time information. If the
user navigates to the Clock tab by using its own browser application after connecting to the
application server over the Internet, the user can see the clock data and this data is not
affected by any user input and does not save any state at the execution time on the instance
machine, which makes this application stateless, indeed.

Model Execution Plan for Stateful Applications

In our stateful application example, Face Recognition application, we give an image as input
to the application. This application performs detection of the faces in the given image, saves
them to the database, extracting the features of the given face and compares them with the
images in the database. We integrated model training functionality to the application, in order
to make analysis of the migration by making the application duration longer and making the
checkpoint file size change dynamically. In other words, we change the model training set
size by providing parameters to the model training function in the source code. In order to be
able to take metrics on the system performance, this modification is integrated to the
application and does not affect the functionality of the application. It only affects the
application runtime duration and the complexity of the checkpoint and resume operations.
The application server also acts as a bridge between the client and the cloud instances, which
means that although the service provider address is changed after the migration process, the
user does not have to navigate to the new address of the service provider. Because the
application depends on the user input and is not required to execute infinitely as in the Clock
application, we need to know the end of the program execution and make the end user wait
until the execution is finished. In order to achieve that, we needed to save a status flag in the
database located in the database server. This flag holds the information on whether the result
set has been updated or not. The application server waits until the flag indicates that
execution has ended. If the flag turns to 1, it means that the result is ready to display. The
application server again retrieves the output from the database. It also parses the result and
renders the page accordingly.

Figure 16: A simple clock application execution plan


VI. Load balancing

Container load balancing helps to deliver traffic management services for containerized
applications efficiently. Today’s developers use containers to quickly test, deploy, and scale
applications through continuous integration and continuous delivery. But the transient and
stateless nature of container-based applications requires different traffic control for optimal
performance. A load balancer in front of the Docker engine will result in higher availability
and scalability of client requests. This ensures uninterrupted performance of the
microservices-based applications running inside the container. The ability to update a single
microservice without disruption is made possible by load balancing Docker containers. When
containers are deployed across a cluster of servers, load balancers running in Docker
containers make it possible for multiple containers to be accessed on the same host port.

Working of container load balancing

In addition to management tasks, container load balancing also involves infrastructure


services that make sure the applications are secure and running efficiently. The following is
necessary:

• Containers deployed across a cluster of servers.


• Continuously updated without disruption.
• Load balancing for multiple containers on a single host accessed on the same port.
• Secure container communication.
• Monitoring of containers and the cluster.

Benefits of load balancing

The benefits of container load balancing include:

• Balanced distribution — A user-defined load balancing algorithm distributes traffic


evenly (if non-weighted round robin is applied) through the healthiest and most
available backends to the endpoint group.
• Accurate health checks — Direct health checks of the pods is a more accurate way to
determine the health of the backends.
• Better visibility and security — Visibility is possible at the pod or even container
level, depending on the granularity required. The IP source is preserved for easier
tracking of traffic sources.

VII. Novel Proposed Solution

We designed a policy-programmable [30] container migration architecture based on Docker.


The policy-based architecture allows us to change policies with a simple configuration file,
so programming the migration mechanism [26][27] is easy. Then we test load balancing
policies within our SDN-based prototype over Mininet. Third, we designed and evaluated
novel Moving Target Defense [25] (MTD) solutions inspired by network coding. The policy-
based migration system can do software defined measurement based on the network traffic
statistics obtained through SDN controller. We developed our algorithms to make migration
decision and applied it on two use cases. The first is Load Balancing that we feature with 3
policies: bandwidth based, shortest path, random. The second is Moving Target Defense,
where novel solutions are inspired by network coding, that we feature also with three
policies: Shamir, Digital Fountain, and Pseudo Random function.

Figure 17: Architecture for Policy programmable live migration technique


This application allows migration by monitoring the network traffic. The destination host is
selected according to different criteria, and we focused on three policies to select the
destination:

• Random: destination host is selected at random.


• Bandwidth-based: destination host is the host with the maximum available outgoing
bandwidth. We define this value as the minimum link capacity of the links in the
path.
• Shortest Path: leveraging Floodlight controller we are able to get the network
topology and compute the shortest path for each couple of nodes

Figure 18: Network diagram for Policy programmable live migration technique
VIII. Conclusion

Live migration using containers is most used and preferable technique when compared to VM
migrations and cold migrations of containers. In this survey we have discussed briefly about the
procedure and challenges faced during live migration and solutions proposed. To make sure
migration of resources is successful between two physical system using containers without
shutting down the system then it is better to combine the proposed solutions discussed above
instead of implementing individual solution for each problem. Our future work will be based on
proposing one solution for three main challenges i.e., downtime should me as minimum as
possible, security and load balancing as it decreases the overhead while performing live
migration using containers.
IX. References

[1] Live Container Migration: Opportunities and Challenges Niroj Pokhrel Aalto University

[2] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, omega, and
Kubernetes,” Communications of the ACM 59, pp. 50–57, 2016.

[3] M. Rapoport. Userfaultfd and post-copy migration. [Online; accessed on December 18,
2021]. Available: http://www.slideshare.net/kerneltlv/userfaultfd-and-postcopy-migration

[4] A. Kuznetsov,. K. Kolsyhkin,. Andrey Mirkin, "Containers checkpointing and live


migration," In Ottawa Linux Symposium, 2008.

[5] O.O. Sudakov, Yu.V. Boyko, O.V. Tretyak, T.P. Korotkova, E.S. Meshcheryakov, Process
checkpointing and restart system for Linux, Mathematical Machines and Systems, 2003.

[6] Eduardo Pinheiro, Truly-Transparent Checkpointing of Parallel Applications, Federal


University of Rio de Janeiro UFRJ.

[7] A. Desai, “Virtual Machine." (2012), Available:


http://searchservervirtualization.techtarget.com/definition/virtualmachine [Online; accessed on
December 18, 2021]

[8] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, "Cost of Virtual Machine Live
Migration in Clouds: A Performance Evaluation,” in 1st International Conference on Cloud
Computing, Berlin, Germany, 2009, pp. 254-65.

[9] C. Christopher, F. Keir, H. Steven, H. Jacob Gorm, J. Eric, L. Christian, P. Ian, and W.
Andrew, “Live migration of virtual machines,” 2nd conference on Symposium on
Networked Systems Design & Implementation - Volume 2: USENIX Association, 2005.

[10] H. Wei, G. Qi, L. Jiuxing, and D. K. Panda, “High performance virtual machine
migration with RDMA over modern interconnects,” in IEEE International Conference on Cluster
Computing, 2007, pp. 11-20

[11] L. Yingwei, Z. Binbin, W. Xiaolin, W. Zhenlin, S. Yifeng, and C. Haogang, “Live and
incremental whole-system migration of virtual machines using block-bitmap,” in IEEE
International Conference on Cluster Computing, 2008, pp. 99-106.

[12] B. Robert, K. Evangelos, F. Anja, S. Harald, and berg, “Live wide-area migration of
virtual machines including local persistent state,” 3rd International Conference on Virtual
execution environment, San Diego, California, USA: ACM, 2007.

[13] R. H. Michael, D. Umesh, and G. Kartik, "Post-copy live migration of virtual machines,"
SIGOPS Oper. Syst. Rev., vol. 43, pp. 14-26, 2009.
[14] “Docker”. Available: https://docs.docker.com/get-started/overview [Online; accessed
on December 18, 2021]

[15] D. Kapil, E. S. Pilli, and R. C. Joshi, “Live virtual machine migration techniques: Survey
and research challenges,” in Advance Computing Conference (IACC), 2013 IEEE 3rd International.
IEEE, 2013, pp. 963–969.

[16] P. Kokkinos, D. Kalogeras, A. Levin, and E. Varvarigos, “Survey: Live migration and
disaster recovery over long-distance networks,” ACM Computing Surveys (CSUR), vol. 49, no. 2,
p. 26, 2016.

[17] Zeynep Mavus¸ and Pelin Angın, “A Secure Model for Efficient Live Migration of
Containers”, Middle East Technical University, Ankara, Turkey {e1670157,
pangin}@ceng.metu.edu.tr

[18] mWarp: Accelerating Intra-Host Live Container Migration via Memory Warping Piush K
Sinha, Spoorti S Doddamani, Hui Lu, and Kartik Gopalan State University of New York (SUNY) at
Binghamton {psinha1, sdoddam1, huilu, kartik}@binghamton.edu

[19] W. Li, A. Kanso, and A. Gherbi, “Leveraging Linux containers to achieve High Availability
for cloud services,” Proceedings - 2015 IEEE International Conference on Cloud Engineering, IC2E
2015, pp. 76–83, 2015

[20] Y. Chen, “Checkpoint and Restore of Micro-service in Docker Containers,” no. Icmii, pp.
915–918, 2015.

[21] “SSH File Transfer Protocol,” https://www.ssh.com/ssh/sftp/, [Online; accessed on


December 18, 2021]

[22] “SSH Public Key Authentication,” https://www.tecmint.com/ssh-passwordless-login-


using-ssh-keygen-in-5-easy-steps/, [Online; accessed on December 18, 2021].

[23] Checkpoint Restore in User Space.” https://criu.org/Main_Page, 2012. [Online; accessed


on December 18, 2021]

[24] Y. Chen, “Checkpoint and Restore of Micro-service in Docker Containers,” no. Icmii, pp.
915–918, 2015.

[25] M. Azab, B. M. Mokhtar, A. S. Abed, and M. Eltoweissy, “Smart Moving Target Defense
for Linux Container Resiliency,” in IEEE 2nd International Conference on Collaboration and
Internet Computing (CIC), pp. 122–130, 2016.

[26] A. A. Mohallel, J. M. Bass, and A. Dehghantaha, “Experimenting with docker: Linux


container and base os attack surfaces,” in Proc. of the 2016 International Conference on
Information Society (i-Society’16), Dublin, Ireland. Infonomics Society, October 2017, pp. 17–21.
[27] V. Medina and J. M. Garcia, “A Survey of Migration Mechanisms of Virtual Machines,”
ACM Computing Surveys (CSUR), vol. 46, no. 3, pp. 1–33, January 2014.

[28] “Understanding the SSH Encryption and Connection Process,”


https://www.digitalocean.com/community/ tutorials/understanding-the-ssh-encryption-and-
connection-process, [Online; accessed on September 2, 2019], 2014.

[29] Hua Zhong, Jason Nieh, CRAK: Linux Checkpoint/Restart as a Kernel Module,
Department of Computer Science, Columbia University, Technical Report CUCS-014-01,
November 2001.

[30] A Policy-Based Architecture for Container Migration in Software Defined Infrastructures


Xu Tao∗ Politecnico di Torino, Italy xu.tao@studenti.polito.it Flavio Esposito Saint Louis
University, USA flavio.esposito@slu.edu Alessio Sacco Politecnico di Torino, Italy alessio
sacco@polito.it Guido Marchetto Politecnico di Torino, Italy guido.marchetto@polito.it

[31] G. Soni and M. Kalra, “Comparative study of live virtual machine migration techniques in
cloud,” International Journal of Computer Applications, vol. 84, no. 14, 2013.

[32] “OpenVZ Containers,” https://openvz.org/Main, [Online; accessed on December 18,


2021], 2005

[33] “Containers Live Migration: Behind the Scenes”,


https://www.infoq.com/articles/container-live-migration/, [Online; accessed on December 18,
2021]

[34] “Container Load Balancing”, https://avinetworks.com/glossary/container-load-


balancing/

You might also like