Professional Documents
Culture Documents
From Simple Idea To Devops Darling, A Brief History of Container Technologies
From Simple Idea To Devops Darling, A Brief History of Container Technologies
2000 2001
2002
SELinux added
to Linux mainline
2003
Red Hat
2004
Enterprise Linux 4 Solaris Zones
adds SELinux introduces virtualized
OS environments
2005
2006
Generic process containers Google implements
renamed control groups (cgroups) generic process
and added to the Linux kernel containers
2007
2008
2010 2009
2012
2014
OpenShift 3 launches as a
general-purpose container
application platform built
on Docker format and
Kubernetes orchestration
Red Hat
Red Hat Enterprise
CloudForms
Linux makes .NET Core
adds container
available as a container
management
2016
LEARN MORE
Copyright © 2016 Red Hat, Inc. Red Hat, Red Hat Enterprise Linux, the
Shadowman logo, and JBoss are trademarks of Red Hat, Inc., registered in the
U.S. and other countries. Linux® is the registered trademark of Linus Torvalds
in the U.S. and other countries.
The OpenStack® Word Mark and OpenStack Logo are either registered
trademarks/service marks or trademarks/service marks of the OpenStack
Foundation, in the United States and other countries and are used with the
OpenStack Foundation's permission. We are not affiliated with, endorsed or
sponsored by the OpenStack Foundation or the OpenStack community.
Search the docs Guides Product manuals Reference Samples Docker v19.03 (current) ,
! "
Estimated reading time: 10 minutes
Docker is an open platform for developing, shipping, and running applications. Docker On this page:
enables you to separate your applications from your infrastructure so you can deliver The Docker platform
software quickly. With Docker, you can manage your infrastructure in the same ways you
Docker Engine
manage your applications. By taking advantage of Docker’s methodologies for shipping,
testing, and deploying code quickly, you can signi!cantly reduce the delay between writing What can I use Docker for?
code and running it in production.
Docker architecture
machines!
Control groups
Docker provides tooling and a platform to manage the lifecycle of your containers:
Union !le systems
Develop your application and its supporting components using containers.
Container format
The container becomes the unit for distributing and testing your application.
When you’re ready, deploy your application into your production environment, as a Next steps
container or an orchestrated service. This works the same whether your production
environment is a local data center, a cloud provider, or a hybrid of the two.
Docker Engine
Docker Engine is a client-server application with these major components:
A REST API which speci!es interfaces that programs can use to talk to the daemon and
instruct it what to do.
The CLI uses the Docker REST API to control or interact with the Docker daemon through
scripting or direct CLI commands. Many other Docker applications use the underlying API
and CLI.
The daemon creates and manages Docker objects, such as images, containers, networks, and
volumes.
Note
Note: Docker is licensed under the open source Apache 2.0 license.
Your developers write code locally and share their work with their colleagues using
Docker containers.
They use Docker to push their applications into a test environment and execute
automated and manual tests.
When developers !nd bugs, they can !x them in the development environment and
redeploy them to the test environment for testing and validation.
When testing is complete, getting the !x to the customer is as simple as pushing the
updated image to the production environment.
Docker’s container-based platform allows for highly portable workloads. Docker containers
can run on a developer’s local laptop, on physical or virtual machines in a data center, on
cloud providers, or in a mixture of environments.
Docker’s portability and lightweight nature also make it easy to dynamically manage
workloads, scaling up or tearing down applications and services as business needs dictate, in
near real time.
Docker architecture
Docker uses a client-server architecture. The Docker client talks to the Docker daemon, which
does the heavy lifting of building, running, and distributing your Docker containers. The
Docker client and daemon can run on the same system, or you can connect a Docker client to
a remote Docker daemon. The Docker client and daemon communicate using a REST API,
over UNIX sockets or a network interface.
Docker registries
A Docker registry stores Docker images. Docker Hub is a public registry that anyone can use,
and Docker is con!gured to look for images on Docker Hub by default. You can even run
your own private registry. If you use Docker Datacenter (DDC), it includes Docker Trusted
Registry (DTR).
When you use the docker pull or docker run commands, the required images are
pulled from your con!gured registry. When you use the docker push command, your
image is pushed to your con!gured registry.
Docker objects
When you use Docker, you are creating and using images, containers, networks, volumes,
plugins, and other objects. This section is a brief overview of some of those objects.
IMAGES
An image is a read-only template with instructions for creating a Docker container. Often, an
image is based on another image, with some additional customization. For example, you may
build an image which is based on the ubuntu image, but installs the Apache web server
and your application, as well as the con!guration details needed to make your application
run.
You might create your own images or you might only use those created by others and
published in a registry. To build your own image, you create a Docker!le with a simple syntax
for de!ning the steps needed to create the image and run it. Each instruction in a Docker!le
creates a layer in the image. When you change the Docker!le and rebuild the image, only
those layers which have changed are rebuilt. This is part of what makes images so
lightweight, small, and fast, when compared to other virtualization technologies.
CONTAINERS
A container is a runnable instance of an image. You can create, start, stop, move, or delete a
container using the Docker API or CLI. You can connect a container to one or more networks,
attach storage to it, or even create a new image based on its current state.
By default, a container is relatively well isolated from other containers and its host machine.
You can control how isolated a container’s network, storage, or other underlying subsystems
are from other containers or from the host machine.
A container is de!ned by its image as well as any con!guration options you provide to it
when you create or start it. When a container is removed, any changes to its state that are
not stored in persistent storage disappear.
The following command runs an ubuntu container, attaches interactively to your local
command-line session, and runs /bin/bash .
When you run this command, the following happens (assuming you are using the default
registry con!guration):
1. If you do not have the ubuntu image locally, Docker pulls it from your con!gured
registry, as though you had run docker pull ubuntu manually.
3. Docker allocates a read-write !lesystem to the container, as its !nal layer. This allows a
running container to create or modify !les and directories in its local !lesystem.
4. Docker creates a network interface to connect the container to the default network,
since you did not specify any networking options. This includes assigning an IP address
to the container. By default, containers can connect to external networks using the
host machine’s network connection.
5. Docker starts the container and executes /bin/bash . Because the container is
running interactively and attached to your terminal (due to the -i and -t "ags),
you can provide input using your keyboard while the output is logged to your terminal.
6. When you type exit to terminate the /bin/bash command, the container stops
but is not removed. You can start it again or remove it.
SERVICES
Services allow you to scale containers across multiple Docker daemons, which all work
together as a swarm with multiple managers and workers. Each member of a swarm is a
Docker daemon, and all the daemons communicate using the Docker API. A service allows
you to de!ne the desired state, such as the number of replicas of the service that must be
available at any given time. By default, the service is load-balanced across all worker nodes.
To the consumer, the Docker service appears to be a single application. Docker Engine
supports swarm mode in Docker 1.12 and higher.
Namespaces
Docker uses a technology called namespaces to provide the isolated workspace called the
container. When you run a container, Docker creates a set of namespaces for that container.
These namespaces provide a layer of isolation. Each aspect of a container runs in a separate
namespace and its access is limited to that namespace.
Control groups
Docker Engine on Linux also relies on another technology called control groups ( cgroups ).
A cgroup limits an application to a speci!c set of resources. Control groups allow Docker
Engine to share available hardware resources to containers and optionally enforce limits and
constraints. For example, you can limit the memory available to a speci!c container.
Container format
Docker Engine combines the namespaces, control groups, and UnionFS into a wrapper called
a container format. The default container format is libcontainer . In the future, Docker
may support other container formats by integrating with technologies such as BSD Jails or
Solaris Zones.
Next steps
Read about installing Docker.
Get hands-on experience with the Getting started with Docker tutorial.
Kubernetes Careers
Contact Us
In the previous chapter, you deployed your first applications in OpenShift. In this
chapter, we’ll look deeper into your OpenShift cluster and investigate how these
containers isolate their processes on the application node.
Knowledge of how containers work in a platform like OpenShift is some of the
most powerful information in IT right now. This fundamental understanding of
how a container actually works as part of a Linux server informs how systems are
designed and how issues are analyzed when they inevitably occur.
This is a challenging chapter—not because you’ll be editing a lot of configura-
tions and making complex changes, but because we’re talking about the funda-
mental layers of abstraction that make a container a container. Let’s get started by
attempting to define exactly what a container is.
37
38 CHAPTER 3 Containers are Linux
6 The OpenShift internal load balancer is updated with an entry for the DNS
record for the application. This entry will be linked to a component that’s cre-
ated by Kubernetes, which we’ll get to shortly.
7 OpenShift creates an image stream component. In OpenShift, an image stream
monitors the builder image, deployment config, and other components for
changes. If a change is detected, image streams can trigger application rede-
ployments to reflect changes.
Figure 3.1 shows how these components are linked together. When a developer cre-
ates source code and triggers a new application deployment (in this case, using the oc
command-line tool), OpenShift creates the deployment config, image stream, and
build config components.
Source oc new-app
code ...
External
OpenShift
Image registry
Deployment
DNS
route
The build config creates an application-specific custom container image using the
specified builder image and source code. This image is stored in the OpenShift image
registry. The deployment config component creates an application deployment that’s
unique for each version of the application. The image stream is created and monitors
for changes to the deployment config and related images in the internal registry. The
DNS route is also created and will be linked to a Kubernetes object.
In figure 3.1, notice that the users are sitting by themselves with no access to the
application. There is no application. OpenShift depends on Kubernetes, as well as
docker, to get the deployed application to the user. Next, we’ll look at Kubernetes’
responsibilities in OpenShift.
NOTE Typically, a single pod is made up of a single container. But in some sit-
uations, it makes sense to have a single pod consist of multiple containers.
Figure 3.2 illustrates the relationships between the Kubernetes components that are
created. The replication controller dictates how many pods are created for an initial
application deployment and is linked to the OpenShift deployment component.
Also linked to the pod component is a Kubernetes service. The service represents
all the pods deployed by a replication controller. It provides a single IP address in
OpenShift to access your application as it’s scaled up and down on different nodes in
your cluster. The service is the internal IP address that’s referenced in the route cre-
ated in the OpenShift load balancer.
Load balancer
Deployment
DNS
route
OpenShift
Kubernetes
Replication
Pod Service
controller
Figure 3.2 Kubernetes components that are created when applications are deployed
We’re getting closer to the application itself, but we haven’t gotten there yet. Kuberne-
tes is used to orchestrate containers in an OpenShift cluster. But on each application
node, Kubernetes depends on docker to create the containers for each application
deployment.
NOTE Docker is currently the container runtime for OpenShift. But a new
runtime is supported as of OpenShift 3.9. It’s called cri-o, and you can find
more information at http://cri-o.io.
Kubernetes controls docker to create containers that house the application. These
containers use the custom base image as the starting point for the files that are visible
to applications in the container. Finally, the docker container is associated with the
Kubernetes pod (see figure 3.3).
To isolate the libraries and applications in the container image, along with other
server resources, docker uses Linux kernel components. These kernel-level resources
are the components that isolate the applications in your container from everything
else on the application node. Let’s look at these next.
42 CHAPTER 3 Containers are Linux
Image registry
OpenShift
Kubernetes
Replication
Pod
controller
docker
Container
Containers are associated
with a Kubernetes pod.
Container
docker
Linux kernel
Control SELinux
Application Namespaces
groups contexts
we’ll discuss how you can investigate a container from the application node. From the
point of view of being inside the container, an application only has the resources allo-
cated to it that are included in its unique namespaces. Let’s confirm that next.
In the previous sections, we looked at each individual layer of OpenShift. Let’s put all
of these together before we dive down into the weeds of the Linux kernel.
44 CHAPTER 3 Containers are Linux
Developers
Users
Source oc new-app
code ...
External
OpenShift
Kubernetes
Replication
Pod Service
controller
docker
Container
Linux
Control SELinux
Application Namespaces
groups contexts
Figure 3.5 OpenShift deployment including components that make up the container
Application isolation with kernel namespaces 45
Developers and users interact primarily with OpenShift and its services. OpenShift
works with Kubernetes to ensure that user requests are fulfilled and applications are
delivered consistently according to the developer’s designs.
As you’ll recall, one of the acceptable definitions for a container earlier in this
chapter was that they’re “fancy processes.” We developed this definition by explaining
how a container takes an application process and uses namespaces to limit access to
resources on the host. We’ll continue to develop this definition by interacting with
these fancy processes in more depth in chapters 9 and 10.
Like any other process running on a Linux server, each container has an assigned
process ID (PID) on the application node.
OpenShift uses five Linux namespaces to isolate processes and resources on applica-
tion nodes. Coming up with a concise definition for exactly what a namespace does is
a little difficult. Two analogies best describe their most important properties, if you’ll
forgive a little poetic license:
Namespaces are like paper walls in the Linux kernel. They’re lightweight and
easy to stand up and tear down, but they offer sufficient privacy when they’re in
place.
Namespaces are similar to two-way mirrors. From within the container, only the
resources in the namespace are available. But with proper tooling, you can see
what’s in a namespace from the host system.
The following snippet lists all namespaces for app-cli with lsns:
As you can see, the five namespaces that OpenShift uses to isolate applications are as
follows:
Mount—Ensures that only the correct content is available to applications in the
container
Network—Gives each container its own isolated network stack
PID—Provides each container with its own set of PID counters
UTS—Gives each container its own hostname and domain name
IPC—Provides shared memory isolation for each container
There are currently two additional namespaces in the Linux kernel that aren’t used by
OpenShift:
Cgroup—Cgroups are used as a shared resource on an OpenShift node, so this
namespace isn’t required for effective isolation.
Application isolation with kernel namespaces 47
What is /usr/bin/pod?
The IPC and network namespaces are associated with a different PID for an applica-
tion called /usr/bin/pod. This is a pseudo-application that’s used for containers cre-
ated by Kubernetes.
Under most circumstances, a pod consists of one container. There are conditions,
however, where a single pod may contain multiple containers. Those situations are
outside the scope of this chapter; but when this happens, all the containers in the
pod share these namespaces. That means they share a single IP address and can
communicate with shared memory devices as though they’re on the same host.
We’ll discuss the five namespaces used by OpenShift with examples, including how
they enhance your security posture and how they isolate their associated resources.
Let’s start with the mount namespace.
The root filesystem, based on the app-cli container image, is a little more difficult to
uncover, but we’ll do that next.
48 CHAPTER 3 Containers are Linux
app-cli image
Cached copy of the Application
app-cli container image Starts app-cli
from the image registry Mount namespace
Created for the
NFS mount app-cli container
Applications
Volume added to Launched by the
app-cli via PVC app-cli initial process
Figure 3.6 The mount namespace takes selected content and makes it available to the app-cli
applications.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 8G 0 disk
vda1 253:1 0 8G 0 part /
vdb 253:16 0 20G 0 disk
Application isolation with kernel namespaces 49
The LV device that the app-cli container uses for storage is recorded in the informa-
tion from docker inspect. To get the LV for your app-cli container, run the following
command:
You can’t see the LV for app-cli because it’s in a different namespace. No, we’re not
kidding. The mount namespace for your application containers is created in a differ-
ent mount namespace from your application node’s operating system.
When the docker daemon starts, it creates its own mount namespace to contain filesys-
tem content for the containers it creates. You can confirm this by running lsns for the
docker process. To get the PID for the main docker process, run the following pgrep com-
mand (the process dockerd-current is the name for the main docker daemon process):
# pgrep -f dockerd-current
Once you have the docker daemon’s PID, you can use lsns to view its namespaces.
You can tell from the output that the docker daemon is using the system namespaces
created by systemd when the server booted, except for the mount namespace:
# lsns -p 2385
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 221 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
4026531837 user 254 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
4026531838 uts 223 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
50 CHAPTER 3 Containers are Linux
You can use a command-line tool named nsenter to enter an active namespace for
another application. It’s a great tool to use when you need to troubleshoot a container
that isn’t performing as it should. To use nsenter, you give it a PID for the container
with the --target option and then instruct it regarding which namespaces you want
to enter for that PID:
When you run the command, you arrive at a prompt similar to your previous prompt.
The big difference is that now you’re operating from inside the namespace you speci-
fied. Run mount from within docker’s mount namespace and grep for your app-cli LV
(the output is trimmed for clarity):
From inside docker’s mount namespace, the mount command output includes the
mount point for the root filesystem for app-cli. The LV that docker created for app-cli
is mounted on the application node at /var/lib/docker/devicemapper/mnt/8bd64-
cae… (directory name trimmed for clarity).
Go to that directory while in the docker daemon mount namespace, and you’ll find
a directory named rootfs. This directory is the filesystem for your app-cli container:
# ls -al rootfs
total 32
-rw-r--r--. 1 root root 15759 Aug 1 17:24 anaconda-post.log
lrwxrwxrwx. 1 root root 7 Aug 1 17:23 bin -> usr/bin
drwxr-xr-x. 3 root root 18 Sep 14 22:18 boot
drwxr-xr-x. 4 root root 43 Sep 21 23:19 dev
drwxr-xr-x. 53 root root 4096 Sep 21 23:19 etc
-rw-r--r--. 1 root root 7388 Sep 14 22:16 help.1
drwxr-xr-x. 2 root root 6 Nov 5 2016 home
lrwxrwxrwx. 1 root root 7 Aug 1 17:23 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Aug 1 17:23 lib64 -> usr/lib64
drwx------. 2 root root 6 Aug 1 17:23 lost+found
drwxr-xr-x. 2 root root 6 Nov 5 2016 media
drwxr-xr-x. 2 root root 6 Nov 5 2016 mnt
drwxr-xr-x. 4 root root 32 Sep 14 22:05 opt
Application isolation with kernel namespaces 51
It’s been quite a journey to uncover the root filesystem for app-cli. You’ve used infor-
mation from the docker daemon to use multiple command-line tools, including
nsenter, to change from the default mount namespace for your server to the namespace
created by the docker daemon. You’ve done a lot of work to find an isolated filesystem.
Docker does this automatically at the request of OpenShift every time a container is cre-
ated. Understanding how this process works, and where the artifacts are created, is
important when you’re using containers every day for your application workloads.
From the point of view of the applications running in the app-cli container, all
that’s available to them is what’s in the rootfs directory, because the mount namespace
created for the container isolates its content (see figure 3.7). Understanding how
mount namespaces function on an application node, and knowing how to enter a
container namespace manually, are invaluable tools when you’re troubleshooting a
container that’s not functioning as designed.
Figure 3.7 The app-cli mount namespace isolates the contents of the rootfs directory.
52 CHAPTER 3 Containers are Linux
Press Ctrl-D to exit the docker daemon’s mount namespace and return to the default
namespace for your application node. Next, we’ll discuss the UTS namespace. It won’t
be as involved an investigation as the mount namespace, but the UTS namespace is
useful for an application platform like OpenShift that deploys horizontally scalable
applications across a cluster of servers.
Time sharing
It can be confusing to talk about time sharing when the UTS namespace has nothing
to do with managing the system clock. Time sharing originally referred to multiple
users sharing time on a system simultaneously. Back in the 1970s, when this con-
cept was created, it was a novel idea.
The UTS data structure in the Linux kernel had its beginnings then. This is where the
hostname, domain name, and other system information are retained. If you’d like to
see all the information in that structure, run uname -a on a Linux server. That com-
mand queries the same data structure.
The easiest way to view the hostname for a server is to run the hostname command, as
follows:
# hostname
You could use nsenter to enter the UTS namespace for the app-cli container, the same
way you entered the mount namespace in the previous section. But there are additional
tools that will execute a command in the namespaces for a running container.
NOTE On the application node, if you use the nip.io domain discussed in
appendix A, your hostname should look similar to ocp2.192.168.122.101
.nip.io.
One of those tools is the docker exec command. To get the hostname value for a run-
ning container, pass docker exec a container’s short ID and the same hostname com-
mand you want to run in the container. Docker executes the specified command for
you in the container’s namespaces and returns the value. The hostname for each
OpenShift container is its pod name:
Each container has its own hostname because of its unique UTS namespace. If you
scale up app-cli, the container in each pod will have a unique hostname as well. The
Application isolation with kernel namespaces 53
value of this is identifying data coming from each container in a scaled-up system. To
confirm that each container has a unique hostname, log in to your cluster as your
developer user:
The oc command-line tool has functionality that’s similar to docker exec. Instead of
passing in the short ID for the container, however, you can pass it the pod in which
you want to execute the command. After logging in to your oc client, scale the app-cli
application to two pods with the following command:
This will cause an update to your app-cli deployment config and trigger the creation
of a new app-cli pod. You can get the new pod’s name by running the command oc
get pods --show-all=false. The show-all=false option prevents the output of
pods in a Completed state, so you see only active pods in the output.
Because the container hostname is its corresponding pod name in OpenShift, you
know which pod you were working with using docker directly:
To get the hostname from your new pod, use the oc exec command. It’s similar to
docker exec, but instead of a container’s short ID, you use the pod name to specify
where you want the command to run. The hostname for your new pod matches the
pod name, just like your original pod:
From your application node, run ps with the --ppid option, and include the PID
you obtained for your app-cli container. Here you can see that the process for
PID 4470 is httpd and that it has spawned several other processes:
# ps --ppid 4470
PID TTY TIME CMD
4506 ? 00:00:00 cat
4510 ? 00:00:01 cat
4542 ? 00:02:55 httpd
4544 ? 00:03:01 httpd
4548 ? 00:03:01 httpd
4565 ? 00:03:01 httpd
4568 ? 00:03:01 httpd
4571 ? 00:03:01 httpd
4574 ? 00:03:00 httpd
4577 ? 00:03:01 httpd
6486 ? 00:03:01 httpd
Use oc exec to get the output of ps for the app-cli pod that matches the PID you col-
lected earlier. If you’ve forgotten, you can compare the hostname in the docker con-
tainer to the pod name. From inside the container, don’t use the --ppid option,
because you want to see all the PIDs visible from within the app-cli container.
When you run the following command, the output is similar to that from the previ-
ous command:
$ oc exec app-cli-1-18k2s ps
PID TTY TIME CMD
1 ? 00:00:27 httpd
18 ? 00:00:00 cat
19 ? 00:00:01 cat
20 ? 00:02:55 httpd
22 ? 00:03:00 httpd
26 ? 00:03:00 httpd
43 ? 00:03:00 httpd
46 ? 00:03:01 httpd
49 ? 00:03:01 httpd
52 ? 00:03:00 httpd
55 ? 00:03:00 httpd
60 ? 00:03:01 httpd
83 ? 00:00:00 ps
So far, we’ve discussed how filesystems, hostnames, and PIDs are isolated in a con-
tainer. Next, let’s take a quick look at how shared memory resources are isolated.
If you run nsenter and include a command as the last argument, then instead of
opening an interactive session in that namespace, the command is executed in the
specified namespace and returns the results. Using this tool, you can run the ip
command from your server’s default namespace in the network namespace of your
app-cli container.
If you compare this to the output from running the /sbin/ip a command on your
host, the differences are obvious. Your application node will have 10 or more active
network interfaces. These represent the physical and software-defined devices that
make OpenShift function securely. But in the app-cli container, you have a container-
specific loopback interface and a single network interface with a unique MAC and
IP address:
The network namespace is the first component in the OpenShift networking solution.
We’ll discuss how network traffic gets in and out of containers in chapter 10, when we
cover OpenShift networking in depth.
In OpenShift, isolating processes doesn’t happen in the application, or even in the
userspace on the application node. This is a key difference between other types of soft-
ware clusters, and even some other container-based solutions. In OpenShift, isolation
Summary 57
and resource limits are enforced in the Linux kernel on the application nodes. Isola-
tion with kernel namespaces provides a much smaller attack surface. An exploit that
would let someone break out from a container would have to exist in the container run-
time or the kernel itself. With OpenShift, as we’ll discuss in depth in chapter 11 when
we examine security principles in OpenShift, configuration of the kernel and the con-
tainer runtime is tightly controlled.
The last point we’d like to make in this chapter echoes how we began the discus-
sion. Fundamental knowledge of how containers work and use the Linux kernel is
invaluable. When you need to manage your cluster or troubleshoot issues when they
arise, this knowledge lets you think about containers in terms of what they’re doing all
the way to the bottom of the Linux kernel. That makes solving issues and creating sta-
ble configurations easier to accomplish.
Before you move on, clean up by reverting back to a single replica of the app-cli
application with the following command:
3.4 Summary
OpenShift orchestrates Kubernetes and docker to deploy and manage applica-
tions in containers.
Multiple levels of management are available in your OpenShift cluster that can
be used for different levels of information.
Containers isolate processes in containers using kernel namespaces.
You can interact with namespaces from the host using special applications and
tools.
HOME ABOUT LINUX SHELL SCRIPTING TUTORIAL RSS/FEED DONATIONS SEARCH
nixCraft
Linux and Unix tutorials for new and seasoned sysadmin
ADVERTISEMENTS
Ads by
Each process/command on Linux and Unix-like system has current working directory
called root directory of a process/command. You can change the root directory of a
command using chroot command, which ends up changing the root directory for both
current running process and its children.[donotprint]
Difficulty Advanced
Contents
» Syntax
» Examples
» Exit form chrooted jail
» Find out if service in chrooted jail
» Rescue and fix software RAID
» Should I use the chroot feature?
» Options
» See also Search nixCraft
1
30 Cool Open Source Software I Discovered
1. Privilege separation for unprivileged process such as Web-server or DNS server. in 2013
2. Setting up a test environment.
3. Run old programs or ABI in-compatibility programs without crashing application
2
30 Handy Bash Shell Aliases For Linux / Unix
or system.
/ Mac OS X
4. System recovery.
5. Reinstall the bootloader such as Grub or Lilo.
3
6. Password recovery – Reset a forgotten password and more. Top 32 Nmap Command Examples For
Linux Sys/Network Admins
Purpose
4
25 PHP Security Best Practices For Linux Sys
Admins
The chroot command changes its current and root directories to the
provided directory and then run command, if supplied, or an interactive
5
copy of the user’s login shell. Please note that not every application can 30 Linux System Monitoring Tools Every
be chrooted. SysAdmin Should Know
Syntax
6
40 Linux Server Hardening Security Tips
7
Linux: 25 Iptables Netfilter Firewall
chroot /path/to/new/root command Examples For New SysAdmins
OR
8
Top 20 OpenSSH Server Best Security
Practices
chroot /path/to/new/root /path/to/server
9
OR Top 25 Nginx Web Server Best Security
Practices
10
My 10 UNIX Command Line Mistakes
In this example, build a mini-jail for testing purpose with bash and ls command only.
First, set jail location using mkdir command:
SIGN UP FOR MY NEWSLETTER
$ J=$HOME/jail
$ mkdir -p $J
$ mkdir -p $J/{bin,lib64,lib}
$ cd $J
$ cp -v /bin/{bash,ls} $J/bin
Debian/Ubuntu
Sample outputs: Linux: Restrict an
SSH user session…
Copy required libs in $J for ls command. Use ldd command to print shared library How to change
dependencies for ls command: directory in Linux
terminal
$ ldd /bin/ls
Sample outputs:
How to append text to a file when using sudo
command on Linux or Unix
# ls /
# ls /etc/
# ls /var/
Type exit
$ exit
You can easily find out if Postfix mail server is chrooted or not using the following two
commands:
pid=$(pidof -s master)
ls -ld /proc/$pid/root
The PID 8613 pointing out to / (root) i.e. the root directory for application is not
changed or chrooted. This is a quick and dirty way to find out if application is chrooted
or not without opening configuration files. Here is another example from chrooted
nginx server:
pid=$(pidof -s master)
ls -ld /proc/$pid/root
Sample outputs:
I’m assuming that software RAID based Linux system is not booting. So you booted
system either using the Live CD or networked based remote rescue kernel mode to fix
the system. In this example, I booting RHEL based system using live Linux DVD/CD and
chroot into /dev/sda1 and/or /dev/md0 to fix the problem:
# Get out of chrooted jail and reboot or format the server as per your
needs ;)
exit
umount {dev,sys,[...],}
reboot
B U T WA I T, T H E R E ’ S M O R E !
Should you use the chroot feature all the time? In the above example, the program is
fairly simple but you may end up with several different kinds of problems such as:
1. Do not forgot, to updated chrooted apps when you upgrade apps locally.
2. Not every app can or should be chrooted.
3. Any app which has to assume root privileges to operate is pointless to attempt to
chroot, as root can generally escape a chroot.
4. Chroot is not a silver bullet. Learn how to secure and harden rest of the system too.
See also
ADVERTISEMENTS
Ad Neople Inc.
Open
RELATED TUTORIALS
I have a small favor to ask. More people are reading the nixCraft. Many of you
block advertising which is your right, and advertising revenues are not
sufficient to cover my operating costs. So you can see why I need to ask for
your help. The nixCraft takes a lot of my time and hard work to produce. If
everyone who reads nixCraft, who likes it, helps fund it, my future would be
more secure. You can donate as little as $1 to support nixCraft:
10 comment
You should really have a look at lxc. I’ve tried it on Fedora. It even comes with scripts to
initialize a minimum Fedora or Debian inside a container. You can just install the
necessary libraries and software using the package manager.
I see some similarities between Docker and chroot, although they are made for quite
different purposes.
I would recommend:
#!/bin/bash
JAIL="$(pwd)" #-- if you run this in your jail directory
echo -n "Give command name to seach libs for: "
read bashcmd
[ ! -e "$bashcmd" ] && echo "$bashcmd command not found. Abording." && exit
list="$(ldd $bashcmd | egrep -o ' /.*/lib.*\.[0-9]* '|sed 's/ //g')"
for i in $list; do cp -v "$i" "$JAIL${i}"; done
What it does:
It copies the whole directory structure of the dependencies so you won’t get an error if
a subfolder like $HOME/jail/lib/x86_64-linux-gnu/ doesn’t exist.
Aren’t we supposed to `chdir` a folder first before ‘chroot’-ing it as it will tighten holes
in chroot environment?
There are other options that will automate this for you. I’m personally fond of
debootstrap.
a marvelously written document. indeed, very useful for experts and newbies too.
thank you very much for your contribution.
Minify and Compress CSS & Javascript Files At a How To Set up SSH Keys on a Lin-
Linux/Unix Shell Prompt ux / Unix System
©2000-2020 nixCraft. All rights reserved. PRIVACY TERM OF SERVICE CONTACT/EMAIL DONATIONS SEARCH
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Add to playlist ! Bookmark " Use Credit # Buy with Subscriber Discount $ Code Files %
&
( clone() : This creates a new process and attaches it to a new specified namespace
There are six namespaces currently in use by LXC, with more being developed:
Let's have a look at each in more detail and see some userspace examples, to help us better understand what
happens under the hood.
Mount namespaces
Mount namespaces first appeared in kernel 2.4.19 in 2002 and provided a separate view of the filesystem mount
points for the process and its children. When mounting or unmounting a filesystem, the change will be noticed by
all processes because they all share the same default namespace. When the CLONE_NEWNS flag is passed to the
clone() system call, the new process gets a copy of the calling process mount tree that it can then change
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 1 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
without affecting the parent process. From that point on, all mounts and unmounts in the default namespace will
be visible in the new namespace, but changes in the per-process mount namespaces will not be noticed outside of
it.
Copy
#define _GNU_SOURCE
#include <sched.h>
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg);
An example call that creates a child process in a new mount namespace looks like this:
Copy
When the child process is created, it executes the childFunc function, which will perform its work in the new
mount namespace.
The util-linux package provides userspace tools that implement the unshare() call, which effectively
unshares the indicated namespaces from the parent process.
To illustrate this:
Copy
2 Next, move the current bash process to its own mount namespace by passing the mount flag to
unshare :
Copy
3 The bash process is now in a separate namespace. Let's check the associated inode number of the
namespace:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 2 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Copy
5 Also, make sure you can see the mount point from the newly created namespace:
Copy
As expected, the mount point is visible because it is part of the namespace we created and the current
bash process is running from.
6 Next, start a new terminal session and display the namespace inode ID from it:
Copy
Notice how it's different from the mount namespace on the other terminal.
Copy
Not surprisingly, the mount point is not visible from the default namespace.
) Note
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 3 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
In the context of LXC, mount namespaces are useful because they provide a way for a different filesystem layout to exist
inside the container. It's worth mentioning that before the mount namespaces, a similar process confinement could be
achieved with the chroot() system call, however chroot does not provide the same per-process isolation as
mount namespaces do.
UTS namespaces
Unix Timesharing (UTS
UTS) namespaces provide isolation for the hostname and domain name, so that each LXC
container can maintain its own identifier as returned by the hostname -f command. This is needed for most
applications that rely on a properly set hostname.
To create a bash session in a new UTS namespace, we can use the unshare utility again, which uses the
unshare() system call to create the namespace and the execve() system call to execute bash in it:
Copy
root@server:~# hostname
server
root@server:~# unshare -u /bin/bash
root@server:~# hostname uts-namespace
root@server:~# hostname
uts-namespace
root@server:~# cat /proc/sys/kernel/hostname
uts-namespace
root@server:~#
As the preceding output shows, the hostname inside the namespace is now uts-namespace .
Next, from a different terminal, check the hostname again to make sure it has not changed:
Copy
root@server:~# hostname
server
root@server:~#
To see the actual system calls that the unshare command uses, we can run the strace utility:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 4 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
From the output we can see that the unshare command is indeed using the unshare() and execve()
system calls and the CLONE_NEWUTS flag to specify new UTS namespace.
IPC namespaces
The Interprocess Communication (IPC
IPC) namespaces provide isolation for a set of IPC and synchronization
facilities. These facilities provide a way of exchanging data and synchronizing the actions between threads and
processes. They provide primitives such as semaphores, file locks, and mutexes among others, that are needed to
have true process separation in a container.
PID namespaces
The Process ID (PID
PID) namespaces provide the ability for a process to have an ID that already exists in the default
namespace, for example an ID of 1 . This allows for an init system to run in a container with various other
processes, without causing a collision with the rest of the PIDs on the same OS.
Copy
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>
First, we include the headers and define the childFunc function that the clone() system call will use.
The function prints out the child PID using the getpid() system call:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 5 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
In the main() function, we specify the stack size and call clone() , passing the child function
childFunc , the stack pointer, the CLONE_NEWPID flag, and the SIGCHLD signal. The CLONE_NEWPID
flag instructs clone() to create a new PID namespace and the SIGCHLD flag notifies the parent process
when one of its children terminates. The parent process will block on waitpid() if the child process has not
terminated.
Copy
From the output, we can see that the child process has a PID of 1 inside its namespace and 17705
otherwise.
) Note
Note that error handling has been omitted from the code examples for brevity.
User namespaces
The user namespaces allow a process inside a namespace to have a different user and group ID than that in the
default namespace. In the context of LXC, this allows for a process to run as root inside the container, while
having a non-privileged ID outside. This adds a thin layer of security, because braking out for the container will
result in a non-privileged user. This is possible because of kernel 3.8, which introduced the ability for non-
privileged processes to create user namespaces.
To create a new user namespace as a non-privileged user and have root inside, we can use the unshare
Copy
root@ubuntu:~# cd /usr/src/
root@ubuntu:/usr/src# wget https://www.kernel.org/pub/linux/utils/util-linux/v2.28/util-linux-2.28.tar.gz
root@ubuntu:/usr/src# tar zxfv util-linux-2.28.tar.gz
root@ubuntu:/usr/src# cd util-linux-2.28/
root@ubuntu:/usr/src/util-linux-2.28# ./configure
root@ubuntu:/usr/src/util-linux-2.28# make && make install
root@ubuntu:/usr/src/util-linux-2.28# unshare --map-root-user --user sh -c whoami
root
root@ubuntu:/usr/src/util-linux-2.28#
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 6 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
We can also use the clone() system call with the CLONE_NEWUSER flag to create a process in a user
namespace, as demonstrated by the following program:
Copy
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>
After compilation and execution, the output looks similar to this when run as root - UID of 0 :
Copy
Network namespaces
Network namespaces provide isolation of the networking resources, such as network devices, addresses, routes,
and firewall rules. This effectively creates a logical copy of the network stack, allowing multiple processes to listen
on the same port from multiple namespaces. This is the foundation of networking in LXC and there are quite a lot
of other use cases where this can come in handy.
The iproute2 package provides very useful userspace tools that we can use to experiment with the network
namespaces, and is installed by default on almost all Linux systems.
There's always the default network namespace, referred to as the root namespace, where all network interfaces
are initially assigned. To list the network interfaces that belong to the default namespace run the following
command:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 7 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
root@server:~#
Copy
root@server:~# ip a s
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
inet 10.1.32.40/24 brd 10.1.32.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::cd5:eff:feb0:a347/64 scope link
valid_lft forever preferred_lft forever
root@server:~#
Also, to list the routes from the root network namespace, execute the following:
Copy
root@server:~# ip r s
default via 10.1.32.1 dev eth0
10.1.32.0/24 dev eth0 proto kernel scope link src 10.1.32.40
root@server:~#
Let's create two new network namespaces called ns1 and ns2 and list them:
Copy
Now that we have the new network namespaces, we can execute commands inside them:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 8 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
The preceding output shows that in the ns1 namespace, there's only one network interface, the loopback -
lo interface, and it's in a DOWN state.
We can also start a new bash session inside the namespace and list the interfaces in a similar way:
Copy
This is more convenient for running multiple commands than specifying each, one at a time. The two network
namespaces are not of much use if not connected to anything, so let's connect them to each other. To do this we'll
use a software bridge called Open vSwitch.
Open vSwitch works just as a regular network bridge and then it forwards frames between virtual ports that we
define. Virtual machines such as KVM, Xen, and LXC or Docker containers can then be connected to it.
Most Debian-based distributions such as Ubuntu provide a package, so let's install that:
Copy
This installs and starts the Open vSwitch daemon. Time to create the bridge; we'll name it OVS-1 :
Copy
) Note
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 9 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
If you would like to experiment with the latest version of Open vSwitch, you can download the source code from
http://openvswitch.org/download/ (http://openvswitch.org/download/) and compile it.
The newly created bridge can now be seen in the root namespace:
Copy
root@server:~# ip a s OVS-1
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
inet6 fe80::f0d9:78ff:fe72:3d77/64 scope link
valid_lft forever preferred_lft forever
root@server:~#
In order to connect both network namespaces, let's first create a virtual pair of interfaces for each namespace:
Copy
The preceding two commands create four virtual interfaces eth1-ns1 , eth1-ns2 and veth-ns1 ,
veth-ns2 . The names are arbitrary.
To list all interfaces that are part of the root network namespace, run:
Copy
root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
6: eth1-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
8: eth1-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
root@server:~#
Let's assign the eth1-ns1 and eth1-ns2 interfaces to the ns1 and ns2 namespaces:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 10 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Also, confirm they are visible from inside each network namespace:
Copy
Notice, how each network namespace now has two interfaces assigned – loopback and eth1-ns* .
If we list the devices from the root namespace, we should see that the interfaces we just moved to ns1 and
ns2 namespaces are no longer visible:
Copy
root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
root@server:~#
It's time to connect the other end of the two virtual pipes, the veth-ns1 and veth-ns2 interfaces to the
bridge:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 11 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
From the preceding output, it's apparent that the bridge now has two ports, veth-ns1 and veth-ns2 .
The last thing left to do is bring the network interfaces up and assign IP addresses:
Copy
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 12 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
With this, we established a connection between both ns1 and ns2 network namespaces through the
Open vSwitch bridge. To confirm, let's use ping :
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 13 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Open vSwitch allows for assigning VLAN tags to network interfaces, resulting in traffic isolation between
namespaces. This can be helpful in a scenario where you have multiple namespaces and you want to have
connectivity between some of them.
The following example demonstrates how to tag the virtual interfaces on the ns1 and ns2 namespaces, so
that the traffic will not be visible from each of the two network namespaces:
Copy
Both the namespaces should now be isolated in their own VLANs and ping should fail:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 14 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
We can also use the unshare utility that we saw in the mount and UTC namespaces examples to create a new
network namespace:
Copy
The cgroups we are most interested in are described in the following table:
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 15 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Cgroups are organized in hierarchies, represented as directories in a Virtual File System (VFS
VFS). Similar to
process hierarchies, where every process is a descendent of the init or systemd process, cgroups inherit
some of the properties of their parents. Multiple cgroups hierarchies can exist on the system, each one
representing a single or group of resources. It is possible to have hierarchies that combine two or more
subsystems, for example, memory and I/O, and tasks assigned to a group will have limits applied on those
resources.
) Note
If you are interested in how the different subsystems are implemented in the kernel, install the kernel source and have a
look at the C files, shown in the third column of the table.
The following diagram helps visualize a single hierarchy that has two subsystems—CPU and I/O—attached to it:
Let's have a look at few practical examples on how to use cgroups to limit resources. This will help us get a better
understanding of how containers work.
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 16 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Let's assume we have two applications running on a server that are heavily I/O bound: app1 and app2 . We
would like to give more bandwidth to app1 during the day and to app2 during the night. This type of I/O
throughput prioritization can be accomplished using the blkio subsystem.
First, let's attach the blkio subsystem by mounting the cgroup VFS:
Copy
Next, create two priority groups, which will be part of the same blkio hierarchy:
Copy
We need to acquire the PIDs of the app1 and app2 processes and assign them to the high_io and
low_io groups:
Copy
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/blkio/high_io/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/blkio/low_io/tasks; done
root@server:~#
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 17 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
The tasks file is where we define what processes/tasks the limit should be applied on.
Finally, let's set a ratio of 10:1 for the high_io and low_io cgroups. Tasks in those cgroups will
immediately use only the resources made available to them:
Copy
The blkio.weight file defines the weight of I/O access available to a process or group of processes, with
values ranging from 100 to 1,000. In this example, the values of 1000 and 100 create a ratio of 10:1.
With this, the low priority application, app2 will use only about 10 percent of the I/O operations available,
whereas the high priority application, app1 , will use about 90 percent.
If you list the contents of the high_io directory on Ubuntu you will see the following files:
Copy
From the preceding output you can see that only some files are writeable. This depends on various OS settings,
such as what I/O scheduler is being used.
We've already seen what the tasks and blkio.weight files are used for. The following is a short
description of the most commonly used files in the blkio subsystem:
File Description
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 18 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
blkio.io_queued Total number of read/write, sync, or async requests queued up at any given
time
blkio.io_service_byte
The number of bytes transferred to or from the specified device
s
Total amount of time the I/O operations spent waiting in the scheduler queues
blkio.io_wait_time
for the specified device
blkio.throttle.io_serv
The number of bytes transferred to or from the disk
ice_bytes
blkio.throttle.io_serv
The number of I/O operations issued to the specified disk
iced
blkio.weight_device Same as blkio.weight , but specifies a block device to apply the limit on
* Tip
One thing to keep in mind is that writing to the files directly to make changes will not persist after the server restarts. Later
in this chapter, you will learn how to use the userspace tools to generate persistent configuration files.
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 19 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
The memory subsystem performs resource accounting, such as tracking the utilization of anonymous pages,
file caches, swap caches, and general hierarchical accounting, all of which presents an overhead. Because of this,
the memory cgroup is disabled by default on some Linux distributions. If the following commands below fail
you'll need to enable it, by specifying the following GRUB parameter and restarting:
Copy
Copy
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 20 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Similar to the blkio subsystem, the tasks file is used to specify the PID of the processes we are adding to
the cgroup hierarchy, and the memory.limit_in_bytes specifies how much memory is to be made available in
bytes.
Copy
The files and their function in the memory subsystem are described in the following table:
File Description
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 21 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Limiting the memory available to a process might trigger the Out of Memory (OOM
OOM) killer, which might kill the
running task. If this is not the desired behavior and you prefer the process to be suspended waiting for memory to
be freed, the OOM killer can be disabled:
Copy
The memory cgroup presents a wide slew of accounting statistics in the memory.stat file, which can be of
interest:
Copy
If you need to start a new task in the app1 memory hierarchy you can move the current shell process into the
tasks file, and all other processes started in this shell will be direct descendants and inherit the same cgroup
properties:
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 22 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Copy
The cpuset subsystem allows for assigning CPU cores to a set of tasks, similar to the taskset command
in Linux.
The main benefits that the cpu and cpuset subsystems provide are better utilization per processor core
for highly CPU bound applications. They also allow for distributing load between cores that are otherwise idle at
certain times of the day. In the context of multitenant environments, running many LXC containers, cpu and
cpuset cgroups allow for creating different instance sizes and container flavors, for example exposing only a
single core per container, with 40 percent scheduled work time.
As an example, let's assume we have two processes app1 and app2 , and we would like app1 to use 60
percent of the CPU time and app2 only 40 percent. We start by mounting the cgroup VFS:
Copy
Copy
Also assign CPU shares for each, where app1 will get 60 percent and app2 will get 40 percent of the
scheduled time:
Copy
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 23 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Copy
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/cpu/limit_60_percent/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/cpu/limit_40_percent/tasks; done
root@server:~#
Copy
File Description
Copy
To demonstrate how the cpuset subsystem works, let's create cpuset hierarchies named app1 ,
containing CPUs 0 and 1. The app2 cgroup will contain only CPU 1:
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 24 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Copy
Copy
Copy
File Description
cpuset.cpu_exclusi Checks if other cpuset hierarchies share the settings defined in the current
ve group
List of the physical numbers of the CPUs on which processes in that cpuset are
cpuset.cpus
allowed to execute
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 25 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
cpuset.mem_exclusi Should the cpuset have exclusive use of its memory nodes
ve
cpuset.mem_hardwal
Checks if each tasks' user allocation be kept separate
l
cpuset.memory_migr Checks if a page in memory should migrate to a new node if the values in
ate cpuset.mems change
cpuset.memory_pres
Contains running average of the memory pressure created by the processes
sure
cpuset.memory_spre
Checks if filesystem buffers should spread evenly across the memory nodes
ad_page
cpuset.memory_spre Checks if kernel slab caches for file I/O operations should spread evenly across the
ad_slab cpuset
cpuset.mems Specifies the memory nodes that tasks in this cgroup are permitted to access
cpuset.sched_load_ Checks if the kernel balance should load across the CPUs in the cpuset by
balance moving processes from overloaded CPUs to less utilized CPUs
cpuset.sched_relax Contains the width of the range of CPUs across which the kernel should attempt to
_domain_level balance loads
Checks if the hierarchy should receive special handling after it is released and no
notify_on_release
process are using it
The next example shows how to suspend the execution of the top process, check its state, and then resume it.
First, mount the freezer subsystem and create the new hierarchy:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 26 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
In a new terminal, start the top process and observe how it periodically refreshes. Back in the original
terminal, add the PID of top to the frozen_group task file and observe its state:
Copy
Copy
Notice how the top process output is not refreshing anymore, and upon inspection of its status file, you can see
that it is now in the blocked state.
Copy
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 27 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
File Description
freezer.self_freezin
Shows the self-state. Shows 0 if the self-state is THAWED ; otherwise, 1 .
g
To address this, there are packages that provide userspace tools and daemons that are quite easy to use. Let's see
a few examples.
Copy
Copy
Copy
root@server:~# cgroups-mount
root@server:~# cat /proc/mounts | grep cgroup
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-ag
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agen
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-re
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,release_agent=/run/cgmanager/agents/cgm-release-ag
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,release_agent=/run/cgmanager/agents/cgm-release-agent.cp
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,release_agent=/run/cgmanager/agents/cgm-release-
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 28 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Notice from the preceding output the location of the cgroups - /sys/fs/cgroup . This is the default location
on many Linux distributions and in most cases the various subsystems have already been mounted.
To verify what cgroup subsystems are in use, we can check with the following commands:
Copy
Next, let's create a blkio hierarchy and add an already running process to it with cgclassify . This is
similar to what we did earlier, by creating the directories and the files by hand:
Copy
Now that we have defined the high_io and low_io cgroups and added a process to them, let's generate a
configuration file that can be used later to reapply the setup:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 29 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
To start a new process in the high_io group, we can use the cgexec command:
Copy
In the preceding example, we started a new bash process in the high_io cgroup , as confirmed by looking
at the tasks file.
To move an already running process to the memory subsystem, first we create the high_prio and
low_prio groups and move the task with cgclassify :
Copy
To set the memory and CPU limits, we can use the cgset command. In contrast, remember that we used the
echo command to manually move the PIDs and memory limits to the tasks and the
memory.limit_in_bytes files:
Copy
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 30 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
To see how the cgroup hierarchies look, we can use the lscgroup utility:
Copy
root@server:~# lscgroup
cpuset:/
cpu:/
cpu:/low_prio
cpu:/high_prio
cpuacct:/
memory:/
memory:/low_prio
memory:/high_prio
devices:/
freezer:/
blkio:/
blkio:/low_io
blkio:/high_io
perf_event:/
hugetlb:/
root@server:~#
The preceding output confirms the existence of the blkio , memory , and cpu hierarchies and their
children.
Once finished, you can delete the hierarchies with cgdelete , which deletes the respective directories on the
VFS:
Copy
To completely clear the cgroups, we can use the cgclear utility, which will unmount the cgroup directories:
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 31 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07
Copy
root@server:~# cgclear
root@server:~# lscgroup
cgroups can't be listed: Cgroup is not mounted
root@server:~#
If multiple services are running on the server, the CPU resources will be shared equally among them by default,
because systemd assigns equal weights to each. To change this behavior for an application, we can edit its
service file and define the CPU shares, allocated memory, and I/O.
The following example demonstrates how to change the CPU shares, memory, and I/O limits for the nginx process:
Copy
Copy
This will create and update the necessary control files in /sys/fs/cgroup/systemd and apply the limits.
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 32 of 32