Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

EVOLUTION OF CONTAINERS

From simple idea to DevOps darling, a brief history of container technologies.

Jails added Jacques Gélinas created


to FreeBSD the Linux-VServer project
with kernel-level isolation

2000 2001

2002
SELinux added
to Linux mainline

2003

Red Hat

2004
Enterprise Linux 4 Solaris Zones
adds SELinux introduces virtualized
OS environments

2005

2006
Generic process containers Google implements
renamed control groups (cgroups) generic process
and added to the Linux kernel containers

2007

AIX Workload Partitions Linux Containers


(WPARs) allows users to (LXC) project
virtualize the OS improves user
experience

2008

Development of Kernel and user


systemd begins namespaces help
define privileges

2010 2009

Red Hat Enterprise Linux 6


launches with cgroups and
namespaces

Red Hat launches OpenShift


as first enterprise PaaS
based on Linux containers
2011

2012

Docker project Red Hat announces intent


announced via dotCloud to bring Docker technology
PyCon lightning talk to Red Hat Enterprise Linux
and OpenShift
2013

Red Hat Enterprise Linux 7


launches with Docker support Redesign of cgroups
increases consistency
and manageability

2014

Google announces Prototype version of


Kubernetes, a container rkt available on GitHub
orchestration system

Red Hat launches Container Red Hat Enterprise Linux


2015

Certification and Container Atomic Host launches with


Development Kit (CDK) for Docker and Kubernetes
ISV partners

OpenShift 3 launches as a
general-purpose container
application platform built
on Docker format and
Kubernetes orchestration

Open Container Initiative


(OCI) announced

Cloud Native Computing


Foundation announced

Red Hat
Red Hat Enterprise
CloudForms
Linux makes .NET Core
adds container
available as a container
management

2016

Red Hat OpenShift


Container Platform
announced

LEARN MORE

Copyright © 2016 Red Hat, Inc. Red Hat, Red Hat Enterprise Linux, the
Shadowman logo, and JBoss are trademarks of Red Hat, Inc., registered in the
U.S. and other countries. Linux® is the registered trademark of Linus Torvalds
in the U.S. and other countries.

The OpenStack® Word Mark and OpenStack Logo are either registered
trademarks/service marks or trademarks/service marks of the OpenStack
Foundation, in the United States and other countries and are used with the
OpenStack Foundation's permission. We are not affiliated with, endorsed or
sponsored by the OpenStack Foundation or the OpenStack community.
Search the docs Guides Product manuals Reference Samples Docker v19.03 (current) ,

# Edit this page

Docker overview $ Request docs changes

! "
Estimated reading time: 10 minutes

Docker is an open platform for developing, shipping, and running applications. Docker On this page:

enables you to separate your applications from your infrastructure so you can deliver The Docker platform
software quickly. With Docker, you can manage your infrastructure in the same ways you
Docker Engine
manage your applications. By taking advantage of Docker’s methodologies for shipping,
testing, and deploying code quickly, you can signi!cantly reduce the delay between writing What can I use Docker for?
code and running it in production.
Docker architecture

The Docker platform The Docker daemon

The Docker client


Docker provides the ability to package and run an application in a loosely isolated
Docker registries
environment called a container. The isolation and security allow you to run many containers
simultaneously on a given host. Containers are lightweight because they don’t need the extra Docker objects
load of a hypervisor, but run directly within the host machine’s kernel. This means you can
The underlying technology
run more containers on a given hardware combination than if you were using virtual
machines. You can even run Docker containers within host machines that are actually virtual Namespaces

machines!
Control groups

Docker provides tooling and a platform to manage the lifecycle of your containers:
Union !le systems
Develop your application and its supporting components using containers.
Container format
The container becomes the unit for distributing and testing your application.
When you’re ready, deploy your application into your production environment, as a Next steps
container or an orchestrated service. This works the same whether your production
environment is a local data center, a cloud provider, or a hybrid of the two.

Docker Engine
Docker Engine is a client-server application with these major components:

A server which is a type of long-running program called a daemon process (the


dockerd command).

A REST API which speci!es interfaces that programs can use to talk to the daemon and
instruct it what to do.

A command line interface (CLI) client (the docker command).

The CLI uses the Docker REST API to control or interact with the Docker daemon through
scripting or direct CLI commands. Many other Docker applications use the underlying API
and CLI.

The daemon creates and manages Docker objects, such as images, containers, networks, and
volumes.

Note
Note: Docker is licensed under the open source Apache 2.0 license.

For more details, see Docker Architecture below.

What can I use Docker for?


Fast, consistent delivery of your applications

Docker streamlines the development lifecycle by allowing developers to work in standardized


environments using local containers which provide your applications and services.
Containers are great for continuous integration and continuous delivery (CI/CD) work"ows.

Consider the following example scenario:

Your developers write code locally and share their work with their colleagues using
Docker containers.
They use Docker to push their applications into a test environment and execute
automated and manual tests.
When developers !nd bugs, they can !x them in the development environment and
redeploy them to the test environment for testing and validation.
When testing is complete, getting the !x to the customer is as simple as pushing the
updated image to the production environment.

Responsive deployment and scaling

Docker’s container-based platform allows for highly portable workloads. Docker containers
can run on a developer’s local laptop, on physical or virtual machines in a data center, on
cloud providers, or in a mixture of environments.

Docker’s portability and lightweight nature also make it easy to dynamically manage
workloads, scaling up or tearing down applications and services as business needs dictate, in
near real time.

Running more workloads on the same hardware

Docker is lightweight and fast. It provides a viable, cost-e#ective alternative to hypervisor-


based virtual machines, so you can use more of your compute capacity to achieve your
business goals. Docker is perfect for high density environments and for small and medium
deployments where you need to do more with fewer resources.

Docker architecture
Docker uses a client-server architecture. The Docker client talks to the Docker daemon, which
does the heavy lifting of building, running, and distributing your Docker containers. The
Docker client and daemon can run on the same system, or you can connect a Docker client to
a remote Docker daemon. The Docker client and daemon communicate using a REST API,
over UNIX sockets or a network interface.

The Docker daemon


The Docker daemon ( dockerd ) listens for Docker API requests and manages Docker objects
such as images, containers, networks, and volumes. A daemon can also communicate with
other daemons to manage Docker services.

The Docker client


The Docker client ( docker ) is the primary way that many Docker users interact with Docker.
When you use commands such as docker run , the client sends these commands to
dockerd , which carries them out. The docker command uses the Docker API. The Docker
client can communicate with more than one daemon.

Docker registries
A Docker registry stores Docker images. Docker Hub is a public registry that anyone can use,
and Docker is con!gured to look for images on Docker Hub by default. You can even run
your own private registry. If you use Docker Datacenter (DDC), it includes Docker Trusted
Registry (DTR).

When you use the docker pull or docker run commands, the required images are
pulled from your con!gured registry. When you use the docker push command, your
image is pushed to your con!gured registry.

Docker objects
When you use Docker, you are creating and using images, containers, networks, volumes,
plugins, and other objects. This section is a brief overview of some of those objects.

IMAGES

An image is a read-only template with instructions for creating a Docker container. Often, an
image is based on another image, with some additional customization. For example, you may
build an image which is based on the ubuntu image, but installs the Apache web server
and your application, as well as the con!guration details needed to make your application
run.

You might create your own images or you might only use those created by others and
published in a registry. To build your own image, you create a Docker!le with a simple syntax
for de!ning the steps needed to create the image and run it. Each instruction in a Docker!le
creates a layer in the image. When you change the Docker!le and rebuild the image, only
those layers which have changed are rebuilt. This is part of what makes images so
lightweight, small, and fast, when compared to other virtualization technologies.

CONTAINERS

A container is a runnable instance of an image. You can create, start, stop, move, or delete a
container using the Docker API or CLI. You can connect a container to one or more networks,
attach storage to it, or even create a new image based on its current state.

By default, a container is relatively well isolated from other containers and its host machine.
You can control how isolated a container’s network, storage, or other underlying subsystems
are from other containers or from the host machine.

A container is de!ned by its image as well as any con!guration options you provide to it
when you create or start it. When a container is removed, any changes to its state that are
not stored in persistent storage disappear.

Example docker run command

The following command runs an ubuntu container, attaches interactively to your local
command-line session, and runs /bin/bash .

$ docker run -i -t ubuntu /bin/bash

When you run this command, the following happens (assuming you are using the default
registry con!guration):

1. If you do not have the ubuntu image locally, Docker pulls it from your con!gured
registry, as though you had run docker pull ubuntu manually.

2. Docker creates a new container, as though you had run a


docker container create command manually.

3. Docker allocates a read-write !lesystem to the container, as its !nal layer. This allows a
running container to create or modify !les and directories in its local !lesystem.

4. Docker creates a network interface to connect the container to the default network,
since you did not specify any networking options. This includes assigning an IP address
to the container. By default, containers can connect to external networks using the
host machine’s network connection.

5. Docker starts the container and executes /bin/bash . Because the container is
running interactively and attached to your terminal (due to the -i and -t "ags),
you can provide input using your keyboard while the output is logged to your terminal.

6. When you type exit to terminate the /bin/bash command, the container stops
but is not removed. You can start it again or remove it.

SERVICES

Services allow you to scale containers across multiple Docker daemons, which all work
together as a swarm with multiple managers and workers. Each member of a swarm is a
Docker daemon, and all the daemons communicate using the Docker API. A service allows
you to de!ne the desired state, such as the number of replicas of the service that must be
available at any given time. By default, the service is load-balanced across all worker nodes.
To the consumer, the Docker service appears to be a single application. Docker Engine
supports swarm mode in Docker 1.12 and higher.

The underlying technology


Docker is written in Go and takes advantage of several features of the Linux kernel to deliver
its functionality.

Namespaces
Docker uses a technology called namespaces to provide the isolated workspace called the
container. When you run a container, Docker creates a set of namespaces for that container.

These namespaces provide a layer of isolation. Each aspect of a container runs in a separate
namespace and its access is limited to that namespace.

Docker Engine uses namespaces such as the following on Linux:

The pid namespace: Process isolation (PID: Process ID).


The net namespace: Managing network interfaces (NET: Networking).
The ipc namespace: Managing access to IPC resources (IPC: InterProcess
Communication).
The mnt namespace: Managing !lesystem mount points (MNT: Mount).
The uts namespace: Isolating kernel and version identi!ers. (UTS: Unix
Timesharing System).

Control groups
Docker Engine on Linux also relies on another technology called control groups ( cgroups ).
A cgroup limits an application to a speci!c set of resources. Control groups allow Docker
Engine to share available hardware resources to containers and optionally enforce limits and
constraints. For example, you can limit the memory available to a speci!c container.

Union file systems


Union !le systems, or UnionFS, are !le systems that operate by creating layers, making them
very lightweight and fast. Docker Engine uses UnionFS to provide the building blocks for
containers. Docker Engine can use multiple UnionFS variants, including AUFS, btrfs, vfs, and
DeviceMapper.

Container format
Docker Engine combines the namespaces, control groups, and UnionFS into a wrapper called
a container format. The default container format is libcontainer . In the future, Docker
may support other container formats by integrating with technologies such as BSD Jails or
Solaris Zones.

Next steps
Read about installing Docker.
Get hands-on experience with the Getting started with Docker tutorial.

 docker, introduction, documentation, about, technology, understanding

Rate this page:


53 2

Why Docker Products Developers Company

What is a Container Docker Desktop Use Cases About Us

Docker Hub Play with Docker Blog

Features Community Customers

Container Runtime Open Source Partners

Developer Tools Docker Captains Newsroom

Kubernetes Careers

Contact Us

Status Security Legal Contact

Copyright © 2013-2020 Docker Inc. All rights reserved. % & ' ( ) * +


Containers are Linux

This chapter covers


How OpenShift, Kubernetes, and docker work together
How containers isolate processes with namespaces

In the previous chapter, you deployed your first applications in OpenShift. In this
chapter, we’ll look deeper into your OpenShift cluster and investigate how these
containers isolate their processes on the application node.
Knowledge of how containers work in a platform like OpenShift is some of the
most powerful information in IT right now. This fundamental understanding of
how a container actually works as part of a Linux server informs how systems are
designed and how issues are analyzed when they inevitably occur.
This is a challenging chapter—not because you’ll be editing a lot of configura-
tions and making complex changes, but because we’re talking about the funda-
mental layers of abstraction that make a container a container. Let’s get started by
attempting to define exactly what a container is.

3.1 Defining containers


You can find five different container experts and ask them to define what a con-
tainer is, and you’re likely to get five different answers. The following are some of
our personal favorites, all of which are correct from a certain perspective:

37
38 CHAPTER 3 Containers are Linux

A transportable unit to move applications around. This is a typical developer’s


answer.
A fancy Linux process (one of our personal favorites).
A more effective way to isolate processes on a Linux system. This is a more
operations-centered answer.
What we need to untangle is the fact that they’re all correct, depending on your point
of view.
In chapter 1, we talked about how OpenShift uses Kubernetes and docker to
orchestrate and deploy applications in containers in your cluster. But we haven’t
talked much about which application component is created by each of these services.
Before we move forward, it’s important for you to understand these responsibilities as
you begin interacting with application components directly.

3.2 How OpenShift components work together


When you deploy an application in OpenShift, the request starts in the OpenShift
API. We discussed this process at a high level in chapter 2. To really understand how
containers isolate the processes within them, we need take a more detailed look at
how these services work together to deploy your application. The relationship
between OpenShift, Kubernetes, docker, and, ultimately, the Linux kernel is a chain
of dependencies.
When you deploy an application in OpenShift, the process starts with the Open-
Shift services.

3.2.1 OpenShift manages deployments


Deploying applications begins with application components that are unique to Open-
Shift. The process is as follows:
1 OpenShift creates a custom container image using your source code and the
builder image template you specified. For example, app-cli and app-gui use the
PHP builder image.
2 This image is uploaded to the OpenShift container image registry.
3 OpenShift creates a build config to document how your application is built.
This includes which image was created, the builder image used, the location of
the source code, and other information.
4 OpenShift creates a deployment config to control deployments and deploy and
update your applications. Information in deployment configs includes the
number of replicas, the upgrade method, and application-specific variables and
mounted volumes.
5 OpenShift creates a deployment, which represents a single deployed version of
an application. Each unique application deployment is associated with your
application’s deployment config component.
How OpenShift components work together 39

6 The OpenShift internal load balancer is updated with an entry for the DNS
record for the application. This entry will be linked to a component that’s cre-
ated by Kubernetes, which we’ll get to shortly.
7 OpenShift creates an image stream component. In OpenShift, an image stream
monitors the builder image, deployment config, and other components for
changes. If a change is detected, image streams can trigger application rede-
ployments to reflect changes.
Figure 3.1 shows how these components are linked together. When a developer cre-
ates source code and triggers a new application deployment (in this case, using the oc
command-line tool), OpenShift creates the deployment config, image stream, and
build config components.

1. The developers Users want to use


create application application but have
source code. no access (and no
application…)
Developers 2. The developers trigger
a new application
deployment.
Users

Source oc new-app
code ...

External

OpenShift

Builder Custom Image Deployment


Build config
image image stream config

Image registry
Deployment

DNS
route

3. A custom container 4. The deployment config creates


Load balancer image is created a unique deployment for each
and referenced in application version.
the build config.

3b. A DNS route is created 3a. The image stream monitors


in the OpenShift load the images and deployment
balancer. config for changes, triggering
upgrades and rebuilds as
needed to serve the new
configuration.

Figure 3.1 Application components created by OpenShift during application deployment


40 CHAPTER 3 Containers are Linux

The build config creates an application-specific custom container image using the
specified builder image and source code. This image is stored in the OpenShift image
registry. The deployment config component creates an application deployment that’s
unique for each version of the application. The image stream is created and monitors
for changes to the deployment config and related images in the internal registry. The
DNS route is also created and will be linked to a Kubernetes object.
In figure 3.1, notice that the users are sitting by themselves with no access to the
application. There is no application. OpenShift depends on Kubernetes, as well as
docker, to get the deployed application to the user. Next, we’ll look at Kubernetes’
responsibilities in OpenShift.

3.2.2 Kubernetes schedules applications across nodes


Kubernetes is the orchestration engine at the heart of OpenShift. In many ways, an
OpenShift cluster is a Kubernetes cluster. When you initially deployed app-cli, Kuber-
netes created several application components:
Replication controller—Scales the application as needed in Kubernetes. This com-
ponent also ensures that the desired number of replicas in the deployment con-
fig is maintained at all times.
Service—Exposes the application. A Kubernetes service is a single IP address
that’s used to access all the active pods for an application deployment. When
you scale an application up or down, the number of pods changes, but they’re
all accessed through a single service.
Pods—Represent the smallest scalable unit in OpenShift.

NOTE Typically, a single pod is made up of a single container. But in some sit-
uations, it makes sense to have a single pod consist of multiple containers.

Figure 3.2 illustrates the relationships between the Kubernetes components that are
created. The replication controller dictates how many pods are created for an initial
application deployment and is linked to the OpenShift deployment component.
Also linked to the pod component is a Kubernetes service. The service represents
all the pods deployed by a replication controller. It provides a single IP address in
OpenShift to access your application as it’s scaled up and down on different nodes in
your cluster. The service is the internal IP address that’s referenced in the route cre-
ated in the OpenShift load balancer.

NOTE The relationship between deployments and replication controllers is


how applications are deployed, scaled, and upgraded. When changes are
made to a deployment config, a new deployment is created, which in turn cre-
ates a new replication controller. The replication controller then creates the
desired number of pods within the cluster, which is where your application is
actually deployed.
How OpenShift components work together 41

The Kubernetes service is associated with


the DNS route created in the load balancer.

Load balancer

Deployment
DNS
route

OpenShift

Kubernetes

Replication
Pod Service
controller

OpenShift deployments Replication controllers The Kubernetes service


are associated with the are associated with pods is linked to pods for
Kubernetes replication in Kubernetes. each deployment.
controller.

Figure 3.2 Kubernetes components that are created when applications are deployed

We’re getting closer to the application itself, but we haven’t gotten there yet. Kuberne-
tes is used to orchestrate containers in an OpenShift cluster. But on each application
node, Kubernetes depends on docker to create the containers for each application
deployment.

3.2.3 Docker creates containers


Docker is a container runtime. A container runtime is the application on a server that
creates, maintains, and removes containers. A container runtime can act as a stand-
alone tool on a laptop or a single server, but it’s at its most powerful when being
orchestrated across a cluster by a tool like Kubernetes.

NOTE Docker is currently the container runtime for OpenShift. But a new
runtime is supported as of OpenShift 3.9. It’s called cri-o, and you can find
more information at http://cri-o.io.

Kubernetes controls docker to create containers that house the application. These
containers use the custom base image as the starting point for the files that are visible
to applications in the container. Finally, the docker container is associated with the
Kubernetes pod (see figure 3.3).
To isolate the libraries and applications in the container image, along with other
server resources, docker uses Linux kernel components. These kernel-level resources
are the components that isolate the applications in your container from everything
else on the application node. Let’s look at these next.
42 CHAPTER 3 Containers are Linux

Containers use the custom container image Custom Builder


as the basis for the container filesystem. image image

Image registry

OpenShift

Kubernetes

Replication
Pod
controller

docker
Container
Containers are associated
with a Kubernetes pod.

Figure 3.3 Docker containers are associated with Kubernetes pods.

3.2.4 Linux isolates and limits resources


We’re down to the core of what makes a container a container in OpenShift and
Linux. Docker uses three Linux kernel components to isolate the applications run-
ning in containers it creates and limit their access to resources on the host:
Linux namespaces—Provide isolation for the resources running in the container.
Although the term is the same, this is a different concept than Kubernetes
namespaces (http://mng.bz/X8yz), which are roughly analogous to an Open-
Shift project. We’ll discuss these in more depth in chapter 7. For the sake of
brevity, in this chapter, when we reference namespaces, we’re talking about
Linux namespaces.
Control groups (cgroups)—Provide maximum, guaranteed access limits for CPU
and memory on the application node. We’ll look at cgroups in depth in chapter 9.
SELinux contexts—Prevent the container applications from improperly access-
ing resources on the host or in other containers. An SELinux context is a
unique label that’s applied to a container’s resources on the application node.
This unique label prevents the container from accessing anything that doesn’t
have a matching label on the host. We’ll discuss SELinux contexts in more
depth in chapter 11.
The docker daemon creates these kernel resources dynamically when the container is
created. These resources are associated with the applications that are launched for the
corresponding container; your application is now running in a container (figure 3.4).
Applications in OpenShift are run and associated with these kernel components.
They provide the isolation that you see from inside a container. In upcoming sections,
How OpenShift components work together 43

The docker container creates SELinux limits containers’ access


Linux kernel resources to to only what they should be able
isolate applications. to access on the application node.

Container

docker

Linux kernel

Control SELinux
Application Namespaces
groups contexts

User space Kernel space

The application is Namespaces isolate Control groups limit


linked to the container applications in the CPU and memory
namespaces to isolate container from other resources available
it from everything else. applications on the to each container.
host.

Figure 3.4 Linux kernel components used to isolate containers

we’ll discuss how you can investigate a container from the application node. From the
point of view of being inside the container, an application only has the resources allo-
cated to it that are included in its unique namespaces. Let’s confirm that next.

Userspace and kernelspace


A Linux server is separated into two primary resource groups: the userspace and the
kernelspace. The userspace is where applications run. Any process that isn’t part of
the kernel is considered part of the userspace on a Linux server.
The kernelspace is the kernel itself. Without special administrator privileges like
those the root user has, users can’t make changes to code that’s running in the ker-
nelspace.
The applications in a container run in the userspace, but the components that isolate
the applications in the container run in the kernelspace. That means containers are
isolated using kernel components that can’t be modified from inside the container.

In the previous sections, we looked at each individual layer of OpenShift. Let’s put all
of these together before we dive down into the weeds of the Linux kernel.
44 CHAPTER 3 Containers are Linux

3.2.5 Putting it all together


The automated workflow that’s executed when you deploy an application in Open-
Shift includes OpenShift, Kubernetes, docker, and the Linux kernel. The interactions
and dependencies stretch across multiple services, as outlined in figure 3.5.

Developers

Users

Source oc new-app
code ...

External

Builder Custom Image Deployment


Build config
image image stream config

Image registry Load


balancer
Deployment
DNS
route

OpenShift

Kubernetes

Replication
Pod Service
controller

docker
Container

Linux

Control SELinux
Application Namespaces
groups contexts

User space Kernel space

Figure 3.5 OpenShift deployment including components that make up the container
Application isolation with kernel namespaces 45

Developers and users interact primarily with OpenShift and its services. OpenShift
works with Kubernetes to ensure that user requests are fulfilled and applications are
delivered consistently according to the developer’s designs.
As you’ll recall, one of the acceptable definitions for a container earlier in this
chapter was that they’re “fancy processes.” We developed this definition by explaining
how a container takes an application process and uses namespaces to limit access to
resources on the host. We’ll continue to develop this definition by interacting with
these fancy processes in more depth in chapters 9 and 10.
Like any other process running on a Linux server, each container has an assigned
process ID (PID) on the application node.

3.3 Application isolation with kernel namespaces


Armed with the PID for the current app-cli container, you can begin to analyze how
containers isolate process resources with Linux namespaces. Earlier in this chapter, we
discussed how kernel namespaces are used to isolate the applications in a container
from the other processes on a host. Docker creates a unique set of namespaces to iso-
late the resources in each container. Looking again at figure 3.4, the application is
linked to the namespaces because they’re unique for each container. Cgroups and
SELinux are both configured to include information for a newly created container,
but those kernel resources are shared among all containers running on the applica-
tion node.
To get a list of the namespaces that were created for app-cli, use the lsns com-
mand. You need the PID for app-cli to pass as a parameter to lsns. Appendix C walks
you through how to use the docker daemon to get the host PID for a container, along
with some other helpful docker commands. Use this appendix as a reference to get
the host PID for your app-cli container.
The lsns command accepts a PID with the -p option and outputs the namespaces
associated with that PID. The output for lsns has the following six columns:
NS—Inode associated with the namespace
TYPE—Type of namespace created
NPROCS—Number of processes associated with the namespace
PID—Process used to create the namespace
USER—User that owns the namespace
COMMAND—Command executed to launch the process to create the namespace
When you run the command, the output from lsns shows six namespaces for app-cli.
Five of these namespaces are unique to app-cli and provide the container isolation
that we’re discussing in this chapter. There are also two additional namespaces in
Linux that aren’t used directly by OpenShift. The user namespace isn’t currently used
by OpenShift, and the cgroup namespace is shared between all containers on
the system.
46 CHAPTER 3 Containers are Linux

NOTE On an OpenShift application node, the user namespace is shared


across all applications on the host. The user namespace was created by PID 1
on the host, has over 200 processes associated with it, and is associated with
the systemd command. The other namespaces associated with the app-cli PID
have far fewer processes and aren’t owned by PID 1 on the host.

OpenShift uses five Linux namespaces to isolate processes and resources on applica-
tion nodes. Coming up with a concise definition for exactly what a namespace does is
a little difficult. Two analogies best describe their most important properties, if you’ll
forgive a little poetic license:
Namespaces are like paper walls in the Linux kernel. They’re lightweight and
easy to stand up and tear down, but they offer sufficient privacy when they’re in
place.
Namespaces are similar to two-way mirrors. From within the container, only the
resources in the namespace are available. But with proper tooling, you can see
what’s in a namespace from the host system.
The following snippet lists all namespaces for app-cli with lsns:

# lsns -p 4470 Mount namespace


NS TYPE NPROCS PID USER COMMAND
4026531837 user 254 1 root /usr/lib/systemd/systemd --
➥ switched-root --system --deserialize 20
UTS 4026532211 mnt 12 4470 1000080000 httpd -D FOREGROUND
namespace 4026532212 uts 12 4470 1000080000 httpd -D FOREGROUND
4026532213 pid 12 4470 1000080000 httpd -D FOREGROUND
4026532420 ipc 13 3476 1001 /usr/bin/pod PID
IPC
4026532423 net 13 3476 1001 /usr/bin/pod namespace
namespace
Network namespace

As you can see, the five namespaces that OpenShift uses to isolate applications are as
follows:
Mount—Ensures that only the correct content is available to applications in the
container
Network—Gives each container its own isolated network stack
PID—Provides each container with its own set of PID counters
UTS—Gives each container its own hostname and domain name
IPC—Provides shared memory isolation for each container
There are currently two additional namespaces in the Linux kernel that aren’t used by
OpenShift:
Cgroup—Cgroups are used as a shared resource on an OpenShift node, so this
namespace isn’t required for effective isolation.
Application isolation with kernel namespaces 47

User—This namespace can map a user in a container to a different user on the


host. For example, a user with ID 0 in the container could have user ID 5000
when interacting with resources outside the container. This feature can be
enabled in OpenShift, but there are issues with performance and node configu-
ration that fall out of scope for our example cluster. If you’d like more informa-
tion on enabling the user namespace to work with docker, and thus with
OpenShift, see the article “Hardening Docker Hosts with User Namespaces” by
Chris Binnie (Linux.com, http://mng.bz/Giwd).

What is /usr/bin/pod?
The IPC and network namespaces are associated with a different PID for an applica-
tion called /usr/bin/pod. This is a pseudo-application that’s used for containers cre-
ated by Kubernetes.
Under most circumstances, a pod consists of one container. There are conditions,
however, where a single pod may contain multiple containers. Those situations are
outside the scope of this chapter; but when this happens, all the containers in the
pod share these namespaces. That means they share a single IP address and can
communicate with shared memory devices as though they’re on the same host.

We’ll discuss the five namespaces used by OpenShift with examples, including how
they enhance your security posture and how they isolate their associated resources.
Let’s start with the mount namespace.

3.3.1 The mount namespace


The mount namespace isolates filesystem content, ensuring that content assigned to
the container by OpenShift is the only content available to the processes running in
the container. The mount namespace for the app-cli container allows the applications
in the container to access only the content in the custom app-cli container image, and
any information stored on the persistent volume associated with the persistent volume
claim (PVC) for app-cli (see figure 3.6).

NOTE Applications always need persistent storage. Persistent storage allows


data to persist when a pod is removed from the cluster. It also allows data to
be shared between multiple pods when needed. You’ll learn how to configure
and use persistent storage on an NFS server with OpenShift in chapter 7.

The root filesystem, based on the app-cli container image, is a little more difficult to
uncover, but we’ll do that next.
48 CHAPTER 3 Containers are Linux

Anything in the app-cli namespace


must be available on the host on
the local filesystem or as a
mounted remote volume. 1. Content from the app-cli container
image is made available in the
app-cli mount namespace.
2. The application is launched
Application node filesystem using the files and libraries
All content available on in the app-cli namespace.
the host by default

app-cli image
Cached copy of the Application
app-cli container image Starts app-cli
from the image registry Mount namespace
Created for the
NFS mount app-cli container
Applications
Volume added to Launched by the
app-cli via PVC app-cli initial process

Volumes can be added An application launched Applications using


before or after a pod is by another application the mount namespace
started, depending on inherits the same created for the app-cli
the application's needs. namespace as its container can see only
parent. the content in that
namespace.
The NFS mount for the app-cli persistent volume
is added to the mount namespace and made
available at /opt/app-root/src/uploads.

Figure 3.6 The mount namespace takes selected content and makes it available to the app-cli
applications.

ACCESSING CONTAINER ROOT FILESYSTEMS


When you configured OpenShift, you specified a block device for docker to use for
container storage. Your OpenShift configuration uses logical volume management
(LVM) on this device for container storage. Each container gets its own logical vol-
ume (LV) when it’s created. This storage solution is fast and scales well for large pro-
duction clusters.
To view all LVs created by docker on your host, run the lsblk command. This
command shows all block devices on your host, as well as any LVs. It confirms that
docker has been creating LVs for your containers:

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 8G 0 disk
vda1 253:1 0 8G 0 part /
vdb 253:16 0 20G 0 disk
Application isolation with kernel namespaces 49

vdb1 253:17 0 20G 0 part


docker_vg-docker--pool_tmeta 252:0 0 24M 0 lvm
docker_vg-docker--pool 252:2 0 8G 0 lvm
docker-253:1-10125-e27ee79f... 252:3 0 10G 0 dm
docker-253:1-10125-6ec90d0f... 252:4 0 10G 0 dm
...
docker_vg-docker--pool_tdata 252:1 0 8G 0 lvm
docker_vg-docker--pool 252:2 0 8G 0 lvm
docker-253:1-10125-e27ee79f... 252:3 0 10G 0 dm
docker-253:1-10125-6ec90d0f... 252:4 0 10G 0 dm
...

The LV device that the app-cli container uses for storage is recorded in the informa-
tion from docker inspect. To get the LV for your app-cli container, run the following
command:

docker inspect -f '{{ .GraphDriver.Data.DeviceName }}' fae8e211e7a7

You’ll get a value similar to docker-253:1-10125-8bd64caed0421039e83ee4f1cdc


bcf25708e3da97081d43a99b6d20a3eb09c98. This is the name for the LV that’s being
used as the root filesystem for the app-cli container.
Unfortunately, when you run the following mount command to see where this LV is
mounted, you don’t get any results:

mount | grep docker-253:1-10125-


➥ 8bd64caed0421039e83ee4f1cdcbcf25708e3da97081d43a99b6d20a3eb09c9

You can’t see the LV for app-cli because it’s in a different namespace. No, we’re not
kidding. The mount namespace for your application containers is created in a differ-
ent mount namespace from your application node’s operating system.
When the docker daemon starts, it creates its own mount namespace to contain filesys-
tem content for the containers it creates. You can confirm this by running lsns for the
docker process. To get the PID for the main docker process, run the following pgrep com-
mand (the process dockerd-current is the name for the main docker daemon process):

# pgrep -f dockerd-current

Once you have the docker daemon’s PID, you can use lsns to view its namespaces.
You can tell from the output that the docker daemon is using the system namespaces
created by systemd when the server booted, except for the mount namespace:

# lsns -p 2385
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 221 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
4026531837 user 254 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
4026531838 uts 223 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
50 CHAPTER 3 Containers are Linux

4026531839 ipc 221 1 root /usr/lib/systemd/systemd --switched-root


➥ --system --deserialize 20
4026531956 net 223 1 root /usr/lib/systemd/systemd --switched-root
➥ --system --deserialize 20
4026532298 mnt 12 2385 root /usr/bin/dockerd-current --add-runtime
➥ docker-runc=/usr/libexec/docker/docker-runc-current
--default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd
➥ --userland-proxy-p

You can use a command-line tool named nsenter to enter an active namespace for
another application. It’s a great tool to use when you need to troubleshoot a container
that isn’t performing as it should. To use nsenter, you give it a PID for the container
with the --target option and then instruct it regarding which namespaces you want
to enter for that PID:

$ nsenter --target 2385

When you run the command, you arrive at a prompt similar to your previous prompt.
The big difference is that now you’re operating from inside the namespace you speci-
fied. Run mount from within docker’s mount namespace and grep for your app-cli LV
(the output is trimmed for clarity):

mount | grep docker-253:1-10125-8bd64cae...


/dev/mapper/docker-253:1-10125-8bd64cae... on ➥
/var/lib/docker/devicemapper/mnt/8bd64cae... type xfs (rw,relatime,➥
context="system_u:object_r:svirt_sandbox_file_t:s0:c4,c9",nouuid,attr2,inode64,
➥ sunit=1024,swidth=1024,noquota)

From inside docker’s mount namespace, the mount command output includes the
mount point for the root filesystem for app-cli. The LV that docker created for app-cli
is mounted on the application node at /var/lib/docker/devicemapper/mnt/8bd64-
cae… (directory name trimmed for clarity).
Go to that directory while in the docker daemon mount namespace, and you’ll find
a directory named rootfs. This directory is the filesystem for your app-cli container:

# ls -al rootfs
total 32
-rw-r--r--. 1 root root 15759 Aug 1 17:24 anaconda-post.log
lrwxrwxrwx. 1 root root 7 Aug 1 17:23 bin -> usr/bin
drwxr-xr-x. 3 root root 18 Sep 14 22:18 boot
drwxr-xr-x. 4 root root 43 Sep 21 23:19 dev
drwxr-xr-x. 53 root root 4096 Sep 21 23:19 etc
-rw-r--r--. 1 root root 7388 Sep 14 22:16 help.1
drwxr-xr-x. 2 root root 6 Nov 5 2016 home
lrwxrwxrwx. 1 root root 7 Aug 1 17:23 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Aug 1 17:23 lib64 -> usr/lib64
drwx------. 2 root root 6 Aug 1 17:23 lost+found
drwxr-xr-x. 2 root root 6 Nov 5 2016 media
drwxr-xr-x. 2 root root 6 Nov 5 2016 mnt
drwxr-xr-x. 4 root root 32 Sep 14 22:05 opt
Application isolation with kernel namespaces 51

drwxr-xr-x. 2 root root 6 Aug 1 17:23 proc


dr-xr-x---. 2 root root 137 Aug 1 17:24 root
drwxr-xr-x. 11 root root 145 Sep 13 15:35 run
lrwxrwxrwx. 1 root root 8 Aug 1 17:23 sbin -> usr/sbin
...

It’s been quite a journey to uncover the root filesystem for app-cli. You’ve used infor-
mation from the docker daemon to use multiple command-line tools, including
nsenter, to change from the default mount namespace for your server to the namespace
created by the docker daemon. You’ve done a lot of work to find an isolated filesystem.
Docker does this automatically at the request of OpenShift every time a container is cre-
ated. Understanding how this process works, and where the artifacts are created, is
important when you’re using containers every day for your application workloads.
From the point of view of the applications running in the app-cli container, all
that’s available to them is what’s in the rootfs directory, because the mount namespace
created for the container isolates its content (see figure 3.7). Understanding how
mount namespaces function on an application node, and knowing how to enter a
container namespace manually, are invaluable tools when you’re troubleshooting a
container that’s not functioning as designed.

The system mount System mount namespace


namespace is for all
applications running
on the host.
docker daemon mount namespace

The docker mount


namespace isolates
the mounted volumes app-cli container mount namespace
for the containers on
the system. drwxr-xr-x. 18 root root 4096 Oct 9 12:39 .
drwxr-xr-x. 3 root root 30 Sep 21 12:49 ..
The app-cli namespace lrwxrwxrwx. 1 root root 7 Aug 1 17:23 bin
isolates the content drwxr-xr-x. 3 root root 18 Sep 14 22:18 boot
available in the drwxr-xr-x. 4 root root 43 Oct 9 12:39 dev
container from -rwxr-xr-x. 1 root root 0 Oct 9 12:39 .dockerenv
everything else drwxr-xr-x. 53 root root 4096 Oct 9 12:39 etc
on the system. -rw-r--r--. 1 root root 7388 Sep 14 22:16 help.1
drwxr-xr-x. 2 root root 6 Nov 5 2016 home
lrwxrwxrwx. 1 root root 7 Aug 1 17:23 lib
lrwxrwxrwx. 1 root root 9 Aug 1 17:23 lib64
drwx------. 2 root root 6 Aug 1 17:23 lost+found
drwxr-xr-x. 2 root root 6 Nov 5 2016 media
drwxr-xr-x. 2 root root 6 Nov 5 2016 mnt
drwxr-xr-x. 4 root root 32 Sep 14 22:05 opt
drwxr-xr-x. 2 root root 6 Aug 1 17:23 proc

Figure 3.7 The app-cli mount namespace isolates the contents of the rootfs directory.
52 CHAPTER 3 Containers are Linux

Press Ctrl-D to exit the docker daemon’s mount namespace and return to the default
namespace for your application node. Next, we’ll discuss the UTS namespace. It won’t
be as involved an investigation as the mount namespace, but the UTS namespace is
useful for an application platform like OpenShift that deploys horizontally scalable
applications across a cluster of servers.

3.3.2 The UTS namespace


UTS stands for Unix time sharing in the Linux kernel. The UTS namespace lets each
container have its own hostname and domain name.

Time sharing
It can be confusing to talk about time sharing when the UTS namespace has nothing
to do with managing the system clock. Time sharing originally referred to multiple
users sharing time on a system simultaneously. Back in the 1970s, when this con-
cept was created, it was a novel idea.
The UTS data structure in the Linux kernel had its beginnings then. This is where the
hostname, domain name, and other system information are retained. If you’d like to
see all the information in that structure, run uname -a on a Linux server. That com-
mand queries the same data structure.

The easiest way to view the hostname for a server is to run the hostname command, as
follows:

# hostname

You could use nsenter to enter the UTS namespace for the app-cli container, the same
way you entered the mount namespace in the previous section. But there are additional
tools that will execute a command in the namespaces for a running container.

NOTE On the application node, if you use the nip.io domain discussed in
appendix A, your hostname should look similar to ocp2.192.168.122.101
.nip.io.

One of those tools is the docker exec command. To get the hostname value for a run-
ning container, pass docker exec a container’s short ID and the same hostname com-
mand you want to run in the container. Docker executes the specified command for
you in the container’s namespaces and returns the value. The hostname for each
OpenShift container is its pod name:

# docker exec fae8e211e7a7 hostname


app-cli-1-18k2s

Each container has its own hostname because of its unique UTS namespace. If you
scale up app-cli, the container in each pod will have a unique hostname as well. The
Application isolation with kernel namespaces 53

value of this is identifying data coming from each container in a scaled-up system. To
confirm that each container has a unique hostname, log in to your cluster as your
developer user:

oc login -u developer -p developer https://ocp1.192.168.122.100.nip.io:8443

The oc command-line tool has functionality that’s similar to docker exec. Instead of
passing in the short ID for the container, however, you can pass it the pod in which
you want to execute the command. After logging in to your oc client, scale the app-cli
application to two pods with the following command:

oc scale dc/app-cli --replicas=2

This will cause an update to your app-cli deployment config and trigger the creation
of a new app-cli pod. You can get the new pod’s name by running the command oc
get pods --show-all=false. The show-all=false option prevents the output of
pods in a Completed state, so you see only active pods in the output.
Because the container hostname is its corresponding pod name in OpenShift, you
know which pod you were working with using docker directly:

$ oc get pods --show-all=false


NAME READY STATUS RESTARTS AGE Original app-cli pod
app-cli-1-18k2s 1/1 Running 1 5d
app-cli-1-9hsz1 1/1 Running 0 42m New app-cli pod
app-gui-1-l65d9 1/1 Running 1 5d

To get the hostname from your new pod, use the oc exec command. It’s similar to
docker exec, but instead of a container’s short ID, you use the pod name to specify
where you want the command to run. The hostname for your new pod matches the
pod name, just like your original pod:

$ oc exec app-cli-1-9hsz1 hostname


app-cli-1-9hsz1

When you’re troubleshooting application-level issues on your cluster, this is an incred-


ibly useful benefit provided by the UTS namespace. Now that you know how host-
names work in containers, we’ll investigate the PID namespace.

3.3.3 PIDs in containers


Because PIDs are how one application sends signals and information to other applica-
tions, isolating visible PIDs in a container to only the applications in it is an important
security feature. This is accomplished using the PID namespace.
On a Linux server, the ps command shows all running processes, along with their
associated PIDs, on the host. This command typically has a lot of output on a busy sys-
tem. The --ppid option limits the output to a single PID and any child processes it has
spawned.
54 CHAPTER 3 Containers are Linux

From your application node, run ps with the --ppid option, and include the PID
you obtained for your app-cli container. Here you can see that the process for
PID 4470 is httpd and that it has spawned several other processes:

# ps --ppid 4470
PID TTY TIME CMD
4506 ? 00:00:00 cat
4510 ? 00:00:01 cat
4542 ? 00:02:55 httpd
4544 ? 00:03:01 httpd
4548 ? 00:03:01 httpd
4565 ? 00:03:01 httpd
4568 ? 00:03:01 httpd
4571 ? 00:03:01 httpd
4574 ? 00:03:00 httpd
4577 ? 00:03:01 httpd
6486 ? 00:03:01 httpd

Use oc exec to get the output of ps for the app-cli pod that matches the PID you col-
lected earlier. If you’ve forgotten, you can compare the hostname in the docker con-
tainer to the pod name. From inside the container, don’t use the --ppid option,
because you want to see all the PIDs visible from within the app-cli container.
When you run the following command, the output is similar to that from the previ-
ous command:

$ oc exec app-cli-1-18k2s ps
PID TTY TIME CMD
1 ? 00:00:27 httpd
18 ? 00:00:00 cat
19 ? 00:00:01 cat
20 ? 00:02:55 httpd
22 ? 00:03:00 httpd
26 ? 00:03:00 httpd
43 ? 00:03:00 httpd
46 ? 00:03:01 httpd
49 ? 00:03:01 httpd
52 ? 00:03:00 httpd
55 ? 00:03:00 httpd
60 ? 00:03:01 httpd
83 ? 00:00:00 ps

There are three main differences in the output:


The initial httpd command (PID 4470) is listed in the output.
The ps command is listed in the output.
The PIDs are completely different.
Each container has a unique PID namespace. That means from inside the container,
the initial command that started the container (PID 4470) is viewed as PID 1. All the
processes it spawned also have PIDs in the same container-specific namespace.
Application isolation with kernel namespaces 55

NOTE Applications that are created by a process already in a container auto-


matically inherit the container’s namespace. This makes it easier for applica-
tions in the container to communicate.

So far, we’ve discussed how filesystems, hostnames, and PIDs are isolated in a con-
tainer. Next, let’s take a quick look at how shared memory resources are isolated.

3.3.4 Shared memory resources


Applications can be designed to share memory resources. For example, application A
can write a value into a special, shared section of system memory, and the value can be
read and used by application B. The following shared memory resources, docu-
mented at http://mng.bz/Xjai, are isolated for each container in OpenShift:
POSIX message queue interfaces in /proc/sys/fs/mqueue
The following shared memory parameters:
– msgmax
– msgmnb
– msgmni
– sem
– shmall
– shmmax
– shmmni
– shm_rmid_forced
IPC interfaces in /proc/sysvipc
If a container is destroyed, shared memory resources are destroyed as well. Because
these resources are application-specific, you’ll work with them more in chapter 8
when you deploy a stateful application.
The last namespace to discuss is the network namespace.

3.3.5 Container networking


The fifth kernel namespace that’s used by docker to isolate containers in OpenShift is
the network namespace. There’s nothing funny about the name for this namespace.
The network namespace isolates network resources and traffic in a container.
The resources in this definition mean the entire TCP/IP stack is used by applications
in the container.
Chapter 10 is dedicated to going deep into OpenShift’s software-defined network-
ing, but we need to illustrate in this chapter how the view from within the container is
drastically different than the view from your host.
The PHP builder image you used to create app-cli and app-gui doesn’t have the ip
utility installed. You could install it into the running container using yum. But a faster
way is to use nsenter. Earlier, you used nsenter to enter the mount namespace of the
docker process so you could view the root filesystem for app-cli.
56 CHAPTER 3 Containers are Linux

The OSI model


It would be great if we could go through the OSI model here. Unfortunately, it’s out of
scope for this book. In short, it’s a model to describe how data travels in a TCP/IP
network. There are seven layers. You’ll often hear about layer 3 devices, or a layer 2
switch; when someone says that, they’re referring to the layer of the OSI model on
which a particular device operates. Additionally, the OSI model is a great tool to use
any time you need to understand how data moves through any system or application.
If you haven’t read up on the OSI model before, it’s worth your time to look at the
article “The OSI Model Explained: How to Understand (and Remember) the 7 Layer
Network Model” by Keith Shaw (Network World, http://mng.bz/CQCE).

If you run nsenter and include a command as the last argument, then instead of
opening an interactive session in that namespace, the command is executed in the
specified namespace and returns the results. Using this tool, you can run the ip
command from your server’s default namespace in the network namespace of your
app-cli container.
If you compare this to the output from running the /sbin/ip a command on your
host, the differences are obvious. Your application node will have 10 or more active
network interfaces. These represent the physical and software-defined devices that
make OpenShift function securely. But in the app-cli container, you have a container-
specific loopback interface and a single network interface with a unique MAC and
IP address:

Loopback # nsenter -t 5136 -n /sbin/ip a


device in the 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
container ➥ state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
eth0 valid_lft forever preferred_lft forever
device in 3: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue
the container ➥ state UP
link/ether 0a:58:0a:81:00:2e brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.129.0.46/23 scope global eth0
IP address MAC address
valid_lft forever preferred_lft forever
for eth0 for eth0
inet6 fe80::858:aff:fe81:2e/64 scope link
valid_lft forever preferred_lft forever

The network namespace is the first component in the OpenShift networking solution.
We’ll discuss how network traffic gets in and out of containers in chapter 10, when we
cover OpenShift networking in depth.
In OpenShift, isolating processes doesn’t happen in the application, or even in the
userspace on the application node. This is a key difference between other types of soft-
ware clusters, and even some other container-based solutions. In OpenShift, isolation
Summary 57

and resource limits are enforced in the Linux kernel on the application nodes. Isola-
tion with kernel namespaces provides a much smaller attack surface. An exploit that
would let someone break out from a container would have to exist in the container run-
time or the kernel itself. With OpenShift, as we’ll discuss in depth in chapter 11 when
we examine security principles in OpenShift, configuration of the kernel and the con-
tainer runtime is tightly controlled.
The last point we’d like to make in this chapter echoes how we began the discus-
sion. Fundamental knowledge of how containers work and use the Linux kernel is
invaluable. When you need to manage your cluster or troubleshoot issues when they
arise, this knowledge lets you think about containers in terms of what they’re doing all
the way to the bottom of the Linux kernel. That makes solving issues and creating sta-
ble configurations easier to accomplish.
Before you move on, clean up by reverting back to a single replica of the app-cli
application with the following command:

oc scale dc/app-cli --replicas=1

3.4 Summary
OpenShift orchestrates Kubernetes and docker to deploy and manage applica-
tions in containers.
Multiple levels of management are available in your OpenShift cluster that can
be used for different levels of information.
Containers isolate processes in containers using kernel namespaces.
You can interact with namespaces from the host using special applications and
tools.
HOME ABOUT LINUX SHELL SCRIPTING TUTORIAL RSS/FEED DONATIONS SEARCH

nixCraft
Linux and Unix tutorials for new and seasoned sysadmin

ADVERTISEMENTS

Linux / Unix: chroot Command Examples


last updated March 9, 2014 in Commands, Linux, UNIX

Ads by

I am a new Linux and Unix user. How do I change the root


Report this ad
directory of a command? How do I change the root
directory of a process such as web-server using a chroot Why this ad?
command to isolate file system? How do I use a chroot to
recover password or fix the damaged Linux/Unix based environment?

Each process/command on Linux and Unix-like system has current working directory
called root directory of a process/command. You can change the root directory of a
command using chroot command, which ends up changing the root directory for both
current running process and its children.[donotprint]

chroot command details

Description Change root directory

Category Processes Management

Difficulty Advanced

Root privileges Yes

Estimated completion time 30m

Contents
» Syntax
» Examples
» Exit form chrooted jail
» Find out if service in chrooted jail
» Rescue and fix software RAID
» Should I use the chroot feature?
» Options
» See also Search nixCraft

[/donotprint]A process/command that is run in such a modified environment cannot ADVERTISEMENTS


access files outside the root directory. This modified environment is commonly known
as “jailed directory” or “chroot jail”. Only a privileged process and root user can use
chroot command. This is useful to:
Automate Away Dev
Advertisements Environments
Ads by
Tools and Enterprise Platform
AutomateReport
Away Dev
this ad
Open-Source Tools and An Enterprise
Environments
Why this ad? Platform
coder.com
Tools and Enterprise Platform
Open-Source Tools and An Enterprise
OPEN
Platform
coder.com

OPEN FEATURED ARTICLES

1
30 Cool Open Source Software I Discovered
1. Privilege separation for unprivileged process such as Web-server or DNS server. in 2013
2. Setting up a test environment.
3. Run old programs or ABI in-compatibility programs without crashing application

2
30 Handy Bash Shell Aliases For Linux / Unix
or system.
/ Mac OS X
4. System recovery.
5. Reinstall the bootloader such as Grub or Lilo.

3
6. Password recovery – Reset a forgotten password and more. Top 32 Nmap Command Examples For
Linux Sys/Network Admins

Purpose
4
25 PHP Security Best Practices For Linux Sys
Admins
The chroot command changes its current and root directories to the
provided directory and then run command, if supplied, or an interactive

5
copy of the user’s login shell. Please note that not every application can 30 Linux System Monitoring Tools Every
be chrooted. SysAdmin Should Know

Syntax
6
40 Linux Server Hardening Security Tips

The basic syntax is as follows:

7
Linux: 25 Iptables Netfilter Firewall
chroot /path/to/new/root command Examples For New SysAdmins

OR

8
Top 20 OpenSSH Server Best Security
Practices
chroot /path/to/new/root /path/to/server

9
OR Top 25 Nginx Web Server Best Security
Practices

chroot [options] /path/to/new/root /path/to/server

10
My 10 UNIX Command Line Mistakes

chroot command examples

In this example, build a mini-jail for testing purpose with bash and ls command only.
First, set jail location using mkdir command:
SIGN UP FOR MY NEWSLETTER
$ J=$HOME/jail

Create directories inside $J:

$ mkdir -p $J
$ mkdir -p $J/{bin,lib64,lib}
$ cd $J

Copy /bin/bash and /bin/ls into $J/bin/ location using cp command:

$ cp -v /bin/{bash,ls} $J/bin

Linux / UNIX List


Copy required libs in $J. Use ldd command to print shared library dependencies for
Just Directories Or
bash: Directory Names
$ ldd /bin/bash
Red Hat / CentOS:
Chroot Apache 2
Sample outputs: Web Server

How to check open


linux-vdso.so.1 => (0x00007fff8d987000) ports in Linux using
libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00000032f7a00000) the CLI
libdl.so.2 => /lib64/libdl.so.2 (0x00000032f6e00000)
libc.so.6 => /lib64/libc.so.6 (0x00000032f7200000) Linux Delete /
/lib64/ld-linux-x86-64.so.2 (0x00000032f6a00000) Remove MBR

Linux Set or Change


Ads by User Password
SK SATO Infrared
Report this adThermometer…
Why this ad?
Rp 1,9 JT RHEL / CentOS
Linux: Nginx Chroot
monotaro.id Selengkapnya Jail Start / Stop /…

Copy libs in $J correctly from the above output: Linux Command:


$ cp -v /lib64/libtinfo.so.5 /lib64/libdl.so.2 /lib64/libc.so.6 /lib64/ld-linux-
Show Mounted Hard
Drives Partition
x86-64.so.2 $J/lib64/

Debian/Ubuntu
Sample outputs: Linux: Restrict an
SSH user session…

Linux Shell script to


`/lib64/libtinfo.so.5' -> `/home/vivek/jail/lib64/libtinfo.so.5'
add a user with a
`/lib64/libdl.so.2' -> `/home/vivek/jail/lib64/libdl.so.2'
password to the…
`/lib64/libc.so.6' -> `/home/vivek/jail/lib64/libc.so.6'
`/lib64/ld-linux-x86-64.so.2' -> `/home/vivek/jail/lib64/ld-linux-x8
6-64.so.2' Understanding /etc/
passwd File Format

Copy required libs in $J for ls command. Use ldd command to print shared library How to change
dependencies for ls command: directory in Linux
terminal
$ ldd /bin/ls

Unix / Linux Print


Sample outputs: Environment
Variables Command

linux-vdso.so.1 => (0x00007fff68dff000)


Ubuntu: Mount
libselinux.so.1 => /lib64/libselinux.so.1 (0x00000032f8a00000)
Encrypted Home
librt.so.1 => /lib64/librt.so.1 (0x00000032f7a00000)
Directory…
libcap.so.2 => /lib64/libcap.so.2 (0x00000032fda00000)
libacl.so.1 => /lib64/libacl.so.1 (0x00000032fbe00000)
libc.so.6 => /lib64/libc.so.6 (0x00000032f7200000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000032f6e00000)
RECENTLY UPDATED
/lib64/ld-linux-x86-64.so.2 (0x00000032f6a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000032f7600000)
libattr.so.1 => /lib64/libattr.so.1 (0x00000032f9600000) Make Linux/Unix Script Portable With #!/usr/bin/env
As a Shebang

exa a modern replacement for ls command in rust


for Linux/Unix
You can copy libs one-by-one or try bash shell for loop as follows:

20 Unix/Linux Command Line Tricks - Part I


list="$(ldd /bin/ls | egrep -o '/lib.*\.[0-9]')"
for i in $list; do cp -v "$i" "${J}${i}"; done
Linux append text to end of file

Sample outputs:
How to append text to a file when using sudo
command on Linux or Unix

`/lib64/libselinux.so.1' -> `/home/vivek/jail/lib64/libselinux.so.1' Append Current Date To Filename in Bash Shell

`/lib64/librt.so.1' -> `/home/vivek/jail/lib64/librt.so.1'


How to update/upgrade Pi-hole with an OpenVPN
`/lib64/libcap.so.2' -> `/home/vivek/jail/lib64/libcap.so.2'
on Ubuntu/Debian Linux server
`/lib64/libacl.so.1' -> `/home/vivek/jail/lib64/libacl.so.1'
`/lib64/libc.so.6' -> `/home/vivek/jail/lib64/libc.so.6'
Linux Kill and Logout Users Command
`/lib64/libdl.so.2' -> `/home/vivek/jail/lib64/libdl.so.2'
`/lib64/ld-linux-x86-64.so.2' -> `/home/vivek/jail/lib64/ld-linux-x8
Display Colored Man Pages in Linux and Unix
6-64.so.2'
`/lib64/libpthread.so.0' -> `/home/vivek/jail/lib64/libpthread.so.0' Linux / UNIX List Just Directories Or Directory
`/lib64/libattr.so.1' -> `/home/vivek/jail/lib64/libattr.so.1' Names

Finally, chroot into your new jail:

$ sudo chroot $J /bin/bash

Try browsing /etc or /var:

# ls /
# ls /etc/
# ls /var/

A chrooted bash and ls application is locked into a particular directory called


$HOME/$J and unable to wander around the rest of the directory tree, and sees that
directory as its “/” (root) directory. This is a tremendous boost to security if configured
properly. I usually lock down the following applications using the same techniques:

1. Apache – Red Hat / CentOS: Chroot Apache 2 Web Server


2. Nginx – Linux nginx: Chroot (Jail) Setup
3. Chroot Lighttpd web server on a Linux based system
4. Chroot mail server.
5. Chroot Bind DNS server and more.

How do I exit from chrooted jail?

Type exit

$ exit

Sample session from above commands:

Animated gif 01: Linux / Unix: Bash Chroot ls Command Demo

Find out if service in chrooted jail or not

You can easily find out if Postfix mail server is chrooted or not using the following two
commands:

pid=$(pidof -s master)
ls -ld /proc/$pid/root

Sample outputs from my Linux based server:

lrwxrwxrwx. 1 root root 0 Mar 9 11:16 /proc/8613/root -> /

The PID 8613 pointing out to / (root) i.e. the root directory for application is not
changed or chrooted. This is a quick and dirty way to find out if application is chrooted
or not without opening configuration files. Here is another example from chrooted
nginx server:

pid=$(pidof -s master)
ls -ld /proc/$pid/root

Sample outputs:

lrwxrwxrwx 1 nginx nginx 0 Mar 9 11:17 /proc/4233/root -> /nginxjail

The root directory for application is changed to /nginxjail.

Rescue and fix software RAID system with chroot

I’m assuming that software RAID based Linux system is not booting. So you booted
system either using the Live CD or networked based remote rescue kernel mode to fix
the system. In this example, I booting RHEL based system using live Linux DVD/CD and
chroot into /dev/sda1 and/or /dev/md0 to fix the problem:

## Recover data, at live cd prompt type the following commands. ##


## /dev/sda1 main system partition ##
## /dev/md0 /data partition ##
# Set jail dir
d=/chroot
mkdir $d

# Mount sda1 and required dirs


mount /dev/sda1 $d
mount -o bind /dev $d/dev
mount -o bind /sys $d/sys
mount -o bind /dev/shm $d/dev/shm
mount -o bind /proc $d/proc

# Mount software raid /dev/md0


mount /dev/md0 $d/data

# Chroot to our newly created jail. This allows us to fix bootloader or


grab data before everything goes to /dev/null
chroot $d

# Can you see?


ls
df

# Get files to safe location


rsync -avr /path/to/my_precious_data_dir user@safe.location.cyberciti.b
iz:/path/to/dest

# Get out of chrooted jail and reboot or format the server as per your
needs ;)
exit
umount {dev,sys,[...],}
reboot

B U T WA I T, T H E R E ’ S M O R E !

See all other chroot command related examples on nixCraft:

1. Ubuntu: Mount Encrypted Home Directory (~/.private) From an Ubuntu Live CD


2. Linux Configure rssh Chroot Jail To Lock Users To Their Home Directories Only
3. Fix a dual boot MS-Windows XP/Vista/7/Server and Linux problem
4. Restore Debian Linux Grub boot loader

A note about chrooting apps on a Linux or Unix-like


systems

Should you use the chroot feature all the time? In the above example, the program is
fairly simple but you may end up with several different kinds of problems such as:

1. Missing libs in jail can result into broken jail.


2. Complex program are difficult to chroot. I suggest you either try real jail such as
provided by FreeBSD or use virtualization soultuon such as KVM on Linux.
3. App running in jail can not run any other programs, can not alter any files, and can
not assume another user’s identity. Loosen these restrictions, you have lessened
your security, chroot or no chroot.

Also note that:

1. Do not forgot, to updated chrooted apps when you upgrade apps locally.
2. Not every app can or should be chrooted.
3. Any app which has to assume root privileges to operate is pointless to attempt to
chroot, as root can generally escape a chroot.
4. Chroot is not a silver bullet. Learn how to secure and harden rest of the system too.

chroot command options

From the chroot(8) command man page:

--userspec=USER:GROUP specify user and group (ID or name) to use


--groups=G_LIST specify supplementary groups as g1,g2,..,gN
--help display this help and exit
--version output version information and exit

See also

chroot(8) Linux/Unix command man page


Man pages – chroot(2)
OpenBSD documentation – See Apache chrooting faq for more information.

Category List of Unix and Linux commands

File Management cat

Network Utilities dig • host • ip •

Package Manager apk • apt

bg • chroot • disown • fg • jobs • kill • killall • pwdx •


Processes Management
time • pidof • pstree

Searching whereis • which

id • groups • last • lastcomm • logname • users • w


User Information
• who • whoami • lid/libuser-lid • members

SHARE ON Facebook Twitter

ADVERTISEMENTS

Ads by Unlimited Beat'em Up


Report this ad Action
Why this ad?

Ad Neople Inc.

Open

Posted by: Vivek Gite


The author is the creator of nixCraft and a seasoned sysadmin, DevOps engineer, and a
trainer for the Linux operating system/Unix shell scripting. Get the latest tutorials on
SysAdmin, Linux/Unix and open source topics via RSS/XML feed or weekly email
newsletter.

RELATED TUTORIALS

Linux nginx: Chroot (Jail) Setup

Debian/Ubuntu Linux: Restrict an SSH user session to a specific directory by setting


chrooted jail

Explains: Linux linux-gate.so.1 Library / Dynamic Shared Object [ vdso ]

Linux: Find All File Descriptors Used By a Process

Linux: Set OR Change The Library Path

HowTo: Check If a Directory Exists In a Shell Script

Error while loading shared libraries: libXrender.so.1 on Linux

Ubuntu: Mount Encrypted Home Directory (~/.private) From an Ubuntu Live CD

What Files Are In a RPM Package?

Redhat / CentOS / Fedora Linux Install XCache for PHP 5

Your support makes a big difference:

I have a small favor to ask. More people are reading the nixCraft. Many of you
block advertising which is your right, and advertising revenues are not
sufficient to cover my operating costs. So you can see why I need to ask for
your help. The nixCraft takes a lot of my time and hard work to produce. If
everyone who reads nixCraft, who likes it, helps fund it, my future would be
more secure. You can donate as little as $1 to support nixCraft:

Become a Supporter Make a contribution via Paypal/Bitcoin

10 comment

autarch princeps March 9, 2014 at 6:16 pm

You should really have a look at lxc. I’ve tried it on Fedora. It even comes with scripts to
initialize a minimum Fedora or Debian inside a container. You can just install the
necessary libraries and software using the package manager.

Prasad March 10, 2014 at 3:25 am

Best document on chroot. Thanks for sharing.

Mesut Tasci March 10, 2014 at 4:06 pm

How you make this gif?

Kettu March 17, 2014 at 8:57 pm

I see some similarities between Docker and chroot, although they are made for quite
different purposes.

Sebastien December 23, 2014 at 2:34 pm

list="$(ldd /bin/ls | egrep -o '/lib.*\.[0-9]')"

I would recommend:

#!/bin/bash
JAIL="$(pwd)" #-- if you run this in your jail directory
echo -n "Give command name to seach libs for: "
read bashcmd
[ ! -e "$bashcmd" ] && echo "$bashcmd command not found. Abording." && exit
list="$(ldd $bashcmd | egrep -o ' /.*/lib.*\.[0-9]* '|sed 's/ //g')"
for i in $list; do cp -v "$i" "$JAIL${i}"; done

Christian February 16, 2015 at 9:18 am

Enhancement for your loop to copy the bash and ls dependencies:

for i in $list; do cp -v --parents "$i" "${J}"; done

What it does:
It copies the whole directory structure of the dependencies so you won’t get an error if
a subfolder like $HOME/jail/lib/x86_64-linux-gnu/ doesn’t exist.

Faron May 14, 2015 at 9:46 pm

Aren’t we supposed to `chdir` a folder first before ‘chroot’-ing it as it will tighten holes
in chroot environment?

Diromedes Malta July 13, 2015 at 9:47 pm

Very useful. Thanks.

ceneblock June 15, 2016 at 8:17 pm

There are other options that will automate this for you. I’m personally fond of
debootstrap.

sameer.oak September 13, 2017 at 9:16 am

a marvelously written document. indeed, very useful for experts and newbies too.
thank you very much for your contribution.

Still, have a question? Get help on our forum!

Previous FAQ: Next FAQ:

Minify and Compress CSS & Javascript Files At a How To Set up SSH Keys on a Lin-
Linux/Unix Shell Prompt ux / Unix System

Tagged as: Advanced, Processes Management

©2000-2020 nixCraft. All rights reserved. PRIVACY TERM OF SERVICE CONTACT/EMAIL DONATIONS SEARCH
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Add to playlist ! Bookmark " Use Credit # Buy with Subscriber Discount $ Code Files %

& 

Linux namespaces – the foundation of LXC


Namespaces are the foundation of lightweight process virtualization. They enable a process and its children to
have different views of the underlying system. This is achieved by the addition of the unshare() and
setns() system calls, and the inclusion of six new constant flags passed to the clone() , unshare() ,
and setns() system calls:

( clone() : This creates a new process and attaches it to a new specified namespace

( unshare() : This attaches the current process to a new specified namespace

( setns() : This attaches a process to an already existing namespace

There are six namespaces currently in use by LXC, with more being developed:

( Mount namespaces, specified by the CLONE_NEWNS flag

( UTS namespaces, specified by the CLONE_NEWUTS flag

( IPC namespaces, specified by the CLONE_NEWIPC flag

( PID namespaces, specified by the CLONE_NEWPID flag

( User namespaces, specified by the CLONE_NEWUSER flag

( Network namespaces, specified by the CLONE_NEWNET flag

Let's have a look at each in more detail and see some userspace examples, to help us better understand what
happens under the hood.

Mount namespaces
Mount namespaces first appeared in kernel 2.4.19 in 2002 and provided a separate view of the filesystem mount
points for the process and its children. When mounting or unmounting a filesystem, the change will be noticed by
all processes because they all share the same default namespace. When the CLONE_NEWNS flag is passed to the
clone() system call, the new process gets a copy of the calling process mount tree that it can then change

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 1 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

without affecting the parent process. From that point on, all mounts and unmounts in the default namespace will
be visible in the new namespace, but changes in the per-process mount namespaces will not be noticed outside of
it.

The clone() prototype is as follows:

Copy

#define _GNU_SOURCE
#include <sched.h>
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg);

An example call that creates a child process in a new mount namespace looks like this:

Copy

child_pid = clone(childFunc, child_stack + STACK_SIZE, CLONE_NEWNS | SIGCHLD, argv[1]);

When the child process is created, it executes the childFunc function, which will perform its work in the new
mount namespace.

The util-linux package provides userspace tools that implement the unshare() call, which effectively
unshares the indicated namespaces from the parent process.

To illustrate this:

1 First open a terminal and create a directory in /tmp as follows:

Copy

root@server:~# mkdir /tmp/mount_ns


root@server:~#

2 Next, move the current bash process to its own mount namespace by passing the mount flag to
unshare :

Copy

root@server:~# unshare -m /bin/bash


root@server:~#

3 The bash process is now in a separate namespace. Let's check the associated inode number of the
namespace:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 2 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# readlink /proc/$$/ns/mnt


mnt:[4026532211]
root@server:~#

4 Next, create a temporary mount point:

Copy

root@server:~# mount -n -t tmpfs tmpfs /tmp/mount_ns


root@server:~#

5 Also, make sure you can see the mount point from the newly created namespace:

Copy

root@server:~# df -h | grep mount_ns


tmpfs 3.9G 0 3.9G 0% /tmp/mount_ns
root@server:~#
oot@server:~# cat /proc/mounts | grep mount_ns
tmpfs
mpfs /tmp/mount_ns tmpfs rw,relatime 0 0
root@server:~#
oot@server:~#

As expected, the mount point is visible because it is part of the namespace we created and the current
bash process is running from.

6 Next, start a new terminal session and display the namespace inode ID from it:

Copy

root@server:~# readlink /proc/$$/ns/mnt


mnt:[4026531840]
root@server:~#

Notice how it's different from the mount namespace on the other terminal.

7 Finally, check if the mount point is visible in the new terminal:

Copy

root@server:~# cat /proc/mounts | grep mount_ns


root@server:~# df -h | grep mount_ns
root@server:~#

Not surprisingly, the mount point is not visible from the default namespace.

) Note
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 3 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

In the context of LXC, mount namespaces are useful because they provide a way for a different filesystem layout to exist
inside the container. It's worth mentioning that before the mount namespaces, a similar process confinement could be
achieved with the chroot() system call, however chroot does not provide the same per-process isolation as
mount namespaces do.

UTS namespaces
Unix Timesharing (UTS
UTS) namespaces provide isolation for the hostname and domain name, so that each LXC
container can maintain its own identifier as returned by the hostname -f command. This is needed for most
applications that rely on a properly set hostname.

To create a bash session in a new UTS namespace, we can use the unshare utility again, which uses the
unshare() system call to create the namespace and the execve() system call to execute bash in it:

Copy

root@server:~# hostname
server
root@server:~# unshare -u /bin/bash
root@server:~# hostname uts-namespace
root@server:~# hostname
uts-namespace
root@server:~# cat /proc/sys/kernel/hostname
uts-namespace
root@server:~#

As the preceding output shows, the hostname inside the namespace is now uts-namespace .

Next, from a different terminal, check the hostname again to make sure it has not changed:

Copy

root@server:~# hostname
server
root@server:~#

As expected, the hostname only changed in the new UTS namespace.

To see the actual system calls that the unshare command uses, we can run the strace utility:

Copy

root@server:~# strace -s 2000 -f unshare -u /bin/bash


...
unshare(CLONE_NEWUTS) = 0
getgid() = 0
setgid(0) = 0
getuid() = 0
setuid(0) = 0
execve("/bin/bash", ["/bin/bash"], [/* 15 vars */]) = 0
...

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 4 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

From the output we can see that the unshare command is indeed using the unshare() and execve()

system calls and the CLONE_NEWUTS flag to specify new UTS namespace.

IPC namespaces
The Interprocess Communication (IPC
IPC) namespaces provide isolation for a set of IPC and synchronization
facilities. These facilities provide a way of exchanging data and synchronizing the actions between threads and
processes. They provide primitives such as semaphores, file locks, and mutexes among others, that are needed to
have true process separation in a container.

PID namespaces
The Process ID (PID
PID) namespaces provide the ability for a process to have an ID that already exists in the default
namespace, for example an ID of 1 . This allows for an init system to run in a container with various other
processes, without causing a collision with the rest of the PIDs on the same OS.

To demonstrate this concept, open up pid_namespace.c :

Copy

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>

static int childFunc(void *arg)


{
printf("Process ID in child = %ld\n", (long) getpid());
}

First, we include the headers and define the childFunc function that the clone() system call will use.
The function prints out the child PID using the getpid() system call:

Copy

static char child_stack[1024*1024];

int main(int argc, char *argv[])


{
pid_t child_pid;

child_pid = clone(childFunc, child_stack +


(1024*1024),
CLONE_NEWPID | SIGCHLD, NULL);

printf("PID of cloned process: %ld\n", (long) child_pid);


waitpid(child_pid, NULL, 0);
exit(EXIT_SUCCESS);
}

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 5 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

In the main() function, we specify the stack size and call clone() , passing the child function
childFunc , the stack pointer, the CLONE_NEWPID flag, and the SIGCHLD signal. The CLONE_NEWPID

flag instructs clone() to create a new PID namespace and the SIGCHLD flag notifies the parent process
when one of its children terminates. The parent process will block on waitpid() if the child process has not
terminated.

Compile and then run the program with the following:

Copy

root@server:~# gcc pid_namespace.c -o pid_namespace


root@server:~# ./pid_namespace
PID of cloned process: 17705
Process ID in child = 1
root@server:~#

From the output, we can see that the child process has a PID of 1 inside its namespace and 17705

otherwise.

) Note
Note that error handling has been omitted from the code examples for brevity.

User namespaces
The user namespaces allow a process inside a namespace to have a different user and group ID than that in the
default namespace. In the context of LXC, this allows for a process to run as root inside the container, while
having a non-privileged ID outside. This adds a thin layer of security, because braking out for the container will
result in a non-privileged user. This is possible because of kernel 3.8, which introduced the ability for non-
privileged processes to create user namespaces.

To create a new user namespace as a non-privileged user and have root inside, we can use the unshare

utility. Let's install the latest version from source:

Copy

root@ubuntu:~# cd /usr/src/
root@ubuntu:/usr/src# wget https://www.kernel.org/pub/linux/utils/util-linux/v2.28/util-linux-2.28.tar.gz
root@ubuntu:/usr/src# tar zxfv util-linux-2.28.tar.gz
root@ubuntu:/usr/src# cd util-linux-2.28/
root@ubuntu:/usr/src/util-linux-2.28# ./configure
root@ubuntu:/usr/src/util-linux-2.28# make && make install
root@ubuntu:/usr/src/util-linux-2.28# unshare --map-root-user --user sh -c whoami
root
root@ubuntu:/usr/src/util-linux-2.28#

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 6 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

We can also use the clone() system call with the CLONE_NEWUSER flag to create a process in a user
namespace, as demonstrated by the following program:

Copy

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sched.h>

static int childFunc(void *arg)


{
printf("UID inside the namespace is %ld\n", (long)
geteuid());
printf("GID inside the namespace is %ld\n", (long)
getegid());
}

static char child_stack[1024*1024];

int main(int argc, char *argv[])


{
pid_t child_pid;

After compilation and execution, the output looks similar to this when run as root - UID of 0 :

Copy

root@server:~# gcc user_namespace.c -o user_namespace


root@server:~# ./user_namespace
UID outside the namespace is 0
GID outside the namespace is 0
UID inside the namespace is 65534
GID inside the namespace is 65534
root@server:~#

Network namespaces
Network namespaces provide isolation of the networking resources, such as network devices, addresses, routes,
and firewall rules. This effectively creates a logical copy of the network stack, allowing multiple processes to listen
on the same port from multiple namespaces. This is the foundation of networking in LXC and there are quite a lot
of other use cases where this can come in handy.

The iproute2 package provides very useful userspace tools that we can use to experiment with the network
namespaces, and is installed by default on almost all Linux systems.

There's always the default network namespace, referred to as the root namespace, where all network interfaces
are initially assigned. To list the network interfaces that belong to the default namespace run the following
command:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 7 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
root@server:~#

In this case, there are two interfaces – lo and eth0 .

To list their configuration, we can run the following:

Copy

root@server:~# ip a s
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
inet 10.1.32.40/24 brd 10.1.32.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::cd5:eff:feb0:a347/64 scope link
valid_lft forever preferred_lft forever
root@server:~#

Also, to list the routes from the root network namespace, execute the following:

Copy

root@server:~# ip r s
default via 10.1.32.1 dev eth0
10.1.32.0/24 dev eth0 proto kernel scope link src 10.1.32.40
root@server:~#

Let's create two new network namespaces called ns1 and ns2 and list them:

Copy

root@server:~# ip netns add ns1


root@server:~# ip netns add ns2
root@server:~# ip netns
ns2
ns1
root@server:~#

Now that we have the new network namespaces, we can execute commands inside them:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 8 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip netns exec ns1 ip link


1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~#

The preceding output shows that in the ns1 namespace, there's only one network interface, the loopback -
lo interface, and it's in a DOWN state.

We can also start a new bash session inside the namespace and list the interfaces in a similar way:

Copy

root@server:~# ip netns exec ns1 bash


root@server:~# ip link
1: lo: < LOOPBACK
LOOPBACK>> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~# exit
root@server:~#

This is more convenient for running multiple commands than specifying each, one at a time. The two network
namespaces are not of much use if not connected to anything, so let's connect them to each other. To do this we'll
use a software bridge called Open vSwitch.

Open vSwitch works just as a regular network bridge and then it forwards frames between virtual ports that we
define. Virtual machines such as KVM, Xen, and LXC or Docker containers can then be connected to it.

Most Debian-based distributions such as Ubuntu provide a package, so let's install that:

Copy

root@server:~# apt-get install -y openvswitch-switch


root@server:~#

This installs and starts the Open vSwitch daemon. Time to create the bridge; we'll name it OVS-1 :

Copy

root@server:~# ovs-vsctl add-br OVS-1


root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
Port "OVS-1"
Interface "OVS-1"
type: internal
ovs_version: "2.0.2"
root@server:~#

) Note
https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 9 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

If you would like to experiment with the latest version of Open vSwitch, you can download the source code from
http://openvswitch.org/download/ (http://openvswitch.org/download/) and compile it.

The newly created bridge can now be seen in the root namespace:

Copy

root@server:~# ip a s OVS-1
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
inet6 fe80::f0d9:78ff:fe72:3d77/64 scope link
valid_lft forever preferred_lft forever
root@server:~#

In order to connect both network namespaces, let's first create a virtual pair of interfaces for each namespace:

Copy

root@server:~# ip link add eth1-ns1 type veth peer name veth-ns1


root@server:~# ip link add eth1-ns2 type veth peer name veth-ns2
root@server:~#

The preceding two commands create four virtual interfaces eth1-ns1 , eth1-ns2 and veth-ns1 ,
veth-ns2 . The names are arbitrary.

To list all interfaces that are part of the root network namespace, run:

Copy

root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
6: eth1-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
8: eth1-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
root@server:~#

Let's assign the eth1-ns1 and eth1-ns2 interfaces to the ns1 and ns2 namespaces:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 10 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip link set eth1-ns1 netns ns1


root@server:~# ip link set eth1-ns2 netns ns2

Also, confirm they are visible from inside each network namespace:

Copy

root@server:~# ip netns exec ns1 ip link


1: lo: < LOOPBACK
LOOPBACK>> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: eth1-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
root@server:~#
root@server:~# ip netns exec ns2 ip link
1: lo: < LOOPBACK
LOOPBACK>> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8: eth1-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
root@server:~#

Notice, how each network namespace now has two interfaces assigned – loopback and eth1-ns* .

If we list the devices from the root namespace, we should see that the interfaces we just moved to ns1 and
ns2 namespaces are no longer visible:

Copy

root@server:~# ip link
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: < BROADCAST,UP,LOWER_UP
BROADCAST,UP,LOWER_UP>> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: < BROADCAST,MULTICAST
BROADCAST,MULTICAST>> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
root@server:~#

It's time to connect the other end of the two virtual pipes, the veth-ns1 and veth-ns2 interfaces to the
bridge:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 11 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ovs-vsctl add-port OVS-1 veth-ns1


root@server:~# ovs-vsctl add-port OVS-1 veth-ns2
root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
Port "OVS-1"
Interface "OVS-1"
type: internal
Port "veth-ns1"
Interface "veth-ns1"
Port "veth-ns2"
Interface "veth-ns2"
ovs_version: "2.0.2"
root@server:~#

From the preceding output, it's apparent that the bridge now has two ports, veth-ns1 and veth-ns2 .

The last thing left to do is bring the network interfaces up and assign IP addresses:

Copy

root@server:~# ip link set veth-ns1 up


root@server:~# ip link set veth-ns2 up
root@server:~# ip netns exec ns1 ip link set dev lo up
root@server:~# ip netns exec ns1 ip link set dev eth1-ns1 up
root@server:~# ip netns exec ns1 ip address add 192.168.0.1/24 dev eth1-ns1
root@server:~# ip netns exec ns1 ip a s
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
6: eth1-ns1: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/24 scope global eth1-ns1
valid_lft forever preferred_lft forever
inet6 fe80::8c99:3fff:feb8:4331/64 scope link
valid_lft forever preferred_lft forever
root@server:~#

Similarly for the ns2 namespace:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 12 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip netns exec ns2 ip link set dev lo up


root@server:~# ip netns exec ns2 ip link set dev eth1-ns2 up
root@server:~# ip netns exec ns2 ip address add 192.168.0.2/24 dev eth1-ns2
root@server:~# ip netns exec ns2 ip a s
1: lo: < LOOPBACK,UP,LOWER_UP
LOOPBACK,UP,LOWER_UP>> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
8: eth1-ns2: < BROADCAST,MULTICAST,UP,LOWER_UP
BROADCAST,MULTICAST,UP,LOWER_UP>> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.2/24 scope global eth1-ns2
valid_lft forever preferred_lft forever
inet6 fe80::f871:b8ff:fea1:7f85/64 scope link
valid_lft forever preferred_lft forever
root@server:~#

With this, we established a connection between both ns1 and ns2 network namespaces through the
Open vSwitch bridge. To confirm, let's use ping :

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 13 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip netns exec ns1 ping -c 3 192.168.0.2


PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.414 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.030 ms
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.027/0.157/0.414/0.181 ms
root@server:~#
root@server:~# ip netns exec ns2 ping -c 3 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.150 ms
64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=0.025 ms
64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=0.027 ms
--- 192.168.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.025/0.067/0.150/0.058 ms
root@server:~#

Open vSwitch allows for assigning VLAN tags to network interfaces, resulting in traffic isolation between
namespaces. This can be helpful in a scenario where you have multiple namespaces and you want to have
connectivity between some of them.

The following example demonstrates how to tag the virtual interfaces on the ns1 and ns2 namespaces, so
that the traffic will not be visible from each of the two network namespaces:

Copy

root@server:~# ovs-vsctl set port veth-ns1 tag=100


root@server:~# ovs-vsctl set port veth-ns2 tag=200
root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
Port "OVS-1"
Interface "OVS-1"
type: internal
Port "veth-ns1"
tag: 100
Interface "veth-ns1"
Port "veth-ns2"
tag: 200
Interface "veth-ns2"
ovs_version: "2.0.2"
root@server:~#

Both the namespaces should now be isolated in their own VLANs and ping should fail:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 14 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# ip netns exec ns1 ping -c 3 192.168.0.2


PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
root@server:~# ip netns exec ns2 ping -c 3 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
--- 192.168.0.1 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
root@server:~#

We can also use the unshare utility that we saw in the mount and UTC namespaces examples to create a new
network namespace:

Copy

root@server:~# unshare --net /bin/bash


root@server:~# ip a s
1: lo: < LOOPBACK
LOOPBACK>> mtu 65536 qdisc noop state DOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~# exit
root@server

Resource management with cgroups


Cgroups are kernel features that allows fine-grained control over resource allocation for a single process, or a
group of processes, called tasks
tasks. In the context of LXC this is quite important, because it makes it possible to
assign limits to how much memory, CPU time, or I/O, any given container can use.

The cgroups we are most interested in are described in the following table:

Subsystem Description Defined in

cpu Allocates CPU time for tasks kernel/sched/core.c

cpuacct Accounts for CPU usage kernel/sched/core.c

cpuset Assigns CPU cores to tasks kernel/cpuset.c

memory Allocates memory for tasks mm/memcontrol.c

blkio Limits the I/O access to devices block/blk-cgroup.c

devices Allows/denies access to devices security/device_cgroup.c

freezer Suspends/resumes tasks kernel/cgroup_freezer.c

net_cls Tags network packets net/sched/cls_cgroup.c

net_prio Prioritizes network traffic net/core/netprio_cgroup.c

hugetlb Limits the HugeTLB mm/hugetlb_cgroup.c

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 15 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Cgroups are organized in hierarchies, represented as directories in a Virtual File System (VFS
VFS). Similar to
process hierarchies, where every process is a descendent of the init or systemd process, cgroups inherit
some of the properties of their parents. Multiple cgroups hierarchies can exist on the system, each one
representing a single or group of resources. It is possible to have hierarchies that combine two or more
subsystems, for example, memory and I/O, and tasks assigned to a group will have limits applied on those
resources.

) Note
If you are interested in how the different subsystems are implemented in the kernel, install the kernel source and have a
look at the C files, shown in the third column of the table.

The following diagram helps visualize a single hierarchy that has two subsystems—CPU and I/O—attached to it:

Cgroups can be used in two ways:

( By manually manipulating files and directories on a mounted VFS

( Using userspace tools provided by various packages such as cgroup-bin on Debian/Ubuntu


and libcgroup on RHEL/CentOS

Let's have a look at few practical examples on how to use cgroups to limit resources. This will help us get a better
understanding of how containers work.

Limiting I/O throughput

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 16 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Let's assume we have two applications running on a server that are heavily I/O bound: app1 and app2 . We
would like to give more bandwidth to app1 during the day and to app2 during the night. This type of I/O
throughput prioritization can be accomplished using the blkio subsystem.

First, let's attach the blkio subsystem by mounting the cgroup VFS:

Copy

root@server:~# mkdir -p /cgroup/blkio


root@server:~# mount -t cgroup -o blkio blkio /cgroup/blkio
root@server:~# cat /proc/mounts | grep cgroup
blkio /cgroup/blkio cgroup rw, relatime, blkio, crelease_agent=/run/cgmanager/agents/cgm-release-agent.bl
root@server:~#

Next, create two priority groups, which will be part of the same blkio hierarchy:

Copy

root@server:~# mkdir /cgroup/blkio/high_io


root@server:~# mkdir /cgroup/blkio/low_io
root@server:~#

We need to acquire the PIDs of the app1 and app2 processes and assign them to the high_io and
low_io groups:

Copy

root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/blkio/high_io/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/blkio/low_io/tasks; done
root@server:~#

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 17 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

The blkio hierarchy we've created

The tasks file is where we define what processes/tasks the limit should be applied on.

Finally, let's set a ratio of 10:1 for the high_io and low_io cgroups. Tasks in those cgroups will
immediately use only the resources made available to them:

Copy

root@server:~# echo 1000 > /cgroup/blkio/high_io/blkio.weight


root@server:~# echo 100 > /cgroup/blkio/low_io/blkio.weight
root@server:~#

The blkio.weight file defines the weight of I/O access available to a process or group of processes, with
values ranging from 100 to 1,000. In this example, the values of 1000 and 100 create a ratio of 10:1.

With this, the low priority application, app2 will use only about 10 percent of the I/O operations available,
whereas the high priority application, app1 , will use about 90 percent.

If you list the contents of the high_io directory on Ubuntu you will see the following files:

Copy

root@server:~# ls -la /cgroup/blkio/high_io/


drwxr-xr-x 2 root root 0 Aug 24 16:14 .
drwxr-xr-x 4 root root 0 Aug 19 21:14 ..
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_merged
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_merged_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_queued
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_queued_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_bytes
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_bytes_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_serviced
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_serviced_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_time
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_time_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_wait_time
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_wait_time_recursive
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.leaf_weight
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.leaf_weight_device
--w------- 1 root root 0 Aug 24 16:14 blkio.reset_stats
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.sectors
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.sectors_recursive

From the preceding output you can see that only some files are writeable. This depends on various OS settings,
such as what I/O scheduler is being used.

We've already seen what the tasks and blkio.weight files are used for. The following is a short
description of the most commonly used files in the blkio subsystem:

File Description

blkio.io_merged Total number of reads/writes, sync, or async merged into requests

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 18 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

blkio.io_queued Total number of read/write, sync, or async requests queued up at any given
time

blkio.io_service_byte
The number of bytes transferred to or from the specified device
s

blkio.io_serviced The number of I/O operations issued to the specified device

Total amount of time between request dispatch and request completion in


blkio.io_service_time
nanoseconds for the specified device

Total amount of time the I/O operations spent waiting in the scheduler queues
blkio.io_wait_time
for the specified device

Similar to blkio.weight and can be applied to the Completely Fair


blkio.leaf_weight
Queuing (CFQ
CFQ) I/O scheduler

blkio.reset_stats Writing an integer to this file will reset all statistics

blkio.sectors The number of sectors transferred to or from the specified device

blkio.throttle.io_serv
The number of bytes transferred to or from the disk
ice_bytes

blkio.throttle.io_serv
The number of I/O operations issued to the specified disk
iced

blkio.time The disk time allocated to a device in milliseconds

blkio.weight Specifies weight for a cgroup hierarchy

blkio.weight_device Same as blkio.weight , but specifies a block device to apply the limit on

tasks Attach tasks to the cgroup

* Tip
One thing to keep in mind is that writing to the files directly to make changes will not persist after the server restarts. Later
in this chapter, you will learn how to use the userspace tools to generate persistent configuration files.

Limiting memory usage


The memory subsystem controls how much memory is presented to and available for use by processes. This
can be particularly useful in multitenant environments where better control over how much memory a user
process can utilize is needed, or to limit memory hungry applications. Containerized solutions like LXC can use the
memory subsystem to manage the size of the instances, without needing to restart the entire container.

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 19 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

The memory subsystem performs resource accounting, such as tracking the utilization of anonymous pages,
file caches, swap caches, and general hierarchical accounting, all of which presents an overhead. Because of this,
the memory cgroup is disabled by default on some Linux distributions. If the following commands below fail
you'll need to enable it, by specifying the following GRUB parameter and restarting:

Copy

root@server:~# vim /etc/default/grub


RUB_CMDLINE_LINUX_DEFAULT="cgroup_enable=memory"
root@server:~# grub-update && reboot

First, let's mount the memory cgroup:

Copy

root@server:~# mkdir -p /cgroup/memory


root@server:~# mount -t cgroup -o memory memory /cgroup/memory
root@server:~# cat /proc/mounts | grep memory
memory /cgroup/memory cgroup rw, relatime, memory, release_agent=/run/cgmanager/agents/cgm-release-agent.
root@server:~#

Then set the app1 memory to 1 GB:

Copy

root@server:~# mkdir /cgroup/memory/app1


root@server:~# echo 1G > /cgroup/memory/app1/memory.limit_in_bytes
root@server:~# cat /cgroup/memory/app1/memory.limit_in_bytes
1073741824
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/memory/app1/tasks; done
root@server:~#

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 20 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

The memory hierarchy for the app1 process

Similar to the blkio subsystem, the tasks file is used to specify the PID of the processes we are adding to
the cgroup hierarchy, and the memory.limit_in_bytes specifies how much memory is to be made available in
bytes.

The app1 memory hierarchy contains the following files:

Copy

root@server:~# ls -la /cgroup/memory/app1/


drwxr-xr-x 2 root root 0 Aug 24 22:05 .
drwxr-xr-x 3 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 24 22:05 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 24 22:05 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 24 22:05 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.failcnt
--w------- 1 root root 0 Aug 24 22:05 memory.force_empty
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.move_charge_at_immigrate

The files and their function in the memory subsystem are described in the following table:

File Description

memory.failcnt Shows the total number of memory limit hits

memory.force_empty If set to 0 , frees memory used by tasks

memory.kmem.failcnt Shows the total number of kernel memory limit hits

memory.kmem.limit_in_bytes Sets or shows kernel memory hard limit

memory.kmem.max_usage_in_bytes Shows maximum kernel memory usage

memory.kmem.tcp.failcnt Shows the number of TCP buffer memory limit hits

memory.kmem.tcp.limit_in_bytes Sets or shows hard limit for TCP buffer memory

memory.kmem.tcp.max_usage_in_bytes Shows maximum TCP buffer memory usage

memory.kmem.tcp.usage_in_bytes Shows current TCP buffer memory

memory.kmem.usage_in_bytes Shows current kernel memory

memory.limit_in_bytes Sets or shows memory usage limit

memory.max_usage_in_bytes Shows maximum memory usage

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 21 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

memory.move_charge_at_immigrate Sets or shows controls of moving charges

memory.numa_stat Shows the number of memory usage per NUMA node

memory.oom_control Sets or shows the OOM controls

memory.pressure_level Sets memory pressure notifications

memory.soft_limit_in_bytes Sets or shows soft limit of memory usage

memory.stat Shows various statistics

memory.swappiness Sets or shows swappiness level

memory.usage_in_bytes Shows current memory usage

memory.use_hierarchy Sets memory reclamation from child processes

tasks Attaches tasks to the cgroup

Limiting the memory available to a process might trigger the Out of Memory (OOM
OOM) killer, which might kill the
running task. If this is not the desired behavior and you prefer the process to be suspended waiting for memory to
be freed, the OOM killer can be disabled:

Copy

root@server:~# cat /cgroup/memory/app1/memory.oom_control


oom_kill_disable 0
under_oom 0
root@server:~# echo 1 > /cgroup/memory/app1/memory.oom_control
root@server:~#

The memory cgroup presents a wide slew of accounting statistics in the memory.stat file, which can be of
interest:

Copy

root@server:~# head /cgroup/memory/app1/memory.stat


cache 43325 # Number of bytes of page cache memory
rss 55d43 # Number of bytes of anonymous and swap cache memory
rss_huge 0 # Number of anonymous transparent hugepages
mapped_file 2 # Number of bytes of mapped file
writeback 0 # Number of bytes of cache queued for syncing
pgpgin 0 # Number of charging events to the memory cgroup
pgpgout 0 # Number of uncharging events to the memory cgroup
pgfault 0 # Total number of page faults
pgmajfault 0 # Number of major page faults
inactive_anon 0 # Anonymous and swap cache memory on inactive LRU list

If you need to start a new task in the app1 memory hierarchy you can move the current shell process into the
tasks file, and all other processes started in this shell will be direct descendants and inherit the same cgroup
properties:

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 22 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Copy

root@server:~# echo $$ > /cgroup/memory/app1/tasks


root@server:~# echo "The memory limit is now applied to all processes started from this shell"

The cpu and cpuset subsystems


The cpu subsystem schedules CPU time to cgroup hierarchies and their tasks. It provides finer control over
CPU execution time than the default behavior of the CFS.

The cpuset subsystem allows for assigning CPU cores to a set of tasks, similar to the taskset command
in Linux.

The main benefits that the cpu and cpuset subsystems provide are better utilization per processor core
for highly CPU bound applications. They also allow for distributing load between cores that are otherwise idle at
certain times of the day. In the context of multitenant environments, running many LXC containers, cpu and
cpuset cgroups allow for creating different instance sizes and container flavors, for example exposing only a
single core per container, with 40 percent scheduled work time.

As an example, let's assume we have two processes app1 and app2 , and we would like app1 to use 60
percent of the CPU time and app2 only 40 percent. We start by mounting the cgroup VFS:

Copy

root@server:~# mkdir -p /cgroup/cpu


root@server:~# mount -t cgroup -o cpu cpu /cgroup/cpu
root@server:~# cat /proc/mounts | grep cpu
cpu /cgroup/cpu cgroup rw, relatime, cpu, release_agent=/run/cgmanager/agents/cgm-release-agent.cpu 0 0

Then we create two child hierarchies:

Copy

root@server:~# mkdir /cgroup/cpu/limit_60_percent


root@server:~# mkdir /cgroup/cpu/limit_40_percent

Also assign CPU shares for each, where app1 will get 60 percent and app2 will get 40 percent of the
scheduled time:

Copy

root@server:~# echo 600 > /cgroup/cpu/limit_60_percent/cpu.shares


root@server:~# echo 400 > /cgroup/cpu/limit_40_percent/cpu.shares

Finally, we move the PIDs in the tasks files:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 23 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Copy

root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/cpu/limit_60_percent/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/cpu/limit_40_percent/tasks; done
root@server:~#

The cpu subsystem contains the following control files:

Copy

root@server:~# ls -la /cgroup/cpu/limit_60_percent/


drwxr-xr-x 2 root root 0 Aug 25 15:13 .
drwxr-xr-x 4 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 15:13 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 15:13 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 15:13 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 25 15:13 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Aug 25 15:13 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Aug 25 15:14 cpu.shares
-r--r--r-- 1 root root 0 Aug 25 15:13 cpu.stat
-rw-r--r-- 1 root root 0 Aug 25 15:13 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 15:13 tasks
root@server:~#

Here's a brief explanation of each:

File Description

cpu.cfs_period_us CPU resource reallocation in microseconds

cpu.cfs_quota_us Run duration of tasks in microseconds during one cpu.cfs_perious_us period

cpu.shares Relative share of CPU time available to the tasks

cpu.stat Shows CPU time statistics

tasks Attaches tasks to the cgroup

The cpu.stat file is of particular interest:

Copy

root@server:~# cat /cgroup/cpu/limit_60_percent/cpu.stat


nr_periods 0 # number of elapsed period intervals, as specified in
# cpu.cfs_period_us
nr_throttled 0 # number of times a task was not scheduled to run
# because of quota limit
throttled_time 0 # total time in nanoseconds for which tasks have been
# throttled
root@server:~#

To demonstrate how the cpuset subsystem works, let's create cpuset hierarchies named app1 ,
containing CPUs 0 and 1. The app2 cgroup will contain only CPU 1:

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 24 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Copy

root@server:~# mkdir /cgroup/cpuset


root@server:~# mount -t cgroup -o cpuset cpuset /cgroup/cpuset
root@server:~# mkdir /cgroup/cpuset/app{1..2}
root@server:~# echo 0-1 > /cgroup/cpuset/app1/cpuset.cpus
root@server:~# echo 1 > /cgroup/cpuset/app2/cpuset.cpus
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/cpuset/app1/tasks limit_60_percent/ta
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/cpuset/app2/tasks limit_40_percent/ta
root@server:~#

To check if the app1 process is pinned to CPU 0 and 1, we can use:

Copy

root@server:~# taskset -c -p $(pidof app1)


pid 8052's current affinity list: 0,1
root@server:~# taskset -c -p $(pidof app2)
pid 8052's current affinity list: 1
root@server:~#

The cpuset app1 hierarchy contains the following files:

Copy

root@server:~# ls -la /cgroup/cpuset/app1/


drwxr-xr-x 2 root root 0 Aug 25 16:47 .
drwxr-xr-x 5 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 16:47 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 16:47 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 16:47 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.cpu_exclusive
-rw-r--r-- 1 root root 0 Aug 25 17:57 cpuset.cpus
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mem_exclusive
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mem_hardwall
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_migrate
-r--r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_pressure
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_spread_page
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_spread_slab
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mems
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.sched_load_balance
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root root 0 Aug 25 16:47 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 17:13 tasks
root@server:~#

A brief description of the control files is as follows:

File Description

cpuset.cpu_exclusi Checks if other cpuset hierarchies share the settings defined in the current
ve group

List of the physical numbers of the CPUs on which processes in that cpuset are
cpuset.cpus
allowed to execute

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 25 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

cpuset.mem_exclusi Should the cpuset have exclusive use of its memory nodes
ve

cpuset.mem_hardwal
Checks if each tasks' user allocation be kept separate
l

cpuset.memory_migr Checks if a page in memory should migrate to a new node if the values in
ate cpuset.mems change

cpuset.memory_pres
Contains running average of the memory pressure created by the processes
sure

cpuset.memory_spre
Checks if filesystem buffers should spread evenly across the memory nodes
ad_page

cpuset.memory_spre Checks if kernel slab caches for file I/O operations should spread evenly across the
ad_slab cpuset

cpuset.mems Specifies the memory nodes that tasks in this cgroup are permitted to access

cpuset.sched_load_ Checks if the kernel balance should load across the CPUs in the cpuset by
balance moving processes from overloaded CPUs to less utilized CPUs

cpuset.sched_relax Contains the width of the range of CPUs across which the kernel should attempt to
_domain_level balance loads

Checks if the hierarchy should receive special handling after it is released and no
notify_on_release
process are using it

tasks Attaches tasks to the cgroup

The cgroup freezer subsystem


The freezer subsystem can be used to suspend the current state of running tasks for the purposes of
analyzing them, or to create a checkpoint that can be used to migrate the process to a different server. Another
use case is when a process is negatively impacting the system and needs to be temporarily paused, without losing
its current state data.

The next example shows how to suspend the execution of the top process, check its state, and then resume it.

First, mount the freezer subsystem and create the new hierarchy:

Copy

root@server:~# mkdir /cgroup/freezer


root@server:~# mount -t cgroup -o freezer freezer /cgroup/freezer
root@server:~# mkdir /cgroup/freezer/frozen_group
root@server:~# cat /proc/mounts | grep freezer
freezer /cgroup/freezer cgroup rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.
root@server:~#

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 26 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

In a new terminal, start the top process and observe how it periodically refreshes. Back in the original
terminal, add the PID of top to the frozen_group task file and observe its state:

Copy

root@server:~# echo 25731 > /cgroup/freezer/frozen_group/tasks


root@server:~# cat /cgroup/freezer/frozen_group/freezer.state
THAWED
root@server:~#

To freeze the process, echo the following:

Copy

root@server:~# echo FROZEN > /cgroup/freezer/frozen_group/freezer.state


root@server:~# cat /cgroup/freezer/frozen_group/freezer.state
FROZEN
root@server:~# cat /proc/25s731/status | grep -i state
State: D (disk sleep)
root@server:~#

Notice how the top process output is not refreshing anymore, and upon inspection of its status file, you can see
that it is now in the blocked state.

To resume it, execute the following:

Copy

root@server:~# echo THAWED > /cgroup/freezer/frozen_group/freezer.state


root@server:~# cat /proc/29328/status | grep -i state
State: S (sleeping)
root@server:~#

Inspecting the frozen_group hierarchy yields the following files:

Copy

root@server:~# ls -la /cgroup/freezer/frozen_group/


drwxr-xr-x 2 root root 0 Aug 25 20:50 .
drwxr-xr-x 4 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 20:50 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 20:50 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 20:50 cgroup.procs
-r--r--r-- 1 root root 0 Aug 25 20:50 freezer.parent_freezing
-r--r--r-- 1 root root 0 Aug 25 20:50 freezer.self_freezing
-rw-r--r-- 1 root root 0 Aug 25 21:00 freezer.state
-rw-r--r-- 1 root root 0 Aug 25 20:50 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 20:59 tasks
root@server:~#

The few files of interest are described in the following table:

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 27 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

File Description

freezer.parent_freez Shows the parent-state. Shows 0 if none of the cgroup's ancestors is


ing FROZEN ; otherwise, 1 .

freezer.self_freezin
Shows the self-state. Shows 0 if the self-state is THAWED ; otherwise, 1 .
g

freezer.state Sets the self-state of the cgroup to either THAWED or FROZEN .

tasks Attaches tasks to the cgroup.

Using userspace tools to manage cgroups and persist changes


Working with the cgroups subsystems by manipulating directories and files directly is a fast and convenient way
to prototype and test changes, however, this comes with few drawbacks, namely the changes made will not persist
a server restart and there's not much error reporting or handling.

To address this, there are packages that provide userspace tools and daemons that are quite easy to use. Let's see
a few examples.

To install the tools on Debian/Ubuntu, run the following:

Copy

root@server:~# apt-get install -y cgroup-bin cgroup-lite libcgroup1


root@server:~# service cgroup-lite start

On RHEL/CentOS, execute the following:

Copy

root@server:~# yum install libcgroup


root@server:~# service cgconfig start

To mount all subsystems, run the following:

Copy

root@server:~# cgroups-mount
root@server:~# cat /proc/mounts | grep cgroup
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-ag
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agen
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-re
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,release_agent=/run/cgmanager/agents/cgm-release-ag
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,release_agent=/run/cgmanager/agents/cgm-release-agent.cp
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,release_agent=/run/cgmanager/agents/cgm-release-

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 28 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Notice from the preceding output the location of the cgroups - /sys/fs/cgroup . This is the default location
on many Linux distributions and in most cases the various subsystems have already been mounted.

To verify what cgroup subsystems are in use, we can check with the following commands:

Copy

root@server:~# cat /proc/cgroups


#subsys_name hierarchy num_cgroups enabled
cpuset 7 1 1
cpu 8 2 1
cpuacct 9 1 1
memory 10 2 1
devices 11 1 1
freezer 12 1 1
blkio 6 3 1
perf_event 13 1 1
hugetlb 14 1 1

Next, let's create a blkio hierarchy and add an already running process to it with cgclassify . This is
similar to what we did earlier, by creating the directories and the files by hand:

Copy

root@server:~# cgcreate -g blkio:high_io


root@server:~# cgcreate -g blkio:low_io
root@server:~# cgclassify -g blkio:low_io $(pidof app1)
root@server:~# cat /sys/fs/cgroup/blkio/low_io/tasks
8052
root@server:~# cgset -r blkio.weight=1000 high_io
root@server:~# cgset -r blkio.weight=100 low_io
root@server:~# cat /sys/fs/cgroup/blkio/high_io/blkio.weight
1000
root@server:~#

Now that we have defined the high_io and low_io cgroups and added a process to them, let's generate a
configuration file that can be used later to reapply the setup:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 29 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# cgsnapshot -s -f /tmp/cgconfig_io.conf


cpuset = /sys/fs/cgroup/cpuset;
cpu = /sys/fs/cgroup/cpu;
cpuacct = /sys/fs/cgroup/cpuacct;
memory = /sys/fs/cgroup/memory;
devices = /sys/fs/cgroup/devices;
freezer = /sys/fs/cgroup/freezer;
blkio = /sys/fs/cgroup/blkio;
perf_event = /sys/fs/cgroup/perf_event;
hugetlb = /sys/fs/cgroup/hugetlb;
root@server:~# cat /tmp/cgconfig_io.conf
# Configuration file generated by cgsnapshot
mount {
blkio = /sys/fs/cgroup/blkio;
}
group low_io {
blkio {
blkio.leaf_weight="500";
blkio.leaf_weight_device="";
blkio.weight="100";

To start a new process in the high_io group, we can use the cgexec command:

Copy

root@server:~# cgexec -g blkio:high_io bash


root@server:~# echo $$
19654
root@server:~# cat /sys/fs/cgroup/blkio/high_io/tasks
19654
root@server:~#

In the preceding example, we started a new bash process in the high_io cgroup , as confirmed by looking
at the tasks file.

To move an already running process to the memory subsystem, first we create the high_prio and
low_prio groups and move the task with cgclassify :

Copy

root@server:~# cgcreate -g cpu,memory:high_prio


root@server:~# cgcreate -g cpu,memory:low_prio
root@server:~# cgclassify -g cpu,memory:high_prio 8052
root@server:~# cat /sys/fs/cgroup/memory/high_prio/tasks
8052
root@server:~# cat /sys/fs/cgroup/cpu/high_prio/tasks
8052
root@server:~#

To set the memory and CPU limits, we can use the cgset command. In contrast, remember that we used the
echo command to manually move the PIDs and memory limits to the tasks and the
memory.limit_in_bytes files:

Copy

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 30 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

root@server:~# cgset -r memory.limit_in_bytes=1G low_prio


root@server:~# cat /sys/fs/cgroup/memory/low_prio/memory.limit_in_bytes
1073741824
root@server:~# cgset -r cpu.shares=1000 high_prio
root@server:~# cat /sys/fs/cgroup/cpu/high_prio/cpu.shares
1000
root@server:~#

To see how the cgroup hierarchies look, we can use the lscgroup utility:

Copy

root@server:~# lscgroup
cpuset:/
cpu:/
cpu:/low_prio
cpu:/high_prio
cpuacct:/
memory:/
memory:/low_prio
memory:/high_prio
devices:/
freezer:/
blkio:/
blkio:/low_io
blkio:/high_io
perf_event:/
hugetlb:/
root@server:~#

The preceding output confirms the existence of the blkio , memory , and cpu hierarchies and their
children.

Once finished, you can delete the hierarchies with cgdelete , which deletes the respective directories on the
VFS:

Copy

root@server:~# cgdelete -g cpu,memory:high_prio


root@server:~# cgdelete -g cpu,memory:low_prio
root@server:~# lscgroup
cpuset:/
cpu:/
cpuacct:/
memory:/
devices:/
freezer:/
blkio:/
blkio:/low_io
blkio:/high_io
perf_event:/
hugetlb:/
root@server:~#

To completely clear the cgroups, we can use the cgclear utility, which will unmount the cgroup directories:

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 31 of 32
Linux namespaces – the foundation of LXC - Containerization with LXC 16/04/20 06.07

Copy

root@server:~# cgclear
root@server:~# lscgroup
cgroups can't be listed: Cgroup is not mounted
root@server:~#

Managing resources with systemd


With the increased adoption of systemd as an init system, new ways of manipulating cgroups were
introduced. For example, if the cpu controller is enabled in the kernel, systemd will create a cgroup for each
service by default. This behavior can be changed by adding or removing cgroup subsystems in the configuration
file of systemd , usually found at /etc/systemd/system.conf .

If multiple services are running on the server, the CPU resources will be shared equally among them by default,
because systemd assigns equal weights to each. To change this behavior for an application, we can edit its
service file and define the CPU shares, allocated memory, and I/O.

The following example demonstrates how to change the CPU shares, memory, and I/O limits for the nginx process:

Copy

root@server:~# vim /etc/systemd/system/nginx.service


.include /usr/lib/systemd/system/httpd.service
[Service]
CPUShares=2000
MemoryLimit=1G
BlockIOWeight=100

To apply the changes first reload systemd , then nginx:

Copy

root@server:~# systemctl daemon-reload


root@server:~# systemctl restart httpd.service
root@server:~#

This will create and update the necessary control files in /sys/fs/cgroup/systemd and apply the limits.

+ Previous Section (/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec8/the-case-for-linux-containers)

Next Section , (/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec10/summary)

https://subscription.packtpub.com/book/virtualization_and_cloud/9781785888946/1/ch01lvl1sec9/linux-namespaces-the-foundation-of-lxc Page 32 of 32

You might also like