Linux Container

Linux
Containers
Basic Concepts
Lucian Carata
FRESCO Talklet, 3 Oct 2014
Underlying kernel mechanisms
cgroups manage resources for groups of processes
namespaces per process resource isolation
seccomp limit available system calls
capabilities limit available privileges
CRIU checkpoint/restore (with kernel support)
Those mechanisms are orthogonal and are used in conjunction for implementing actual container functionality.
cgroups user space view
lowlevel filesystem interface similar to sysfs (/sys) and procfs (/proc)
new filesystem type “cgroup”, default location in /sys/fs/cgroup
each subsystem can be
cgroup hierarchies used at most once*
subsystems (controllers)
/sys/fs/cgroup
TL /cpu cpu cpuacct cpuset cpu cpuacct

/highpriority
/normal memory hugetbl
/experiment_1
devices blkio net_cls net_prio
TL /mem memory
/opus
freezer perf
/normal built as kernel module
/experiment_1 TL top level cgroup (mount)
* or, if a new toplevel cgroup is created with an already existing combination of subsystems, the previous top
level cgroup will be used behind scenes
● issues with systemd premounting directories with certain controllers, which makes new hierarchies (with
different controller combinations) difficult to achieve
● each process can appear at most once within a cgroup hierarchy (from toplevel towards descendants)
cgroups user space view
cgroup hierarchies
/sys/fs/cgroup common cpuacct cpu

TL /cpu cpu
tasks cpuacct.stat cpu.stat
cpuacct
cgroup.procs cpuacct.usage cpu.shares
/highpriority
release_agent cpuacct.usage_percpu cpu.cfs_period_us
/normal notify_on_release cpu.cfs_quota_us
TL
/experiment_1 cgroup.clone_children cpu.rt_period_us
cgroup.sane_behavior cpu.rt_runtime_us
TL /mem memory
/opus
/normal
/experiment_1
cpuset memory hugetbl devices blkio
net_cls net_prio freezer perf
● by default, the toplevel cgroup contains all running tasks. a cgroup created as a subdirectory starts with no
tasks, and those must be manually added to the “tasks” file
● release_agent is only present at the toplevel cgroup level, and contains a command to be run when the
last process of a cgroup terminates. notify_on_release needs to be set in particular cgroups for that
command to actually execute.
● cpu controller: by default, the kernel scheduler aims to give equal cpu time to all processes. cgroups can
be used for fair grouping between arbitrary sets of processes (an example of 30 apache processes and 10
postgres processes)
● net_cls interface for tagging network packets with a class identifier (so that you could later add rules
based on packet class)
● memory controller has hierarchical support and allows for soft limits (cgroup can use as much memory as
needed provided there is no memory contention and the hard limit is not exceeded).
○ Hierarchical support means that child cgroups contribute to the memory usage of their ancestors. If an
ancestor exceeds a limit, memory will be reclaimed from the ancestor and all its children
● cpuset is also hierarchical
cgroups kernel space view
include / linux / cgroup.h
task_struct css_set
css_set *cgroups list_head tasks
list_head cg_list cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
task_struct kernel code for attach/detaching
task from css_set
css_set *cgroups
list_head cg_list
init/main.c
fork(), exit()
list of all tasks using the
same css_set
on initialization, a css_set init_css_set is created containing the initial css_set at system boot.
a css_set contains all the tasks that are under the same state configuration for all enabled controllers (they share cgroups in all hierarchies)
the cgroup hierarchy is not directly accessible from a given task (this is not required as often)
include / linux / cgroup.h
task_struct css_set
css_set *cgroups list_head tasks
list_head cg_list cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
...
include / linux / cgroup_subsys.h
cgroup_subsys
task_struct cgroup_subsys cpuset_subsys
int (*attach)(...)
css_set *cgroups void (*fork)(...)
list_head cg_list
cgroup_subsys freezer_subsys
void (*exit)(...)
void (*bind)(...)
cgroup_subsys mem_cgroup_subsys
...
const char* name;
list of all tasks using the cgroupfs_root *root;
cftype *base_cftypes
same css_set
include / linux / cgroup_subsys.h
cgroup_subsys
int (*attach)(...)
void (*fork)(...)
void (*exit)(...)
void (*bind)(...)
...
const char* name;
cgroupfs_root *root;
cftype *base_cftypes
cgroup_subsys cpuset_subsys
.base_cftypes = files
cgroups summary
each subsystem can be
cgroup hierarchies used at most once*
subsystems (controllers)
/sys/fs/cgroup
TL /cpu cpu cpuacct cpuset cpu cpuacct

/highpriority
/normal memory hugetbl
/experiment_1
devices blkio net_cls net_prio
TL /mem memory
/opus
freezer perf
/normal built as kernel module
/experiment_1 TL top level cgroup (mount)
● show in terminal /sys/fs/cgroups
namespaces user space view
Namespaces limit the scope of kernelside names and data structures
at process granularity
mnt (mount points, filesystems) CLONE_NEWNS
pid (processes) CLONE_NEWPID
net (network stack) CLONE_NEWNET
ipc (System V IPC) CLONE_NEWIPC
uts (unix timesharing domain name, etc) CLONE_NEWUTS
user (UIDs) CLONE_NEWUSER
The main purpose of a namespace is the isolation of whatever is contained within from other namespaces running in the same kernel
Three system calls for management
clone() new process, new namespace, attach process to ns
unshare() new namespace, attach current process to it
setns(int fd, int nstype) join an existing namespace
each namespace is identified by an inode (unique)
six entries (inodes) added to /proc/<pid>/ns/
(?)
two processes are in the same namespace if they see the same inode for
equivalent namespace types (mnt, net, user, ...)
User space utilities
* IPROUTE (ip netns add, etc)
* unshare, nsenter (part of utillinux)
* shadow, shadowutils (for user ns)
nsenter is a wrapper around setns
unshare has support for all 6 namespaces
namespaces kernel space view
include / linux / nsproxy.h include / linux / cred.h
task_struct nsproxy cred
atomic_t count
struct nsproxy *nsproxy ...
struct cred *cred struct uts_namespace *uts_ns struct user_namespace *user_ns
struct ipc_namespace *ipc_ns
struct mnt_namespace *mnt_ns
struct pid_namespace *pid_ns_for_children
struct net *net_ns
include / linux / nsproxy.h
nsproxy* task_nsproxy(struct task_struct *tsk)
For each namespace type, a default namespace exists (the global namespace)
struct nsproxy is shared by all tasks with the same set of namespaces
Example for uts namespace
include / uapi / linux / utsname.h
new_utsname
task_struct nsproxy char sysname []
char nodename []
struct uts_namespace *uts_ns char release []
struct nsproxy *nsproxy
... ... char version []
char machine []
char domainname []
global access to hostname: system_utsname.nodename
namespaceaware access to hostname: &current>nsproxy>uts_ns>name>nodename
Example for net namespace
include / net / net_namespace.h
net
task_struct nsproxy Logical copy of the network stack:
struct net *net_ns
struct nsproxy *nsproxy loopback device
... ...
all network tables (routing, etc)
all sockets
/procfs and /sysfs entries
a network device belongs to exactly one network namespace
a socket belongs to exactly one network namespace
a new network namespace only includes the loopback device
communication between namespaces using veth or unix sockets
namespaces summary
mnt (mount points, filesystems)
pid (processes)
net (network stack)
ipc (System V IPC)
uts (unix timesharing domain name, etc)
user (UIDs)
Containers
A light form of resource virtualization based on kernel mechanisms
A container is a userspace construct
Multiple containers run on top of the same kernel
illusion that they are the only one using resources
(cpu, memory, disk, network)
some implementations offer support for
container templates
deployment / migration
union filesystems
taken from the Docker documentation
Container solutions
Mainline
Google containers (lmctfy)
uses cgroups only, offers CPU & memory isolation
no isolation for: disk I/O, network, filesystem, checkpoint/restore
adds some cgroup files: cpu.lat, cpuacct.histogram
LXC: userspace containerisation tools
Docker
systemdnspawn
Forks
Vserver, OpenVZ
Container solutions LXC
An LXC container is a userspace process created with the clone() system call
with its own pid namespace
with its own mnt namespace
net namespace (configurable) lxc.network.type
Offers container templates /usr/share/lxc/templates
shell scripts
lxccreate t ubuntu n containerName
also creates cgroup /sys/fs/cgroup/<controller>/lxc/containerName
Container solutions Docker
A Linux container engine
multiple backend drivers
application rather than machinecentric
app build tools
diffbased deployment of updates (AUFS)
versioning (gitlike) and reuse
links (tunnels) between containers
taken from the Docker documentation
Questions?
Thank you! Lucian Carata
lc525@cam.ac.uk
More details
cgroups: http://media.wix.com/ugd/295986_d73d8d6087ed430c34c21f90b0b607fd.pdf
namespaces: http://lwn.net/Articles/531114/ (and series)

Linux Container

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linux Container

Uploaded by

Copyright:

Available Formats

Linux

TL /cpu cpu cpuacct cpuset cpu cpuacct

/sys/fs/cgroup common cpuacct cpu

cpuset memory hugetbl devices blkio

net_cls net_prio freezer perf

TL /cpu cpu cpuacct cpuset cpu cpuacct

You might also like