Helper tools and kernel extensions for LXC containers
PlanetLab recently switched to using LXC - a container-based virtualization mechanism implemented in the Linux kernel. For managing containers, we modified our Node Manager to interface with libvirt, a de facto VM management framework. In the process of migrating our system we encountered some difficulties:
- By default, the /proc filesystem in containers was mounted in read-write mode, which allowed containers to run potentially dangerous commands such as rebooting the node via /proc/sysrq-trigger. While there were work-in-progress solutions to this problem based on implementations of mandatory access control in the kernel, they were either incomplete or required patches to be applied to the kernel, which conflicted with our requirement to not have to build our own kernel.
- Lack of support for shared IP addresses between containers. By default, libvirt populates containers with a virtual device with a private IP address. Traffic is relayed between the container and the outside using NAT. Since a number of PlanetLab services need to listen on the public IP address of a node, we needed some type of IP address sharing.
- Libvirt's support for LXC is thin. There is an excellent set of tools written by the developers of LXC but it is incompatible with libvirt. What we needed was something simple that would let administrators enter a container's namespaces for debugging purposes, and that would put users in containers when they logged in.
We address the problems mentioned above through the following tools. Their source code can be found in our git repository. The tools are basically functional, but have not been tested extensively. They are currently in beta deployment on PlanetLab.
- procprotect
A kernel module that implements simple ACLs for limiting access to entries in /proc. The user interface is simple. To prevent access to a file or directory, run "echo <prefix>" > /proc/procprotect. e.g. echo sysrq > /proc/procprotect
- transforward
A kernel module that enables certain sockets to be hoisted into root context. With this module loaded, if a process in one container explicitly binds to an IP address assigned in another container, then the corresponding socket's network namespace gets switched to the second container. In effect, a process can bind to devices in other containers. Note IP addresses that are to be thus globally visible first need to be whitelisted.
- bind_public
A library that when linked with an existing binary forces it to bind to the current host's public IP address via procprotect, courtesy of Jude Nelson.
- lxcsu-user
A tool for natively entering into a container via the setns system call. Type lxcsu -n -i -m princeton_vcoblitz to get into the princeton_vcoblitz container. Needs the kernel module below.
- lxcsu
The setns system call in current stable kernels does not support the mount and pid namespaces. This kernel module encapsulates Eric Biederman's patches for this purpose to enable switching mount namespaces possible without rebuilding the kernel. For use till the aforementioned patch makes it into the mainline kernel.
Back to my home page.