While using docker it is quite natural to feel like we are dealing with a virtual machine because we can install something in a container, it has an IP, file tree is different and whatnot. In reality, containers are just a mechanism that allows Linux processes to run with some degree of isolation and separation. Containers basically use few different primitives in Linux combined. By using Linux namespaces containers can make these happen.
What are namespaces? Namespaces are another Linux kernel primitive that can be used to control access and visibility of resources. Generally, when properties like names or permissions change, it is visible only within the namespace and it is not visible to the processes outside of it. So it makes it look like that the process has its own isolated copy of a given resource. Docker uses these properties of the namespace to provide isolation. Namespaces can map resources outside a namespace to a resource inside a namespace. Linux has at least 7 kinds of namespaces that cover different kinds of resources: Mount, UTS, IPC, PID, Network, User, Cgroup. Sometimes we these namespaces individually and sometimes we have to use them together because types like IPC, Cgroup, PID use file systems that need to be mounted. A name server does not stay open when there is no running process on it or a bind mount. That’s why a docker gets killed when the process inside it gets killed, it is because of how Linux System deals with its PID 1, which I am going to discuss later in this blog.
How would I find out the namespace of a process? We can find any information about a process by accessing the proc virtual file system. If we check carefully we should be able to find a directory called ns which contains symbolic links to the namespace. These symbolic link structures can provide us the information about the namespace type and inode number that the kernel uses. By comparing this inode number we can confirm if both of the processes are part of the same namespace or not.
How can we use namespace? A set of Linux syscall is used to work with namespaces, most commonly clone, unshare, etc. Clone is a sort of more general version of the fork system call that allows us to create a new child process. It is controlled by a set of flags that we can if we want the new child process to be in some new namespaces and we want to move the process to move itself from one namespace that it is currently resident in, to another existing namespace. unshare supports the same flags as clone. unshare is a syscall that a running process can use to move into a new namespace. Namespaces automatically close when nothing is holding them open so it ceases to exist when the running process stops or kills. Creating a user namespace requires no privilege but for another namespace, it requires privilege. So time to time you will be needed to use sudo. We can use commands like setns, nsenter, and ip netns to switch or enter from one namespace to another namespace.
Every docker container has its own hostname. Docker uses UTS namespace to achieve it. UTS namespaces isolate two system identifiers of node name (generally known as hostname) and the NIS domain name. On a Linux system, there might be multiple UTS namespaces. If one of those processes changes the hostname the change will be visible to the other processes within that namespace.
Lets use unshare command to create new namespaces and execute some command in those namespaces.
echo $$ readlink /proc/$$/ns/uts sudo unsure -u bash echo $$ readlink /proc/$$/ns/uts hostname new_host_name readlink /proc/$$/ns/mnt nsenter -t PID -u bash
Another fascinating feature of a Linux container is that it is successful to make us believe that every container has a completely different set of files but in reality, it uses Linux mount namespaces to provide a separate view of the container image mounted on the containers root file system. Mount namespace allows us to isolate a set of mount points seen by a process. Mount points are just tuples that keep track of the device, its pathname, and the paren of the mount. By tracking parents, mount point helps us to maintain a single directory hierarchy. So across namespaces, we can have processes that have a different set of mount points and they see a completely different single directory hierarchy tree. It hides the host file system from view to share data across containers or between the host and a container. Volumes are really amount added to the container’s mount namespace. There are ways to share subtrees that allow us to propagate mount points across different namespaces.
Docker makes sure that from one container we don’t access another process of a container. PID namespaces play a great role in it. PID namespace isolates process id. We can only see the id of a process when the process that is part of the same PID namespaces. Since every PID in a namespace is isolated and unique in that isolated space, we don’t need to worry about potential conflicts of id. We can use the PID namespace in many ways, if we want to freeze a process and move it to another core or another machine, we can do that without thinking about the conflict it may cause in terms of PID conflict. PID namespace also allows us to use PID 1 across different namespaces. PID 1 is a special thing on any UNIX system. PID 1 is the parent of all orphaned processes and is responsible for doing certain system initialization. PID namespace is different from all other namespaces. PID namespace maintains a hierarchy. This type of hierarchy has a depth of 32. Each PID namespace has a parent PID namespace that has been initiated by clone() or unshare(). For when the fork happens the process stays at the same level but when clone happens it goes to a lower level. All these 32 generations are initiated from PID 1 on that process. So as a parent of all processes, PID 1 can see all the processes in the namespace, but lower-level PIDs can’t see higher-level PIDs. When PID 1 goes down or gets killed Linux kernel panics and gets restarted. When PID 1 in a namespace goes away, the namespace is considered unusable
sudo unshare -p -f bash echo $$ ls /proc/
We are going to see many processes, but we should not see any processes, because PID should not have any processes yet, right? It is because we unshare only the PID namespace, and totally forgot to unshare mount point, so we can still have access to all processes that are part of PID1 across all namespaces.
Now let’s unshare mount and PID namespace together:
sudo unshare -m -p -f bash mount -t proc none /proc
docker typically uses a separate network namespace per container. Network namespace a separated view of the network with different network resources like routing table rules, firewall rules the socket port, some directories in /proc/net and sys/class/net, and so on. Network namespaces can be connected using a virtual ethernet device pair or veth pair and these vets pair are being isolated using network namespace. These veth pair are connected to a Linux bridge to enable outbound connectivity. Kubernetes pods and ECS tasks get a unified view of the network and a single IP address exposed to address all of the containers in the same pod or task. Every container has its own socket port number space. It can use some NAT rules on the container we could run a web server on port 80.
ip link sudo unshare —net ip link touch /var/run/netns/new mount —bind /proc/$$/ns/net /var/run/netns/new exit docker run —name busybox —detach busybox ps ax | grep busybox sudo redlink /proc/PID/ns/net docker inspect busybox | jq ..NetworkSettings.IPAddress nsenter --target PID --net ifconfig eth0 ip link | grep veth ifconfig vethxxxxxx
Since dockers are just a clever use of Linux namespaces, we can interchangeably use all namespace-related commands and tools to debug dockers as well. For example, to enter namespaces we can use commands like setns, nsenter, ip netns. To demonstrate that we can run a docker image, and find the process id of that image, and then let’s use nsenter command to access to enter the namespace and run commands inside that namespace that has been created by docker.
docker run —name redis —detach redis ps ax | grep redis nsenter —target PID —mount /usr/local/bin/redis-server