Core

To a Senior Backend Engineer, "Docker" is often a misnomer. We usually say "Docker" when we actually mean a specific stack of standardized components (OCI) interacting with Linux kernel primitives.

At its core, Docker is not a virtualization technology; it is a process isolation technology. It is a user-space tool that manipulates kernel features to trick a process into thinking it has its own dedicated machine.

Here is the deep dive into Docker Core Theory, broken down by Architecture, Kernel Primitives, and Storage.


1. The Architecture: It's Not Monolithic

Years ago, Docker was a single binary (docker) that did everything. Today, it is a modular stack. When you run docker run, a relay race occurs between distinct components.

The Stack

  1. Docker Client (CLI): The user interface. It converts your commands into REST API calls sent to the daemon.

  2. Dockerd (The Daemon): The API server. It manages higher-level objects like images, networks, and volumes. It does not run the containers. It delegates that to containerd.

  3. Containerd: The industry-standard container supervisor (CNCF graduated). It manages the container lifecycle (pulling images, storage, networking). When it needs to run a container, it calls runc.

  4. Runc: The low-level OCI Runtime. This is the binary that actually talks to the Kernel. It sets up the namespaces and cgroups, spawns the process, and then exits.

  5. Containerd-Shim: A small piece of code that sits between containerd and the container.

    • Why it exists: It allows runc to exit (keeping the runtime lightweight) and allows containerd (and dockerd) to be restarted/upgraded without killing running containers. It also handles the container's stdout/stderr.

The "Life of a Request"

  1. You type docker run nginx.

  2. Client POSTs to Dockerd.

  3. Dockerd calls Containerd ("Get me an nginx container").

  4. Containerd pulls the image and converts it into an OCI Bundle (filesystem + config).

  5. Containerd calls Runc.

  6. Runc interacts with the Kernel to create the isolated environment.

  7. Runc starts the nginx process and exits.

  8. Shim takes over ownership of the process.


2. The Kernel Primitives: The "Magic"

Docker essentially relies on two Linux Kernel features to create the illusion of a container.

A. Namespaces (Isolation)

Namespaces limit what a process can see. They partition kernel resources such that one set of processes sees one set of resources, while another set sees a different set.

  • PID Namespace: The container has its own Process ID 1. Inside, nginx might be PID 1. On the host, it might be PID 14302.

  • MNT Namespace (Mount): The container has its own root filesystem (/). It cannot see the host's /home or /var unless explicitly mounted.

  • NET Namespace: The container gets its own network stack (IP, localhost, port range).

  • UTS Namespace: Allows the container to have its own hostname.

  • IPC Namespace: Prevents shared memory access between the container and host/other containers.

  • USER Namespace: (Optional but powerful) Maps root inside the container to a non-privileged user on the host.

B. Cgroups (Control Groups)

Cgroups limit how much a process can use. While Namespaces hide the neighbors, Cgroups build the walls.

  • Resource Limiting: Restrict CPU usage (shares/quotas) and Memory usage (OOM Killer triggers).

  • Prioritization: Give critical containers more CPU time during contention.

  • Accounting: Measure exactly how much resource a container consumes.


3. The Filesystem: Layers & UnionFS

This is often the most misunderstood part of Docker theory. Docker images are not single files; they are a stack of read-only layers (tarballs).

Union File System (OverlayFS)

Docker uses a Union File System (usually overlay2 on modern Linux) to combine these layers into a single view.

  • LowerDir: The read-only image layers (e.g., Ubuntu Base -> Apt install Python -> Add App Code).

  • UpperDir: A thin, read-write layer created when the container starts.

  • MergedDir: The unified view the application sees.

Copy-on-Write (CoW)

This mechanism is why Docker containers start instantly and take up very little space.

  1. Read: If your app reads config.ini located in a lower (image) layer, it reads it directly from there. Efficiency is high.

  2. Write: If your app tries to modify config.ini, the kernel traps the write. It copies the file from the read-only LowerDir to the read-write UpperDir first.

  3. Result: The container now sees its own copy. The original image remains untouched for other containers to use.


4. Networking: The Bridge

How does a process inside a confined NET Namespace talk to the internet?

  • Veth Pairs (Virtual Ethernet): Docker creates a virtual cable. One end (eth0) sits inside the container. The other end (vethXXXX) sits on the host machine.

  • The Bridge (docker0): The host end of the cable is plugged into a virtual switch called docker0.

  • NAT (Network Address Translation): To get out to the internet, iptables rules on the host Masquerade the traffic, making it look like it came from the host's physical IP.


Summary for Engineers

  • Containers are just processes. If you run ps aux on the host, you can find the container's process.

  • Images are just files. They are tarballs managed by a graph driver.

  • "Docker" is a manager. The real work is done by runc (setup) and the Linux Kernel (execution).

What specific action would you like to take?

Would you like me to provide a practical guide on inspecting these primitives manually (e.g., finding a container's PID on the host and entering its namespace using nsenter without using Docker commands)?

... Deep Dive into Docker Internals ...

This video is relevant because it visualizes the exact breakdown of containerd, runc, and the kernel namespaces discussed above, helping to cement the mental model of the architecture.

Last updated