Kubernetes - Container Runtimes
What are Containers?
Think of "container" as just another packaging format.
Just like .iso
files for disk images, .deb
/.rpm
for linux packages, or .zip
/.tgz
for binary or arbitrary files.
The ecosystem is more than just a format, it includes:
- Image
- Distribute
- Runtime
- Orchestration
Unlike traditional virtualization, containerization takes place at the kernel level. Most modern operating system kernels now support the primitives necessary for containerization, including Linux with openvz
, vserver
and more recently lxc
.
A container image is a tar file containing tar files. Each of the tar file is a layer.
Read more: Containers vs VMs
Standards
TL;DR: OCI vs CRI
- OCI for low-level specs (think of containers,
runc
). - CRI for high-level specs (think of k8s,
containerd
).
OCI: Open Container Initiative
https://www.opencontainers.org/
Defines important specs, so different tools can be used to pack/unpack and run by different runtimes:
- the Runtime Specification(runtime-spec)
- the Image Specification(image-spec)
- the Distribution Specification(distribution-spec)
runc
(https://github.com/opencontainers/runc) is a CLI tool for spawning and running containers according to the OCI specification.
CRI: Container Runtime Interface
Defines an API between Kubernetes and the container runtime (defined by OCI).
Notable Projects
- Docker: an open source Linux containerization technology. Package, distribute and runtime solution.
containerd
: Container daemon. Docker spun out the container runtime and donated it to CNCF. Now containerd is a graduated CNCF project. Usingrunc
as runtime. Used by Docker, Kubernetes, AWS ECS, etc.- cgroup: limits and isolates resources(CPU, memory, disk I/O, network, etc)
- lxc(linuxcontainer)
- gVisor: a user-space kernel for containers. It limits the host kernel surface accessible to the application while still giving the application access to all the features it expects. It leverages existing host kernel functionality and runs as a normal user-space process. For running untrusted workloads. Lower memory and startup overhead compared to a full VM.
Runtime
In 2020, Kubernetes deprecated Docker as a container runtime after version 1.20, in favor of runtimes that use the Container Runtime Interface (CRI): containerd
and CRI-O
. (Note that Docker is still a useful tool for building containers, and the images that result from running docker build can still run in your Kubernetes cluster.)
runc
: This is the low-level container runtime (the thing that actually creates and runs containers). It includeslibcontainer
, a native Go-based implementation for creating containers. Docker donatedrunC
to OCI.containerd
: CNCF graduated project, contributers: Google, Microsoft, Alibaba, etc, came from docker and made CRI compliant.- CRI-O: CNCF incubating project, contributers: RedHat, IBM, Intel etc, created from the ground up for K8s.
Docker's default runtime: runC
$ docker run --runtime=runc ...
gVisor can be integrated with Docker by changing runc
to runsc
("run sandboxed container)
$ docker run --runtime=runsc ...
gVisor runs slower than default docker runtime due to the "sandboxing": https://github.com/google/gvisor/issues/102
LXC vs LXD vs cgroups vs Docker
- Linux Containers (LXC): on top of
cgroups
, operating system–level virtualization technology for running multiple isolated Linux systems (containers) on a single control host. cgroups
: provides namespace isolation and abilities to limit, account and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups- LXD: similar to LXC, but a REST API on top of
liblxc
- Docker: application container; LXC/LXD: system container; Docker initially used
liblxc
but later changed tolibcontainer
Who's Not Using Containers?
Well it is gaining momentum and popularity. Many companies are adopting it.
Two notable exceptions are: Google and Facebook
Google has its own packaging format: MPM. MPM on Borg is similar to container on Kubernetes, and Kubernetes is the open-source version of Borg.
Facebook use Tupperware. Why not docker? They didn't exist then.
CRI
- defines the main gRPC protocol for the communication between the cluster components
kubelet
and container runtime. - implemented by container runtimes (e.g.
containerd
). - CRI = RuntimeService + ImageService: https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.proto
- The kubelet acts as a client when connecting to the container runtime via gRPC. The runtime and image service endpoints have to be available in the container runtime.
containerd view logs
journalctl -u containerd
crictl image
= ctr -n=k8s.io images ls
kind load
uses "ctr", "--namespace=k8s.io", "images", "import", "--digests", "--snapshotter="+snapshotter, "-"
Check processes:
crictl ps
ctr vs crictl
ctr
: containerd CLI, not related to k8s.crictl
: CRI Compatible container runtime command line interface, related to k8s.
OCI Bundle
OCI bundle can be loaded into Harbor without further processing.
Benefits:
- Deduplication of layers of images in releases; saving space.
- File verification thru SHA digests stored in Index and Manifests against file corruption.
- Listing Image Manifest of the bundle (before storing); full transparency to the customer.
Structure:
- OCI image bundle can be nested.
- A
oci-layout
file specifying the layout version:"imageLayoutVersion": "1.0.0"
- Root level must have:
index.json
MediaType
is either one of:- index:
application/vnd.oci.image.index.v1+json
- manifest:
application/vnd.oci.image.manifest.v1+json
- index:
- Artifacts are stored in
oci/blobs/<alg>/<digest>
, e.g.oci/blobs/sha256/XXXXXXXXXXX
MediaType is flexible
(Some are JSON some are binary)application/vnd.oci.image.config.v1+json
application/vnd.oci.image.layer.v1.tar+gzip
- References are all done by digest
- "file whose digest is
sha256:cdce9e...
" - instead of "file whose name is
layer.tar.gz
"
- "file whose digest is
- Merkle DAG (Directed Acyclic Graph)
- content dedup by digest
- immutable - tamper proof
- no circular dependency
OCI Image -> OCI Runtime Spec bundle
- Start with an OCI image
- Apply the filesystem layers in the order specified in manifest
- Generate
config.json
- OCI Runtime Spec bundle is formed,
runc
now has enough information to run - apply cgroup/namespace etc on Linux host