Kubernetes - Pod

apiVersion: v1
kind: Pod

Use controllers to manage pods, do not manage pods directly

Deployment -> ReplicaSet -> Pod

Containers in a Pod

In the simplest cases, each pod just have 1 container; in some cases each pod has more than 1 pods; with sidecars, each pod has at least 2 containers.

Order: first start spec.initContainers then spec.containers; no specific order when sttarting containers in spec.containers.

Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.

Native Sidecar Containers since Kubernetes 1.28

Before 1.28:

sidecar container is part of spec.containers
- if app container starts faster than sidecar container, or shuts down after the sidecar container (i.e. sidecar container life-syscle shorter than app container), the app container cannot access the network.
- if app container exists but sidecar containers runs, the pod will be running indefinitely.
- init containers run before sidecar container, so cannot access the network.

After 1.28:

sidecar container is part of spec.initContainers but with restartPolicy: Always
- later containers in the list of spec.initContainers, and all normal spec.containers will not start until the sidecar container is ready.
- the pod will terminate even if the sidecar container is still running.

Example:

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: network-proxy
    image: network-proxy:1.0
    restartPolicy: Always
  containers:
  ...

The sidecar containers can communicate with the main container over a socket with a gRPC protocol.

Pod Termination Process

TL;DR: In Kubernetes, a SIGTERM command is always sent before SIGKILL, to give containers a chance to shut down gracefully.

A Terminating Pod is not deleted yet. Here's the termination process:

Terminating: the Pod is removed from the endpoints list of all Services, it stops getting new traffic. Containers running in the pod will not be affected.
Trigger PreStop Hook.
SIGTERM signal is sent to the pod (and the containers).
K8s wait for the up to "termination grace period" (timer start BEFORE PreStop hook is triggered). By default, this is 30 seconds. can be configured by Pod's .spec.terminationGracePeriodSeconds.
SIGKILL signal is sent to pod, and the pod is removed. If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.

Expect exit code 143 if the container terminated gracefully with SIGTERM, or exit code 137 if it was forcefully terminated using SIGKILL.

Note on SIGTERM:

main process: By default, the main process will receive the signal; To properly propagate the SIGTERM signal to the main process in a Kubernetes pod, you need to modify the entrypoint script to ensure that the signal is forwarded to the main process. One way to do this is to use the exec command to replace the shell script process with the main process when launching it. This way, when the SIGTERM signal is sent to the shell script process, it will be forwarded to the main process.
child process: if the main process launches additional child processes, it’s important to ensure that the SIGTERM signal is propagated to those child processes as well. Otherwise, those child processes may be abruptly terminated and leave behind unclean state.

If /bin/sh is running as pid 1, it swallows the SIGTERM and doesn't terminate the subprocess.

Volume

Inside Pod, volumes need to be configured in 2 types of places:

.spec.volumes: ref to pvc, configmap, secret, emptyDir, etc.
.spec.containers[].volumeMounts:
- ref to volumes
- mountPath

Volumes examples:

volumes:
  # PVC example
  - name: mydisk
    persistentVolumeClaim:                                                  claimName: xxx-xxx
  # host-path example
  - name: myvolume
    hostPath:
      path: /mnt/vpath

Entrypoint

A pod can override entrypoint by specifying command in the spec:

kind: Pod
spec:
  containers:
    - image: some/image
      command: ['/bin/command']
      args: ['arg1', 'arg2', 'arg3']

Pod fields override Dockerfile:

command => overrides ENTRYPOINT
args => overrides CMD

Toleration

If any new Pods tolerate the node.kubernetes.io/unschedulable taint, then those Pods might be scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.

$ kubectl uncordon <node name>

afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.

spec.hostNetwork

If .spec.hostNetwork is true, the pod's podIP is set to the host machine IP, so it can use the host machine's physical interface (eth0). It disables Kubernetes's networking layer.

Use case: a worker pool consisting of Pods that don't need to run HTTP servers but instead make outgoing requests to a message broker.

spec.dnsPolicy

If hostNetwork: true, then dnsPolicy: ClusterFirstWithHostNet. Even though it uses hostNetwork, it will require CoreDNS for cluster DNS resolution. This allows the pod to call out to other Services.

It is required to use ClusterFirstWithHostNet if the pod needs to resolve cluster.local based DNS names.

Pods running with hostNetwork: true and ClusterFirst will fallback to the behavior of the Default policy.

spec.priorityClassName

For critical pods (e.g. networking, DNS, etc), set priorityClassName to:

system-node-critical (highest priority)
system-cluster-critical (lower than system-node-critical but still critical)

Images

Pod reference image by tag

haproxy:v2.2.25-gke.4

or by sha

haproxy:v2.2.25-gke.4@sha256:27f260a9f4b2f1848b360c044d94b957b2fe513d2c2afc6daf4e3c148c2a39f4

Digest Stability: While constructing the image bundle, we use docker save to save the image into the image bundle tar ball. Digests computed for the images saved using this scheme are not stable. In other words, when these images are being copied to different private registries by different customers, the digest changes, because the registry URL is taken into account for computing the digest. Solution to make the digests stable is to save the images using oci-format image manifests, using a image save tool supporting this format. (e.g. crane)