Kubernetes - Pod
apiVersion: v1
kind: Pod
Use controllers to manage pods, do not manage pods directly
Deployment -> ReplicaSet -> Pod
Containers in a Pod
In the simplest cases, each pod just have 1 container; in some cases each pod has more than 1 pods; with sidecars, each pod has at least 2 containers.
Order: first start spec.initContainers then spec.containers; no specific order when sttarting containers in spec.containers.
Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.
Native Sidecar Containers since Kubernetes 1.28
Before 1.28:
- sidecar container is part of
spec.containers- if app container starts faster than sidecar container, or shuts down after the sidecar container (i.e. sidecar container life-syscle shorter than app container), the app container cannot access the network.
- if app container exists but sidecar containers runs, the pod will be running indefinitely.
- init containers run before sidecar container, so cannot access the network.
After 1.28:
- sidecar container is part of
spec.initContainersbut withrestartPolicy: Always- later containers in the list of
spec.initContainers, and all normalspec.containerswill not start until the sidecar container is ready. - the pod will terminate even if the sidecar container is still running.
- later containers in the list of
Example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: network-proxy
image: network-proxy:1.0
restartPolicy: Always
containers:
...
The sidecar containers can communicate with the main container over a socket with a gRPC protocol.
Pod Termination Process
TL;DR: In Kubernetes, a SIGTERM command is always sent before SIGKILL, to give containers a chance to shut down gracefully.
A Terminating Pod is not deleted yet. Here's the termination process:
- Terminating: the
Podis removed from the endpoints list of allServices, it stops getting new traffic. Containers running in the pod will not be affected. - Trigger
PreStopHook. SIGTERMsignal is sent to the pod (and the containers).- K8s wait for the up to "termination grace period" (timer start BEFORE PreStop hook is triggered). By default, this is 30 seconds. can be configured by Pod's
.spec.terminationGracePeriodSeconds. SIGKILLsignal is sent to pod, and the pod is removed. If the containers are still running after the grace period, they are sent theSIGKILLsignal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
Expect exit code 143 if the container terminated gracefully with SIGTERM, or exit code 137 if it was forcefully terminated using SIGKILL.
Note on SIGTERM:
- main process: By default, the main process will receive the signal; To properly propagate the
SIGTERMsignal to the main process in a Kubernetes pod, you need to modify the entrypoint script to ensure that the signal is forwarded to the main process. One way to do this is to use theexeccommand to replace the shell script process with the main process when launching it. This way, when theSIGTERMsignal is sent to the shell script process, it will be forwarded to the main process. - child process: if the main process launches additional child processes, it’s important to ensure that the
SIGTERMsignal is propagated to those child processes as well. Otherwise, those child processes may be abruptly terminated and leave behind unclean state.
If /bin/sh is running as pid 1, it swallows the SIGTERM and doesn't terminate the subprocess.
Volume
Inside Pod, volumes need to be configured in 2 types of places:
.spec.volumes: ref to pvc, configmap, secret, emptyDir, etc..spec.containers[].volumeMounts:- ref to volumes
- mountPath
Volumes examples:
volumes:
# PVC example
- name: mydisk
persistentVolumeClaim: claimName: xxx-xxx
# host-path example
- name: myvolume
hostPath:
path: /mnt/vpath
Entrypoint
A pod can override entrypoint by specifying command in the spec:
kind: Pod
spec:
containers:
- image: some/image
command: ['/bin/command']
args: ['arg1', 'arg2', 'arg3']
Pod fields override Dockerfile:
command=> overridesENTRYPOINTargs=> overridesCMD
Toleration
If any new Pods tolerate the node.kubernetes.io/unschedulable taint, then those Pods might be scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.
$ kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
spec.hostNetwork
If .spec.hostNetwork is true, the pod's podIP is set to the host machine IP, so it can use the host machine's physical interface (eth0). It disables Kubernetes's networking layer.
Use case: a worker pool consisting of Pods that don't need to run HTTP servers but instead make outgoing requests to a message broker.
spec.dnsPolicy
If hostNetwork: true, then dnsPolicy: ClusterFirstWithHostNet. Even though it uses hostNetwork, it will require CoreDNS for cluster DNS resolution. This allows the pod to call out to other Services.
It is required to use ClusterFirstWithHostNet if the pod needs to resolve cluster.local based DNS names.
Pods running with hostNetwork: true and ClusterFirst will fallback to the behavior of the Default policy.
spec.priorityClassName
For critical pods (e.g. networking, DNS, etc), set priorityClassName to:
system-node-critical(highest priority)system-cluster-critical(lower thansystem-node-criticalbut still critical)
Images
Pod reference image by tag
haproxy:v2.2.25-gke.4
or by sha
haproxy:v2.2.25-gke.4@sha256:27f260a9f4b2f1848b360c044d94b957b2fe513d2c2afc6daf4e3c148c2a39f4
Digest Stability: While constructing the image bundle, we use docker save to save the image into the image bundle tar ball. Digests computed for the images saved using this scheme are not stable. In other words, when these images are being copied to different private registries by different customers, the digest changes, because the registry URL is taken into account for computing the digest. Solution to make the digests stable is to save the images using oci-format image manifests, using a image save tool supporting this format. (e.g. crane)