Kubernetes - Pod
apiVersion: v1
kind: Pod
Use controllers to manage pods, do not manage pods directly
Deployment
-> ReplicaSet
-> Pod
Containers in a Pod
In the simplest cases, each pod just have 1 container; in some cases each pod has more than 1 pods; with sidecars, each pod has at least 2 containers.
Order: first start spec.initContainers
then spec.containers
; no specific order when sttarting containers in spec.containers
.
Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.
Native Sidecar Containers since Kubernetes 1.28
Before 1.28:
- sidecar container is part of
spec.containers
- if app container starts faster than sidecar container, or shuts down after the sidecar container (i.e. sidecar container life-syscle shorter than app container), the app container cannot access the network.
- if app container exists but sidecar containers runs, the pod will be running indefinitely.
- init containers run before sidecar container, so cannot access the network.
After 1.28:
- sidecar container is part of
spec.initContainers
but withrestartPolicy: Always
- later containers in the list of
spec.initContainers
, and all normalspec.containers
will not start until the sidecar container is ready. - the pod will terminate even if the sidecar container is still running.
- later containers in the list of
Example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: network-proxy
image: network-proxy:1.0
restartPolicy: Always
containers:
...
Pod Termination Process
TL;DR: In Kubernetes, a SIGTERM
command is always sent before SIGKILL
, to give containers a chance to shut down gracefully.
A Terminating
Pod is not deleted yet. Here's the termination process:
- Terminating: the
Pod
is removed from the endpoints list of allService
s, it stops getting new traffic. Containers running in the pod will not be affected. - Trigger
PreStop
Hook. SIGTERM
signal is sent to the pod (and the containers).- K8s wait for the up to "termination grace period" (timer start BEFORE PreStop hook is triggered). By default, this is 30 seconds. can be configured by Pod's
.spec.terminationGracePeriodSeconds
. SIGKILL
signal is sent to pod, and the pod is removed. If the containers are still running after the grace period, they are sent theSIGKILL
signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
Expect exit code 143
if the container terminated gracefully with SIGTERM
, or exit code 137
if it was forcefully terminated using SIGKILL
.
Note on SIGTERM
:
- main process: By default, the main process will receive the signal; To properly propagate the
SIGTERM
signal to the main process in a Kubernetes pod, you need to modify the entrypoint script to ensure that the signal is forwarded to the main process. One way to do this is to use theexec
command to replace the shell script process with the main process when launching it. This way, when theSIGTERM
signal is sent to the shell script process, it will be forwarded to the main process. - child process: if the main process launches additional child processes, it’s important to ensure that the
SIGTERM
signal is propagated to those child processes as well. Otherwise, those child processes may be abruptly terminated and leave behind unclean state.
If /bin/sh
is running as pid 1, it swallows the SIGTERM
and doesn't terminate the subprocess.
Volume
Inside Pod
, volumes need to be configured in 2 types of places:
.spec.volumes
: ref to pvc, configmap, secret, emptyDir, etc..spec.containers[].volumeMounts
:- ref to volumes
- mountPath
Volumes examples:
volumes:
# PVC example
- name: mydisk
persistentVolumeClaim: claimName: xxx-xxx
# host-path example
- name: myvolume
hostPath:
path: /mnt/vpath
Entrypoint
A pod can override entrypoint by specifying command
in the spec:
kind: Pod
spec:
containers:
- image: some/image
command: ['/bin/command']
args: ['arg1', 'arg2', 'arg3']
Pod fields override Dockerfile:
command
=> overridesENTRYPOINT
args
=> overridesCMD
Toleration
If any new Pods tolerate the node.kubernetes.io/unschedulable
taint, then those Pods might be scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.
$ kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
spec.hostNetwork
If .spec.hostNetwork
is true
, the pod's podIP
is set to the host machine IP, so it can use the host machine's physical interface (eth0
). It disables Kubernetes's networking layer.
Use case: a worker pool consisting of Pods that don't need to run HTTP servers but instead make outgoing requests to a message broker.
spec.dnsPolicy
If hostNetwork: true
, then dnsPolicy: ClusterFirstWithHostNet
. Even though it uses hostNetwork, it will require CoreDNS for cluster DNS resolution. This allows the pod to call out to other Services.
It is required to use ClusterFirstWithHostNet
if the pod needs to resolve cluster.local
based DNS names.
Pods running with hostNetwork: true
and ClusterFirst
will fallback to the behavior of the Default
policy.
spec.priorityClassName
For critical pods (e.g. networking, DNS, etc), set priorityClassName
to:
system-node-critical
(highest priority)system-cluster-critical
(lower thansystem-node-critical
but still critical)
Images
Pod reference image by tag
haproxy:v2.2.25-gke.4
or by sha
haproxy:v2.2.25-gke.4@sha256:27f260a9f4b2f1848b360c044d94b957b2fe513d2c2afc6daf4e3c148c2a39f4
Digest Stability: While constructing the image bundle, we use docker save to save the image into the image bundle tar ball. Digests computed for the images saved using this scheme are not stable. In other words, when these images are being copied to different private registries by different customers, the digest changes, because the registry URL is taken into account for computing the digest. Solution to make the digests stable is to save the images using oci-format image manifests, using a image save tool supporting this format. (e.g. crane)