Kubernetes - Pod
apiVersion: v1
kind: Pod
Use controllers to manage pods, do not manage pods directly
Deployment
-> ReplicaSet
-> Pod
Containers in a Pod
In the simplest cases, each pod just have 1 container; in some cases each pod has more than 1 pods; with sidecars, each pod has at least 2 containers.
Order: first start spec.initContainers
then spec.containers
; no specific order when sttarting containers in spec.containers
.
Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.
Native Sidecar Containers since Kubernetes 1.28
Before 1.28:
- sidecar container is part of
spec.containers
- if app container starts faster than sidecar container, or shuts down after the sidecar container (i.e. sidecar container life-syscle shorter than app container), the app container cannot access the network.
- if app container exists but sidecar containers runs, the pod will be running indefinitely.
- init containers run before sidecar container, so cannot access the network.
After 1.28:
- sidecar container is part of
spec.initContainers
but withrestartPolicy: Always
- later containers in the list of
spec.initContainers
, and all normalspec.containers
will not start until the sidecar container is ready. - the pod will terminate even if the sidecar container is still running.
- later containers in the list of
Example:
apiVersion: v1
kind: Pod
spec:
initContainers:
- name: network-proxy
image: network-proxy:1.0
restartPolicy: Always
containers:
...
Pod Termination Process
TL;DR: In Kubernetes, a SIGTERM
command is always sent before SIGKILL
, to give containers a chance to shut down gracefully.
A Terminating
Pod is not deleted yet. Here's the termination process:
- Terminating: the
Pod
is removed from the endpoints list of allService
s, it stops getting new traffic. Containers running in the pod will not be affected. - Trigger
PreStop
Hook. SIGTERM
signal is sent to the pod (and the containers).- K8s wait for the up to "termination grace period" (timer start BEFORE PreStop hook is triggered). By default, this is 30 seconds. can be configured by Pod's
.spec.terminationGracePeriodSeconds
. SIGKILL
signal is sent to pod, and the pod is removed. If the containers are still running after the grace period, they are sent theSIGKILL
signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
Expect exit code 143
if the container terminated gracefully with SIGTERM
, or exit code 137
if it was forcefully terminated using SIGKILL
.
Note on SIGTERM
:
- main process: By default, the main process will receive the signal; To properly propagate the
SIGTERM
signal to the main process in a Kubernetes pod, you need to modify the entrypoint script to ensure that the signal is forwarded to the main process. One way to do this is to use theexec
command to replace the shell script process with the main process when launching it. This way, when theSIGTERM
signal is sent to the shell script process, it will be forwarded to the main process. - child process: if the main process launches additional child processes, it’s important to ensure that the
SIGTERM
signal is propagated to those child processes as well. Otherwise, those child processes may be abruptly terminated and leave behind unclean state.
If /bin/sh
is running as pid 1, it swallows the SIGTERM
and doesn't terminate the subprocess.
Volume
Inside Pod
, volumes need to be configured in 2 types of places:
.spec.volumes
: ref to pvc, configmap, secret, emptyDir, etc..spec.containers[].volumeMounts
:- ref to volumes
- mountPath
Volumes examples:
volumes:
# PVC example
- name: mydisk
persistentVolumeClaim: claimName: xxx-xxx
# host-path example
- name: myvolume
hostPath:
path: /mnt/vpath
Entrypoint
A pod can override entrypoint by specifying command
in the spec:
kind: Pod
spec:
containers:
- image: some/image
command: ['/bin/command']
args: ['arg1', 'arg2', 'arg3']
Pod fields override Dockerfile:
command
=> overridesENTRYPOINT
args
=> overridesCMD
Toleration
If any new Pods tolerate the node.kubernetes.io/unschedulable
taint, then those Pods might be scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.
$ kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.