logo

Kubernetes - Pod

Last Updated: 2024-03-03
apiVersion: v1
kind: Pod

Use controllers to manage pods, do not manage pods directly

Deployment -> ReplicaSet -> Pod

Containers in a Pod

In the simplest cases, each pod just have 1 container; in some cases each pod has more than 1 pods; with sidecars, each pod has at least 2 containers.

Order: first start spec.initContainers then spec.containers; no specific order when sttarting containers in spec.containers.

Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.

Native Sidecar Containers since Kubernetes 1.28

Before 1.28:

  • sidecar container is part of spec.containers
    • if app container starts faster than sidecar container, or shuts down after the sidecar container (i.e. sidecar container life-syscle shorter than app container), the app container cannot access the network.
    • if app container exists but sidecar containers runs, the pod will be running indefinitely.
    • init containers run before sidecar container, so cannot access the network.

After 1.28:

  • sidecar container is part of spec.initContainers but with restartPolicy: Always
    • later containers in the list of spec.initContainers, and all normal spec.containers will not start until the sidecar container is ready.
    • the pod will terminate even if the sidecar container is still running.

Example:

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: network-proxy
    image: network-proxy:1.0
    restartPolicy: Always
  containers:
  ...

Pod Termination Process

TL;DR: In Kubernetes, a SIGTERM command is always sent before SIGKILL, to give containers a chance to shut down gracefully.

A Terminating Pod is not deleted yet. Here's the termination process:

  • Terminating: the Pod is removed from the endpoints list of all Services, it stops getting new traffic. Containers running in the pod will not be affected.
  • Trigger PreStop Hook.
  • SIGTERM signal is sent to the pod (and the containers).
  • K8s wait for the up to "termination grace period" (timer start BEFORE PreStop hook is triggered). By default, this is 30 seconds. can be configured by Pod's .spec.terminationGracePeriodSeconds.
  • SIGKILL signal is sent to pod, and the pod is removed. If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.

Expect exit code 143 if the container terminated gracefully with SIGTERM, or exit code 137 if it was forcefully terminated using SIGKILL.

Note on SIGTERM:

  • main process: By default, the main process will receive the signal; To properly propagate the SIGTERM signal to the main process in a Kubernetes pod, you need to modify the entrypoint script to ensure that the signal is forwarded to the main process. One way to do this is to use the exec command to replace the shell script process with the main process when launching it. This way, when the SIGTERM signal is sent to the shell script process, it will be forwarded to the main process.
  • child process: if the main process launches additional child processes, it’s important to ensure that the SIGTERM signal is propagated to those child processes as well. Otherwise, those child processes may be abruptly terminated and leave behind unclean state.

If /bin/sh is running as pid 1, it swallows the SIGTERM and doesn't terminate the subprocess.

Volume

Inside Pod, volumes need to be configured in 2 types of places:

  • .spec.volumes: ref to pvc, configmap, secret, emptyDir, etc.
  • .spec.containers[].volumeMounts:
    • ref to volumes
    • mountPath

Volumes examples:

volumes:
  # PVC example
  - name: mydisk
    persistentVolumeClaim:                                                  claimName: xxx-xxx
  # host-path example
  - name: myvolume
    hostPath:
      path: /mnt/vpath

Entrypoint

A pod can override entrypoint by specifying command in the spec:

kind: Pod
spec:
  containers:
    - image: some/image
      command: ['/bin/command']
      args: ['arg1', 'arg2', 'arg3']

Pod fields override Dockerfile:

  • command => overrides ENTRYPOINT
  • args => overrides CMD

Toleration

If any new Pods tolerate the node.kubernetes.io/unschedulable taint, then those Pods might be scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.

$ kubectl uncordon <node name>

afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.