logo

Kubernetes - Storage

Lifecycle of the storage

Storage lifecycle: Provision -> Attach -> Mount -> [volume fully setup, pod executing] -> Unmount -> Detach -> Delete

Provision

When a PersistentVolumeClaim object is created (in the same namespace as a Pod), a volume is created on demand, represented by a PersistentVolume object. The PVC and PV objects are bound together with bidirectional pointers to each other.

This logic is implemented by the PersistentVolume controller inside kube-controller-manager.

Attach

Once a Pod is scheduled to a node, attach the disk to that node.

This logic is handled by the AttachDetach controller inside kube-controller-manager

If the disk is attached, the PV should appear in /dev/disk/by-id/ on the node

  • the mounted filesystem info may be found in /var/lib/kubelet/plugins/kubernetes.io/
  • Bind mount the volume from the global directory to a pod directory /var/lib/kubelet/pods/<pod uid>/volumes

Mount

Once a Pod is scheduled to a node, mount it and give it to the container. This happens on the node in kubelet and involves a few steps.

Unmount

Once a Pod is deleted and its containers have been terminated, it is safe to unmount the volume. This is handled by kubelet on the node.

  • UnmountVolume: Unmount the pod bind mount
  • UnmountDevice: Unmount the global mount if this is the last reference to the volume

Mark the volume as safe to detach

Detach

When a Pod is deleted from a node, and it is safe to detach, then detach it. This is handled by the AttachDetach controller.

Delete

When a PVC is deleted and the reclaim policy is Delete, delete the volume. This is handled by the PersistentVolume controller.

System Storage vs Application Storage

There are two components to storage.

  • system storage: stored locally, on the control plane nodes (e.g., etcd, keys, certificates) and on the worker nodes (e.g., logs, metrics).
    • Etcd: fault-tolerance can be achieved either through master replication (i.e., running multiple masters, each using non-fault-tolerant (local) storage) or by a single master writing to / reading from fault-tolerant storage.
    • Keys and certificates, Audit logs: require encryption and restricted mutability.
    • System logs (e.g. Fluentd) metrics (e.g. Prometheus): may not require fault tolerant storage as they are usually exported to Cloud and typically need storage for local buffering only (e.g., to cover up to 24h of network unavailability).
  • application storage: requires CSI drivers for customer-provided external storage. Options:
    • use pre-existing fault-tolerant on-prem storage solutions like NetApp or EMC
    • use a storage solution on top of a K8s cluster.
      • fault-tolerant K8s-managed storage: e.g. Ceph, EdgeFS, etc.
      • non-fault-tolerant: e.g. Persistent Local Volumes.

Access mode

  • ReadWriteOnce (RWO) – only one node is allowed to access the storage volume at a time for read and write access. RWO is supported by all PVs.
  • ReadOnlyMany (ROX) – many nodes may access the storage volume in read-only mode. ROX is supported primarily by file and file-like protocols, e.g. NFS and CephFS. However, some block protocols are supported, such as iSCSI.
  • ReadWriteMany (RWX) – many nodes may simultaneously read and write to the storage volume. RWX is supported by file and file-like protocols only, such as NFS.
  • ReadWriteOncePod (RWOP) - the volume can be mounted as read-write by a single Pod. GA in 1.29.

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes

An SSD or block storage device can only be mounted by a single VM instance so it would be ReadWriteOnce (only one device can read/write to it).

File-based volumes (or file shares like EFS & FSX) have the ability for numerous (Many) resources to connect to them and read/write data to that drive at the same time. a file storage mount, say a NFS/SAMBA share, could be mounted to multiple virtual machines at the same time.

The ReadWriteOnce access mode restricts volume access to a single node, it is possible for multiple pods on the same node to read from and write to the same volume. This could potentially be a major problem for some applications, especially if they require at most one writer for data safety guarantees. That's why ReadWriteOncePod was created.

Key requirements of K8s system storage

  • fault tolerance (persisted state must be durable) and
  • bootstrapping (storage must be available even before the cluster control plane is fully operational)

Storage Options

Ephemeral: local ephemeral storage is managed by kubelet on each node, e.g. emptyDir, configMap, downwardAPI, secret

  • Erased when a pod is removed.
  • Standard Kubernetes volume types:
    • emptyDir: all containers in the Pod can read and write the same files in the emptyDir volume.
    • secret, configMap: inject secrets and configuration data into the pod.
    • downwardAPI: downward API allows containers to consume information about themselves or the cluster without using the Kubernetes client or API server. e.g. Pod's name,namespace, annotations, labels, etc. A downwardAPI volume makes downward API data available to applications.
  • Backed by local disks.
  • Manage sharing via Pod ephemeral-storage requests/limits, node allocatable.

Local and HostPath:

  • hostPath: should be avoided. mounts a file or directory from the host node's filesystem into your Pod.
  • local: local storage device such as a disk, partition or directory. If a node becomes unhealthy, then the local volume becomes inaccessible by the pod.

Backend technology or protocols

  • nfs
  • iscsi
  • fc: fibre channel

Open Source Projects

  • Ceph (Red Hat) / Rook: k8s -> Rook -> Ceph
  • LongHorn (Rancher)
  • OpenEBS

Cloud big 3:

  • Amazon EBS
  • Google Persistent Disk
  • Azure Disk Storage

Enterprise:

  • NetApp: k8s -> Trident -> ONTAP,
    • .spec.csi.driver: csi.trident.netapp.io
  • Pure Storage: Portworx
  • HPE Storage
  • Dell EMC
  • Red Hat Container Storage Platform
  • MayaData Kubera
  • Robin
  • StorageOS
  • Diamanti

References

Pod:

  • .spec.volumes: volumes available for the pod
  • .spec.containers[].volumeMounts: where to mount those volumes into containers
  • .spec.containers[].resources.limits.ephemeral-storage
  • .spec.containers[].resources.requests.ephemeral-storage

Container Storage Interface (CSI)

CSI is a spec. Container Storage Interface (CSI) defines a standard interface for container orchestration systems to expose arbitrary storage systems (block and file storage) to their container workloads.

Using CSI, third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code.

Where is CSI called

  • Kubelet directly issues CSI calls (like NodeStageVolume, NodePublishVolume, etc.) to CSI drivers via a Unix Domain Socket to mount and unmount volumes.
  • Kubelet discovers CSI drivers (and the Unix Domain Socket to use to interact with a CSI driver) via the kubelet plugin registration mechanism.
  • Kubernetes master components do not communicate directly (via a Unix Domain Socket or otherwise) with CSI drivers. Kubernetes master components interact only with the Kubernetes API.

HostPath

A host path volume mounts a file or directory from the file system of the host node into your pod; mounted to the path /var/lib/csi-hostpath-data/<pvc-id> which means that writes aren't even guaranteed to be persisted (the contents might be in RAM).

downwardAPI

Making the Pod's info available to the container. E.g. this will make metadata.labels available in /etc/podinfo/labels:

spec:
  containers:
    - ...
      volumeMounts:
        - name: podinfo
          mountPath: /etc/podinfo
  volumes:
    - name: podinfo
      downwardAPI:
        items:
          - path: "labels"
            fieldRef:
              fieldPath: metadata.labels

NetApp

NetApp Harvest

https://github.com/NetApp/harvest

The default package collects performance, capacity and hardware metrics from ONTAP clusters.

NetApp Trident

https://github.com/NetApp/trident

Trident is an external provisioner controller:

  • run as a k8s pod or deployment; provides dynamic storage orchestration services for your Kubernetes workloads.
  • monitors activities on PVC / PV / StorageClass.
  • a single provisioner for different storage platforms (ONTAP and others).
  • Trident CSI driver talks to ONTAP REST API.

Trident interacts with k8s (adapted from Trident official doc) (TL;DR: PersistentVolumeClaim -> PersistentVolume -> TridentVolume - actual storage)

  • A user creates a PersistentVolumeClaim requesting a new PersistentVolume of a particular size from a Kubernetes StorageClass that was previously configured by the administrator.
  • The Kubernetes StorageClass identifies Trident as its provisioner and includes parameters that tell Trident how to provision a volume for the requested class.
  • Trident looks at its own TridentStorageClass with the same name that identifies the matching Backends and StoragePools that it can use to provision volumes for the class.
  • Trident provisions storage on a matching backend and creates two objects:
    • a PersistentVolume in Kubernetes that tells Kubernetes how to find, mount and treat the volume.
    • a TridentVolume that retains the relationship between the PersistentVolume and the actual storage.
  • Kubernetes binds the PersistentVolumeClaim to the new PersistentVolume. Pods that include the PersistentVolumeClaim will mount that PersistentVolume on any host that it runs on.

Trident backend related CRs: TridentBackend / TridentBackendConfig.

Trident CLI: tridentctl.