Kubernetes - Troubleshooting
... is attempting to grant RBAC permissions not currently held
Error:
Error from server (Forbidden): clusterroles.rbac.authorization.k8s.io "foo-cluster-role" is forbidden: user "[email protected]" (groups=["bar"]) is attempting to grant RBAC permissions not currently held:
{APIGroups:[""], Resources:["nodes"], Verbs:["list"]}
Solution: use kubectl patch
to add the missing permission
$ kubectl patch clusterrole cluster-role-name \
--kubeconfig ${KUBECONFIG} \
--type='json' \
-p='[{"op": "add", "path": "/rules/0", "value":{ "apiGroups": [""], "resources": ["nodes"], "verbs": ["list"]}}]'
If kubectl patch
fails for the current user does not have the permission, so it cannot grant permission to this clusterrole.
: Check your kubeconfig, if there's another context with higher permissions, use the context:
$ kubectl config use-context admin-context
Then patch again.
Err:28
: map[DriverName:filesystem Enclosed:map[Err:28 Op:mkdir ...
Error: Err:28
map[DriverName:filesystem Enclosed:map[Err:28 Op:mkdir Path:/var/lib/registry/docker/registry/v2/repositories/<project>/<repository>]]
Root cause: not enough space.
Verification: check disk space of the harbor registry pod:
$ kubectl -n HARBOR_NAMESPACE exec HARBOR_REGISTRY_POD_NAME -- df -ah | less
Solution: resize the disk size for the registry.
# Get the pod
POD=$(kubectl get pods -n HARBOR_NAMESPACE -l goharbor.io/operator-controller=registry -o name --kubeconfig=/path/to/kubeconfig)
# Set the new size
STORAGE_SIZE=400Gi
# Patch PVC
kubectl patch Persistentvolumeclaim/harbor-registry \
--kubeconfig=/path/to/kubeconfig \
-n harbor-system --type=merge \
-p '{"spec": {"resources": {"requests": {"storage": "'$STORAGE_SIZE'"}}}}'
# Wait until the storage capacity is changed
kubectl --kubeconfig=/path/to/kubeconfig -n HARBOR_NAMESPACE exec $POD -- df -ah | grep "/var/lib/registry"
Err:30
Error: Err:30
map[DriverName:filesystem Enclosed:map[Err:30 Op:mkdir Path:/var/lib/registry/docker/registry/v2/repositories/<project>/<repository>]]
Root cause: Err 30 is -EROFS
, error due to writeback to read-only filesystem.
Verification:
# Get the pod.
POD=$(kubectl get pods -n HARBOR_NAMESPACE -l goharbor.io/operator-controller=registry -o name --kubeconfig=/path/to/kubeconfig)
kubectl --kubeconfig=/path/to/kubeconfig -n HARBOR_NAMESPACE exec $POD -- mount | grep /var/lib/registry
# Check if it is mounted as `ro`.
Solution: try to delete and recreate the pod and check if the volume is attached as rw
.
Object stuck in Terminating Status
Check the finalizers
of the object. Objects will not be removed until its metadata.finalizers
field is empty.
The target object remains in a terminating state while the control plane, or other components, take the actions defined by the finalizers.
https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/
message: 'The node was low on resource: ephemeral-storage.
Error
Pods are failing:
"message: 'The node was low on resource: ephemeral-storage."
Debug
Check disk usage
$ df -h
If the disk is indeed full, check what is taking up the disk spaces in /var/lib/kubelet
or /var/log
.
no kind is registered for the type ... in scheme ...
Add AddToScheme()
:
import (
foov1 "path/to/foo/v1"
runtimeutil "k8s.io/apimachinery/pkg/util/runtime"
)
runtimeutil.Must(foov1.AddToScheme(scheme))
too many open files
Check:
$ sysctl fs.inotify.max_user_instances fs.inotify.max_user_watches
# or
$ cat /proc/sys/fs/inotify/max_user_watches
$ cat /proc/sys/fs/inotify/max_user_instances
$ ulimit -n
1024
$ systemctl show xxxxx | grep LimitNOFILE
LimitNOFILE=262144
LimitNOFILESoft=1024
# find pid
$ systemctl status xxxxx
# check open files of the pid (e.g. pid=1964090)
ls "/proc/1964090/fd" | wc -l
1024
To increase temporarily:
$ sudo sysctl fs.inotify.max_user_watches=524288
$ sudo sysctl fs.inotify.max_user_instances=512
To make the changes persistent, edit the file /etc/sysctl.conf
and add these lines:
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
Count inotify watches by user
find /proc/*/fd -user "$USER" -lname anon_inode:inotify \
-printf '%hinfo/%f\n' 2>/dev/null |
xargs cat | grep -c '^inotify'
Count open files:
#!/bin/bash
pids=$(ls -d /proc/[0-9]*)
for p in ${pids}; do
count=$(ls $p/fd | wc -c)
if [ "${count}" -gt "150" ]; then
name=$(cat ${p}/comm)
echo "${name} has ${count} open files"
fi
done
"timed out waiting for cache to be synced"
Maybe missing CRD or RBAC.
failed to call webhook: the server could not find the requested resource
- Check your
ValidatingWebhookConfiguration
CRs. - Check the
Service
of the webhook. - Check the
Deployment
of the webhook backend, see if it is up and running, and if it is busy dealing with something else.
the object has been modified; please apply your changes to the latest version and try again
When you "Get" an object, the object has a resourceVersion
inside ObjectMeta
; if resourceVersion
is changed before you submit your update, API server detects a diff in resourceVersion
so it rejects your request.
Under the hood etcd stores a 64-bit int called revision
for each object.
solution:
use RetryOnConflict
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
if _, err := controllerutil.CreateOrUpdate(ctx, client, object, func() error {
// ...
return nil
}); err != nil {
return err
}
return nil
})
if err != nil {
return fmt.Errorf("failed to update object: %w", err)
}
Pod cannot be scheduled
Possible causes:
- The cluster not having enough CPU or RAM available to meet the Pod's requirements.
1 node(s) didn't have free ports for the requested pod ports.
- Pod affinity or anti-affinity rules preventing it from being deployed to available nodes.
- Nodes being cordoned due to updates or restarts.
- The Pod requiring a persistent volume that's unavailable, or bound to an unavailable node.
User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
How to troubleshoot:
- check status of the Pod
- check status of the Node
- check log of kube-scheduler
Pod takes a long time to shutdown / SIGTERM is not properly handled
If the container uses /bin/sh -c ./startup.sh
as its command, the shell process does not automatically handle the SIGTERM
it receives when being asked to shut down, which means Kubernetes will ask the container to stop and then just wait until its timeout (20 minutes in this case) before sending the container a SIGKILL
. In the meantime, the shell process is oblivious and doesn't know it should shut down.
To fix this, one way is to use Tini (https://github.com/krallin/tini):
with Tini, SIGTERM properly terminates your process even if you didn't explicitly install a signal handler for it.
For example, if you use Bazel to build the container image:
container_image(
name = "docker_image",
cmd = [
"/bin/sh",
"-c",
"/startup.sh",
],
# ...
)
Replace cmd
with
cmd = [
"/usr/bin/tini",
"/startup.sh",
],