logo

GCP - GKE

What is GKE Metadata Server?

gke-metadata-server: A reimplementation of a subset of the GCE Metadata Server API, without legacy APIs or access to VM metadata. When a pod requests an access token, gke-metadata-server communicates with the CRI to authenticate the request based on the source IP, and then exchanges the Kubernetes token (k8s JWT) for a Google Access Token (Using IAM Security Token Service).

The GKE metadata server is a component of the GKE Workload Identity feature that maps Kubernetes identities to Cloud IAM. It runs as a daemonset on every node in a Workload Identity enabled node pool and implements a metadata API that is compatible with the Compute Engine and App Engine metadata servers, exposing this API to pods on the node.

The GKE metadata server includes a token endpoint that returns a Google access token based on pods' Kubernetes service accounts. It serves endpoints that return static metadata such as the numeric project ID, project ID, cluster name, hostname, GCE instance ID, and zone

To check the status of the gke-metadata-server pod:

$ kubectl get pods -A | grep gke-metadata-server.

If the pod status is Running, it is working properly. If it is stuck in ImagePullBackOff, the GSA associated with your NodePool needs permission to pull the container image from your Registry. Give it a role with storage.objects.get permission, such as "Storage Viewer".

All packets bound for the GCE Metadata Server will instead be delivered to the gke-metadata-server daemonset.

Where are the Nodes running?

k8s control plane runs in tenant projects, user cluster nodes run in customer projects.

What are supported Node OS Images?

GKE provides two primary families of Linux-based node images:

  • Container-Optimized OS: the default node OS Image in GKE.
  • Ubuntu.

What is Workload Identity?

Workload Identity (WI) is a GKE feature that allows Kubernetes pods to authenticate to GCP APIs without manually managing IAM service account credentials.

It's built on top of Workload Identity Pools, which allow federating external identity providers into Cloud IAM. It's the recommended way for customers to integrate their GKE Workloads with other GCP services, since it provides a drop-in experience (the GKE Metadata Server) that maintains compatibility with existing client libraries.

Workload Identity solves the multitenancy and key management problems by giving each Kubernetes service account its own Google identity, and letting each pod automatically pull an access token for this identity using the existing Google client libraries.

How to enable GKE Sandbox?

Enabling gVisor (marketed as GKE Sandbox) is the industry standard for securing AI workloads and untrusted code on Google Kubernetes Engine.

Since you cannot enable gVisor on a "default" node pool (as system services like networking and logging need direct kernel access), you must follow these steps to set up a dedicated environment.

Step 1: Enable GKE Sandbox on the Node Pool

The configuration differs depending on whether you are using GKE Standard or GKE Autopilot.

A. For GKE Standard (Manual Control)

In Standard mode, you must create a specific node pool for your sandboxed workloads.

  1. Create a New Node Pool: In the GKE console, navigate to your cluster and click Add Node Pool.
  2. Image Type: Under the Nodes tab, you must select Container-Optimized OS with containerd (cos_containerd). gVisor does not work with Ubuntu or Docker runtimes.
  3. Enable Sandbox: Navigate to the Security tab and check the box Enable sandbox with gVisor.
  4. Machine Type: For older GKE versions, avoid e2-micro/small/medium. Stick to n1, n2, or e2-standard (2+ vCPU) to ensure stability.

Using CLI:

gcloud container node-pools create sandbox-pool \
    --cluster [CLUSTER_NAME] \
    --image-type cos_containerd \
    --sandbox type=gvisor \
    --machine-type n2-standard-2

B. For GKE Autopilot (Zero-Config)

In Autopilot, you don't need to configure node pools. Google manages the underlying hardware for you. As long as your cluster is version 1.27.4-gke.800 or later, gVisor is ready to go. You only need to request it in your YAML (Step 2).

Step 2: Update Your Workload Manifest

Simply enabling the sandbox on the node pool isn't enough; you must explicitly tell Kubernetes to put your specific Pod inside the gVisor "jail."

You do this by adding the runtimeClassName: gvisor field to your Pod spec.

Example: A Sandboxed AI Python Runner

apiVersion: v1
kind: Pod
metadata:
  name: untrusted-ai-agent
spec:
  # This is the magic line that activates gVisor
  runtimeClassName: gvisor
  containers:
    - name: python-executor
      image: python:3.9-slim
      command: ['python3', '-c', "import os; print('Running in gVisor')"]

Step 3: Enabling GPU Support for AI (The "nvproxy" Layer)

If you are running LLM inference or AI-generated CUDA code, you can now use GPUs inside gVisor (GKE version 1.29.2-gke.1108000+).

  1. Configure Node Pool: Ensure your node pool has GPUs attached (e.g., L4 or T4).
  2. The Manifest: Use the same runtimeClassName, but add the standard GPU resource limits. GKE will automatically use nvproxy to safely pass GPU instructions through the sandbox.
spec:
  runtimeClassName: gvisor
  containers:
    - name: gpu-inference
      image: nvidia/cuda:12.0-base
      resources:
        limits:
          nvidia.com/gpu: 1

Step 4: Verification

Once your Pod is running, you can verify that it is actually inside the gVisor sandbox by running:

kubectl get pod [POD_NAME] -o jsonpath='{.spec.runtimeClassName}'
# Output should be: gvisor

Or, run a command inside the pod to check the kernel version. gVisor usually reports a specific, older Linux kernel version regardless of what the host is running:

kubectl exec [POD_NAME] -- uname -a
# You will see "gVisor" in the output string.

Pro-Tips for AI Developers:

  • Metadata Access: By default, gVisor blocks access to the Google Cloud Metadata Server (to prevent your AI code from stealing the node's Identity token). If your AI app needs to talk to Google Cloud APIs (like Vertex AI or S3), you must use Workload Identity.
  • Performance: Don't be afraid to use gVisor for math-heavy AI. Because the "math" happens on the CPU/GPU and not via system calls, the performance hit is typically less than 1%.
  • Logging: Enable Cloud Logging on the cluster features. gVisor logs security violations (e.g., if the AI tried to access a forbidden system call), which is vital for debugging why an AI agent's code might be failing.