Cheatsheet - Kubeflow
Kubeflow is an open-source project dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. It provides components for all stages of the ML lifecycle, from data preparation and model training to hyperparameter tuning and model serving.
1. Core Concepts & Architecture
- Kubernetes-Native: Kubeflow leverages Kubernetes for container orchestration, resource management, and scalability.
- Central Dashboard: A web-based UI for managing all Kubeflow components.
- Namespaces: Each Kubeflow user typically gets their own Kubernetes namespace (often called a "profile") for isolation.
- Components: Modular services for specific ML tasks.
2. Common Kubeflow Components
2.1 Kubeflow Notebooks (Jupyter/VS Code)
- Purpose: Interactive development environment for ML experiments.
- Access: Via Central Dashboard -> Notebooks.
- Key Features:
- Spawn new Notebook Server: Choose Docker image (TensorFlow, PyTorch, custom), CPU/GPU, memory, storage.
- Persistent Volume Claim (PVC): Attach persistent storage for notebooks and data.
- Custom Images: Use your own Docker images with pre-installed libraries.
- Workload Identity: Assign a GCP Service Account (or AWS IAM role, Azure Managed Identity) to your notebook for secure cloud resource access.
2.2 Kubeflow Pipelines (KFP)
-
Purpose: Orchestrate complex ML workflows as directed acyclic graphs (DAGs) of components.
-
SDK:
kfp
(Kubeflow Pipelines SDK for Python). -
Key Concepts:
- Component: A self-contained piece of code (often a Docker image) that performs a single task (e.g., data preprocessing, model training).
- Pipeline: A DAG of components, defining the workflow.
- Run: An execution of a pipeline.
- Experiment: A logical grouping of pipeline runs.
-
Defining a Component:
- Python Function-based: Simplest. Decorate a Python function.
from kfp import dsl @dsl.component def add(a: float, b: float) -> float: return a + b
- Container-based (YAML): More flexible, uses a Docker image.
# component.yaml name: my-component description: A sample component inputs: - { name: message, type: String } outputs: - { name: output_message, type: String } implementation: container: image: python:3.9-slim command: ['python', '-c'] args: - | import argparse parser = argparse.ArgumentParser() parser.add_argument("--message", type=str) parser.add_argument("--output_message_path", type=str) args = parser.parse_args() with open(args.output_message_path, "w") as f: f.write(f"Hello, {args.message}!")
- Python Function-based: Simplest. Decorate a Python function.
-
Defining a Pipeline (Python):
from kfp import dsl, compiler @dsl.pipeline(name="my-first-pipeline", description="A simple pipeline") def my_pipeline(msg: str = "World"): op1 = add_op(a=10.0, b=20.0) # Assuming 'add_op' is a defined component op2 = greet_op(message=msg) # Assuming 'greet_op' is a defined component # Components can pass outputs as inputs: # op3 = another_op(input_from_op1=op1.output) # Compile the pipeline compiler.Compiler().compile(my_pipeline, 'my_pipeline.yaml')
-
Running a Pipeline:
- From UI: Upload the compiled YAML or connect to a Git repo.
- From SDK:
from kfp import Client client = Client() # Connects to Kubeflow Pipelines in your cluster client.create_run_from_pipeline_func(my_pipeline, arguments={'msg': 'Kubeflow'})
2.3 KFServing (now KServe)
- Purpose: Deploy ML models to Kubernetes with serverless capabilities, auto-scaling, canary rollouts, and multi-model hosting.
- Core Resource:
InferenceService
(Custom Resource Definition - CRD). - Key Features:
- Serverless Auto-scaling: Scales to zero when idle.
- Canary Rollouts: Gradually shift traffic to new model versions.
- A/B Testing: Route traffic between multiple models.
- Model Explanations: Integration with AI Explainability (e.g., Alibi Explain, ART).
- Built-in Runtimes: Supports common frameworks like TensorFlow, PyTorch, Scikit-learn, XGBoost.
- Example InferenceService (YAML):
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: my-sklearn-model spec: predictor: sklearn: storageUri: gs://my-model-bucket/sklearn-iris-model # Or s3://, pvc:// minReplicas: 1 maxReplicas: 3 # PyTorch example: # pytorch: # storageUri: gs://my-model-bucket/pytorch-model # runtimeVersion: "1.10" # Optional # protocolVersion: "v2" # Optional for v2 API
- Deployment:
kubectl apply -f my-model-service.yaml
- Inference: KServe exposes a URL for your model. You send HTTP POST requests (e.g., JSON payload) to this URL.
2.4 Katib
- Purpose: Automate hyperparameter tuning and Neural Architecture Search (NAS).
- Core Resource:
Experiment
(CRD). - Key Concepts:
- Experiment: Defines the search space for hyperparameters or architecture, objective metrics (e.g., accuracy, loss), and a search algorithm (e.g., Grid, Random, Bayesian Optimization).
- Trial: A single run of the training job with a specific set of hyperparameter values or a specific architecture.
- Suggestion: Katib's recommendation for the next set of hyperparameters to try.
- Example Katib Experiment (YAML):
apiVersion: katib.kubeflow.org/v1beta1 kind: Experiment metadata: name: pytorch-experiment namespace: kubeflow-user-example-com spec: objective: type: minimize goal: 0.01 objectiveMetricName: loss algorithm: algorithmName: random parallelTrialCount: 3 # Number of trials to run in parallel maxTrialCount: 12 # Total number of trials parameters: - name: lr parameterType: double feasibleSpace: min: '0.001' max: '0.01' - name: momentum parameterType: double feasibleSpace: min: '0.1' max: '0.9' trialTemplate: # Define the training job here (e.g., a PyTorchJob) # The environment variables defined for parameters above will be injected # into this training job. trialSpec: apiVersion: kubeflow.org/v1 kind: PyTorchJob spec: pytorchReplicaSpecs: Worker: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: pytorch image: docker.io/kubeflow/pytorch-mnist-with-pvc:v1.0 command: - 'python' - '/opt/model.py' - '--lr=${trialParameters.lr}' - '--momentum=${trialParameters.momentum}'
- Deployment:
kubectl apply -f my-katib-experiment.yaml
- Monitoring: Track trials and their results via the Katib UI in the Central Dashboard.
2.5 Training Operators (e.g., TFJob, PyTorchJob)
- Purpose: Run distributed ML training jobs natively on Kubernetes.
- Custom Resources:
TFJob
,PyTorchJob
,MPIJob
,XGBoostJob
,PaddleJob
,MXNetJob
,RayJob
. - Key Features:
- Distributed Training: Automatically sets up communication between workers, PS (parameter servers), and chief replicas.
- Fault Tolerance: Handles node failures and restarts jobs.
- Resource Management: Allocates specific CPU/GPU resources to each replica.
- Example TFJob (YAML):
apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: tfjob-resnet namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Chief: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.10.0-gpu command: ['python', 'model.py'] resources: limits: nvidia.com/gpu: 1 Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.10.0-gpu command: ['python', 'model.py'] resources: limits: nvidia.com/gpu: 1
- Deployment:
kubectl apply -f my-tfjob.yaml
- Monitoring: Use
kubectl get tfjob
or the Training Operators UI in the Central Dashboard.
3. CLI Tools
kubectl
: The standard Kubernetes CLI. Used for managing all Kubeflow resources (CRDs).kubectl get pods -n <namespace>
kubectl get notebook -n <namespace>
kubectl get inferenceservice -n <namespace>
kubectl logs <pod-name> -n <namespace>
kfctl
(Legacy): Previously used for Kubeflow installation. Now largely replaced bykustomize
or cloud-specific deployment tools.kfp
SDK (Python): For defining, compiling, and submitting Kubeflow Pipelines.
4. General Tips & Best Practices
- Namespaces/Profiles: Always work within your assigned namespace (
kubectl config set-context --current --namespace=<your-namespace>
). - Persistent Storage: Always use PVCs for data that needs to persist beyond a pod's lifecycle (notebooks, datasets, model checkpoints).
- Docker Images: Use custom Docker images for your components to ensure consistent environments and manage dependencies.
- Resource Requests/Limits: Specify CPU, memory, and GPU requests/limits for all your pods (notebooks, pipeline components, training jobs) to ensure fair scheduling and prevent resource exhaustion.
- Workload Identity/IAM: Leverage Kubernetes' Workload Identity (or cloud-provider equivalents) for secure access to cloud services instead of hardcoding credentials.
- Central Dashboard: Your primary entry point for managing and monitoring. Explore all the sections (Notebooks, Pipelines, Models, Experiments).
- Logging & Monitoring: Integrate with Prometheus/Grafana or cloud-native monitoring (e.g., Google Cloud Monitoring) for observing your ML workloads.
- GitOps for Pipelines: Store your pipeline definitions in Git and use CI/CD to automatically compile and deploy them.