Cheatsheet - Kubeflow

Kubeflow is an open-source project dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. It provides components for all stages of the ML lifecycle, from data preparation and model training to hyperparameter tuning and model serving.

1. Core Concepts & Architecture

Kubernetes-Native: Kubeflow leverages Kubernetes for container orchestration, resource management, and scalability.
Central Dashboard: A web-based UI for managing all Kubeflow components.
Namespaces: Each Kubeflow user typically gets their own Kubernetes namespace (often called a "profile") for isolation.
Components: Modular services for specific ML tasks.

2. Common Kubeflow Components

2.1 Kubeflow Notebooks (Jupyter/VS Code)

Purpose: Interactive development environment for ML experiments.
Access: Via Central Dashboard -> Notebooks.
Key Features:
- Spawn new Notebook Server: Choose Docker image (TensorFlow, PyTorch, custom), CPU/GPU, memory, storage.
- Persistent Volume Claim (PVC): Attach persistent storage for notebooks and data.
- Custom Images: Use your own Docker images with pre-installed libraries.
- Workload Identity: Assign a GCP Service Account (or AWS IAM role, Azure Managed Identity) to your notebook for secure cloud resource access.

2.2 Kubeflow Pipelines (KFP)

Purpose: Orchestrate complex ML workflows as directed acyclic graphs (DAGs) of components.
SDK: kfp (Kubeflow Pipelines SDK for Python).
Key Concepts:
- Component: A self-contained piece of code (often a Docker image) that performs a single task (e.g., data preprocessing, model training).
- Pipeline: A DAG of components, defining the workflow.
- Run: An execution of a pipeline.
- Experiment: A logical grouping of pipeline runs.

Defining a Component:

Python Function-based: Simplest. Decorate a Python function.

from kfp import dsl
@dsl.component
def add(a: float, b: float) -> float:
    return a + b

Container-based (YAML): More flexible, uses a Docker image.

# component.yaml
name: my-component
description: A sample component
inputs:
  - { name: message, type: String }
outputs:
  - { name: output_message, type: String }
implementation:
  container:
    image: python:3.9-slim
    command: ['python', '-c']
    args:
      - |
        import argparse
        parser = argparse.ArgumentParser()
        parser.add_argument("--message", type=str)
        parser.add_argument("--output_message_path", type=str)
        args = parser.parse_args()
        with open(args.output_message_path, "w") as f:
            f.write(f"Hello, {args.message}!")

Defining a Pipeline (Python):

from kfp import dsl, compiler

@dsl.pipeline(name="my-first-pipeline", description="A simple pipeline")
def my_pipeline(msg: str = "World"):
    op1 = add_op(a=10.0, b=20.0) # Assuming 'add_op' is a defined component
    op2 = greet_op(message=msg)  # Assuming 'greet_op' is a defined component
    # Components can pass outputs as inputs:
    # op3 = another_op(input_from_op1=op1.output)

# Compile the pipeline
compiler.Compiler().compile(my_pipeline, 'my_pipeline.yaml')

Running a Pipeline:

From UI: Upload the compiled YAML or connect to a Git repo.

From SDK:

from kfp import Client
client = Client() # Connects to Kubeflow Pipelines in your cluster
client.create_run_from_pipeline_func(my_pipeline, arguments={'msg': 'Kubeflow'})

2.3 KFServing (now KServe)

Purpose: Deploy ML models to Kubernetes with serverless capabilities, auto-scaling, canary rollouts, and multi-model hosting.
Core Resource: InferenceService (Custom Resource Definition - CRD).
Key Features:
- Serverless Auto-scaling: Scales to zero when idle.
- Canary Rollouts: Gradually shift traffic to new model versions.
- A/B Testing: Route traffic between multiple models.
- Model Explanations: Integration with AI Explainability (e.g., Alibi Explain, ART).
- Built-in Runtimes: Supports common frameworks like TensorFlow, PyTorch, Scikit-learn, XGBoost.

Example InferenceService (YAML):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-sklearn-model
spec:
  predictor:
    sklearn:
      storageUri: gs://my-model-bucket/sklearn-iris-model # Or s3://, pvc://
      minReplicas: 1
      maxReplicas: 3
    # PyTorch example:
    # pytorch:
    #   storageUri: gs://my-model-bucket/pytorch-model
    #   runtimeVersion: "1.10" # Optional
    #   protocolVersion: "v2" # Optional for v2 API

Deployment: kubectl apply -f my-model-service.yaml
Inference: KServe exposes a URL for your model. You send HTTP POST requests (e.g., JSON payload) to this URL.

2.4 Katib

Purpose: Automate hyperparameter tuning and Neural Architecture Search (NAS).
Core Resource: Experiment (CRD).
Key Concepts:
- Experiment: Defines the search space for hyperparameters or architecture, objective metrics (e.g., accuracy, loss), and a search algorithm (e.g., Grid, Random, Bayesian Optimization).
- Trial: A single run of the training job with a specific set of hyperparameter values or a specific architecture.
- Suggestion: Katib's recommendation for the next set of hyperparameters to try.

Example Katib Experiment (YAML):

apiVersion: katib.kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: pytorch-experiment
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: minimize
    goal: 0.01
    objectiveMetricName: loss
  algorithm:
    algorithmName: random
  parallelTrialCount: 3 # Number of trials to run in parallel
  maxTrialCount: 12 # Total number of trials
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.001'
        max: '0.01'
    - name: momentum
      parameterType: double
      feasibleSpace:
        min: '0.1'
        max: '0.9'
  trialTemplate:
    # Define the training job here (e.g., a PyTorchJob)
    # The environment variables defined for parameters above will be injected
    # into this training job.
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflow/pytorch-mnist-with-pvc:v1.0
                    command:
                      - 'python'
                      - '/opt/model.py'
                      - '--lr=${trialParameters.lr}'
                      - '--momentum=${trialParameters.momentum}'

Deployment: kubectl apply -f my-katib-experiment.yaml
Monitoring: Track trials and their results via the Katib UI in the Central Dashboard.

2.5 Training Operators (e.g., TFJob, PyTorchJob)

Purpose: Run distributed ML training jobs natively on Kubernetes.
Custom Resources: TFJob, PyTorchJob, MPIJob, XGBoostJob, PaddleJob, MXNetJob, RayJob.
Key Features:
- Distributed Training: Automatically sets up communication between workers, PS (parameter servers), and chief replicas.
- Fault Tolerance: Handles node failures and restarts jobs.
- Resource Management: Allocates specific CPU/GPU resources to each replica.

Example TFJob (YAML):

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tfjob-resnet
  namespace: kubeflow-user-example-com
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.10.0-gpu
              command: ['python', 'model.py']
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.10.0-gpu
              command: ['python', 'model.py']
              resources:
                limits:
                  nvidia.com/gpu: 1

Deployment: kubectl apply -f my-tfjob.yaml
Monitoring: Use kubectl get tfjob or the Training Operators UI in the Central Dashboard.

3. CLI Tools

kubectl: The standard Kubernetes CLI. Used for managing all Kubeflow resources (CRDs).
- kubectl get pods -n <namespace>
- kubectl get notebook -n <namespace>
- kubectl get inferenceservice -n <namespace>
- kubectl logs <pod-name> -n <namespace>
kfctl (Legacy): Previously used for Kubeflow installation. Now largely replaced by kustomize or cloud-specific deployment tools.
kfp SDK (Python): For defining, compiling, and submitting Kubeflow Pipelines.

4. General Tips & Best Practices

Namespaces/Profiles: Always work within your assigned namespace (kubectl config set-context --current --namespace=<your-namespace>).
Persistent Storage: Always use PVCs for data that needs to persist beyond a pod's lifecycle (notebooks, datasets, model checkpoints).
Docker Images: Use custom Docker images for your components to ensure consistent environments and manage dependencies.
Resource Requests/Limits: Specify CPU, memory, and GPU requests/limits for all your pods (notebooks, pipeline components, training jobs) to ensure fair scheduling and prevent resource exhaustion.
Workload Identity/IAM: Leverage Kubernetes' Workload Identity (or cloud-provider equivalents) for secure access to cloud services instead of hardcoding credentials.
Central Dashboard: Your primary entry point for managing and monitoring. Explore all the sections (Notebooks, Pipelines, Models, Experiments).
Logging & Monitoring: Integrate with Prometheus/Grafana or cloud-native monitoring (e.g., Google Cloud Monitoring) for observing your ML workloads.
GitOps for Pipelines: Store your pipeline definitions in Git and use CI/CD to automatically compile and deploy them.