logo

Cheatsheet - Kubeflow

Kubeflow is an open-source project dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. It provides components for all stages of the ML lifecycle, from data preparation and model training to hyperparameter tuning and model serving.

1. Core Concepts & Architecture

  • Kubernetes-Native: Kubeflow leverages Kubernetes for container orchestration, resource management, and scalability.
  • Central Dashboard: A web-based UI for managing all Kubeflow components.
  • Namespaces: Each Kubeflow user typically gets their own Kubernetes namespace (often called a "profile") for isolation.
  • Components: Modular services for specific ML tasks.

2. Common Kubeflow Components

2.1 Kubeflow Notebooks (Jupyter/VS Code)

  • Purpose: Interactive development environment for ML experiments.
  • Access: Via Central Dashboard -> Notebooks.
  • Key Features:
    • Spawn new Notebook Server: Choose Docker image (TensorFlow, PyTorch, custom), CPU/GPU, memory, storage.
    • Persistent Volume Claim (PVC): Attach persistent storage for notebooks and data.
    • Custom Images: Use your own Docker images with pre-installed libraries.
    • Workload Identity: Assign a GCP Service Account (or AWS IAM role, Azure Managed Identity) to your notebook for secure cloud resource access.

2.2 Kubeflow Pipelines (KFP)

  • Purpose: Orchestrate complex ML workflows as directed acyclic graphs (DAGs) of components.

  • SDK: kfp (Kubeflow Pipelines SDK for Python).

  • Key Concepts:

    • Component: A self-contained piece of code (often a Docker image) that performs a single task (e.g., data preprocessing, model training).
    • Pipeline: A DAG of components, defining the workflow.
    • Run: An execution of a pipeline.
    • Experiment: A logical grouping of pipeline runs.
  • Defining a Component:

    • Python Function-based: Simplest. Decorate a Python function.
      from kfp import dsl
      @dsl.component
      def add(a: float, b: float) -> float:
          return a + b
      
    • Container-based (YAML): More flexible, uses a Docker image.
      # component.yaml
      name: my-component
      description: A sample component
      inputs:
        - { name: message, type: String }
      outputs:
        - { name: output_message, type: String }
      implementation:
        container:
          image: python:3.9-slim
          command: ['python', '-c']
          args:
            - |
              import argparse
              parser = argparse.ArgumentParser()
              parser.add_argument("--message", type=str)
              parser.add_argument("--output_message_path", type=str)
              args = parser.parse_args()
              with open(args.output_message_path, "w") as f:
                  f.write(f"Hello, {args.message}!")
      
  • Defining a Pipeline (Python):

    from kfp import dsl, compiler
    
    @dsl.pipeline(name="my-first-pipeline", description="A simple pipeline")
    def my_pipeline(msg: str = "World"):
        op1 = add_op(a=10.0, b=20.0) # Assuming 'add_op' is a defined component
        op2 = greet_op(message=msg)  # Assuming 'greet_op' is a defined component
        # Components can pass outputs as inputs:
        # op3 = another_op(input_from_op1=op1.output)
    
    # Compile the pipeline
    compiler.Compiler().compile(my_pipeline, 'my_pipeline.yaml')
    
  • Running a Pipeline:

    • From UI: Upload the compiled YAML or connect to a Git repo.
    • From SDK:
      from kfp import Client
      client = Client() # Connects to Kubeflow Pipelines in your cluster
      client.create_run_from_pipeline_func(my_pipeline, arguments={'msg': 'Kubeflow'})
      

2.3 KFServing (now KServe)

  • Purpose: Deploy ML models to Kubernetes with serverless capabilities, auto-scaling, canary rollouts, and multi-model hosting.
  • Core Resource: InferenceService (Custom Resource Definition - CRD).
  • Key Features:
    • Serverless Auto-scaling: Scales to zero when idle.
    • Canary Rollouts: Gradually shift traffic to new model versions.
    • A/B Testing: Route traffic between multiple models.
    • Model Explanations: Integration with AI Explainability (e.g., Alibi Explain, ART).
    • Built-in Runtimes: Supports common frameworks like TensorFlow, PyTorch, Scikit-learn, XGBoost.
  • Example InferenceService (YAML):
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: my-sklearn-model
    spec:
      predictor:
        sklearn:
          storageUri: gs://my-model-bucket/sklearn-iris-model # Or s3://, pvc://
          minReplicas: 1
          maxReplicas: 3
        # PyTorch example:
        # pytorch:
        #   storageUri: gs://my-model-bucket/pytorch-model
        #   runtimeVersion: "1.10" # Optional
        #   protocolVersion: "v2" # Optional for v2 API
    
  • Deployment: kubectl apply -f my-model-service.yaml
  • Inference: KServe exposes a URL for your model. You send HTTP POST requests (e.g., JSON payload) to this URL.

2.4 Katib

  • Purpose: Automate hyperparameter tuning and Neural Architecture Search (NAS).
  • Core Resource: Experiment (CRD).
  • Key Concepts:
    • Experiment: Defines the search space for hyperparameters or architecture, objective metrics (e.g., accuracy, loss), and a search algorithm (e.g., Grid, Random, Bayesian Optimization).
    • Trial: A single run of the training job with a specific set of hyperparameter values or a specific architecture.
    • Suggestion: Katib's recommendation for the next set of hyperparameters to try.
  • Example Katib Experiment (YAML):
    apiVersion: katib.kubeflow.org/v1beta1
    kind: Experiment
    metadata:
      name: pytorch-experiment
      namespace: kubeflow-user-example-com
    spec:
      objective:
        type: minimize
        goal: 0.01
        objectiveMetricName: loss
      algorithm:
        algorithmName: random
      parallelTrialCount: 3 # Number of trials to run in parallel
      maxTrialCount: 12 # Total number of trials
      parameters:
        - name: lr
          parameterType: double
          feasibleSpace:
            min: '0.001'
            max: '0.01'
        - name: momentum
          parameterType: double
          feasibleSpace:
            min: '0.1'
            max: '0.9'
      trialTemplate:
        # Define the training job here (e.g., a PyTorchJob)
        # The environment variables defined for parameters above will be injected
        # into this training job.
        trialSpec:
          apiVersion: kubeflow.org/v1
          kind: PyTorchJob
          spec:
            pytorchReplicaSpecs:
              Worker:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    containers:
                      - name: pytorch
                        image: docker.io/kubeflow/pytorch-mnist-with-pvc:v1.0
                        command:
                          - 'python'
                          - '/opt/model.py'
                          - '--lr=${trialParameters.lr}'
                          - '--momentum=${trialParameters.momentum}'
    
  • Deployment: kubectl apply -f my-katib-experiment.yaml
  • Monitoring: Track trials and their results via the Katib UI in the Central Dashboard.

2.5 Training Operators (e.g., TFJob, PyTorchJob)

  • Purpose: Run distributed ML training jobs natively on Kubernetes.
  • Custom Resources: TFJob, PyTorchJob, MPIJob, XGBoostJob, PaddleJob, MXNetJob, RayJob.
  • Key Features:
    • Distributed Training: Automatically sets up communication between workers, PS (parameter servers), and chief replicas.
    • Fault Tolerance: Handles node failures and restarts jobs.
    • Resource Management: Allocates specific CPU/GPU resources to each replica.
  • Example TFJob (YAML):
    apiVersion: kubeflow.org/v1
    kind: TFJob
    metadata:
      name: tfjob-resnet
      namespace: kubeflow-user-example-com
    spec:
      tfReplicaSpecs:
        Chief:
          replicas: 1
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: tensorflow
                  image: tensorflow/tensorflow:2.10.0-gpu
                  command: ['python', 'model.py']
                  resources:
                    limits:
                      nvidia.com/gpu: 1
        Worker:
          replicas: 2
          restartPolicy: OnFailure
          template:
            spec:
              containers:
                - name: tensorflow
                  image: tensorflow/tensorflow:2.10.0-gpu
                  command: ['python', 'model.py']
                  resources:
                    limits:
                      nvidia.com/gpu: 1
    
  • Deployment: kubectl apply -f my-tfjob.yaml
  • Monitoring: Use kubectl get tfjob or the Training Operators UI in the Central Dashboard.

3. CLI Tools

  • kubectl: The standard Kubernetes CLI. Used for managing all Kubeflow resources (CRDs).
    • kubectl get pods -n <namespace>
    • kubectl get notebook -n <namespace>
    • kubectl get inferenceservice -n <namespace>
    • kubectl logs <pod-name> -n <namespace>
  • kfctl (Legacy): Previously used for Kubeflow installation. Now largely replaced by kustomize or cloud-specific deployment tools.
  • kfp SDK (Python): For defining, compiling, and submitting Kubeflow Pipelines.

4. General Tips & Best Practices

  • Namespaces/Profiles: Always work within your assigned namespace (kubectl config set-context --current --namespace=<your-namespace>).
  • Persistent Storage: Always use PVCs for data that needs to persist beyond a pod's lifecycle (notebooks, datasets, model checkpoints).
  • Docker Images: Use custom Docker images for your components to ensure consistent environments and manage dependencies.
  • Resource Requests/Limits: Specify CPU, memory, and GPU requests/limits for all your pods (notebooks, pipeline components, training jobs) to ensure fair scheduling and prevent resource exhaustion.
  • Workload Identity/IAM: Leverage Kubernetes' Workload Identity (or cloud-provider equivalents) for secure access to cloud services instead of hardcoding credentials.
  • Central Dashboard: Your primary entry point for managing and monitoring. Explore all the sections (Notebooks, Pipelines, Models, Experiments).
  • Logging & Monitoring: Integrate with Prometheus/Grafana or cloud-native monitoring (e.g., Google Cloud Monitoring) for observing your ML workloads.
  • GitOps for Pipelines: Store your pipeline definitions in Git and use CI/CD to automatically compile and deploy them.