GCP - GCE

What is GCE Metadata Server?

When a Compute Engine virtual machine is created, Google Cloud automatically sets up a special network route and DNS entry on that instance. This entry resolves the domain metadata.google.internal to the private IP address 169.254.169.254.

This is a link-local address, which means it's only valid and accessible within the specific network segment of the instance itself. The GCE network is designed to route any traffic sent to this special IP address to the metadata server, which is a service that runs independently of the instance's operating system.

To get a list of all available metadata attributes, you can use the following command. The trailing slash is important.

curl "http://metadata.google.internal/computeMetadata/v1/" \
  -H "Metadata-Flavor: Google"

The server will return a list of available top-level metadata keys.

Is GCE Metadata Server per GCE Instance?

Yes. Each Google Compute Engine virtual machine instance has its own dedicated Metadata Server. The MDS runs as an HTTP server within the GCE VM controller process.

Note that it is "Per VM," but runs on the Host, NOT inside the VM.

The GCE Metadata Server is a distributed service provided by the physical host machine (the hypervisor), not a process inside your VM's Operating System.

Logical View (Per VM): From the perspective of your code, it is per-instance. A request from VM-A only returns VM-A's data, and it is impossible for VM-A to query VM-B’s metadata via that local IP.
Physical View (Host-Level): The actual software handling the request runs on the physical Google host. When your VM sends a packet to 169.254.169.254, the hypervisor intercepts that packet before it ever touches a physical network cable.

Why is it using 169.254.169.254?

This specific IP range is reserved for IPv4 Link-Local addresses. It is non-routable, meaning it can only exist on a single local "hop."

What is Managed Instance Group (MIG)?

A Managed Instance Group is a collection of virtual machine (VM) instances that are managed as a single entity. MIGs help you run your workloads on a group of identical VMs, offering features that improve scalability, availability, and cost-effectiveness.

Key characteristics and benefits:

Scalability:
- Autoscaling: This is one of the most powerful features. You can configure a MIG to automatically add or remove VM instances based on the load (e.g., CPU utilization, HTTP load balancing serving capacity, Cloud Monitoring metrics, or queue size from Pub/Sub). This ensures your application can handle traffic spikes and scales down during low usage to save costs.
- Load Balancing Integration: MIGs are tightly integrated with Google Cloud's load balancers (e.g., HTTP(S) Load Balancing, Network Load Balancing). The load balancer distributes incoming traffic across the healthy instances in the MIG.
High Availability:
- Autohealing: MIGs continuously monitor the health of individual VM instances. If an instance becomes unresponsive or unhealthy (based on application-level health checks), the MIG automatically deletes it and recreates a new instance, ensuring your application remains available.
- Automatic Rolling Updates: MIGs support rolling updates, allowing you to deploy new versions of your application or update the underlying VM image without downtime. You can control the update strategy (e.g., Canary deployments, blue/green deployments).
- Multi-zone or Regional Deployment: You can create MIGs that span multiple zones within a region (Regional MIGs) for even higher availability and resilience against zone failures. Zonal MIGs exist within a single zone.
Ease of Management:
- Instance Templates: All instances within a MIG are created from a common instance template. This template defines the VM's machine type, boot disk image, network settings, metadata (like startup scripts), and other properties, ensuring consistency across all instances.
- Configuration Consistency: Because all instances are based on the same template, you get consistent configurations, simplifying management and troubleshooting.
- Simplified Operations: You manage the group as a whole rather than individual VMs, significantly reducing operational overhead.

Types of Managed Instance Groups:

Zonal MIGs: All VM instances in the group are located in a single zone. They offer high availability within that zone.
Regional MIGs: The VM instances are distributed across multiple zones within a single region. This provides higher resilience against a single zone failure and can distribute load more effectively across zones.

Common Use Cases for MIGs:

Web Servers/Application Servers: Easily scale your web application frontend or backend services based on user traffic.
Batch Processing: Run a fleet of workers that can scale up to process large datasets and scale down when done.
Microservices: Deploy and manage microservices that require high availability and automatic scaling.
Stateful Workloads (with additional considerations): While primarily designed for stateless applications, MIGs can support stateful workloads with specific configurations (e.g., persistent disks per instance, stateful instance templates).

What is Google Guest Agent

The Google Guest Agent is a set of services and daemons that run inside your Google Compute Engine (GCE) virtual machine instances.

If the Google Cloud Console is the "Manager" outside the VM, the Guest Agent is the "Receptionist" inside the VM that takes orders from the manager and makes sure the OS actually carries them out.

What does it actually do?

The Guest Agent is responsible for bridging the gap between the Google Cloud control plane and the Linux/Windows operating system. Its primary jobs include:

Managing SSH Keys

When you click the "SSH" button in the Google Cloud Console, Google doesn't magically teleport you into the machine.

Google adds your public SSH key to the VM Metadata.
The Guest Agent (watching for changes) sees the new key.
It creates the user account on the OS and adds the key to the ~/.ssh/authorized_keys file.
Only then can you log in.

Network Configuration

The Guest Agent handles complex networking tasks that the OS wouldn't know about on its own:

IP Aliases: Setting up secondary IP addresses.
Routes: Configuring routes for VPC features.
Network Interfaces: Handling hot-plugging of new network interfaces.

Account Management (OS Login)

If you use OS Login (the feature that links your Linux users to your Google IAM identity), the Guest Agent is the component that talks to the Google API to verify who you are when you try to sudo or log in.

Metadata Syncing

Google VMs have a "Metadata Server" (metadata.google.internal). This server stores information like the VM name, project ID, and custom scripts. The Guest Agent constantly polls this server to see if anything has changed (like a request to shut down the VM or a change in permissions).

Can I trust Google Guest Agent?

It's source code can be found on GitHub:

https://github.com/GoogleCloudPlatform/guest-agent: hosts the legacy, monolithic guest agent codebase.
https://github.com/GoogleCloudPlatform/google-guest-agent: hosts the new, plugin-based guest agent architecture.

Changing to a Plugin Architecture

Google transitioned the Guest Agent from a monolithic design to a plugin-based architecture.

The Benefit: If one part of the agent (like the network plugin) crashes, it doesn't take down the entire service.
New Control: You can now selectively enable or disable specific plugins (like "Workload Manager") to reduce resource overhead.

Common Commands

If you need to check or restart the agent on a Linux VM:

Check Status: sudo systemctl status google-guest-agent
Restart Agent: sudo systemctl restart google-guest-agent
View Logs: sudo journalctl -u google-guest-agent

What is a VM Extension Policy?

Think of it as the spec section of a k8s resource, it describes the desired state, then the Guest Platform acts as a k8s controler reconciles it to install or update the extension.

GCE Instance Types

Google Cloud Engine (GCE) instance types follow a structured naming convention that allows users to quickly understand their core characteristics, including their machine family, generation, processor, and resource allocation. Decoding these names involves recognizing common patterns and suffixes.

Here's a breakdown of how to interpret GCE instance types:

General Structure of GCE Instance Types

GCE instance types typically follow a pattern that includes: [FAMILY][GENERATION][PROCESSOR_VARIANT]-[RESOURCE_ALLOCATION]-[OPTIONAL_FEATURES].

Machine Family (e.g., C, N, E, M, A, H, T, Z): The first letter or letter-number combination usually indicates the machine family, which is optimized for specific workloads.
- C (Compute-optimized): Designed for CPU-intensive workloads requiring high performance, faster processors, and advanced networking. Examples include high-performance computing (HPC), gaming servers, and latency-sensitive applications.
- N (General-purpose): Offers a balance of compute and memory resources, suitable for a wide variety of workloads like web servers and databases.
- E (Cost-optimized General-purpose): Provides a good performance-to-cost ratio and is suitable for most general workloads.
- M (Memory-optimized): Ideal for memory-intensive applications such as large-scale databases and in-memory analytics, offering the highest memory-to-vCPU ratios.
- A (Accelerator-optimized): Includes GPUs or TPUs, designed for machine learning and video processing.
- H (High-performance computing): Specifically designed for HPC workloads that require high-bandwidth memory and fast interconnects.
- T (Tau - Scale-out optimized): Optimized for scale-out workloads, offering compelling price for performance.
- Z (Storage-optimized): Provides high-performance search and data analysis for medium-sized datasets, often with high-capacity local SSDs.
Generation (Number following the family letter): An ascending number typically denotes a newer generation of the machine series, often indicating updated CPU platforms or technologies. For example, N2 is a newer generation than N1.
Processor Variant (Optional Letter, e.g., D):
- D (AMD EPYC processor): Indicates that the instance uses AMD EPYC processors. For example, C3D utilizes 4th Gen AMD EPYC Genoa processors, while C2D uses AMD Milan-based VMs.
- Instances without a 'D' (like C3 or N2) typically use Intel Xeon processors. C3 uses 4th Gen Intel Xeon Scalable processors (Sapphire Rapids), and N2 uses Intel Ice Lake and Cascade Lake platforms.
Resource Allocation (-standard, -highcpu, -highmem, etc.): This part of the name indicates the vCPU-to-memory ratio.
- -standard: Offers a balanced vCPU-to-memory ratio, typically around 4 GB of memory per vCPU. For example, c3-standard-22 has 22 vCPUs and 88 GB of memory.
- -highcpu: Prioritizes vCPUs, with a lower memory-to-vCPU ratio, typically 1 to 3 GB of memory per vCPU (often 2 GB).
- -highmem: Prioritizes memory, offering a higher memory-to-vCPU ratio, typically 7 to 12 GB of memory per vCPU (often 8 GB).
- Other less common ratios include megamem (12-15 GB/vCPU), ultramem (24-31 GB/vCPU), and hypermem (15-24 GB/vCPU).
Number (e.g., 22, 360): This number usually specifies the total number of vCPUs for that instance type. For example, c3-standard-22 has 22 vCPUs, and c3d-highmem-360 has 360 vCPUs.
Optional Features (-lssd, -metal):
- -lssd (Local SSD): Indicates that the instance type comes with a predetermined number of local SSDs automatically attached, providing high-performance local storage.
- -metal (Bare Metal): Designates bare metal instances, available in certain series like C3, C3D, and C4.

Key differences

E and N supports custom VM shape; C only supports fixed VM shape
Different VM types support different Disk Types.

Examples

C3: Refers to the C3 machine series, which is compute-optimized and uses 4th Gen Intel Xeon Scalable processors (Sapphire Rapids).
C3D: Refers to the C3D machine series, which is also compute-optimized but uses 4th Gen AMD EPYC Genoa processors.
N2: Refers to the N2 machine series, which is general-purpose and uses Intel Ice Lake and Cascade Lake CPU platforms.
c3-standard-22: Decodes to a C3 series instance (compute-optimized, Intel processor), with a standard vCPU-to-memory ratio (4 GB/vCPU), and 22 vCPUs, resulting in 88 GB of memory.
c3d-highmem-360-lssd: Decodes to a C3D series instance (compute-optimized, AMD processor), with a highmem vCPU-to-memory ratio (8 GB/vCPU), 360 vCPUs, and local SSDs attached.

Availabilities

Different zones offer differnt instance types. E.g. E2 should be universally available, GCP new zones won't have N1, N2/N2D.

To find all the locations supports a specific VM type:

gcloud compute machine-types list --filter=name~n4
gcloud compute machine-types list --filter name=c3-standard-4

Can I store VM images in Artifact Registry?

No. You cannot directly store a Google Compute Engine (GCE) VM disk image or machine image in Artifact Registry.

VM images are stored in GCS
Container images are stored in AR.

VM Image vs Machine Image

A VM Image captures the boot disk and its file system;
A Machine Image captures the entire VM instance (including all attached disks and the instance's hardware configuration).

TPU/GPU VMs

How does GPU VMs work?

GCE instances (like A4 or A4 Ultra) have physically attached GPUs.

GPU-enabled GCE instances have both CPU and GPU.
When you provision a GCE instance with GPUs (e.g., an a2-highgpu-8g instance with 8 NVIDIA A100 GPUs), you are essentially getting a virtual machine that includes both standard x86 CPUs (virtual cores) and dedicated physical GPUs.
The CPU acts as the host processor for the VM, running the operating system, managing I/O, and handling general-purpose computations.
The GPUs are attached as accelerators, and the CPU offloads computationally intensive tasks (like matrix multiplications in ML) to them.

This is the standard model for GPU acceleration in cloud VMs.

Could GPU VMs and non-GPU VMs run on the same bare metal machine?

Yes. If a baremetal node has 8 GPUs, it can have up to 8 VMs that requires 1 GPU, but many other types of instances without GPU.

How does TPU VMs work?

TPU VMs are designed to be more integrated and specialized, and they do include a CPU, but the relationship is different from a GPU-enabled GCE instance.

Early TPU Architecture (TPU Node API): In earlier generations (like TPU v2 and v3), you would typically provision a "TPU Node" which was a dedicated hardware accelerator. You would then connect to this TPU Node from a separate GCE VM (a "host VM") that ran your code. The host VM would send instructions and data to the TPU Node. So, the CPU and TPU were distinct entities, though tightly coupled.
Current TPU VM Architecture (TPU v4 and later): With TPU v4 and subsequent generations (like v5e, v5p, v6e, v7x), Google introduced the TPU VM architecture. In this model:
- The TPU chip(s) and a CPU are co-located within a single virtual machine. This means you get a single VM instance that has both the TPU hardware and a CPU.
- The CPU in a TPU VM is primarily there to serve as the host for the operating system and to manage the TPU device. It handles tasks like loading data, orchestrating the ML workload, and interacting with the TPU hardware.
- The heavy lifting of the machine learning computations is performed directly on the TPU chips. The CPU is not intended for general-purpose computation in the same way as a standalone GCE CPU instance.
- This integrated TPU VM architecture simplifies development and deployment, as you no longer need to manage a separate host VM and a TPU Node. Your code runs directly on the TPU VM, and it interacts with the co-located TPU chips.

The CPU(s) and TPUs within a TPU VM environment are physically built together and offered as predefined machine types.

The TPU hardware (chips, often on trays) is tightly integrated with a dedicated CPU "host machine tray" or "index tray" in the data center.

Because these are offered as bundled SKUs, you cannot arbitrarily mix different CPU types or customize the ratio of vCPUs/RAM to TPU chips. The configurations are set to provide an optimized balance for typical TPU workloads.

TPU VM Architecture: The current standard is the 1VM architecture, where a single Virtual Machine (VM) runs directly on the physical host machine connected to the TPU hardware. You have root access within this VM.

Are TPUs and CPUs virtualized?

TPU: NO. TPUs are allocated to Virtual Machines (VMs) in fixed, whole units.
CPU: YES. CPUs are virtualized into multiple vCPUs.

Can a bare metal machine host multiple TPU VMs?

Yes. A single physical bare metal machine can host multiple TPU VMs. A

A physical machine with, for instance, 8 TPU chips can be partitioned.
You could run eight separate VMs on this single physical host. Each VM gets exclusive access to one of the host's TPU chips and a dedicated portion of the host's CPU (24 vCPUs) and RAM.

TPU Software Stack

JAX: Operates at a high level, focused on numerical computation and transformations like JIT compilation. When targeting TPUs, JAX uses XLA.
XLA (Accelerated Linear Algebra): This is a compiler for linear algebra that takes operations from frameworks like JAX (in an intermediate representation like StableHLO) and compiles them into highly optimized machine code for specific hardware, including TPUs. XLA generates the instructions that will run on the ASIC's cores (like TensorCore or SparseCore).
TPU Runtime: This layer sits between the high-level frameworks and the low-level hardware interface. Key components include:
ASIC-SW Driver: responsible for direct communication with the ASIC hardware.