GCP - GCE
What is GCE Metadata Server?
When a Compute Engine virtual machine is created, Google Cloud automatically sets up a special network route and DNS entry on that instance. This entry resolves the domain metadata.google.internal to the private IP address 169.254.169.254.
This is a link-local address, which means it's only valid and accessible within the specific network segment of the instance itself. The GCE network is designed to route any traffic sent to this special IP address to the metadata server, which is a service that runs independently of the instance's operating system.
To get a list of all available metadata attributes, you can use the following command. The trailing slash is important.
curl "http://metadata.google.internal/computeMetadata/v1/" \
-H "Metadata-Flavor: Google"
The server will return a list of available top-level metadata keys.
Is GCE Metadata Server per GCE Instance?
Yes. Each Google Compute Engine virtual machine instance has its own dedicated Metadata Server. The MDS runs as an HTTP server within the GCE VM controller process.
Why is it using 169.254.169.254?
This specific IP range is reserved for IPv4 Link-Local addresses. It is non-routable, meaning it can only exist on a single local "hop."
What is Managed Instance Group (MIG)?
A Managed Instance Group is a collection of virtual machine (VM) instances that are managed as a single entity. MIGs help you run your workloads on a group of identical VMs, offering features that improve scalability, availability, and cost-effectiveness.
Key characteristics and benefits:
- Scalability:
- Autoscaling: This is one of the most powerful features. You can configure a MIG to automatically add or remove VM instances based on the load (e.g., CPU utilization, HTTP load balancing serving capacity, Cloud Monitoring metrics, or queue size from Pub/Sub). This ensures your application can handle traffic spikes and scales down during low usage to save costs.
- Load Balancing Integration: MIGs are tightly integrated with Google Cloud's load balancers (e.g., HTTP(S) Load Balancing, Network Load Balancing). The load balancer distributes incoming traffic across the healthy instances in the MIG.
- High Availability:
- Autohealing: MIGs continuously monitor the health of individual VM instances. If an instance becomes unresponsive or unhealthy (based on application-level health checks), the MIG automatically deletes it and recreates a new instance, ensuring your application remains available.
- Automatic Rolling Updates: MIGs support rolling updates, allowing you to deploy new versions of your application or update the underlying VM image without downtime. You can control the update strategy (e.g., Canary deployments, blue/green deployments).
- Multi-zone or Regional Deployment: You can create MIGs that span multiple zones within a region (Regional MIGs) for even higher availability and resilience against zone failures. Zonal MIGs exist within a single zone.
- Ease of Management:
- Instance Templates: All instances within a MIG are created from a common instance template. This template defines the VM's machine type, boot disk image, network settings, metadata (like startup scripts), and other properties, ensuring consistency across all instances.
- Configuration Consistency: Because all instances are based on the same template, you get consistent configurations, simplifying management and troubleshooting.
- Simplified Operations: You manage the group as a whole rather than individual VMs, significantly reducing operational overhead.
Types of Managed Instance Groups:
- Zonal MIGs: All VM instances in the group are located in a single zone. They offer high availability within that zone.
- Regional MIGs: The VM instances are distributed across multiple zones within a single region. This provides higher resilience against a single zone failure and can distribute load more effectively across zones.
Common Use Cases for MIGs:
- Web Servers/Application Servers: Easily scale your web application frontend or backend services based on user traffic.
- Batch Processing: Run a fleet of workers that can scale up to process large datasets and scale down when done.
- Microservices: Deploy and manage microservices that require high availability and automatic scaling.
- Stateful Workloads (with additional considerations): While primarily designed for stateless applications, MIGs can support stateful workloads with specific configurations (e.g., persistent disks per instance, stateful instance templates).
What is Google Guest Agent
The Google Guest Agent is a set of services and daemons that run inside your Google Compute Engine (GCE) virtual machine instances.
If the Google Cloud Console is the "Manager" outside the VM, the Guest Agent is the "Receptionist" inside the VM that takes orders from the manager and makes sure the OS actually carries them out.
What does it actually do?
The Guest Agent is responsible for bridging the gap between the Google Cloud control plane and the Linux/Windows operating system. Its primary jobs include:
Managing SSH Keys
When you click the "SSH" button in the Google Cloud Console, Google doesn't magically teleport you into the machine.
- Google adds your public SSH key to the VM Metadata.
- The Guest Agent (watching for changes) sees the new key.
- It creates the user account on the OS and adds the key to the
~/.ssh/authorized_keysfile. - Only then can you log in.
Network Configuration
The Guest Agent handles complex networking tasks that the OS wouldn't know about on its own:
- IP Aliases: Setting up secondary IP addresses.
- Routes: Configuring routes for VPC features.
- Network Interfaces: Handling hot-plugging of new network interfaces.
Account Management (OS Login)
If you use OS Login (the feature that links your Linux users to your Google IAM identity), the Guest Agent is the component that talks to the Google API to verify who you are when you try to sudo or log in.
Metadata Syncing
Google VMs have a "Metadata Server" (metadata.google.internal). This server stores information like the VM name, project ID, and custom scripts. The Guest Agent constantly polls this server to see if anything has changed (like a request to shut down the VM or a change in permissions).
Can I trust Google Guest Agent?
It's source code can be found on GitHub:
- https://github.com/GoogleCloudPlatform/guest-agent: hosts the legacy, monolithic guest agent codebase.
- https://github.com/GoogleCloudPlatform/google-guest-agent: hosts the new, plugin-based guest agent architecture.
Changing to a Plugin Architecture
Google transitioned the Guest Agent from a monolithic design to a plugin-based architecture.
- The Benefit: If one part of the agent (like the network plugin) crashes, it doesn't take down the entire service.
- New Control: You can now selectively enable or disable specific plugins (like "Workload Manager") to reduce resource overhead.
Common Commands
If you need to check or restart the agent on a Linux VM:
- Check Status:
sudo systemctl status google-guest-agent - Restart Agent:
sudo systemctl restart google-guest-agent - View Logs:
sudo journalctl -u google-guest-agent
GCE Instance Types
Google Cloud Engine (GCE) instance types follow a structured naming convention that allows users to quickly understand their core characteristics, including their machine family, generation, processor, and resource allocation. Decoding these names involves recognizing common patterns and suffixes.
Here's a breakdown of how to interpret GCE instance types:
General Structure of GCE Instance Types
GCE instance types typically follow a pattern that includes: [FAMILY][GENERATION][PROCESSOR_VARIANT]-[RESOURCE_ALLOCATION]-[OPTIONAL_FEATURES].
-
Machine Family (e.g., C, N, E, M, A, H, T, Z): The first letter or letter-number combination usually indicates the machine family, which is optimized for specific workloads.
- C (Compute-optimized): Designed for CPU-intensive workloads requiring high performance, faster processors, and advanced networking. Examples include high-performance computing (HPC), gaming servers, and latency-sensitive applications.
- N (General-purpose): Offers a balance of compute and memory resources, suitable for a wide variety of workloads like web servers and databases.
- E (Cost-optimized General-purpose): Provides a good performance-to-cost ratio and is suitable for most general workloads.
- M (Memory-optimized): Ideal for memory-intensive applications such as large-scale databases and in-memory analytics, offering the highest memory-to-vCPU ratios.
- A (Accelerator-optimized): Includes GPUs or TPUs, designed for machine learning and video processing.
- H (High-performance computing): Specifically designed for HPC workloads that require high-bandwidth memory and fast interconnects.
- T (Tau - Scale-out optimized): Optimized for scale-out workloads, offering compelling price for performance.
- Z (Storage-optimized): Provides high-performance search and data analysis for medium-sized datasets, often with high-capacity local SSDs.
-
Generation (Number following the family letter): An ascending number typically denotes a newer generation of the machine series, often indicating updated CPU platforms or technologies. For example,
N2is a newer generation thanN1. -
Processor Variant (Optional Letter, e.g., D):
- D (AMD EPYC processor): Indicates that the instance uses AMD EPYC processors. For example,
C3Dutilizes 4th Gen AMD EPYC Genoa processors, whileC2Duses AMD Milan-based VMs. - Instances without a 'D' (like
C3orN2) typically use Intel Xeon processors.C3uses 4th Gen Intel Xeon Scalable processors (Sapphire Rapids), andN2uses Intel Ice Lake and Cascade Lake platforms.
- D (AMD EPYC processor): Indicates that the instance uses AMD EPYC processors. For example,
-
Resource Allocation (
-standard,-highcpu,-highmem, etc.): This part of the name indicates the vCPU-to-memory ratio.-standard: Offers a balanced vCPU-to-memory ratio, typically around 4 GB of memory per vCPU. For example,c3-standard-22has 22 vCPUs and 88 GB of memory.-highcpu: Prioritizes vCPUs, with a lower memory-to-vCPU ratio, typically 1 to 3 GB of memory per vCPU (often 2 GB).-highmem: Prioritizes memory, offering a higher memory-to-vCPU ratio, typically 7 to 12 GB of memory per vCPU (often 8 GB).- Other less common ratios include
megamem(12-15 GB/vCPU),ultramem(24-31 GB/vCPU), andhypermem(15-24 GB/vCPU).
-
Number (e.g.,
22,360): This number usually specifies the total number of vCPUs for that instance type. For example,c3-standard-22has 22 vCPUs, andc3d-highmem-360has 360 vCPUs. -
Optional Features (
-lssd,-metal):-lssd(Local SSD): Indicates that the instance type comes with a predetermined number of local SSDs automatically attached, providing high-performance local storage.-metal(Bare Metal): Designates bare metal instances, available in certain series like C3, C3D, and C4.
Key differences
- E and N supports custom VM shape; C only supports fixed VM shape
- Different VM types support different Disk Types.
Examples
C3: Refers to the C3 machine series, which is compute-optimized and uses 4th Gen Intel Xeon Scalable processors (Sapphire Rapids).C3D: Refers to the C3D machine series, which is also compute-optimized but uses 4th Gen AMD EPYC Genoa processors.N2: Refers to the N2 machine series, which is general-purpose and uses Intel Ice Lake and Cascade Lake CPU platforms.c3-standard-22: Decodes to a C3 series instance (compute-optimized, Intel processor), with astandardvCPU-to-memory ratio (4 GB/vCPU), and 22 vCPUs, resulting in 88 GB of memory.c3d-highmem-360-lssd: Decodes to a C3D series instance (compute-optimized, AMD processor), with ahighmemvCPU-to-memory ratio (8 GB/vCPU), 360 vCPUs, and local SSDs attached.
Availabilities
Different zones offer differnt instance types. E.g. E2 should be universally available, GCP new zones won't have N1, N2/N2D.
To find all the locations supports a specific VM type:
gcloud compute machine-types list --filter=name~n4
gcloud compute machine-types list --filter name=c3-standard-4
Can I store VM images in Artifact Registry?
No. You cannot directly store a Google Compute Engine (GCE) VM disk image or machine image in Artifact Registry.
- VM images are stored in GCS
- Container images are stored in AR.
VM Image vs Machine Image
- A VM Image captures the boot disk and its file system;
- A Machine Image captures the entire VM instance (including all attached disks and the instance's hardware configuration).