logo

AI - vLLM

vLLM (the "v" stands for Virtual) is an open-source library designed for fast and memory-efficient serving of Large Language Models (LLMs).

Its core innovation, called PagedAttention, is directly inspired by the way operating systems manage virtual memory.

The Problem: The "KV Cache"

When an LLM (like Llama 3 or GPT-4) generates text, it stores the "context" of the conversation in the GPU's memory so it doesn't have to re-calculate everything for every new word. This is called the KV (Key-Value) Cache.

The problem with the KV Cache:

  • It's Huge: For long conversations, it can take up gigabytes of VRAM.
  • It's Unpredictable: You don't know how long a user's response will be.
  • Fragmentation: Standard systems reserve a big "chunk" of contiguous memory for a request. If the request ends early, memory is wasted (Internal Fragmentation). If it's too long, it runs out of space.

The Solution: PagedAttention

vLLM solves this by treating GPU memory exactly like a Linux kernel treats RAM.

Instead of allocating a single continuous block of memory for a sequence, vLLM divides the KV cache into pages.

  • Virtual Mapping: The "context" of a chat isn't stored in one long line in memory; it’s scattered across different "physical" blocks on the GPU.
  • Mapping Table: Just like the PGD and PTE in Linux Page Tables, vLLM maintains a Page Table to keep track of where each part of the conversation is stored.

Result: It eliminates memory waste almost entirely. This allows vLLM to fit much larger batches of users on a single GPU.

Key Features of vLLM

  • High Throughput: It can serve 10x to 20x more requests per second than standard libraries (like Hugging Face Transformers).
  • Continuous Batching: Traditional servers wait for a whole "batch" of users to finish before starting new ones. vLLM uses "iteration-level scheduling," meaning as soon as one user's sentence is finished, it inserts a new user into the batch immediately.
  • Quantization Support: It supports formats like AWQ and FP8, which compress models to make them even faster.
  • OpenAI Compatibility: It can mimic the OpenAI API, so you can swap out "GPT-4" for a local model running on vLLM without changing your app's code.

Why does it matter?

If you are a company trying to run your own LLM:

  • Without vLLM: You might only be able to handle 2 users at a time on one NVIDIA A100 GPU before it crashes.
  • With vLLM: You might be able to handle 20–40 users at a time on that same GPU. This dramatically lowers the cost of running AI.

Source Code

https://github.com/vllm-project/vllm

Summary

vLLM is currently the industry standard for Self-Hosting LLMs because it is the most efficient way to manage GPU memory and maximize the number of users you can serve at once.