TL;DR Link to heading

The blog explores LLM inference accelaration from a system perspective, first introduce chatbot architecture in production environment, then advanced techniques in model optimization, focusing on KV cache management, PD (parameter data) splitting, and Continuous Batching. It explains how KV cache improves inference efficiency by storing key-value pairs, how PD splitting enables better memory distribution across devices, and how Continuous Batching maximizes throughput by dynamically batching requests. Together, these strategies enhance model performance, reduce latency, and optimize resource utilization in large-scale deployments.

High-Level Architecture Link to heading

Building a production-grade chatbot like ChatGPT requires balancing scalability, low latency, and reliability. At a high level, the system consists of a front-end that handles user requests - either through a web UI or RESTful API - and a back-end that powers the core intelligence. The back-end integrates the inference engine, data storage and retrieval, and additional modules like RAG (retrieval-augmented generation) or tool-calling systems to enhance LLM performance.

From a system design perspective, the architecture can be broken down into three core areas. User Interaction is where users connect through web clients or APIs, sending prompts and receiving responses in real-time. Security is enforced with authentication tokens or API keys, while rate limiting ensures fair usage. Requests are routed through load balancers to multiple session manager instances, enabling horizontal scaling and smooth session handling.

Session Management takes care of maintaining conversation context for long-term, stateful interactions. Redis is a common choice here, thanks to its in-memory speed, durability, and ability to scale. It stores conversation history and provides microsecond-level access for inference. Each incoming message retrieves the relevant context from Redis, combines it with the new input, and pushes it into a queue for processing. The generated responses are then stored back in Redis and returned to the user.

Message Queue decouples session management from inference, allowing each to scale independently. It guarantees fault-tolerant delivery of tasks and supports asynchronous or streaming responses. Popular brokers for this include Kafka, RabbitMQ, and Redis Streams.

Cloud service providers like AWS, Google, Azure offer managed message queue services and load balancers, which makes it easy to deploy LLM to public.
This architecture ensures that the chatbot can handle high traffic, maintain context, and respond in real-time, while remaining robust and scalable for production use.

Model Inference Link to heading

The most critical component of any LLM applications is the Inference Service, the engine that powers model predictions. Open-source models are a great starting point, it is crucial to understand how to efficiently generate messages behind the scene.

  • KV Cache is essential, almost all famous frameworks such as vLLM and llamacpp take good care of KV cache, please refer to this blog.

Transformer Architecture Recap Link to heading

Here’s a simplified view of what happens under the hood:

  1. Tokenization: The input prompt is broken down into tokens.

  2. Embedding & Transformer Layers: Tokens pass through an embedding layer, followed by N Transformer layers (sometimes hundreds).

  3. Sampling: Output probabilities are sampled to produce predictions.

  4. Detokenization: Tokens are converted back into human-readable text.

Where the Computation Gets Heavy

  • Multi-Head Attention (MHA) and the feedforward network are the most time-consuming.
  • Each Self-Attention requires projecting inputs into Q, K, V matrices, which involves matrix multiplication.
  • GPT is token-by-token and autoregressive, meaning each generated token is fed back to predict the next one.
  • KV Cache stores previous key-value pairs to avoid repeated computation for the first N-1 tokens, reducing time complexity for each new token.

Calculating Throughput Link to heading

When running inference, one of the most practical questions is:

“Given my GPU, how many users can I serve at once?”

The answer comes from GPU memory budgeting.

  1. Model Weights. Every model has a fixed memory footprint. For example: Llama2 7B in FP16: 7B × 2 bytes ≈ 14GB.
    This is a one-time cost that always sits in GPU memory.

  2. KV Cache Per Token. For each token, the model stores Key (K) and Value (V) vectors for every Transformer layer. The formula is:
    KV size per token = 2 × (num of layers) × (hidden size / embedding dimension) × (bytes per element) Llama2 7B with 32 Layers, 4096 hidden dimension, and FP16 precision: 2 × 32 × 4096 × 2 bytes = 524,288 bytes ≈ 512KB per token.

  3. Multiply by Context Length. If the context length is 4096 tokens: 4096 × 512KB ≈ 2GB KV Cache per session

  4. Subtract Model Weights. On an A100 GPU (80GB): 80GB total - 14GB weights ≈ 66GB free for KV Cache

  5. Divide to Get Maximum Sessions. Each session consumes ~2GB KV memory: 66GB ÷ 2GB ≈ 33 concurrent sessions.

Modern LLMs such as Llama3 and Qwen3 use Grouped Query Attention (GQA), where a single value head is shared across two key/query heads (a 2:1 ratio). This approach modestly reduces both model weights and KV cache requirements, though the savings are limited when dealing with very long context windows or a large number of layers.

Prefill vs Decode Link to heading

Inference runs in two phases:

  • Prefill: The model processes the entire prompt at once and writes all tokens into the KV Cache. High time-to-first-token (TTFT), especially for long prompts. e.g. summarizing a 5K token document → slow first response.

  • Decode: The model generates tokens one by one using cached KV pairs. Much faster per-token latency. Ideal for streaming outputs (chat, creative writing).

PD Disaggregation (Prefill–Decode Split)

Large-scale inference often benefits from separating prefill and decode workloads across different GPU nodes. These two phases have very different compute profiles. Prefill is bursty and compute-intensive, while decode is lightweight but latency-sensitive. Running both on the same GPU leads to resource waste: prefill stalls decodes, and decode underutilizes GPUs during prefill bursts.

Think of it as separating “reading” from “speaking”: some GPUs are optimized for digesting the input, others for generating fluent output.

Benefits

  • Higher throughput by matching workloads to the right hardware.

  • Lower latency for real-time token streaming.

  • Flexible scaling (e.g., more decode nodes for chatbots, more prefill nodes for long-doc summarization).

Limitations

  • Splitting prefill and decode requires moving KV Cache across nodes.

  • This transfer can dominate overhead, causing longer Time-Per-Output-Token (TPOT).

The state of the art is moving beyond static PD splits toward adaptive, cache-aware scheduling and intra-GPU resource sharing, pushing both throughput and latency to new levels. For instance, NVIDIA Dynamo offers a GPU planner that monitors system states and adjusts workers to ensure efficient operation, supporting both aggregated and disaggregated serving modes. Instead of dedicating separate GPUs for prefill and decode or batching them sequentially, Nexus dynamically partitions resources within a single GPU-allocating compute and memory bandwidth between both phases on the fly, depending on current load and performance goals. SGLang scheduler dynamically interrupts ongoing decode batches if new prefill requests arrive-effectively inserting prefill into the running workload loop to improve utilization.

Continuous Batching Link to heading

In traditional inference setups, naive batching is common: the system waits until enough requests arrive before processing them together on the GPU. While simple, this approach has two main drawbacks:

  • Latency spikes: early requests sit idle, waiting for the batch to fill.
  • Underutilized GPU resources: GPU cores may remain idle if the batch isn’t full.

Continuous batching solves this by dynamically filling GPU slots as requests arrive, maximizing utilization without making users wait unnecessarily.

Imagine a GPU that can process 8 sequences in parallel. Requests arrive at different times:

Time Incoming Requests GPU Slots Filled Notes
t0 3 3/8 GPU starts immediately with the 3 requests
t1 2 more 5/8 GPU fills slots dynamically, no waiting
t2 4 more 8/8 GPU fully utilized; remaining requests queue for next cycle

Continuous batching becomes especially practical when you only have a single GPU, because you cannot offload prefill tasks to separate nodes. Since the decode phase is much lighter than prefill, the GPU often has enough spare capacity to start prefill for a new incoming prompt while still decoding tokens for previous requests. This allows overlapping prefill and decode work, improving GPU utilization and reducing latency.

Additional optimizations like CUDA Graphs on decode nodes can reduce CPU overhead and improve efficiency.