How LLMs Generate Tokens in Production

May 26, 2026

A walkthrough of the path from prompt text to generated tokens, and why production LLM serving is really about scheduling and cache reuse.

I came across this guide on operating Mixture-of-Experts (MoE) models in production. To make sense of it, I had to back up and understand the serving loop underneath: prefill, decode, KV cache, batching, and routing; the things that serving engines like vLLM and SGLang manage. So I’m writing up my current understanding with some visuals of how I understand it.

Classical ML inference like linear regression, SVMs or CNNs has a tidy shape. An image classifier gets one image and returns one label. A reranker gets a fixed list and returns scores. The input and output sizes are usually known before the request starts.

LLM inference is different because both sides are variable length.

Input
Length:
?

short question, long chat, uploaded docs

->
LLM
->
Output
Length:
?

one word, code block, long answer

Unlike fixed-shape inferencerequests arrive and finish at different sizes

The input can be a three-word question or a 200-page context window. The output can be one token or a long answer. The server does not know in advance how much work the request will take, and it has to keep producing tokens while other requests arrive, grow, and finish around it.

That variable-length shape is what makes LLM serving feel strange. The model is not just “called” once. It first reads the prompt, then writes the answer one token at a time, carrying forward a growing memory of the conversation.

Here is the path at a high level. The first three steps turn the prompt into model state. The later steps are about extending that state, reusing it, and serving many requests without wasting the GPU.

01

Tokenize

Text becomes token IDs. Billing, latency, and context windows are counted in tokens, not words.

02

Embed

Token IDs become vectors with position information, so the transformer can do math on text.

03

Prefill

The model reads the prompt once. This is parallel over many tokens and tends to be compute-heavy.

04

Decode

The model writes one new token, appends it, and repeats. This is serial and often memory-bandwidth-bound.

05

Reuse Cache

Saved keys and values avoid recomputing old context. The cache grows as the conversation grows.

06

Batch

Serve many users together. New requests join while finished ones leave.

The rest of the post walks through those boxes in order, then comes back to the production-serving tricks that make the last two boxes work at scale.

Text Becomes Tokens

The model never sees the string "tokenization is weird". The front door turns text into token IDs, then token vectors.

# prompt text
text = "tokenization is weird"

# cl100k_base tokens: ["token", "ization", " is", " weird"]
token_ids = tokenizer.encode(text)      # [5963, 2065, 374, 16682]
vectors = embedding[token_ids]          # [[0.12, ...], [0.08, ...], [0.44, ...], [0.19, ...]]

x = vectors + position_encoding         # shape: [4, d_model]

The exact split depends on the tokenizer; this example uses tiktoken with cl100k_base. token_ids are integer IDs from the tokenizer vocabulary. embedding is the learned lookup table that turns each ID into a vector. position_encoding tells the model where each token sits in the sequence; this deep dive on RoPE positional encoding goes deeper. d_model is the width of each token vector: for example, if d_model = 4096, each token becomes a list of 4,096 numbers. x is the tensor the Transformer will operate on.

Prefill Reads The Prompt

After tokenization, the model runs the prompt through the Transformer. This phase is called prefill. It is the big read: the model processes the input tokens and writes a KV cache that will be reused during generation.

for layer in model.layers:
    input = x  # shape: [seq_len, d_model]

    # Self-attention projections. @ means matrix multiply.
    Q = input @ W_query  # shape: [seq_len, d_model]
    K = input @ W_key    # shape: [seq_len, d_model]
    V = input @ W_value  # shape: [seq_len, d_model]

    # .T means transpose; keys line up so every token can compare to every other token.
    scores = (Q @ K.T) / sqrt(d_model)  # shape: [seq_len, seq_len]
    attention_probabilities = softmax(scores)
    attention_output = attention_probabilities @ V

    x = attention_output  # shape: [seq_len, d_model]
    kv_cache[layer].store(K, V)

logits = to_vocab(x[-1])  # one score for every token in the vocabulary
# sample picks one token ID from those logits
first_token = sample(logits)

Attention is the model asking, for each token, which other tokens should influence this token right now, and by how much.

W_query, W_key, and W_value are learned weight matrices. During training they are adjusted so the model can extract useful patterns from token embeddings. They turn the input into Q, K, and V: queries ask what a token is looking for, keys describe what each token can offer, and values carry the information that gets mixed together.

scores measures how much each token should attend to every other token. The division by sqrt(d_model) keeps the dot products from getting too large before softmax; otherwise one or two huge scores can dominate too early. softmax turns those scores into probabilities between 0 and 1, and attention_probabilities @ V mixes the value vectors using those probabilities.

kv_cache[layer] stores the K and V tensors for this layer so decode can reuse them. x[-1] selects the last prompt position. to_vocab turns that vector into one score for every token in the vocabulary, and sample picks the first generated token from those scores.

Prefill is expensive, but it has useful shape. If the prompt has 8,000 tokens, the server can process many of those tokens in parallel. The work is wide.

This is why prefill is compute-intensive. The expensive part is doing large matrix multiplications over many prompt tokens at once. GPUs are good at that kind of dense parallel math, so prefill can use a lot of compute efficiently.

Note: this is a simplified Transformer layer focused on self-attention and the KV cache. It leaves out multi-head attention and the feed-forward network.

Decode Writes The Answer

Decode is the narrow part. It produces one token, appends that token’s keys and values to the cache, then repeats. Every step needs the layer weights and the old cache.

token = first_token

while token != END:
    x = embed(token)  # shape: [1, d_model]

    for layer in model.layers:
        Q = x @ W_query  # shape: [1, d_model]
        K = x @ W_key    # shape: [1, d_model]
        V = x @ W_value  # shape: [1, d_model]

        cached_K, cached_V = kv_cache[layer].read_all()
        all_K = concat(cached_K, K)  # shape: [context_tokens, d_model]
        all_V = concat(cached_V, V)  # shape: [context_tokens, d_model]

        scores = (Q @ all_K.T) / sqrt(d_model)
        x = softmax(scores) @ all_V
        kv_cache[layer].append(K, V)

    logits = to_vocab(x[-1])  # one score for every token in the vocabulary
    # sample picks the next token ID from those logits
    token = sample(logits)

Most names are the same, but their shape has changed. token is now the single token being generated. embed(token) produces one vector, so x has shape [1, d_model] instead of [seq_len, d_model]. cached_K and cached_V are all the keys and values from earlier prompt and output tokens. all_K and all_V extend that attention context with the new token. logits are the raw scores for the next-token vocabulary before sampling.

That while loop is the hard part. You cannot fully parallelize “what is the next token?” before you know the current token. You can make each step fast. You can batch many users together. You can sometimes skip steps. But the basic shape remains serial.

This is why decode is usually memory-bandwidth-intensive. Each step only does math for one new token, so there is less parallel compute to hide the cost of moving data around. The model weights and KV cache usually live in GPU memory. For every decode step, chunks of those tensors are streamed from GPU memory into the GPU’s compute units, used for the matrix multiplications, and then the new K and V for the generated token are written back to the cache. If the model is split across multiple GPUs, some activations or routed tensors also move across GPU-to-GPU links such as NVLink or PCIe. If the weights or KV cache do not fit in GPU memory, serving systems may offload some data to CPU RAM or even disk, then move it back when needed. That is much slower than keeping it on the GPU, so avoiding those transfers is a major part of production inference. As the context gets longer, more cache has to be read for each generated token. The bottleneck shifts from “how many multiplications can I do?” to “how fast can I move weights and cache through memory?”

The sizes get large quickly. A 7B parameter model in 16-bit precision is roughly 14 GB of weights. A 70B model is roughly 140 GB before quantization or serving overhead. The KV cache grows with the number of layers, the context length, and the vector width:

kv_cache_bytes = layers * context_tokens * 2 * d_model * bytes_per_number
# 2 because the cache stores both K and V

For a 32-layer model with d_model = 4096, an 8,000-token context, and 16-bit cache values, that rough calculation is about 4 GB for one long request. Real systems have tricks to reduce this, but the shape is the important part: during decode, every active request carries memory that grows as the conversation grows.

Prefill

One wide read

Many prompt tokens move through the model together, then write the first cache.

Prompt tokens
p1
p2
p3
p4
p5
p6
parallel pass
KV cache written
KV
KV
KV
KV
KV
KV
Decode

Many narrow steps

One new token is produced at a time, while each step rereads the growing cache.

New tokenCache read this step
out 1
KV
KV
KV
KV
KV
out 2
KV
KV
KV
KV
KV
out 3
KV
KV
KV
KV
KV

Now that the basic loop is in place, we can look at the tricks that make LLM serving efficient and scalable in production.

Batching Has To Keep Moving

There is a natural question here: if the server already has a layer’s weights ready, why run only one request through it? The expensive part is often moving those weights and cache through the GPU. Batching lets several requests use the same loaded weights at the same time, whether they are doing prefill work or decode work.

Batching is the basic serving trick: instead of running one request at a time, the server groups multiple requests together so the GPU has more work to do in parallel. This works especially well when the requests have the same shape.

LLMs make that harder. Requests have variable input length and variable output length. Some users finish early. Some keep generating. If the server treats a batch as one fixed group, empty slots waste GPU time.

It also creates head-of-line blocking: one slow request at the front of the queue can make everyone behind it wait. In an LLM batch, one long-running generation can keep shorter requests stuck even after they could have been served.

Continuous batching treats the batch as a moving set. At each prefill or decode tick, the active requests run through many Transformer layers. Then finished work leaves, and new work enters.

T1
T2
T3
T4
T5
T6
T7
T8
slot 1
R1
R1
R1
R1
R1
end
R6
R6
slot 2
R2
R2
R2
R2
R2
R2
R2
end
slot 3
R3
R3
R3
R3
end
R5
R5
R5
slot 4
R4
R4
R4
R4
R4
R4
end
R7
prefill work or prefill chunk decode work newly admitted request request finished
T1-T8: scheduler ticks over time slot: one active lane in the GPU batch R1, R2, ...: different user requests

This is one of the biggest differences from ordinary fixed-shape inference. The batch is not a single group that starts together and ends together. It is a changing set of active requests.

The KV Cache Needs A Memory Manager

Continuous batching keeps many slots active at the same time. Each slot may belong to a different request, and each request carries a KV cache that grows with its context. Multiply that by long prompts, long answers, many layers, and large vector widths, and the amount of KV cache can explode.

That is why efficient KV-cache management matters. The cache reduces repeated attention work, but it also becomes a large and growing memory object for every active request.

The naive version is to reserve one long contiguous cache buffer per request. That wastes memory because requests end at different lengths, and it makes it hard to pack many active conversations into GPU memory. Production systems treat the KV cache more like virtual memory.

Paging is the big idea behind vLLM’s PagedAttention design. Instead of requiring one contiguous cache region, the cache is split into fixed-size blocks. A request keeps a table that says which cache blocks contain its tokens.

for token in generated_tokens:
    block = kv_pool.block_for(request)
    block.append(K, V)

    request.block_table.append(block.id)

Now a request can grow by adding blocks, finish by returning blocks, and reuse memory without moving one giant cache tensor around. The important shift is that KV cache stops being “an array attached to a request” and becomes managed serving state.

Other techniques reduce how much cache has to be stored or moved:

  • Compression stores fewer bytes per cached vector. Google’s TurboQuant post is a good example of KV-cache compression aimed at reducing memory footprint while keeping attention scores useful.
  • Quantization stores numbers with fewer bits. For weights or cache values, that means less memory to store and less data to move. This quantization explainer is a good ground-up walkthrough.

Paging, compression, and quantization are all aimed at the same bottleneck: decode is often waiting on memory, so production systems try to store less, move less, and reuse what is already nearby.

Long Prompts Should Not Freeze Active Generations

The moving-batch picture still leaves one question: how big is one yellow prefill block?

If a prompt is short, prefill can fit neatly into a tick or two. If a prompt is 20,000 tokens, treating the whole prefill as one schedulable unit can still monopolize compute. The request has a slot, but that slot is occupied by a huge piece of work. Active decoders may pause while the long prompt is processed. Users feel that as stutter: tokens stop arriving even though generation already started.

Chunked prefill makes the yellow blocks smaller. It breaks a long prompt into bounded pieces, then lets decode work slip between those pieces.

for chunk in chunks(long_prompt, size=256):
    prefill(chunk, req.kv_cache)

    # let active generations breathe
    if active_decoders:
        decode_step(active_decoders)

A scheduler can also protect decode first and spend leftover budget on waiting prefills.

budget = GPU_TOKEN_BUDGET

decode_step(active_requests)
budget -= len(active_requests)

while budget >= CHUNK and waiting_prefills:
    req = waiting_prefills.pop()
    prefill(req.next_chunk(), req.kv_cache)
    budget -= CHUNK

That is a concrete serving policy: protect tokens already in flight. The user who is already watching tokens stream should not freeze because someone else arrived with a huge prompt.

The model has not changed. The math inside a token step has not changed. But the experience changes because the scheduler respects the shape of the work.

Sometimes You Can Skip Decode Steps

Normal decode asks the big model to choose one token, append it, then choose the next token. That choice is serial because token 4 depends on token 3, token 3 depends on token 2, and so on.

Verification is easier because the candidate tokens are already known. If a smaller or cheaper draft path proposes four tokens, the target model can treat them like a short extension of the prompt. It reuses the KV cache for the existing prefix, runs a prefill-like pass over the draft chunk, computes logits at each draft position in parallel, and checks how many proposed tokens it can accept from left to right.

The target model still does work for each proposed token. The win is that it avoids four separate serial decode steps. Instead of “generate token 1, then token 2, then token 3, then token 4,” verification asks “given these four proposed tokens, which prefix of them would the target model have accepted?” The draft chunk is known, so the target can process it as one block.

Speculative decoding uses that asymmetry. A draft path proposes several tokens. The target model verifies them. Accepted tokens move the output forward without paying one full target-model decode step per token.

There is no universal confidence threshold. If generation is greedy, verification can be as simple as “did the draft token match the target model’s top token?” If generation is sampling, the usual acceptance rule is probabilistic: accept a proposed token with probability min(1, target_prob / draft_prob). If the target model liked the token at least as much as the draft model did, it is accepted. If the draft model was more optimistic than the target model, it is accepted only some of the time.

For example, suppose the prompt is "The capital of France is" and the draft model proposes [" Paris", ".", " It", " is"]. If the draft model gave " Paris" probability 0.70, and the target model gives that same proposed token probability 0.91, then accept_prob = min(1, 0.91 / 0.70) = 1.0, so it is always accepted. If the draft model gave "." probability 0.60, but the target model gives it probability 0.52, then accept_prob = min(1, 0.52 / 0.60) = 0.87, so it is accepted 87% of the time.

prefix = "The capital of France is"
draft_tokens, draft_dists = small_model.generate(prefix, k=4)
# draft_tokens: [" Paris", ".", " It", " is"]

# One target-model pass scores all proposed positions using the prefix KV cache.
logits = target.forward_with_cache(prefix_kv_cache, draft_tokens)
target_dists = softmax(logits)

accepted = []
for proposed, draft_dist, target_dist in zip(draft_tokens, draft_dists, target_dists):
    # Compare the two models' probabilities for this exact proposed token.
    accept_prob = min(1, target_dist[proposed] / draft_dist[proposed])

    if random() < accept_prob:
        accepted.append(proposed)
    else:
        replacement_dist = normalize_positive_part(target_dist - draft_dist)
        accepted.append(sample(replacement_dist))
        break

prefix += accepted

When a proposed token is not accepted, the target model samples a replacement token for that position. The server keeps the accepted prefix plus that replacement, discards the rest of the draft tokens, and continues generation from there. In other words, decode resumes from the first point where the draft stopped matching the target.

Multi-token proposal, or MTP, moves the draft path inside the model itself. An MTP-trained model has the normal next-token output plus extra prediction outputs trained to guess token t+2, t+3, and so on from the same hidden state. The normal output gives the token plain decoding would have produced. The extra outputs give a built-in draft for the next few positions, so there is no separate small model doing autoregressive drafting.

The loop has the same draft-then-verify shape:

# One forward pass at position t.
next_token, draft_tokens, draft_dists = model.propose_multiple(prefix)
# next_token: the normal t+1 prediction
# draft_tokens: proposed t+2, t+3, ..., t+n

candidate = [next_token] + draft_tokens

# One verification pass over the proposed block.
# In MTP, the same model verifies its own built-in draft.
logits = model.forward_with_cache(prefix_kv_cache, candidate)
target_dists = softmax(logits)

accepted = [next_token]
for proposed, draft_dist, target_dist in zip(draft_tokens, draft_dists, target_dists[1:]):
    accept_prob = min(1, target_dist[proposed] / draft_dist[proposed])

    if random() < accept_prob:
        accepted.append(proposed)
    else:
        replacement_dist = normalize_positive_part(target_dist - draft_dist)
        accepted.append(sample(replacement_dist))
        break

prefix += accepted

After verification, the model also has a fresh hidden state at the last accepted position. That fresh state can immediately produce the next group of proposal tokens. So after the first step, each iteration is roughly one proposal pass and one verification pass, instead of one full decode step per generated token.

This does not change what “correct” generation means. The target model still gets the final say. It changes how many serial target steps you spend to get there. The vLLM speculative decoding post goes deeper on the serving details.

Some Models Add Expert Routing

The intuition behind MoE is simple: you want the quality benefits of a very large model, but you do not want to run the whole model for every token. Instead of making every token pass through every feed-forward block, you create many specialized feed-forward blocks and use only a few of them per token.

That is the routing idea. In a dense model, every token flows through the same weights. In a Mixture-of-Experts model, a router picks a small number of expert MLPs for each token.

This is why MoE exists: it lets you scale total parameter count, which can help quality, without scaling the FLOPs spent on every token by the same amount, which helps cost and latency. The model may contain many experts, but each token only activates a few of them.

MoE also pairs naturally with multi-machine inference. Experts are parallel along the expert dimension, so sharding them across GPUs or machines is relatively clean: send tokens to the experts they selected, run those experts, then combine the outputs. The main tax is the routing communication, often an all-to-all exchange where tokens have to move to the machines that own their chosen experts.

LLM
Attention layer replicated across all GPUs
Gating network selects top-k experts
Expert 1
GPU 1
Expert 2
GPU 2
Expert 3
GPU 3
Attention layer replicated across all GPUs
scores = router(hidden_states)
expert_ids = topk(scores, k=2)

buckets = group_by_expert(tokens, expert_ids)
expert_out = {}

for expert, toks in buckets.items():
    expert_out[expert] = experts[expert](toks)

y = combine_by_token(expert_out, scores)

scores are the router’s preferences for each expert. topk(scores, k=2) chooses the two experts each token will use. group_by_expert turns token-level routing decisions into expert-level batches, because tiny scattered expert calls are bad for utilization. combine_by_token puts the expert outputs back into the original token order, weighted by the router scores.

That is the bargain: more parameters without running all of them, clean expert sharding across machines, and a communication bill you now have to manage.

There is another routing question: which worker should get the request? If multiple requests share a prefix, or if a user is continuing a conversation, the best worker may be the worker that already has the right cache.

def score(worker, prompt):
    hit = worker.prefix_cache.match(prompt)
    wait = worker.queue_tokens
    return 3 * hit - wait

worker = best_worker(workers, prompt)

if worker.has_prefix(prompt):
    worker.continue_from_cache(prompt)
else:
    worker.prefill(prompt)

Cache locality can be worth more than sending the request to the emptiest queue. For agents, chat apps, structured output, and workloads with repeated system prompts, that matters a lot.

The Serving Knobs

The tricks above are not an exhaustive map of production LLM serving, but they gave me four useful knobs to look for. They either keep batches full, reuse memory, schedule around the prefill/decode split, or reduce serial work.

Queue

Continuous batching

Add and remove requests at every decode step instead of waiting for a whole batch to finish.

State

KV-cache management

Page it, reuse it, compress it, evict it, or move it. The cache becomes the serving state.

Phase

Scheduling

Chunk prefill, prioritize decode, or split prefill and decode onto different GPU pools.

Loop

Fewer serial steps

Speculative decoding and multi-token proposals (MTP) try to produce more accepted tokens per expensive pass.

This is also a useful way to think about serving engines:

  • vLLM is the throughput workhorse in my head: continuous batching, PagedAttention-style KV memory management, prefix caching, and broad quantization support. It also has the broadest hardware story of the three, with serious support beyond NVIDIA.
  • SGLang is strong when prompts share structure or prefixes: agents, RAG, chat templates, repeated system prompts, and structured generation. Its RadixAttention work is about more aggressive KV-cache reuse across related requests. I still need to understand that mechanism better. SGLang is also a serious runtime for large MoE models, where expert parallelism matters.
  • TensorRT-LLM is the NVIDIA-heavy path. It compiles the model into an optimized TensorRT engine for a particular model shape, precision, parallelism setup, and serving configuration. That compilation step is the point: it gives the runtime more room to fuse kernels, tune memory layout, and squeeze throughput from NVIDIA GPUs. The tradeoff is workflow rigidity. If the model, quantization, max sequence length, or deployment shape changes, you may need to rebuild or retune the engine instead of just pointing a server at a checkpoint and iterating quickly.

All of these systems keep moving and the descriptions may not hold when you read it. But the questions above still help: how does the engine batch work, manage KV memory, reuse prefixes, handle MoE, fit your hardware, and support the models you need?

The core concepts are simple once the variable-length shape is visible. Prefill reads the prompt. Decode writes the answer one token at a time. The KV cache carries the conversation forward. The serving stack keeps the GPU busy while requests of different lengths enter, grow, and leave.

The next thing I want to do is tear down small versions of these systems and map the ideas above to actual code. Two good starting points are GeeeekExplorer/nano-vllm and sgl-project/mini-sglang. They are small enough to read, but close enough to the real serving engines to make the abstractions concrete.