December 5, 2025

AI Token per Second Optimization Techniques: Boost LLM Performance by 10x

Listen to This Content in Podcast Format

Table of content

AI token per second optimization techniques Key Takeaways

Among the most effective AI token per second optimization techniques, memory bandwidth optimization can increase TPS by 40–60% through methods like KV cache quantization and paging.
Attention mechanism optimizations (Flash Attention, Grouped Query Attention) reduce memory usage by up to 50% while maintaining accuracy
Model serving techniques like speculative inference and continuous batching can boost throughput by 2-4x in production environments
Hardware-aware optimizations, including tensor parallelism and mixed-precision inference, deliver measurable performance gains
Combined optimization strategies can achieve 10x+ TPS improvements without quality degradation

Large language models are revolutionizing natural language and artificial intelligence applications, but these AI models’ computational demands create significant bottlenecks in production environments. When your AI models process millions of input tokens and output tokens daily, every millisecond of latency and every token per second optimization directly impacts user experience—especially in mobile web sign flows where responsiveness is critical—and overall operational costs.

Token per second (TPS) optimization has emerged as a critical discipline for organizations deploying large language models and other AI models at scale. The difference between an unoptimized model generating 10 tokens per second and an optimized system achieving 100+ tokens per second translates to dramatically improved user experiences and reduced infrastructure costs.

This comprehensive guide explores proven AI token-per-second optimization techniques that can significantly enhance your AI models’ throughput without compromising output quality. From memory bandwidth optimization to advanced serving techniques, these strategies represent the current state-of-the-art in LLM performance engineering.

Understanding AI Token Generation Speed Bottlenecks

Before diving into specific optimization techniques, it’s essential to understand the fundamental bottlenecks that limit token generation speed in large language models. Token generation occurs in two distinct phases, each with different performance characteristics and optimization opportunities.

Prefill vs Decode Phases

The prefill phase processes all input tokens simultaneously in parallel, utilizing the model’s full computational capacity efficiently. During prefill, the model constructs the initial key-value cache from the input prompt, laying the foundation for subsequent token generation. This phase typically achieves high throughput measured in thousands of tokens per second.

The decode phase generates output tokens autoregressively, where each new token depends on all previous tokens in the sequence. This sequential dependency makes decoding inherently memory-bound rather than compute-bound, typically achieving much lower tokens per second than prefill. For most production workloads, decode performance determines the user-perceived latency.

Memory Bandwidth as the Primary Constraint

Modern AI models face memory access limitations as their primary bottleneck during token generation. Large models with billions of model parameters and model weights require substantial data movement between GPU memory and compute units for each generated output token, increasing the overall computational load during generation. This memory-bound operation means that traditional compute optimizations provide diminishing returns compared to memory-focused techniques.

This cache grows linearly with both sequence length and batch size, creating additional memory pressure. For a model like Llama 2 70B, the KV cache can consume several gigabytes of GPU memory for more extended sequences, significantly impacting the number of tokens the system can process concurrently.

Model Size vs TPS Trade-offs

Larger models generate higher model quality outputs, and the model’s output often reflects deeper reasoning and richer detail—but they also demand substantially more computational resources, resulting in lower tokens per second. This fundamental tradeoff means organizations must balance output quality requirements against throughput and latency constraints.

For concrete examples:

GPT-5.1-class frontier model: In typical enterprise deployments, large frontier models of this class often land in the 80–150 tokens per second range per request at moderate context lengths when served on H100-class GPUs with optimized runtimes, with higher effective throughput under heavy batching.
Llama 4 70B-class model: A 70B-scale Llama 4 variant commonly reaches roughly 35–60 tokens per second on H100 80GB when you enable Flash Attention, KV cache optimizations, and continuous batching, with lower numbers on older A100 hardware.
Llama 4 8B: Smaller Llama 4 models (around 8B parameters) can reach 250–400 tokens per second on H100-class GPUs under optimized serving and aggressive batching, and can push much higher aggregate throughput on specialized LLM hardware such as Cerebras CS-3 or next-gen GPU clusters.

These baseline performance metrics highlight a significant opportunity for optimization across various model sizes and hardware configurations.

Memory Bandwidth Optimization Techniques

Memory bandwidth optimization addresses the fundamental bottleneck limiting token generation speed in most production deployments. By reducing memory movement and improving memory utilization efficiency, these techniques can deliver 40-60% improvements in tokens per second performance.

Optimization Technique

Memory Savings

TPS Improvement

Implementation Complexity

8-bit KV Cache Quantization

50%

30-40%

Medium

4-bit KV Cache Quantization

75%

50-60%

High

PagedAttention

80% fragmentation reduction

20-30%

Medium

Memory Pool Management

Variable

15-25%

Low

KV Cache Quantization

KV cache quantization reduces memory requirements by storing key and value tensors in lower precision formats while preserving model accuracy. This technique directly addresses the memory bandwidth bottleneck by reducing the total amount of data that must be moved during attention computation.

8-bit Quantization Implementation: FP8 KV cache (8-bit floating-point) optimization typically achieves roughly 50% memory savings compared with FP16 KV caches, with minimal impact on accuracy. In many modern implementations, the process converts FP16 key and value tensors in the KV cache to FP8 formats (rather than pure INT8), using calibrated scaling factors that preserve the key–value structure needed for accurate attention. Frameworks such as vLLM expose FP8 KV cache support through simple configuration options, and similar low-precision KV cache techniques are emerging in other optimized inference stacks.

# vLLM FP8 KV cache configuration
from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-70b-hf",
    kv_cache_dtype="fp8"  # Enable FP8 (8-bit) KV cache
)

4-bit Quantization Methods: 4-bit KV cache quantization achieves 75% memory reduction but requires more sophisticated calibration to maintain output quality. This aggressive quantization works best with models fine-tuned specifically for quantized inference or when using advanced quantization algorithms, such as GPTQ or AWQ.

The quality preservation depends on careful calibration, dataset selection, and the optimization of per-layer scaling factors. For production deployments, thorough quality evaluation across representative test cases is essential before implementing 4-bit KV cache quantization.

PagedAttention and Memory Management

PagedAttention revolutionizes memory management in LLM serving by eliminating the memory fragmentation that plagues traditional pre-allocation strategies. Instead of pre-allocating contiguous memory blocks for each sequence’s KV cache, PagedAttention uses virtual memory concepts to allocate memory in smaller, fixed-size blocks.

Block-based Allocation Strategy: The system divides the KV cache into logical blocks, typically containing tokens for 16-32 positions. When a sequence requires additional memory, the system allocates new blocks dynamically without requiring contiguous memory space. This approach reduces memory fragmentation by up to 80% compared to traditional methods.

Memory Utilization Improvements: PagedAttention enables much higher memory utilization by allowing sequences of different lengths to share GPU memory efficiently. Where traditional pre-allocation might achieve 60-70% memory utilization, PagedAttention consistently achieves 90%+ utilization across varied workloads.

The technique integrates seamlessly with continuous batching systems, enabling dynamic request scheduling without memory allocation constraints. This integration provides the foundation for advanced serving optimizations, such as speculative inference and multi-request processing.

Attention Mechanism Optimizations

Attention computation represents a significant computational bottleneck in transformer-based large language models. Modern attention optimizations focus on reducing memory movement and computational complexity while preserving the mathematical properties that enable high-quality natural language text generation.

Flash Attention Implementation

Flash Attention fundamentally reimagines attention computation by fusing operations and optimizing memory access patterns. Instead of materializing large attention matrices in GPU memory, Flash Attention computes attention in smaller blocks, dramatically reducing memory requirements and improving computational efficiency.

Memory Hierarchy Optimization: Flash Attention exploits the memory hierarchy of modern GPUs by keeping intermediate computations in fast SRAM rather than slower HBM memory. This optimization reduces memory I/O by up to 90% for attention computation, directly translating to faster token generation.

The technique achieves 2-4x speedup in attention computation without changing the mathematical operations. For H100 and A100 GPUs, Flash Attention provides the most significant performance improvements when working with more extended context length sequences, where traditional attention becomes increasingly memory-bound.

Integration with Modern Frameworks:

Flash Attention integrates natively with PyTorch 2.0 through the scaled_dot_product_attention API, which automatically dispatches to FlashAttention-style kernels when running on GPU with supported dtypes. Hugging Face Transformers inherits these benefits for models that rely on PyTorch’s SDPA path.

Modern inference frameworks—including vLLM and TensorRT-LLM—also enable Flash Attention–style fused kernels by default for supported model architectures to reduce memory movement and increase throughput.

# PyTorch 2.0 Flash Attention usage
import torch.nn.functional as F

# Automatic Flash Attention dispatch when conditions are met
attention_output = F.scaled_dot_product_attention(
    query, key, value,
    is_causal=True  # Enables causal masking
)

Multi-Query and Grouped-Query Attention

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce the size of KV caches by sharing key and value representations across multiple attention heads. This architectural change provides substantial memory savings while maintaining most of the representational power of full Multi-Head Attention.

MQA Implementation Benefits: Multi-Query Attention utilizes a single key and value head shared across all query heads, thereby reducing the attention cache size by the ratio of the number of attention heads. For a model with 32 attention heads, MQA reduces KV cache memory requirements by approximately 97%, enabling much larger batch sizes and longer sequence processing.

Grouped-Query Attention Balancing: GQA represents a balanced approach between MQA and traditional Multi-Head Attention, grouping multiple query heads to share key-value pairs. Llama 2 70B employs GQA with 8 key-value heads for 64 query heads, achieving significant memory savings while preserving more representational capacity than pure MQA.

Training and Fine-tuning Considerations: Models must be specifically trained or fine-tuned with MQA or GQA architectures - these optimizations cannot be applied post-training without significant accuracy degradation. However, the training overhead is minimal, and the resulting models often achieve comparable quality to traditional attention mechanisms.

Performance benchmarks show:

MQA: Up to 8x reduction in KV cache size, 40-60% improvement in decode TPS
GQA: Around 8x reduction in kv cache size in architectures like Llama 2 70B, with 30–50% improvement in decode TPS depending on sequence length and batch size
Quality impact: Typically less than 2% degradation in benchmark scores

Model Parallelization for Higher TPS

Model parallelization enables scaling beyond the limitations of single GPU memory and compute by distributing model parameters and computation across multiple devices. Effective parallelization strategies strike a balance between communication overhead and the benefits of increased computational resources and memory bandwidth.

Tensor Parallelism Optimization

Tensor parallelism splits individual model layers across multiple GPUs, enabling larger models to fit in memory while potentially reducing latency through parallel computation. The optimal GPU count depends on model size, batch size, the computational load, and the specific communication infrastructure available.

Optimal GPU Count Selection: For most production deployments, a count of 2-8 GPUs provides the best balance between performance gains and communication overhead. Beyond 8 GPUs, communication costs often outweigh computational benefits, unless high-bandwidth interconnects like NVLink or InfiniBand are used.

Communication Overhead Analysis: The effectiveness of tensor parallelism depends heavily on inter-GPU communication bandwidth. NCCL optimization becomes critical for multi-GPU setups, particularly when scaling beyond single-node configurations. Proper CUDA stream management and communication overlap can reduce tensor parallelism overhead by 20–40%.

Concrete Performance Examples:
Llama 4 Scout (MoE, 17B active / 109B total parameters): On A100 GPUs, many teams can deploy Scout on a single GPU for moderate batch sizes, but 2-way tensor parallelism often delivers on the order of 40–60% latency reduction for larger batch configurations or very long context windows by better balancing computational load across devices. Wikipedia
Llama 4 Maverick (MoE, 17B active / 400B total parameters): On A100 80GB GPUs, deployments commonly rely on 4-way tensor parallelism (combined with data or expert parallelism) to support long contexts and higher throughput, while H100 80GB GPUs can typically host the active parameter set on a single device and use tensor parallelism primarily to scale throughput rather than just make the model fit in memory. Wikipedia
Communication scaling: Performance improvements for tensor parallelism often plateau around 8–16 GPUs for most modern transformer and mixture-of-experts designs, unless very high-bandwidth interconnects are available.

Pipeline Parallelism Considerations

Pipeline parallelism distributes model layers across multiple GPUs, enabling simultaneous processing of multiple requests at different pipeline stages. This approach provides excellent memory scaling but introduces pipeline bubbles that can reduce overall utilization.

Layer Distribution Strategies: Effective pipeline parallelism requires careful layer distribution to balance compute load across pipeline stages. Transformer models benefit from distributing attention and feed-forward layers evenly, as attention layers typically consume more memory, while feed-forward layers require more computation.

Microbatching Optimization: Microbatching divides each batch into smaller chunks, which are processed sequentially through the pipeline, thereby reducing pipeline bubbles and improving utilization. The optimal microbatch size depends on pipeline depth and model characteristics, with targets of 80% or higher pipeline utilization achievable through proper tuning.

Memory vs Compute Trade-offs: Pipeline parallelism excels when memory constraints limit batch size more than computational capacity. For memory-bound workloads typical in LLM inference, pipeline parallelism often outperforms tensor parallelism by enabling larger effective batch sizes despite communication overhead.

Advanced Serving Techniques

Production LLM deployments require sophisticated serving techniques that optimize for both throughput and latency while managing dynamic request patterns. These advanced methods can boost throughput by 2-4x compared to static batching approaches.

Continuous Batching Implementation

Continuous batching revolutionizes LLM serving by dynamically adding new requests to ongoing batches as previous requests complete, eliminating the traditional tradeoff between latency and throughput that characterizes static batching systems.

Dynamic Request Scheduling: Instead of waiting for entire batches to complete before starting new requests, continuous batching maintains a dynamic pool of active requests. As individual sequences finish generation, the system immediately adds new requests to maintain optimal GPU utilization while gracefully handling response cancellation scenarios that appear in real-world workloads. This approach typically achieves 3x throughput improvements over static batching while maintaining low latency for individual requests, which is essential for mobile web sign interactions and other real-time workloads.

Memory Allocation for Variable Lengths: Continuous batching requires sophisticated memory management to handle sequences of varying lengths efficiently. Integration with PagedAttention enables dynamic memory allocation without fragmentation, supporting the variable-length sequences that characterize real-world LLM workloads.

Framework Implementation Examples: Modern serving frameworks implement continuous (in-flight) batching differently depending on their architectural focus:

vLLM: Combines continuous batching with PagedAttention for maximum memory efficiency and high throughput.
TensorRT-LLM: Uses CUDA-optimized kernels and in-flight batching tuned specifically for NVIDIA GPUs to maximize Tensor Core performance.
llama.cpp: Provides lightweight in-flight batching and efficient inference on CPU and edge hardware, making it suitable for resource-constrained deployments.

from vllm import LLM, SamplingParams

prompts = [
    "Explain continuous batching in large language model serving.",
    "Give three benefits of PagedAttention for production workloads."
]

# Automatic continuous batching with optimized memory management
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    max_num_seqs=256,  # Maximum concurrent sequences
    block_size=16      # PagedAttention block size
)

sampling_params = SamplingParams(max_tokens=256)

# Requests are processed dynamically as they arrive
outputs = llm.generate(prompts, sampling_params)

Speculative Inference

Speculative inference accelerates token generation by using a smaller, faster “draft” model to predict multiple tokens ahead, then verifying these predictions with the larger target model in parallel. This technique can provide 2x+ speedup while maintaining identical output quality to standard autoregressive generation.

Draft Model Selection Strategies: Effective speculative inference relies on selecting draft models that strike a balance between speed and accuracy. The draft model should be 4-8x faster than the target model while maintaining reasonable prediction accuracy for the target domain. Common approaches include:

Using smaller versions of the same model family (e.g., Llama 4 8B drafting for Llama 4 70B, or Llama 4 8B > Llama 4 405B in multimodal setups)
Pairing newer high-efficiency models with large general-purpose models (e.g., GPT-4.1-mini drafting for GPT-5.1)
Fine-tuning compact models specifically for draft-token generation to improve acceptance rates during verification
Using distilled student models trained to mimic the target model’s token distribution, improving prediction alignment during speculative decoding

Verification and Quality Maintenance: The verification process ensures that speculative inference produces identical outputs to standard generation. The target model processes the draft tokens in parallel, accepting accurate predictions and rejecting those that are incorrect. This parallel verification maintains perfect quality, ensuring the model’s output remains identical to standard generation, while still achieving significant speedup when draft predictions are accurate.

Cost-Benefit Analysis: Speculative inference provides the most significant benefits when:

Draft model prediction accuracy exceeds 70%
The target model is memory bandwidth bound rather than compute bound
Hardware can efficiently run both draft and target models simultaneously

Performance improvements vary significantly based on the task and model combination, with text completion and code generation typically showing better results than creative writing tasks.

Quantization Strategies for TPS Gains

Quantization reduces model precision in ai models, especially in model weights, to achieve substantial performance improvements while preserving output quality. Modern quantization techniques can deliver 1.5–3x speed improvements and dramatic memory reductions, enabling deployment of larger models on existing hardware.

Weight Quantization Methods

Weight quantization converts model parameters and model weights from complete precision (FP16/FP32) to lower precision formats (INT8/INT4), reducing both memory requirements and computational demands while maintaining model accuracy through sophisticated calibration techniques.

INT8 Quantization Implementation: INT8 weight quantization typically achieves 1.5-2x speed improvements with minimal accuracy degradation. Modern frameworks implement INT8 quantization through calibration datasets that capture the statistical properties of model activations, enabling the accurate computation of quantization scale factors.

from transformers import AutoModelForCausalLM, GPTQConfig

quantization_config = GPTQConfig(
    bits=4,              # 4-bit GPTQ
    group_size=128,      # Common for Llama models
    desc_act=False,      # Disable activation quant for stability
    sym=True,            # Symmetric quantization (recommended)
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-8b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

Advanced Quantization Algorithms: Modern quantization techniques like GPTQ and AWQ provide superior quality preservation compared to naive quantization approaches:

GPTQ: Uses gradients to optimize quantization parameters, maintaining quality even at 4-bit precision
AWQ: Focuses on preserving weights that most impact model accuracy, enabling aggressive quantization
NF4: Specialized 4-bit format optimized for transformer architectures

Model-Specific Recommendations: Different model architectures benefit from tailored quantization approaches:

Llama models: GPTQ with 4-bit precision provides excellent quality-performance balance
Mistral 7B: AWQ quantization typically preserves quality better than GPTQ
Code models: More conservative 8-bit quantization is often required for accuracy preservation

Activation Quantization

Activation quantization extends quantization to intermediate computations during inference, providing additional memory and computational savings beyond those achieved with weight quantization alone.

Mixed Precision Strategies: Effective activation quantization employs mixed precision approaches that preserve complete precision for sensitive operations while quantizing less critical computations. Attention computations typically require higher precision, while feed-forward network activations often tolerate aggressive quantization.

SmoothQuant and Outlier Handling: Advanced activation quantization techniques, such as SmoothQuant, address the outlier problem that renders naive activation quantization ineffective. By redistributing quantization difficulty from activations to weights, SmoothQuant enables effective 8-bit activation quantization while maintaining model accuracy.

Hardware Acceleration Integration: Modern GPUs provide specialized instructions for quantized computations. INT8 Tensor Cores on A100 and H100 GPUs deliver significant performance improvements for quantized models, making activation quantization increasingly attractive for production deployments.

Hardware-Specific Optimization Techniques

Different hardware platforms require tailored optimization strategies to achieve optimal ai token per second optimization performance. Understanding platform-specific capabilities enables targeted optimizations that maximize the available computational and memory resources.

NVIDIA GPU Optimizations

NVIDIA GPUs dominate LLM inference workloads, and platform-specific optimizations can deliver substantial performance improvements beyond those achieved with generic techniques.

TensorRT-LLM Kernel Fusion: TensorRT-LLM offers CUDA-optimized kernels that combine multiple operations, thereby reducing kernel launch overhead and enhancing memory bandwidth utilization. These optimizations typically achieve 40% speedup over standard PyTorch implementations through advanced kernel fusion and memory layout optimization.

H100 vs A100 Optimization Differences: The H100 architecture introduces several capabilities that change optimization priorities:

Transformer Engine: Native FP8 support that can deliver up to around 2x throughput gains on core attention and matrix-multiplication kernels, with end-to-end LLM inference speed improvements typically in the 1.3–1.7x range, depending on the design of the system and batch size
Increased Memory Bandwidth: 3TB/s HBM3 enables different memory-compute balance points
Improved Tensor Cores: Enhanced mixed-precision capabilities benefit quantized inference

FP8 Precision Implementation: H100’s native FP8 support provides 2x throughput gains over FP16 while maintaining quality comparable to BF16. FP8 optimization requires model-specific calibration but delivers substantial benefits for large model inference.

CUDA Graph Optimization: CUDA Graphs reduce kernel launch overhead by pre-recording execution sequences, which is particularly beneficial for small-batch inference where kernel launch costs dominate. Proper CUDA Graph implementation can improve small-batch latency by 20-30%.

CPU and Edge Device Optimizations

While GPUs dominate large-scale deployments, CPU and edge inference remain essential for specific use cases and deployment constraints.

ONNX Runtime CPU Optimizations: ONNX Runtime provides highly optimized CPU inference through vectorized operations and platform-specific optimizations. For Intel CPUs, ONNX Runtime leverages AVX-512 instructions and optimized BLAS libraries to achieve competitive performance for smaller models.

ARM64 Specific Optimizations: ARM64 processors benefit from NEON vectorization and optimized quantization kernels. Apple Silicon and AWS Graviton processors show particular strength in quantized inference workloads, making them viable alternatives for cost-sensitive deployments.

Memory Mapping Strategies: CPU inference benefits from sophisticated memory mapping that minimizes model loading times and memory usage. Techniques include:

Memory-mapped model files to reduce RAM requirements
Lazy loading of model parameters to minimize startup latency
Optimized weight layouts for cache-friendly access patterns

Economic Considerations: CPU inference makes economic sense for:

Low-throughput applications where GPU utilization would be poor
Edge deployments where GPU hardware is unavailable
Applications requiring very low latency where CPU cache advantages matter

Benchmarking and Measurement Best Practices

Accurate performance measurement is essential for AI token per second optimization, but common measurement mistakes can lead to misleading results and suboptimal optimization decisions.

Proper TPS Measurement Methodologies

Distinguishing Prefill and Decode Performance: Measuring only aggregate tokens per second obscures the critical distinction between prefill and decode performance. Prefill TPS reflects input processing speed, while decode TPS measures the actual generation speed that users experience. These metrics can differ by orders of magnitude and require separate optimization strategies.

Time to First Token (TTFT) Measurement: TTFT captures the user-perceived latency for interactive applications. This metric includes model loading, tokenization—including byte pair encoding for breaking text into subword units—prefill computation, and scheduling delays. Optimizing TTFT requires different techniques than optimizing overall throughput.

Model Bandwidth Utilization Calculation: Model Bandwidth Utilization (MBU) quantifies how effectively the system utilizes available memory bandwidth:

MBU = (Memory Access per Token × Tokens per Second) / Peak Memory Throughput

High MBU (>80%) indicates memory-bound workloads where memory optimizations provide the most significant benefits, while low MBU suggests compute-bound scenarios where different optimization strategies apply.

End-to-End Performance Testing

Realistic Workload Simulation: Production benchmarks should reflect actual usage patterns, including:

Variable natural language input prompt lengths matching real-world distributions
Representative output token targets for the specific application
Concurrent request patterns that simulate production load
Mixed batch sizes that reflect actual serving scenarios

Quality Preservation Validation: Performance optimizations must maintain output quality, requiring systematic quality measurement across optimization changes. Automated evaluation using benchmark datasets enables rapid iteration while preventing quality regressions.

Common Measurement Pitfalls: Avoid these frequent benchmarking mistakes:

Measuring only single-request latency instead of concurrent throughput, especially when response cancel response patterns occur during high-traffic workloads
Using unrealistic prompt lengths that don’t match production workloads
Ignoring warmup effects that affect real-world performance
Focusing only on peak throughput without measuring latency distribution

Implementation Roadmap and Tool Selection

The systematic implementation of AI token-per-second optimization techniques requires careful prioritization and tool selection based on specific deployment requirements and constraints.

Step-by-Step Optimization Implementation

Phase 1: Baseline Measurement and Low-Hanging Fruit (Week 1-2):

Implement comprehensive performance measurement across prefill and decode phases
Deploy Flash Attention if not already enabled
Configure optimal batch sizes for your hardware configuration
Enable KV cache quantization (8-bit) for immediate memory savings

Phase 2: Memory Optimization (Week 3-4):

Implement PagedAttention for improved memory utilization
Evaluate 4-bit KV cache quantization for aggressive memory reduction
Optimize memory pool management and allocation strategies
Implement continuous batching for enhanced throughput

Phase 3: Model-Level Optimizations (Week 5-8):

Evaluate weight quantization (GPTQ/AWQ) for your specific model
Implement tensor parallelism if single-GPU memory is insufficient
Consider speculative inference for latency-critical applications
Fine-tune parallelization strategies based on workload characteristics

Phase 4: Advanced Techniques (Week 9-12):

Deploy hardware-specific optimizations (TensorRT-LLM, FP8 on H100)
Implement custom kernels for workload-specific bottlenecks
Optimize for specific deployment constraints (edge, CPU-only)
Advanced serving optimizations for production scalability

Tool Comparison and Selection

Framework

Strengths

Best Use Cases

Complexity

vLLM

PagedAttention, continuous batching

General production serving

Medium

TensorRT-LLM

NVIDIA optimization, kernel fusion

Maximum performance on NVIDIA

High

FastTransformers

Lightweight, CPU support

Edge and CPU deployments

Low

Hugging Face Transformers

Broad model support, ease of use

Rapid prototyping

Low

ROI Estimation Framework: Prioritize optimizations based on expected return on investment:

Memory optimizations: Highest ROI for memory-bound workloads (most LLM inference)
Attention optimizations: Medium-high ROI, especially for more extended sequences
Quantization: Medium ROI, requires quality validation overhead
Hardware-specific: High ROI for dedicated hardware, low for mixed environments

Production Deployment Considerations

Monitoring and Observability: Production deployments require comprehensive monitoring of key metrics:

Real-time TPS measurement across prefill and decode phases
Memory utilization and KV cache efficiency
Request queuing and batching effectiveness, including monitoring for response cancel respond behaviors that can affect throughput and latency stability
Quality metrics to detect optimization-related degradation

Cost-Performance Optimization: Balance infrastructure costs against performance requirements:

Evaluate GPU utilization to right-size deployments
Consider a mixture of optimization techniques based on traffic patterns
Implement auto-scaling policies that account for optimization effectiveness
Monitor cost per token to optimize technique selection

FAQ

Performance expectations vary significantly based on system setups, inference optimizations, batch size, and workload patterns. On A100 80GB GPUs with well-tuned LLM inference stacks (Flash Attention, continuous batching, kv cache optimizations): Llama 2 7B often achieves roughly 180–250 tokens per second, Llama 2 70B commonly reaches around 25–35 TPS, while GPT-3.5-scale generative AI models typically deliver 100–150 TPS. H100 GPUs with FP8 optimization and improved memory bandwidth often achieve 1.5–2.5 times these baseline numbers. CPU-only deployments of smaller models typically achieve 5–15 TPS depending on the specific processor, model size, and output size.

Start with 8-bit key-value memory quantization, which provides a 30-40% performance improvement with minimal impact on quality. For weight quantization, begin with GPTQ 4-bit for Llama models and AWQ for Mistral models, as these algorithms provide better quality preservation than naive quantization. Always validate quality on your specific use case using benchmark datasets that represent your actual workload. For critical applications, maintain A/B testing capabilities to compare quantized and full-precision outputs.

Memory bandwidth optimizations deliver the highest ROI for most LLM inference workloads. Implement Flash Attention and 8-bit KV cache quantization first, as these typically require minimal code changes while providing performance improvements of 40-60%. PagedAttention and continuous batching follow as high-impact techniques for multi-request scenarios. Hardware-specific optimizations like TensorRT-LLM provide excellent ROI if you’re committed to NVIDIA infrastructure, while quantization offers good ROI if you can validate quality preservation.

Focus on latency percentiles rather than average performance, as tail latency determines user experience quality. Measure Time to First Token (TTFT) for interactive applications, including mobile web sign-in scenarios, because this metric correlates strongly with perceived responsiveness. Monitor tokens per second during actual generation, rather than relying on peak throughput numbers. Implement user-facing metrics, such as response completion time, and maintain quality measurements to ensure that optimizations don’t degrade output quality. Utilize realistic workload testing that accurately reflects your production traffic patterns.

The most frequent mistake is optimizing for benchmark performance rather than production workloads. Ensure your test scenarios match real usage patterns, including variable prompt lengths and concurrent requests. Avoid focusing solely on prefill performance while ignoring decode optimization, as decode typically determines user experience. Don’t implement aggressive optimizations without quality validation, as even slight accuracy drops can significantly impact user satisfaction. Finally, avoid over-optimizing for single-request latency at the expense of throughput, as most production systems benefit more from improved concurrent request handling, especially when applying AI token per second optimization techniques.

José F. Gomez

Director of Operations at Koombea

José F. Gomez leads operations at Koombea, where he has managed over 500 projects across mobile, IoT, and AI-integrated systems. With 15+ years of experience in software development and team leadership, he specializes in backend engineering, infrastructure, and agile process optimization.

AI Token per Second Optimization Techniques: Boost LLM Performance by 10x

AI token per second optimization techniques Key Takeaways

Understanding AI Token Generation Speed Bottlenecks

Prefill vs Decode Phases

Memory Bandwidth as the Primary Constraint

Model Size vs TPS Trade-offs

Memory Bandwidth Optimization Techniques

KV Cache Quantization

PagedAttention and Memory Management

Attention Mechanism Optimizations

Flash Attention Implementation

Multi-Query and Grouped-Query Attention

Model Parallelization for Higher TPS

Tensor Parallelism Optimization

Pipeline Parallelism Considerations

Advanced Serving Techniques

Continuous Batching Implementation

Speculative Inference

Quantization Strategies for TPS Gains

Weight Quantization Methods

Activation Quantization

Hardware-Specific Optimization Techniques

NVIDIA GPU Optimizations

CPU and Edge Device Optimizations

Benchmarking and Measurement Best Practices

Proper TPS Measurement Methodologies

End-to-End Performance Testing

Implementation Roadmap and Tool Selection

Step-by-Step Optimization Implementation

Tool Comparison and Selection

Production Deployment Considerations

FAQ

What token throughput per second can I expect for different model sizes on standard hardware configurations?

How do I select the optimal quantization method without compromising model quality?

Which optimization techniques provide the best ROI for production deployments?

How do I measure if my optimizations are actually improving end-user experience?

What are the most common mistakes when implementing TPS optimizations?