Disaggregated inference

ScientificConcept

The architectural shift in AI where specific processing tasks are separated and routed to optimized hardware, extending the lifecycle of existing GPUs.

First Mentioned

3/22/2026, 10:45:29 PM

Last Updated

5/23/2026, 5:16:45 AM

Research Retrieved

3/22/2026, 10:49:21 PM

Summary

Disaggregated inference is an architectural paradigm in machine learning, specifically for Large Language Models (LLMs), that decouples the distinct phases of inference—prefill and decode—onto separate hardware or software processes. The prefill phase is typically compute-bound, processing input prompts in parallel, while the decode phase is memory-bandwidth intensive, generating tokens autoregressively. By separating these phases, systems can optimize resource allocation, reduce inter-token latency (ITL), and prevent 'prefill stalls' where new requests interrupt ongoing generation. This concept is central to Nvidia's 'Dynamo' operating system, which manages massive AI workloads across hardware like Blackwell and Vera Rubin, and is supported by cloud providers like AWS through the Neuron SDK. It is considered a foundational design for scaling AI agents and complex simulations in robotics, digital biology, and healthcare.

Referenced in 2 Documents

Research Data

Extracted Attributes

Field
Machine Learning / AI Infrastructure
Key Benefits
Throughput optimization, reduction of inter-token latency (ITL), and elastic scaling
Primary Phases
Prefill (compute-bound) and Decode (memory-bandwidth bound)
Target Architectures
Transformer-based Large Language Models (LLMs)
Implementation Software
Nvidia Dynamo, AWS Neuron SDK, vLLM, llm-d

Timeline

The 'Attention Is All You Need' paper is published, introducing the Transformer architecture which forms the basis for modern inference phases. (Source: Wikipedia)
2017-06-12
Research by Jiang et al. contributes to the development of disaggregated inference systems. (Source: Emergent Mind)
2025-02-11
Mitra et al. publish research on scaling inference across modern data centers using disaggregation. (Source: Emergent Mind)
2025-06-05
Nvidia Research publishes 'Beyond the Buzz: A Pragmatic Take on Inference Disaggregation', the first systematic study of the concept at scale. (Source: Nvidia Research / ArXiv)
2025-06-01
Li et al. release findings on phase-specialized hardware utilization in LLM serving. (Source: Emergent Mind)
2025-08-27
Liu et al. publish research on the architectural challenges of disaggregated inference. (Source: Emergent Mind)
2025-09-22
Zhang et al. provide insights into avoiding cross-phase interference through disaggregation. (Source: Emergent Mind)
2025-10-09
Jensen Huang discusses the 'inference explosion' and the role of disaggregated inference in the Dynamo OS on the All-In Podcast. (Source: Document b3924e92-7a2e-4033-92dc-8fdf1a6f3dce)
2026-03-01

Wikipedia

View on Wikipedia

Transformer (deep learning)

In deep learning, the transformer is an artificial neural network architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM). Later variations have been widely adopted for training large language models (LLMs) on large (language) datasets. The modern version of the transformer was proposed in the 2017 paper "Attention Is All You Need" by researchers at Google. The predecessors of transformers were developed as an improvement over previous architectures for machine translation, but have found many applications since. They are used in large-scale natural language processing, computer vision (vision transformers), reinforcement learning, audio, multimodal learning, robotics, and playing chess. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (bidirectional encoder representations from transformers).

Web Search Results

Disaggregated Inference in LLMs - Emergent Mind
Chrome Extension Sponsor # Disaggregated Inference in LLMs Disaggregated inference refers to a paradigm in machine learning systems—especially LLMs—where distinct phases of inference (typically prefill and decode) are isolated onto separate hardware, processes, or architectural components. This separation is driven by the diverging resource requirements and performance bottlenecks of each phase when scaling inference across modern data center and edge deployments. Disaggregated inference is now a foundational design in efficient LLM serving, supporting heterogeneous clusters, cost-efficiency, specialized accelerators, and sophisticated software scheduling techniques. ## 1. Motivation and Foundational Principles [...] The optimal phase/hardware allocation must track workload dynamics, and static ratios or naive scaling schemes quickly degrade away from the Pareto frontier. ## 7. Broader Implications and Future Directions Disaggregated inference is reshaping how LLM and deep learning inference systems are architected and deployed at scale. Implications include: Disaggregated inference is now a defining architectural and algorithmic challenge in large-scale deep learning deployment, and remains an active area of systems, algorithms, and hardware research (Liu et al., 22 Sep 2025, Mitra et al., 5 Jun 2025, Zhang et al., 9 Oct 2025, Li et al., 27 Aug 2025, Jiang et al., 11 Feb 2025, Zhu et al., 3 Apr 2025, Chen et al., 2024, Hu et al., 2024). ### Topic to Video (Beta) [...] ## 1. Motivation and Foundational Principles The motivation for disaggregated inference begins with the observation that LLM inference is inherently multi-phase. In particular, transformer-based autoregressive models have: Traditional monolithic deployments, where both phases run on the same hardware, suffer inefficiencies: Disaggregation—separating prefill and decode across different hardware pools or processes—addresses these by enabling phase-specialized hardware utilization, elastic scaling, better queueing, and avoidance of cross-phase interference (Zhang et al., 9 Oct 2025, Jiang et al., 11 Feb 2025). ## 2. System Architectures and Techniques ### 2.1. Phase Separation Models ### 2.2. Hardware Specialization
Disaggregated Inference [BETA] - AWS Neuron SDK
Neuron 2.28.0 is released! Check the What's New and Release Notes for more details. Logo image Search Engine: Default Google Repository Suggest edit Open issue .rst # Disaggregated Inference [BETA] # Disaggregated Inference [BETA]# ## Overview# Disaggregated Inference (DI), also known as disaggregated serving, disaggregated prefill, P/D disaggregation, is an LLM serving architecture that separates the prefill and decode phases of inference onto different hardware resources. To achieve this, the prefill worker needs to transfer the computed KV cache to the decode worker to resume decoding. Separating the compute intensive prefill phase from the memory bandwidth intensive decode phase can improve the LLM serving experience by [...] Disaggregated Inference avoids prefill stall because the decode workflow is never interrupted by a prefill as it receives KV caches asynchronously while decoding. The overall ITL on DI is affected by the transfer time of the KV cache but this does not scale with batch size. For example, in a continuous batching workload of batch size 8 each request will on average be interrupted 7 times whereas in DI each request is only affected by a single transfer since it happens asynchronously. [...] 1. Removing prefill interruptions to decode from continuous batching to reduce inter token latency (ITL). These gains can be used to achieve higher throughput by running with a higher decode batch size while staying under Service Level Objectives (SLO). 2. Adapt to changing traffic patterns while still remaining under application SLOs. 3. Enable independent scaling of resources and parallelism strategies for prefill (compute bound) and decode (memory bound). Note Automatic Prefix Caching is not supported with DI. ## High-Level Flow on Neuron# Disaggregated Inference is mainly implemented through Neuron’s vLLM fork aws-neuron/upstreaming-to-vllm and the Neuron Runtime. There are three main components to a DI workflow.
Introducing Disaggregated Inference on AWS powered by llm-d
WithSageMaker HyperPod’s observability dashboards, you can monitor key metrics during inference time such as GPU utilization, EFA metrics and error counts for proactively monitoring and optimizing your inference workloads. Image 3 ## Best Practices Disaggregated inference allows you to scale your prefill nodes separately to your decode nodes, allowing you to tune your performance for your workloads. For example, larger input sequence lengths with short output sequence lengths are a prefill-heavy workload. Disaggregated inference allows you to scale your prefill pods to handle more requests efficiently without an increase in cost. It is not for all workloads however. You can try it with larger models, longer input sequences, and sparse MoE architectures. [...] LLM inference consists of two distinct phases: prefill and decode. The prefill phase is compute bound. It processes the entire input prompt in parallel to generate the initial set of key-value (KV) cache entries. The decode phase is memory bound. It autoregressively generates one token at a time while requiring substantial memory bandwidth to access model weights and the ever-growing KV cache. Adding to this complexity, inference requests vary widely in computational requirements based on input and output length, making efficient resource utilization particularly challenging. [...] We are announcing a joint effort with the llm-d team to bring powerful disaggregated inference capabilities to AWS so that customers can boost performance, maximize GPU utilization, and improve costs for serving large-scale inference workloads. This launch is the result of several months of close collaboration with the llm-d community to deliver a new container `ghcr.io/llm-d/llm-d-aws` that includes libraries that are specific to AWS, such as Elastic Fabric Adapter (EFA) and libfabric, along with integration of llm-d with the NIXL library to support critical features such as multi-node disaggregated inference and expert parallelism. We have also conducted extensive benchmarking through multiple iterations to arrive at a stable release that allows customers to access these powerful
A Pragmatic Take on Inference Disaggregation
As inference scales to multi-node deployments, disaggregation—splitting inference into distinct phases—offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving [...] of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity. [...] As inference scales to multi-node deployments, disaggregation—splitting inference into distinct phases—offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. [...] In contrast, disaggregated inference serving [3, 4, 5, 6, 7] decouples the prefill and decode phases, allowing each to run on a separate model instance — potentially across different GPUs. This separation enables each phase to independently adopt model partitioning and batching strategies tailored to its performance targets. Moreover, it eliminates artificial slowdowns in prefill caused by strict TTL service-level agreements, as seen in piggybacking. Figure 2 illustrates these modes of serving. Refer to caption As demonstrated in Figure 1, disaggregation is not a universal solution. In the following sections, we examine the performance benefits of disaggregated inference serving across a broad design space. Appendix A summarizes the key metrics referenced throughout the paper.