Agentic AI scaling requires new memory architecture

Agentic AI represents a distinct evolution from stateless chatbots toward complex workflows, and scaling it requires new memory architecture.

As foundation models scale toward trillions of parameters and context windows reach millions of tokens, the computational cost of remembering history is rising faster than the ability to process it.

Organisations deploying these systems now face a bottleneck where the sheer volume of “long-term memory” (technically known as Key-Value (KV) cache) overwhelms existing hardware architectures.

Current infrastructure forces a binary choice: store inference context in scarce, high-bandwidth GPU memory (HBM) or relegate it to slow, general-purpose storage. The former is prohibitively expensive for large contexts; the latter creates latency that renders real-time agentic interactions unviable.

To address this widening disparity that is holding back the scaling of agentic AI, NVIDIA has introduced the Inference Context Memory Storage (ICMS) platform within its Rubin architecture, proposing a new storage tier designed specifically to handle the ephemeral and high-velocity nature of AI memory.

“AI is revolutionising the entire computing stack—and now, storage,” Huang said. “AI is no longer about one-shot chatbots but intelligent collaborators that understand the physical world, reason over long horizons, stay grounded in facts, use tools to do real work, and retain both short- and long-term memory.”

The operational challenge lies in the specific behaviour of transformer-based models. To avoid recomputing an entire conversation history for every new word generated, models store previous states in the KV cache. In agentic workflows, this cache acts as persistent memory across tools and sessions, growing linearly with sequence length.

This creates a distinct data class. Unlike financial records or customer logs, KV cache is derived data; it is essential for immediate performance but does not require the heavy durability guarantees of enterprise file systems. General-purpose storage stacks, running on standard CPUs, expend energy on metadata management and replication that agentic workloads do not require.

The current hierarchy, spanning from GPU HBM (G1) to shared storage (G4), is becoming inefficient:

(Credit: NVIDIA)

As context spills from the GPU (G1) to system RAM (G2) and eventually to shared storage (G4), efficiency plummets. Moving active context to the G4 tier introduces millisecond-level latency and increases the power cost per token, leaving expensive GPUs idle while they await data.

For the enterprise, this manifests as a bloated Total Cost of Ownership (TCO), where power is wasted on infrastructure overhead rather than active reasoning.

A new memory tier for the AI factory

The industry response involves inserting a purpose-built layer into this hierarchy. The ICMS platform establishes a “G3.5” tier—an Ethernet-attached flash layer designed explicitly for gigascale inference.

This approach integrates storage directly into the compute pod. By utilising the NVIDIA BlueField-4 data processor, the platform offloads the management of this context data from the host CPU. The system provides petabytes of shared capacity per pod, boosting the scaling of agentic AI by allowing agents to retain massive amounts of history without occupying expensive HBM.

The operational benefit is quantifiable in throughput and energy. By keeping relevant context in this intermediate tier – which is faster than standard storage, but cheaper than HBM – the system can “prestage” memory back to the GPU before it is needed. This reduces the idle time of the GPU decoder, enabling up to 5x higher tokens-per-second (TPS) for long-context workloads.

From an energy perspective, the implications are equally measurable. Because the architecture removes the overhead of general-purpose storage protocols, it delivers 5x better power efficiency than traditional methods.

Integrating the data plane

Implementing this architecture requires a change in how IT teams view storage networking. The ICMS platform relies on NVIDIA Spectrum-X Ethernet to provide the high-bandwidth, low-jitter connectivity required to treat flash storage almost as if it were local memory.

For enterprise infrastructure teams, the integration point is the orchestration layer. Frameworks such as NVIDIA Dynamo and the Inference Transfer Library (NIXL) manage the movement of KV blocks between tiers.

These tools coordinate with the storage layer to ensure that the correct context is loaded into the GPU memory (G1) or host memory (G2) exactly when the AI model requires it. The NVIDIA DOCA framework further supports this by providing a KV communication layer that treats context cache as a first-class resource.

Major storage vendors are already aligning with this architecture. Companies including AIC, Cloudian, DDN, Dell Technologies, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA are building platforms with BlueField-4. These solutions are expected to be available in the second half of this year.

Redefining infrastructure for scaling agentic AI

Adopting a dedicated context memory tier impacts capacity planning and datacentre design.

  • Reclassifying data: CIOs must recognise KV cache as a unique data type. It is “ephemeral but latency-sensitive,” distinct from “durable and cold” compliance data. The G3.5 tier handles the former, allowing durable G4 storage to focus on long-term logs and artifacts.
  • Orchestration maturity: Success depends on software that can intelligently place workloads. The system uses topology-aware orchestration (via NVIDIA Grove) to place jobs near their cached context, minimising data movement across the fabric.
  • Power density: By fitting more usable capacity into the same rack footprint, organisations can extend the life of existing facilities. However, this increases the density of compute per square metre, requiring adequate cooling and power distribution planning.

The transition to agentic AI forces a physical reconfiguration of the datacentre. The prevailing model of separating compute completely from slow, persistent storage is incompatible with the real-time retrieval needs of agents with photographic memories.

By introducing a specialised context tier, enterprises can decouple the growth of model memory from the cost of GPU HBM. This architecture for agentic AI allows multiple agents to share a massive low-power memory pool to reduce the cost of serving complex queries and boosts scaling by enabling high-throughput reasoning.

As organisations plan their next cycle of infrastructure investment, evaluating the efficiency of the memory hierarchy will be as vital as selecting the GPU itself.

See also: 2025’s AI chip wars: What enterprise leaders learned about supply chain reality

Banner for AI & Big Data Expo by TechEx events.

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events. Click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow us on Social Media