Introducing MinIO MemKV: Purpose built Context Store for Inference at scale

MinIO MemKV logo over a digital 3D graph with blue and teal lines and binary digits.

Introducing MinIO MemKV: Purpose-Built Context Store for Inference at Scale

GPU memory is the fastest and most expensive resource in an AI data center. It's also the smallest. When a production inference cluster runs out of it, and it always does, the context gets discarded. There is no shared tier to save it to. The GPU that picks up the next request starts over from scratch. This leads to wasted GPU cycles, wasting previous compute cycles and power.

Until now, that was the only option. Either pin sessions to individual GPUs and hope the context is still in memory, or discard it and recompute. Both paths break at scale. Pinning kills scheduling flexibility. Discarding triggers full prefill recomputation: every token of prior context, rebuilt as if the work was never done. That recomputation, in a typical production deployment, consumes more than half of all GPU cycles. Not because the GPUs are slow. Because there was never a place to put evicted context where another GPU could retrieve it at the speed inference demands.

Today we're launching MinIO MemKV. It's a purpose-built context memory store that turns petabytes of NVMe flash into a shared memory tier the GPU pod can access at microsecond speed over RDMA. No file system in the path. No protocol translation. No storage servers between the GPU and its context. The data moves from NVMe to GPU memory as if the drives were local, except they're shared across the entire cluster and they scale to petabytes.

The engineering team built it because we'd been hearing the same thing from every customer: the most expensive hardware in the data center was burning money rebuilding work it already did. MemKV is the fix.

The GPU Utilization Myth

Most infrastructure teams look at GPU utilization as their primary health metric for inference. If utilization is above 90%, the cluster is working. The investment is justified. Move on to the next problem.

That assumption is wrong, and it's costing enterprises millions of dollars a year.

High utilization only means the GPUs are busy. It says nothing about whether they're busy doing useful work. Pull up any GPU utilization dashboard in a production inference cluster and nothing in it distinguishes a GPU generating new tokens from a GPU recomputing context it already generated last session. Both show up as utilization. Only one is producing value. In a typical production deployment, more than half of those busy cycles are prefill recomputation: the GPU rebuilding context that was already computed and already paid for, because it was evicted from memory since the last time it was needed.

The industry calls this the recompute tax. Here's how it works.

When an AI model processes a request, it builds context: a running record of everything it has seen, computed, and reasoned about so far. That context lives in GPU memory as a KV cache. GPU memory is small and expensive. When it fills up, or when the workload gets routed to a different node, context gets evicted. The next request that needs that context triggers a full prefill recomputation. At 128K tokens, that's roughly 40 GB of state, rebuilt from scratch.

At the GPU density that hyperscalers and neoclouds are running today, this isn't a minor inefficiency. It's the reason inference clusters are paying for a fleet of GPUs and getting useful output from roughly half of them.

The math is straightforward. Take a typical enterprise inference cluster of 128 GPUs. Context windows of 64K to 128K tokens, where a single 128K conversation consumes roughly 40 GB of KV cache. Cache miss rates routinely above 50% in production without a shared context tier. Cloud GPU rates of $2 to $4 per hour. Multiply it out and the recompute tax costs up to $2 million a year. That's not a projection. That's arithmetic. Two million dollars a year, spent on GPUs doing work they've already done, producing nothing new.

Eliminating Context Loss

The recompute tax doesn't just waste money. It caps concurrency. Every GPU cycle spent recomputing is a cycle unavailable for the next user. As context windows grow and agentic workflows chain more steps together, the tax grows quadratically while the GPU budget grows linearly. At scale, that's structural drag.

We ran benchmarks on Llama 3.1 70B at production concurrency. Without MemKV, time to first token at 64K context was 53 seconds. Fifty-three seconds of a GPU doing nothing but rebuilding context it already had.

With MemKV, that same request produced its first token in 703 milliseconds.

Then we pushed it further. At 128K token context lengths, the system without MemKV didn't slow down. It stopped. Out of memory. The GPUs physically couldn't hold enough context to serve more than a single user. MemKV kept serving. From a single user to over a hundred, the experience was the same.

That's the result that matters more than any single number. This isn't a performance optimization. It's a capability that didn't exist before. Without MemKV, these workloads are impossible. With it, they're routine.

Breaking the Speed-Scale Tradeoff

If the recompute tax is the problem, why hasn't anyone solved it already?

Because until recently, the infrastructure stack didn't have a place to put the solution. GPU memory at the top of the hierarchy is fast but tiny. Shared storage at the bottom scales but introduces latency that stalls inference. And nothing in between was designed for the workload.

Local NVMe offload (G3) helps on a single server. Orchestration can spill KV cache to direct-attached SSDs, and that reduces recomputation on that node. But context stays stranded on that machine. When the data routes a workload to a different node, the context doesn't follow, and the GPU recomputes. You've solved the capacity problem for one server and created a scheduling problem for the cluster.

Shared storage (G4) has the capacity and the namespace. But it's built for durable data with enterprise data services: erasure coding across racks, ACID consistency, encryption, lifecycle management. That overhead exists for good reason when you're storing training data or model weights that need to survive for years. It's wasted on ephemeral context that lives for hours and gets accessed in 2 to 16 MB throughput-oriented blocks. Millisecond latency that's fine for loading a model is unacceptable for retrieving the KV cache that a GPU needs right now.

G3.5 closes that gap. A new tier in the memory hierarchy, purpose-built for context memory, combining microsecond access with shared petabyte-scale capacity. MemKV was built from scratch for G3.5 and nothing else.

Chart showing storage hierarchy from fastest smallest GPU HBM to slowest largest shared network storage with details.

Purpose-Built for Inference at Scale

Every storage vendor in the G3.5 ecosystem is claiming context memory support. You've seen the announcements. What most of them are actually doing is taking a platform built for durable file or object data and extending it into the inference data path. The data still travels through protocol nodes, metadata services, and file system translation layers designed for a fundamentally different workload. x86 storage servers still sit alongside the inference infrastructure, consuming power and rack space. Enterprise durability, replication, and heavyweight data services still apply to ephemeral context that doesn't need any of it.

You wouldn't run a database on a tape library just because the tape library added a SQL interface. Same logic.

MemKV doesn't carry that overhead because it was never built to:

  • ARM64-native single binary, embedded directly in the G3.5 storage tier
  • RDMA data path from NVMe to GPU memory, bypassing kernel stacks and protocol translation entirely
  • Throughput-oriented 2 to 16 MB block sizes that match how GPUs consume KV cache at inference time, not the 4 KB blocks legacy storage was designed around
  • Flash-native KV cache, not retrofitted file or object storage

That combination of choices, building for one tier on one class of hardware with one access pattern, is what produces the performance. Near wire speed on 800GbE fabric. Not because we optimized harder. Because we built with fewer things in the way. The benchmarks are a consequence of the architecture, not the other way around.

What This Means in Dollars

Take an enterprise running 128 GPUs serving an agentic coding assistant. Context windows are 128K tokens. Concurrency is in the hundreds of sessions. Without a shared context tier, every session that gets routed to a GPU that doesn't hold its context triggers a full prefill recomputation. At 50% cache miss rates, half the sessions are burning GPU compute on work that's already been done.

With MemKV, those cache misses become cache hits at microsecond retrieval cost. A petabyte of context memory running on standard NVMe drives costs under $80,000, less than the price of a single GPU. GPU utilization climbs from 50% to above 90% because the GPUs that were stalled on recomputation are now generating tokens. Cost per token drops 40 to 60% across the cluster. Every GPU recovered from recomputation is also a GPU you don't need to buy, power, or cool, and at 1,200 watts per modern AI GPU, the energy savings compound fast.

This isn't a storage story. This is a "how many fewer GPUs do I need to buy next quarter" story. The recompute tax is the dominant cost driver in production inference, and MemKV collapses it.

What's Next

Successful AI inference at scale requires making sure that no GPU resources are wasted, and only a purpose-built context store like MemKV can do that.

And the best part is that MemKV is available today. You can get access and learn more here.

Whether you're exploring AI-native object storage or planning your next deployment, we'd love to help.
Let's start a conversation or jump right in and try AIStor yourself.
Contact Us
Download AIStor
// Rich Text image lightbox