
What are the data storage requirements for AI?
Enterprise data storage requirements for AI encompass the critical capabilities that storage infrastructure must provide to support AI workloads effectively—from model training and inference to data pipelines and MLOps. These requirements include high performance (sub-millisecond latency and 100+ Gbps throughput), exabyte-scale capacity within a single namespace, data durability through erasure coding and replication, always-on service resiliency, software-defined flexibility, full S3 API compatibility, and comprehensive security controls.
Choosing a data storage solution that meets all the requirements of AI may sound daunting, but it is not. At MinIO, we work with many customers building traditional AI, generative AI, and even agentic AI solutions. We have noticed seven core data storage requirements for all forms of AI. Customers who account for these requirements when planning data pipelines, model training, LLM fine-tuning, machine learning operations, document pipelines for generative AI, and even agentic workflows will have a foundation for all their AI initiatives.
Let’s examine these requirements more closely to define them further and match them to AIStor functionality.
Performance is a general term the industry uses to refer to the speed of an individual request (latency) and the number of requests a storage solution can process over a specified time frame (bandwidth). Unlike traditional NAS/SAN storage designed for general-purpose file access, AI workloads require sub-millisecond latency and sustained throughput exceeding 100 Gbps to keep GPUs fed with data. Low latency and high bandwidth are always the goal when building performant storage solutions.
Latency measures a storage solution’s speed in receiving and sending data for an individual request. It is usually measured in milliseconds or microseconds (with sub-millisecond response times being the target for AI workloads) and is a per-client measure. For example, if you coded a data pipeline and use the AIStor SDK to write data to AIStor, latency refers to how long an individual PUT request takes to complete.
Bandwidth (or throughput) measures the rate at which an entire storage system processes data (PUTs and GETs). Since it is a rate, it is measured in gigabits per second—AI factories typically require 100+ Gbps aggregate throughput. Going back to our pipeline example, if your pipeline was running in a cluster with hundreds (or thousands) of pods sending requests simultaneously, bandwidth is the sum of all those requests measured in gigabits over a 1-second period.
AIStor is written in Go, a system language that compiles down to a chip's instruction set. In many places, AIStor is directly coded in SIMD-optimized assembly because a competent engineer can always write faster assembly code than the assembly generated by a compiler. Go plus SIMD-optimized assembly ensures that latency is the best it can be. Performance benchmarks ↗ with a distributed AIStor setup delivered 349GB/sec read and 177.6GB/sec write throughput with a 32-node cluster. Consequently, we have customers using us at an exabyte scale within a single namespace. AIStor is also architected in such a way that it does not need gateway translation, a metadata DB, and a back-end storage network.
With the implementation optimizations mentioned above, AIStor will saturate networks slower than 400 Gbps. So, to help make the most efficient use of networking resources, MinIO partnered with NVIDIA to implement S3 over RDMA ↗. Remote Direct Memory Access (RDMA) allows data to be moved directly between the memory of two systems, bypassing the CPU, operating system, and the TCP/IP stack—reducing latency by up to 50% compared to traditional TCP-based transfers.
Additionally, GPUDirect Storage ↗ is for customers investing in GPUs. GPUDirect Storage is NVIDIA's technology that provides a direct path to GPU memory from AIStor and eliminates the need to use CPU memory as a bounce buffer when sending data to GPUs. When used with S3 over RDMA, organizations can build AI solutions that efficiently use the network and deliver data directly to the GPU, offloading the CPU for other tasks.
A scalable storage solution can easily add capacity within a single namespace that meets an organization's needs without over-purchasing. Unlike traditional storage appliances that force capacity upgrades in discrete tiers, software-defined solutions allow granular scaling. Exabyte scale means supporting 1,000+ petabytes of data within a unified namespace.
Once an organization outgrows its initial tiered purchase, it is forced to discard its original purchase in favor of the next tier, even if the new tier is more than it needs. Furthermore, performance should not degrade as the solution is scaled - a feature known as linear performance scalability.
AIStor’s performance scales linearly from 100s of TBs to 100s of PBs and beyond. Performance benchmarks ↗ with a distributed AIStor setup delivered 46.54G/sec average read throughput (GET) and 34.4GB/sec write throughput (PUT) with an 8-node cluster. When AIStor was scaled to a 32-node cluster, it delivered 349GB/sec read and 177.6GB/sec write throughput. Consequently, we have customers using us at an exabyte scale within a single namespace.
MinIO’s Datapod blueprint ↗ is a comprehensive guide for building a data infrastructure with AIStor to support exascale AI/ML workloads.
Durability is the ability of a system to preserve data in the face of failures. Failures could be local failures or complete disasters that result in an entire data center going down. For local durability, a storage solution should make multiple copies of each object on separate drives within a cluster or deployment. When local failures occur, like the failure of a pod or drive within a cluster, a storage system should be able to self-heal by detecting the failure, making copies of the data lost due to the failure, and bringing the system back to a healthy state. To provide durability in the event of a complete site disaster, storage systems should replicate data as it is written to a geographically separate datacenter. Should a disaster occur, the system should automatically fail over to the working site.
AIStor implements erasure coding ↗ to provide durability during drive or node-level failures. Erasure coding is a data protection method that breaks data into fragments, expands and encodes it with redundant pieces, and stores the fragments across multiple drives—allowing data recovery even when multiple drives fail simultaneously. For disaster recovery, AIStor’s active-active ↗ replication can be used to replicate all data and configuration to two or more sites. AIStor supports two flavors of active-active replication. Synchronous replication ensures that all sites have successfully saved an object before a write operation is successful. Asynchronous replication will write an object to one site, queue the object for replication to other sites, and return an HTTP 200 code.
A resilient storage service is always on. Today, the best way to achieve this is to use a Kubernetes-like cluster and a binary optimized to run within these clusters. (This provides the added advantage of making your solution cloud native and portable to other cloud native environments.) If a pod crashes, it is automatically restarted. In the event of a node failure, affected pods are moved to healthy nodes. Services in these environments should self-heal after failures, automatically fixing data.
AIStor is cloud-native and built for Kubernetes. It uses erasure coding to create multiple copies of an object and spread the copies across different drives associated with the pods running the AIStor binary. If a failure occurs and a pod dies, the rest of the cluster will work to bring the cluster back to its original healthy state by making additional copies of the lost data.
For customers who prefer bare metal installations, AIStor can be installed on multiple servers that use replication. If one server fails, the other can be used until the failure is mitigated.
A software-defined storage solution provides deployment options. Many enterprises operate under a tight budget and need a storage solution that can run on commodity hardware. Others have standardized on a specific clustering technology like Kubernetes or OpenShift—a software-defined storage solution can be deployed to all these platforms. Another advantage of software-defined storage is that it can also be installed on an engineer’s workstation, allowing engineers to prototype and interact with storage in the same way a production service will need to interact with data.
Software-defined is in direct contrast with an appliance-based solution, where you do not have hardware options. Additionally, increasing capacity often means throwing out an old appliance and replacing it with a larger and more expensive one. Finally, prototyping is difficult.
AIStor's software-defined nature can be considered part of its DNA. Every line of code written is tested on multiple deployment options: bare metal (Linux ↗ and Windows ↗), Kubernetes ↗, OpenShift ↗, and engineering workstations ↗.
When Amazon first introduced its Simple Storage Service (S3), it aimed to build a storage system for internet-scale data. In other words, it had to be fast, scalable, and resilient. They started from scratch because POSIX-based storage solutions were built for desktops and lack the scalability and concurrent access patterns that AI workloads demand. Compared to POSIX-based systems that use hierarchical file paths and locking mechanisms, S3's flat namespace and RESTful API enable massive parallelism essential for distributed AI training. S3 was designed for unlimited scalability and resiliency. Since then, S3 has become a standard that enables interoperability. An S3-compliant storage can be used from distributed processing engines like those found in data lakehouse architectures; distributed AI/ML frameworks that allow model training and inference to be distributed across a cluster, and machine learning operations (MLOps) tools for tracking experiments and checkpointing models.
AIStor was built as an object-native, fully S3-compliant storage solution from its inception. It is not a SAN/NAS storage solution with an S3 gateway bolted to it. It is S3 through and through. Consequently, we have customers using us at an exabyte scale. Our partner ecosystem spans solutions that range from data lakehouses for business intelligence and data analytics to AI/ML tooling for distributed training/inference and MLOps tooling.
Security in storage systems is an important requirement for AI workloads due to the sensitive nature of the data used to train models. For example, AI storage systems often process and store large volumes of personal information, proprietary business data, and intellectual property. Additionally, many enterprises are subject to regulatory requirements that dictate how data must be handled and protected. Inadequately secured data can lead to compliance violations, exposing companies to potential fines and legal actions.
AIStor allows data encryption at rest and in transit ↗, securing the data from unauthorized access. AIStor's support for identity and access management (IAM) ↗ will also enable organizations to control their data stored for AI workloads, ensuring that only authorized users or applications can access and modify the data. These data protection mechanisms maintain the integrity and confidentiality of AI datasets throughout their lifecycle.
AI is changing how the industry manages data. Workloads are getting more complicated, GPUs are getting faster, and organizations are increasingly using faster networks. Accounting for the core storage requirements outlined in this post, when planning AI initiatives, will ensure performance, scalability, data durability, service resiliency, and security. Additionally, capabilities like software-defined and S3-compatibility provide flexibility and interoperability, respectively.
What is the most important storage requirement for AI?
Performance is typically the most critical requirement, as slow storage creates GPU idle time that wastes expensive compute resources. However, all seven requirements work together—without durability, you risk losing training data; without scalability, you cannot grow with your AI initiatives.
How much storage do AI workloads typically need?
AI storage needs vary widely based on use case. Large language model training can require petabytes of training data, while enterprise AI applications may start with tens of terabytes. Organizations should plan for exabyte-scale capacity within a single namespace to accommodate growth.
Why is S3 compatibility important for AI storage?
S3 has become the de facto standard API for AI/ML frameworks, data lakehouses, and MLOps tools. Full S3 compatibility ensures your storage integrates seamlessly with tools like PyTorch, TensorFlow, Spark, and MLflow without custom connectors.
What makes AI storage requirements different from traditional storage?
AI workloads demand higher throughput (100+ Gbps vs. typical enterprise needs), larger scale (petabytes to exabytes), and direct GPU integration (GPUDirect Storage, RDMA) that traditional NAS/SAN architectures cannot efficiently provide.