7 Enterprise Data Storage Requirements for AI

Performance

Performance is a general term the industry uses to refer to the speed of an individual request (latency) and the number of requests a storage solution can process over a specified time frame (bandwidth). Low latency and high bandwidth are always the goal when building performant storage solutions.

Latency measures a storage solution’s speed in receiving and sending data for an individual request. It is usually measured in milliseconds or microseconds and is a per-client measure. For example, if you coded a data pipeline and use the AIStor SDK to write data to AIStor, latency refers to how long an individual PUT request takes to complete.

Bandwidth (or throughput) measures the rate at which an entire storage system processes data (PUTs and GETs). Since it is a rate, it is measured in gigabits per second. Going back to our pipeline example, if your pipeline was running in a cluster with hundreds (or thousands) of pods sending requests simultaneously, bandwidth is the sum of all those requests measured in gigabits over a 1-second period.

AIStor is coded for low latency and architected for high bandwidth

AIStor is written in Go, a system language that compiles down to a chip's instruction set. In many places, AIStor is directly coded in SIMD-optimized assembly because a competent engineer can always write faster assembly code than the assembly generated by a compiler. Go plus SIMD-optimized assembly ensures that latency is the best it can be. Performance benchmarks with a distributed AIStor setup delivered 349GB/sec read and 177.6GB/sec write throughput with a 32-node cluster. Consequently, we have customers using us at an exabyte scale within a single namespace. AIStor is also architected in such a way that it does not need gateway translation, a metadata DB, and a back-end storage network.

With the implementation optimizations mentioned above, AIStor will saturate networks slower than 400 Gbps. So, to help make the most efficient use of networking resources, MinIO partnered with NVIDIA to implement S3 over RDMA. Remote Direct Memory Access (RDMA) allows data to be moved directly between the memory of two systems, bypassing the CPU, operating system, and the TCP/IP stack.

Additionally, GPUDirect Storage is for customers investing in GPUs. GPUDirect Storage provides a direct path to GPU memory from AIStor and eliminates the need to use CPU memory as a bounce buffer when sending data to GPUs. When used with S3 over RDMA, organizations can build AI solutions that efficiently use the network and deliver data directly to the GPU, offloading the CPU for other tasks.

What Makes AIStor Fast

Single Layer

We are single layer, object only. Multiple layers cause latency, complexity.

No Metadata Database

By writing object and metadata together, you make all operations single and atomic. Multiple writes to different locations with other vendors.

SIMD Acceleration

By writing the core parts of MinIO in assembly language (SIMD extensions, e.g. AVX512, NEON, VSX) we are hyperfast on commodity HW.

Combinations of Go + GoAsm

Delivering faster than C performance by combining GO + Assembly Language and targeting them to the task.

Single Namespace Scalability

A scalable storage solution can easily add capacity within a single namespace that meets an organization's needs without over-purchasing. This is in contrast to an appliance-based solution, which delivers scale in discrete tiers. Once an organization outgrows its initial tiered purchase, it is forced to discard its original purchase in favor of the next tier, even if the new tier is more than it needs. Furthermore, performance should not degrade as the solution is scaled - a feature known as linear performance scalability.

AIStor is a scalable storage solution

AIStor’s performance scales linearly from 100s of TBs to 100s of PBs and beyond. Performance benchmarks with a distributed AIStor setup delivered 46.54G/sec average read throughput (GET) and 34.4GB/sec write throughput (PUT) with an 8-node cluster. When AIStor was scaled to a 32-node cluster, it delivered 349GB/sec read and 177.6GB/sec write throughput. Consequently, we have customers using us at an exabyte scale within a single namespace.

MinIO’s Datapod blueprint is a comprehensive guide for building a data infrastructure with AIStor to support exascale AI/ML workloads.

Data Durability

Durability is the ability of a system to preserve data in the face of failures. Failures could be local failures or complete disasters that result in an entire data center going down. For local durability, a storage solution should make multiple copies of each object on separate drives within a cluster or deployment. When local failures occur, like the failure of a pod or drive within a cluster, a storage system should be able to self-heal by detecting the failure, making copies of the data lost due to the failure, and bringing the system back to a healthy state. To provide durability in the event of a complete site disaster, storage systems should replicate data as it is written to a geographically separate datacenter. Should a disaster occur, the system should automatically fail over to the working site.

AIStor is a resilient service for durable data

AIStor implements erasure coding to provide durability during drive or node-level failures. For disaster recovery, AIStor’s active-active replication can be used to replicate all data and configuration to two or more sites. AIStor supports two flavors of active-active replication. Synchronous replication ensures that all sites have successfully saved an object before a write operation is successful. Asynchronous replication will write an object to one site, queue the object for replication to other sites, and return an HTTP 200 code.

Service Resiliency

A resilient storage service is always on. Today, the best way to achieve this is to use a Kubernetes-like cluster and a binary optimized to run within these clusters. (This provides the added advantage of making your solution cloud native and portable to other cloud native environments.) If a pod crashes, it is automatically restarted. In the event of a node failure, affected pods are moved to healthy nodes. Services in these environments should self-heal after failures, automatically fixing data.

AIStor is always on

AIStor is cloud-native and built for Kubernetes. It uses erasure coding to create multiple copies of an object and spread the copies across different drives associated with the pods running the AIStor binary. If a failure occurs and a pod dies, the rest of the cluster will work to bring the cluster back to its original healthy state by making additional copies of the lost data.

For customers who prefer bare metal installations, AIStor can be installed on multiple servers that use replication. If one server fails, the other can be used until the failure is mitigated.

Software Defined

A software-defined storage solution provides deployment options. Many enterprises operate under a tight budget and need a storage solution that can run on commodity hardware. Others have standardized on a specific clustering technology like Kubernetes or OpenShift—a software-defined storage solution can be deployed to all these platforms. Another advantage of software-defined storage is that it can also be installed on an engineer’s workstation, allowing engineers to prototype and interact with storage in the same way a production service will need to interact with data.

Software-defined is in direct contrast with an appliance-based solution, where you do not have hardware options. Additionally, increasing capacity often means throwing out an old appliance and replacing it with a larger and more expensive one. Finally, prototyping is difficult.

AIStor is Software Defined

AIStor's software-defined nature can be considered part of its DNA. Every line of code written is tested on multiple deployment options: bare metal (Linux and Windows), Kubernetes, OpenShift, and engineering workstations.

Full S3 Compatibility

When Amazon first introduced its Simple Storage Service (S3), it aimed to build a storage system for internet-scale data. In other words, it had to be fast, scalable, and resilient. They started from scratch because POSIX-based storage solutions were built for desktops. S3 was designed for unlimited scalability and resiliency. Since then, S3 has become a standard that enables interoperability. An S3-compliant storage can be used from distributed processing engines like those found in data lakehouse architectures; distributed AI/ML frameworks that allow model training and inference to be distributed across a cluster, and machine learning operations (MLOps) tools for tracking experiments and checkpointing models.

AIStor and S3

AIStor was built as an object-native, fully S3-compliant storage solution from its inception. It is not a SAN/NAS storage solution with an S3 gateway bolted to it. It is S3 through and through. Consequently, we have customers using us at an exabyte scale. Our partner ecosystem spans solutions that range from data lakehouses for business intelligence and data analytics to AI/ML tooling for distributed training/inference and MLOps tooling.

Security

Security in storage systems is an important requirement for AI workloads due to the sensitive nature of the data used to train models. For example, AI storage systems often process and store large volumes of personal information, proprietary business data, and intellectual property. Additionally, many enterprises are subject to regulatory requirements that dictate how data must be handled and protected. Inadequately secured data can lead to compliance violations, exposing companies to potential fines and legal actions.

AIStor secures data at rest and in transit

AIStor allows data encryption at rest and in transit, securing the data from unauthorized access. AIStor's support for identity and access management (IAM) will also enable organizations to control their data stored for AI workloads, ensuring that only authorized users or applications can access and modify the data. These data protection mechanisms maintain the integrity and confidentiality of AI datasets throughout their lifecycle.