MinIO for Modern Datalakes

Modern datalakes and data lakehouses are built on modern object storage. That means they are built on MinIO.

MinIO offers a unified storage solution for modern datalakes/lakehouses that can run anywhere: private clouds, public clouds, colos, baremetal - even at the edge. It is fast, scalable, cloud-native and ready to go - all batteries included.

MinIO for Modern Datalakes

Open Table Format Ready

The modern datalake is multi-engine and those engines (Spark, Flink, Trino, Arrow, Dask etc) all need to be in some way tied into a cohesive architecture. The modern datalake has to deliver central table storage, portable commute, access control and persistent structure. That is where formats like Iceberg, Hudi and Delta Lake come into play. They are designed for the modern datalake and they are each supported in MinIO. We might have an opinion on which one wins (you can always ask us…) but we are committed to supporting them until it doesn’t make sense (see Docker Swarm and Mesosphere).

Open Table Format Ready

Cloud Native

MinIO was born in the cloud and adheres to the principles of the cloud operating model - containerization, orchestration, microservices, APIs, infrastructure as code and automation. Because of this, the cloud native ecosystem “just works” with MinIO - from Spark to Presto/Trino, from Snowflake to Dremio, from Nifi to Kafka, from Prometheus to OpenObserve, Istio to Linkerd and from Hashicorp Vault to Keycloak.

Don’t take our word for it - enter your favorite cloud-native technology and let Google provide you with the evidence.

Multi-Engine

MinIO supports every S3-compatible query engine, which is to say all of them. Don’t see one you use - drop us a note and we will look into it.

Multi Engine Multi Engine
Multi Engine Multi Engine

Performant

The modern datalake demands a level of performance, and more importantly, performance at scale, that Hadoop could only dream of and old school object stores only fantasize about. MinIO has proven in multiple benchmarks that it is materially faster than Hadoop and the migration path is clearly documented. This means better performance for your query engines (Spark, Presto, Trino, Snowflake, Microsoft SQL Server, Teradata and more). This also includes your AI/ML platforms - from MLflow to Kubeflow.

We publish our benchmarks for the world to see and make them repeatable. See how we notched 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs in this post.

Performant

Lightweight

MinIO’s server binary is all of <100 MB. Despite its size, it is powerful enough to run in the datacenter, yet still small enough to live comfortably at the edge. There is no such alternative in the Hadoop world. What it means to enterprises is that your S3 applications can access data anywhere, anytime, and with the same API. Implementing MinIO edge location and with replication capability, we can capture and filter data at the edge and ship it to the mother cluster for aggregation and further analytics implementation.

Lightweight

Disaggregated

The modern datalake extends the disaggregation seen in the Hadoop breakup. Modern datalakes have high speed query processing engines and they have high throughput storage. The modern datalake is far too large to fit into a database, so the data resides on the object store. This way, the database can focus on the query optimization function and outsource the storage functions to a high-speed object store. By keeping a subset of the data in memory and leveraging capabilities like predicate pushdown (S3 Select) and external tables - the query engine has far more flexibility.

Disaggregated

Open Source

The enterprises that adopted Hadoop did so out of a preference for open source technologies. As the logical successor - enterprises want their datalake to be opensource as well. That is why Iceberg has flourished and why Databricks opensourced Deltalake.

The ability to inspect, the freedom from lock-in, and the comfort that comes from tens of thousands of users, has real value. MinIO is also 100% open source, ensuring that organizations can stay true to their goals while investing in a modern datalake.

Open Source

Hungry

Data is constantly getting generated - and that means it must constantly be ingested - without incurring indigestion. MinIO is built for this world and works out of the box with Kafka, Flink, RabbitMQ and a host of other solutions. The result is a datalake/datalakehouse that becomes the single source of truth and can expand seamlessly to EBs and beyond.

MinIO has multiple clients whose daily data ingest exceeds 250PB a day.

Hungry

Simple

Simplicity is hard. It takes work, discipline, and above all, commitment. MinIO’s simplicity is legendary and is the result of a philosophical commitment to making our software easy to deploy, use, upgrade, and scale. The modern datalake need not be complex. There are a handful of pieces and we are committed to ensuring that MinIO is the easiest to adopt and deploy.

Simple

ELT or ETL - It Just Works

It is not just that MinIO works with every data streaming protocol and every data pipeline, it is that every data streaming protocol and every data pipeline works with MinIO. Every vendor tests extensively and frequently such that data pipelines are resilient and performant.

ELT or ETL - It Just Works

Resilient

MinIO protects data with per-object, inline erasure coding, which is far more efficient than HDFS alternatives which came after replication and never gained adoption. In addition, MinIO’s bitrot detection ensures that it will never read corrupted data — capturing and healing corrupted objects on the fly. MinIO also supports cross-region, active-active replication. Finally, MinIO supports a complete object locking framework offering both Legal Hold and Retention (with Governance and Compliance modes).

Resilient

Software Defined

Hadoop HDFS’ successor isn’t a hardware appliance, it is software running on commodity hardware. That is what MinIO is — software. Like Hadoop HDFS, MinIO is designed to take full advantage of commodity servers. With the ability to leverage NVMe drives and 100 GbE networking, MinIO can shrink the datacenter — improving operational efficiency and manageability. Indeed, companies that build replacement datalakes reduce their HW footprint by 60% or more, while improving performance and reducing the FTE required to manage it.

Software Defined

Secure

MinIO supports multiple, sophisticated server-side encryption schemes to protect data — wherever it may be — in flight or at rest. MinIO’s approach assures confidentiality, integrity, and authenticity with negligible performance overhead. Server side and client side encryption are supported using AES-256-GCM, ChaCha20-Poly1305, and AES-CBC, ensuring application compatibility. Furthermore, MinIO supports industry-leading key management systems (KMS).

Secure

Learn more about MinIO for Modern Datalakes

Ask an Expert

Speak Immediately to an Engineer at MinIO About Your Datalake Questions

Send Us An Email by Completing the Form Below

We will be in touch within the hour.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

You are using Internet Explorer version 11 or lower. Due to security issues and lack of support for web standards, it is highly recommended that you upgrade to a modern browser.