Every technology shift begins with real-world examples. For architects and engineers building or planning a data lakehouse, understanding what's working today and what isn't is essential. The idea of a lakehouse, combining the scale, flexibility, and affordability of data lakes with the performance and manageability of data warehouses, is compelling. But abstract concepts alone aren't enough to guide critical architecture decisions.
This page provides detailed, practical examples of organizations currently deploying open-source lakehouse stacks, including Apache Iceberg, Apache Hudi, StarRocks, Dremio, and more, to tackle specific, high-stakes data challenges. Many of these examples are stories of Hadoop migration, detailing extremely impactful performance, cost utilization benefits for a migration to a data lakehouse.
Whether your priority is achieving sub-second analytics at petabyte scale, enabling real-time ingestion without compromising governance, or significantly cutting storage costs, these case studies will show you what's possible and how to get there.
WeChat’s data infrastructure of 1.3B users was historically split between a Hadoop and separate data warehouses for different use cases. This led to duplicated pipelines, heavy maintenance overhead, and governance issues from multiple copies of data. With some tables ingesting trillions of records per day, WeChat needed to drastically improve efficiency.
For the new architecture, they targeted sub-5s query latencies and fresher data (seconds-to-minutes old) even on billion-record scans and knew that HDFS couldn’t keep up. In 2023 WeChat redesigned its platform around an open lakehouse stack. Their data is now stored in Apache Iceberg as the table format (providing ACID transactions and schema evolution), and StarRocks as a high-concurrency MPP SQL engine for low-latency queries. As an iterative approach to infrastructure, they kept Apache Spark for large batch processing, but StarRocks now handles both real-time and analytics from the same data.
The new architecture supports streaming ingestion: hot data flows into a small real-time warehouse and then into the Iceberg lake, where StarRocks can query “hot” and “cold” data together via federated queries. For near-real-time feeds, raw events land directly in Iceberg and StarRocks uses materialized views to continuously transform them – eliminating the old dual pipelines.
By unifying all workloads on one open platform, WeChat achieved significant operational gains. The lakehouse halved the number of data engineering tasks needed day-to-day and reduced storage costs by over 65% by dropping duplicate datasets. Offline data workflows that once took hours were streamlined (development cycles shortened by ~2 hours). Query performance also improved dramatically, batch ingestion delays were removed, and even complex analytic queries now see sub-second latency in production.
Tencent Games (which operates dozens of popular games) found their data split into silos, making global analysis difficult. Originally, their game logs lived in HDFS, transactional data in MySQL/PostgreSQL, and real-time streams in Druid. In a tale as old as Hadoop, this architecture resulted in wasted storage (multiple copies, pre-aggregated tables) and inflexibility, any schema change meant reengineering the pipeline and backfilling data. As data volumes grew to trillions of events per day, the old architecture could not meet the required sub-second query responses or handle the ever-expanding storage footprint.
Tencent Games migrated to an Apache Iceberg based lakehouse backed by object storage, with StarRocks as the unified SQL query engine and ingestion layer. Iceberg allowed them to keep a single copy of all game data in a petabyte-scale lake. StarRocks was deployed for its real-time analytics capabilities (vectorized execution, high concurrency, and ability to handle mutable data) so that fresh ingested events and historical Iceberg data can be queried together via one engine.
As a result of this change, Tencent Games reported a 15× reduction in storage costs after consolidating workloads on Iceberg (no more multiple copies) Their performance goals were achieved as well: they could sustain sub-second query latency at petabyte scale with second-level data freshness, even as they ingest trillions of new records per day into their data lake. In internal benchmarks, the Iceberg + StarRocks solution easily met Tencent’s requirements for both high throughput and low latency. In short, the company went from a rigid, batch-oriented system to a flexible open lakehouse that powers interactive analytics for all their games (and dramatically simplifies operations and governance in the process).
Walmart’s data lakes (spanning hundreds of thousands of cores on-prem and in the cloud cloud) were hitting limits in performance and data freshness as the company aimed to modernize analytics. They faced typical big-data issues: duplicate data across various Hadoop and warehouse systems, consistency problems, and high latency for critical updates. Certain internal workloads could not meet business SLAs – for example, a large daily batch job (WL1) suffered from late-arriving data causing expensive reprocessing, and a streaming upsert pipeline (WL2) needed to reflect changes from an upstream Cassandra store with minimal delay. Walmart’s goal was to support near-real-time analytics and updates on their lake data (for use cases like inventory, supply chain, etc.) without sacrificing consistency.
After extensive testing, Apache Hudi emerged as the best fit for Walmart’s needs, particularly due to its focus on incremental updates which aligned well with Walmart’s requirement to handle late-arriving data and change streams efficiently. The Hudi adoption allowed them to build a unified pipeline for both their time-partitioned batch data and their CDC/upsert streams.
By implementing Hudi, Walmart achieved major gains in data freshness, consistency, and pipeline efficiency. In Walmart’s own words, Hudi “dramatically improved ingestion performance”. One critical batch job ran 5× faster than the previous ORC-based approach in tests. Hudi’s copy-on-write and merge-on-read capabilities allowed record-level updates/deletes which eliminated data duplication and ensured that consumers always see consistent, up-to-date data on the lake. This unlocked true real-time use cases; streaming feeds that were once hard to incorporate can now be ingested and queried within minutes.
In addition, Walmart noted that by replacing Hadoop, Hudi reduced the overall size of their data store (by removing redundant copies) and improved query performance for downstream analytics. Equally important, the solution fit into Walmart’s open-source strategy and existing infrastructure – it ran on their mix of Spark versions (Hudi was uniquely compatible with older Spark 2.4 in parts of their ecosystem) and avoided proprietary lock-in. Walmart has since migrated several key internal data domains to Hudi and is continuing to expand this modern lakehouse approach across the enterprise.
Robinhood, a financial services platform, needed to scale its data platform to support everything from real-time fraud detection to ad-hoc analytics, all under strict regulatory requirements. By 2023 their data lakehouse was ingesting over 10,000 data sources (streams of app events, database CDC feeds, third-party APIs) and had grown to multiple petabytes. The challenge was to serve diverse use cases with different freshness needs – some datasets must be updated in near real-time (for risk models and operational dashboards), while others are more static – without maintaining separate systems. At the same time, as a broker-dealer, Robinhood must enforce strong data governance: for example, ensuring customers’ personal data can be purged on request (“right to be forgotten” under GDPR/CCPA) across all storage layers. Handling such GDPR compliance at scale (deleting or masking a single user’s data across a multi-PB lake) was extremely difficult with their legacy data lake setup. Robinhood’s lean data engineering team needed an architecture that could balance high performance with governance, as failing either could have business and compliance implications.
Robinhood implemented a tiered lakehouse architecture built on Apache Hudi to meet these needs. Incoming data is categorized into tiers by criticality and freshness requirements. For the highest tier (e.g. transactional data for fraud checks), they use change data capture (Debezium on PostgreSQL) to feed Hudi tables in near real-time, ensuring low-latency updates. Less critical data flows in on a schedule to lower tiers. Across all tiers, Apache Hudi’s incremental processing is the backbone: it allows continuous upserts and partition pruning on the lake storage, so that fresh data is queryable quickly without full reprocessing. The team also leverages Hudi’s metadata tracking and timestamp-based queries to manage data lifecycle – e.g. identifying all records for a given user to delete for compliance. In practice, Robinhood’s Hudi tables are accessed by various engines (Spark jobs for ETL, a Trino cluster for analytics, etc.), but the data governance is enforced at the table level. They tag data with zones/tiers and attach retention policies, using Hudi’s support for deletes and updates to propagate GDPR “forget” requests efficiently. This design allows them to use the same unified architecture for both high-speed streaming data and governed archival data.
Robinhood’s data platform now handles exponential data growth while remaining compliant and agile. Thanks to Hudi, they can support real-time and batch workloads on one platform and meet strict SLAs for data freshness and quality. For example, a critical pipeline might guarantee that new events are reflected in analytics within a few minutes, even as the dataset scales, which was hard to imagine before. At the same time, data governance processes (like GDPR deletions) are now feasible at scale – Robinhood’s engineers demonstrated that they can reliably delete all data for a given user across a multi-petabyte lakehouse, using Hudi’s indexing to pinpoint data slices that need removal. Their tiered Hudi implementation also improved reliability: each data domain is ingested with the appropriate frequency and quality checks, which has reduced incidents and allowed the small data team to focus on new features rather than babysitting pipelines. Overall, Robinhood’s case shows that an open-source Hudi-based lakehouse can satisfy demanding financial industry requirements (fresh data for ML models, auditable data lineage, compliance) all within a single platform design.
Datto, a provider of data protection and security software built a data lake on AWS to serve analytics to their customers (managed service providers). However, their initial architecture had performance and cost issues. Incoming events (about 150,000 per day from many endpoint agents) were stored as raw JSON files on S3, queried via Amazon EMR and Hive/Presto. This meant queries scanned thousands of small JSON files and often took 30–60 seconds, sometimes several minutes, to return results. Such slow responses were risky for a customer-facing product (e.g. an analyst investigating an alert might time out waiting for data). Additionally, scanning 150k JSON objects to find a few hundred relevant records was extremely inefficient on cost, contributing significantly to S3 and compute bills. The ingestion side was also cumbersome: Datto used batch ETL jobs (400+ SQL operations orchestrated in Airflow) to prepare each day’s data, which introduced latency and complexity. They considered moving to streaming ingestion, but operating a complex Kafka pipeline with exactly-once semantics appeared to trade one problem for another. In summary, Datto needed to speed up query performance and cut costs on their lakehouse, while simplifying a brittle ingestion process.
In 2023, Datto decided to introduce a modern open table format (Apache Iceberg) to their S3 data lake and to upgrade their query engine to Starburst Trino (a high-performance SQL engine based on Presto). Iceberg gave them an immediate boost by organizing data into columnar Parquet files with partition pruning, instead of a swamp of JSON. They leveraged Iceberg’s schema evolution to gradually restructure data (with features like time-travel for auditing and snapshot rollback for safety). To avoid the burden of manually tuning Iceberg table layouts, Datto adopted Tabular (a managed Iceberg service) for optimizing file sizes, compaction, and ingesting data with lower latency than their old batch jobs. The combination of Trino + Iceberg meant that queries could use SQL on the lakehouse with much less overhead, e.g. filtering by event type would cause Iceberg to skip irrelevant files instead of scanning everything. The migration was completed in a matter of weeks and required minimal code changes to their applications (Tabular’s platform handled the heavy lifting of converting existing data and queries to Iceberg).
The impact was quickly evident. Datto saw a 2–3× reduction in query response times (p50 and p99) after migrating to the Iceberg-based architecture. What used to be 1+ minute worst-case queries came back in just seconds, improving the experience for analysts and customers alike. The simplified architecture also reduced operational headaches as it had fewer moving parts than the previous EMR + Hive + extensive Airflow setup. Cost-wise, Iceberg enabled much more efficient scans: instead of reading 150k JSON files for every query, the engine now reads a fraction of that (only the relevant Parquet splits), which significantly lowered the S3 I/O costs per query Datto’s data team noted that they avoided adding Kafka complexity yet still achieved near-real-time data availability. Overall, by embracing an open lakehouse stack (Iceberg + Trino), Datto solved their performance bottleneck and created a more future-proof analytics foundation for their SaaS product.
Memed, a Brazilian health-tech company offering digital prescriptions, found that their original data platform could not keep up with the growing demand for insights. They had built a basic data lake and a set of ETL pipelines, but it was too slow and unreliable, nightly ETL jobs took a long time and often crashed, resulting in missing or inconsistent data in their analytics layer. Instead of a single source of truth, different teams ended up with conflicting metrics. Any attempt to fix or improve the ETL was “cumbersome and prone to error”, to the point that some analysts bypassed the pipeline entirely (doing data transforms inside dashboards by themselves). This kludge undermined confidence in the data. Moreover, the legacy stack imposed limits on how much data could be queried, preventing Memed from leveraging all their granular transactional data at once. The company needed to modernize their architecture for scalability and real-time needs – for instance, providing up-to-date dashboard insights to doctors and pharmacies who relied on Memed’s data for patient care decisions.
Memed chose Dremio, an open data lakehouse platform, to overhaul their analytics stack. Dremio’s SQL engine (built on Apache Arrow) now queries data directly on their Amazon S3 data lake, avoiding the need for a separate data warehouse. The team migrated their semi-structured and unstructured data into a well-organized Iceberg table format on S3 (Dremio is a strong proponent of Apache Iceberg). This provided the necessary “structure” on the lake, while Dremio’s query accelerator provided the speed. They made heavy use of Dremio’s Reflections (materialized views) to pre-compute frequent query results and automatically boost performance for dashboard users. With Dremio’s self-service SQL interface, Memed’s analysts can now define transformations and joins on the fly (data-as-code) instead of relying on brittle external ETL scripts. Importantly, Dremio’s open architecture let them integrate various data sources (MySQL replicas, clickstream data, etc.) and still query across them in one place, simplifying what was previously a fragmented process.
The new lakehouse architecture had an immediate positive impact. Memed’s daily ETL process, which used to take 30–40 minutes, now completes in around 40 seconds, essentially real-time by comparison. This 60× improvement means data is refreshed quickly and reliably each day. Analysts who were previously constrained by slow queries or incomplete data can now run full analytical queries in ~10 minutes (even over large datasets) and spend almost no time waiting on data prep jobs. Consequently, the data team has freed up time to focus on strategic projects instead of firefighting ETL issues. Memed also eliminated prior limitations on querying transactional data size, so they can derive insights from the raw, detailed data that was once off-limits. Critically for the business, the dashboards provided to external clients (doctors, hospitals, insurers) are now accurate and timely. “External clients who rely on our dashboards now have access to accurate and timely insights,” said Memed’s Head of Data after the Dremio rollout. This improved data reliability helps healthcare providers make better decisions and underscores how an open lakehouse approach (combining Iceberg’s data management with Dremio’s querying) solved Memed’s growth pains in data analytics.
Nomura, a global financial services firm, faced significant challenges with their legacy Hadoop infrastructure, which struggled to handle the ingestion of approximately 1.8 billion rows daily. This setup resulted in frequent outages, limited system diagnostics, and difficulties in meeting stringent SLAs for critical financial risk reporting.
In response, Nomura transitioned from Hadoop to a modern data lakehouse architecture built around Dremio's data processing engine. They adopted Apache Iceberg as their open table format, object storage was provided by MinIO, chosen for its performance scalability and compatibility with Iceberg. To improve operational insights, Nomura developed a structured metadata layer, implemented a query replay system for rigorous testing, and upgraded their monitoring dashboards to include granular, executor-level metrics.
This comprehensive modernization delivered a 13.9% increase in overall system performance and significantly improved platform stability. The new architecture now comfortably supports over 500,000 queries daily, dramatically streamlining ETL processes. By establishing a hybrid on-premises and cloud data platform, Nomura positioned itself for future scalability, greater efficiency, and enhanced reliability in critical business operations.
Organizations are choosing on-premises data lakehouse architectures for their economics, performance at scale, and security. From a cost perspective, on-prem storage offers a predictable model: there are no egress charges or per-request API fees on prem. This means no surprise bills and a total cost of ownership that improves as capacity scales.
MinIO AIStor is the de facto on-prem object storage software; with over 2 billion Docker pulls worldwide and growing. If you’re looking for examples of where other enterprises have built their data lakehouses, AIStor has a proven track record of being the foundational storage layer for billions of users.
Performance-wise, AIStor has demonstrated extreme throughput: yielding faster time-to-first-byte and query responsiveness – a single-digit millisecond TTFB is achievable, compared to ~30 ms typical on AWS S3 Standard. Notably, AWS itself introduced S3 Express One Zone (a high-performance AWS S3 tier) claiming 10× lower latency and higher I/O than standard S3, but at 8× the cost per GB over their standard S3 API. AIStor provides that cloud-grade performance without punitive pricing or trade-offs, all while keeping data on infrastructure you control. Security and governance are stronger on-prem as well: AIStor gives enterprises data sovereignty, with full control over data placement and fine-grained access policies down to individual objects. In short, an AIStor-backed lakehouse offers cloud-like scalability and throughput with predictable economics and enhanced control. A compelling combination for large-scale analytics and AI workloads.
Compared to other on-prem object storage vendors, AIStor’s architecture is purpose-built to avoid common scale bottlenecks. Other S3-compatible systems often rely on separate metadata services or databases that become choke points in which every small-file operation incurs a metadata lookup, doubling latency and straining throughput. AIStor eliminates that entire layer: it does not use an external metadata database, instead storing metadata with the objects on disk via consistent hashing. This means no single metadata node to saturate or fail allowing AIStor to handle millions of object operations remains fast and strictly consistent even at petabyte scale.
The result is a simpler, horizontally scalable design with fewer moving parts (no dedicated index servers or name nodes). That simplicity translates to better operational stability and easier integration.
Together, these real-world outcomes indicate that the open data lakehouse paradigm is on track to become the enterprise standard for modern data platforms. Forward-looking data teams increasingly favor this architecture for its high performance, lower costs, and freedom from vendor lock-in. A combination that traditional warehouses and proprietary stacks struggle to match. Please reach out to us at hello@min.io or on our Slack channel if you have any questions.