X

MLPerf Storage v2.0 Eclipses Records in AI Benchmarking

MLCommons today announced results for its MLPerf Storage v2.0 benchmark, setting new records with over 200 performance results from 26 organizations. The results provide a trove of new data for AI trainers looking to make informed storage decisions and avoid bottlenecks in machine learning (ML) workloads.

The dramatic surge in participation compared to the v1.0 benchmark signals how critical storage has become for AI training systems as they scale to billions of parameters and clusters reach hundreds of thousands of accelerators. Companies ranging from tech giants to specialized storage providers submitted results, representing seven different countries in what officials called unprecedented global engagement.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. “The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it.”

A total of 26 organizations submitted results: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong.

Benchmark Suite Tests Real-World AI Training Scenarios

The MLPerf Storage benchmarks focus on testing a storage system’s ability to keep pace with accelerators, either graphics processing units (GPUs) or application-specific integrated circuits (ASICs). Among other metrics, the suite measures if the storage system can maintain accelerator utilization levels above 90% across different ML workloads.

The v2.0 results reveal storage systems now simultaneously support roughly twice the number of accelerators compared to the previous benchmark round, a critical improvement as training clusters continue to grow to meet demand.

The suite evaluates how well storage systems handle the data demands of actual AI training without requiring organizations to run full training jobs. The benchmarks work by simulating the “think time” of accelerators, the processing periods when they’re computing rather than reading or writing data. This approach generates realistic storage access patterns while testing whether storage systems can maintain the required performance levels to keep accelerators fed with data across different system configurations.

The v2.0 suite carries over three core workloads from v1.0 that represent common AI applications: 3D U-Net for medical image segmentation, ResNet-50 for image classification, and parameter prediction for scientific computing in cosmology.

New Checkpoint Benchmark Tests Address “Chronic Issue” in AI Training

The v2.0 suite introduces new tests to meet a harsh mathematical reality of AI training: in a 100,000-accelerator cluster running at full utilization for extended periods, failures can occur every 30 minutes. In a theoretical million-accelerator system, that’s a failure every three minutes.

The new checkpointing tests address this challenge head-on. Regular checkpoints—saved snapshots of training progress—are essential to mitigate the effects of accelerators failing. To optimize the use of these checkpoints, however, AI trainers require accurate data on the scale and performance of storage systems. The MLPerf Storage v2.0 checkpoint provides that data.

More information on checkpointing and the design of the benchmarks can be found in a blog post by Wes Vaske, a member of the MLPerf Storage working group.

Technical Diversity Reflects Industry Innovation

The submissions showcase remarkable diversity in approaches to high-performance AI storage. The v2.0 results include 6 local storage solutions, 2 systems using in-storage accelerators, 13 software-defined solutions, 12 block systems, 16 on-premises shared storage solutions, and 2 object stores.

This technical variety reflects what MLPerf Storage working group co-chair Oana Balmau called innovation driven by necessity. “Everything is scaling up: models, parameters, training datasets, clusters, and accelerators,” she said. “It’s no surprise to see that storage system providers are innovating to support ever larger scale systems.”

Major Players Showcase Enterprise-Grade Solutions

Enterprise storage leaders demonstrated significant advances in supporting massive AI training clusters.

DDN’s AI400X3 appliance achieved over 110 GiB/s sustained read throughput while supporting up to 640 simulated H100 GPUs on ResNet-50, representing a 2x performance improvement over the previous generation.

HPE submitted results for its Cray Supercomputing Storage Systems E2000. The E2000 more than doubles I/O performance compared to previous generations and powers six of the world’s fastest top 10 supercomputers, demonstrating proven scalability at unprecedented computational scales.

IBM showcased real-world performance with its Storage Scale system, which delivered 656.7 GiB/s read bandwidth for the massive Llama 3 1T model—equivalent to loading the entire trillion-parameter model in approximately 23 seconds—while simultaneously supporting mixed production workloads.

Quanta Cloud Technology (QCT) demonstrated the effectiveness of thoughtful system design through its QuantaGrid D54X-1U server platform, testing configurations with both Solidigm D7-PS1010 NVMe SSDs for low-latency metadata operations and D5-P5336 NVMe SSDs for high-capacity streaming read throughput.

The TechArena Take

When you’re running million-dollar training jobs that can fail every few minutes, storage is mission-critical infrastructure. The overall improvement in the number of accelerators that storage systems can support and record participation numbers reveal an ecosystem that’s taking storage seriously as a potential bottleneck to AI training efficiency.

We’re also excited to see the diversity of approaches represented in these results. Six different storage architectures, spanning everything from local NVMe to object stores, suggests there’s no single “right” answer yet. The industry is still experimenting, which means significant performance gains are likely still on the table. We’ll be watching for those gains in the next benchmark round.

The complete MLPerf Storage v2.0 results are available at MLCommons.org.

Subscribe to our newsletter

MLCommons today announced results for its MLPerf Storage v2.0 benchmark, setting new records with over 200 performance results from 26 organizations. The results provide a trove of new data for AI trainers looking to make informed storage decisions and avoid bottlenecks in machine learning (ML) workloads.

The dramatic surge in participation compared to the v1.0 benchmark signals how critical storage has become for AI training systems as they scale to billions of parameters and clusters reach hundreds of thousands of accelerators. Companies ranging from tech giants to specialized storage providers submitted results, representing seven different countries in what officials called unprecedented global engagement.

“The MLPerf Storage benchmark has set new records for an MLPerf benchmark, both for the number of organizations participating and the total number of submissions,” said David Kanter, Head of MLPerf at MLCommons. “The AI community clearly sees the importance of our work in publishing accurate, reliable, unbiased performance data on storage systems, and it has stepped up globally to be a part of it.”

A total of 26 organizations submitted results: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, KIOXIA, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, WDC, and YanRong.

Benchmark Suite Tests Real-World AI Training Scenarios

The MLPerf Storage benchmarks focus on testing a storage system’s ability to keep pace with accelerators, either graphics processing units (GPUs) or application-specific integrated circuits (ASICs). Among other metrics, the suite measures if the storage system can maintain accelerator utilization levels above 90% across different ML workloads.

The v2.0 results reveal storage systems now simultaneously support roughly twice the number of accelerators compared to the previous benchmark round, a critical improvement as training clusters continue to grow to meet demand.

The suite evaluates how well storage systems handle the data demands of actual AI training without requiring organizations to run full training jobs. The benchmarks work by simulating the “think time” of accelerators, the processing periods when they’re computing rather than reading or writing data. This approach generates realistic storage access patterns while testing whether storage systems can maintain the required performance levels to keep accelerators fed with data across different system configurations.

The v2.0 suite carries over three core workloads from v1.0 that represent common AI applications: 3D U-Net for medical image segmentation, ResNet-50 for image classification, and parameter prediction for scientific computing in cosmology.

New Checkpoint Benchmark Tests Address “Chronic Issue” in AI Training

The v2.0 suite introduces new tests to meet a harsh mathematical reality of AI training: in a 100,000-accelerator cluster running at full utilization for extended periods, failures can occur every 30 minutes. In a theoretical million-accelerator system, that’s a failure every three minutes.

The new checkpointing tests address this challenge head-on. Regular checkpoints—saved snapshots of training progress—are essential to mitigate the effects of accelerators failing. To optimize the use of these checkpoints, however, AI trainers require accurate data on the scale and performance of storage systems. The MLPerf Storage v2.0 checkpoint provides that data.

More information on checkpointing and the design of the benchmarks can be found in a blog post by Wes Vaske, a member of the MLPerf Storage working group.

Technical Diversity Reflects Industry Innovation

The submissions showcase remarkable diversity in approaches to high-performance AI storage. The v2.0 results include 6 local storage solutions, 2 systems using in-storage accelerators, 13 software-defined solutions, 12 block systems, 16 on-premises shared storage solutions, and 2 object stores.

This technical variety reflects what MLPerf Storage working group co-chair Oana Balmau called innovation driven by necessity. “Everything is scaling up: models, parameters, training datasets, clusters, and accelerators,” she said. “It’s no surprise to see that storage system providers are innovating to support ever larger scale systems.”

Major Players Showcase Enterprise-Grade Solutions

Enterprise storage leaders demonstrated significant advances in supporting massive AI training clusters.

DDN’s AI400X3 appliance achieved over 110 GiB/s sustained read throughput while supporting up to 640 simulated H100 GPUs on ResNet-50, representing a 2x performance improvement over the previous generation.

HPE submitted results for its Cray Supercomputing Storage Systems E2000. The E2000 more than doubles I/O performance compared to previous generations and powers six of the world’s fastest top 10 supercomputers, demonstrating proven scalability at unprecedented computational scales.

IBM showcased real-world performance with its Storage Scale system, which delivered 656.7 GiB/s read bandwidth for the massive Llama 3 1T model—equivalent to loading the entire trillion-parameter model in approximately 23 seconds—while simultaneously supporting mixed production workloads.

Quanta Cloud Technology (QCT) demonstrated the effectiveness of thoughtful system design through its QuantaGrid D54X-1U server platform, testing configurations with both Solidigm D7-PS1010 NVMe SSDs for low-latency metadata operations and D5-P5336 NVMe SSDs for high-capacity streaming read throughput.

The TechArena Take

When you’re running million-dollar training jobs that can fail every few minutes, storage is mission-critical infrastructure. The overall improvement in the number of accelerators that storage systems can support and record participation numbers reveal an ecosystem that’s taking storage seriously as a potential bottleneck to AI training efficiency.

We’re also excited to see the diversity of approaches represented in these results. Six different storage architectures, spanning everything from local NVMe to object stores, suggests there’s no single “right” answer yet. The industry is still experimenting, which means significant performance gains are likely still on the table. We’ll be watching for those gains in the next benchmark round.

The complete MLPerf Storage v2.0 results are available at MLCommons.org.

Subscribe to our newsletter

Transcript

Subscribe to TechArena

Subscribe