Introduction to MLPerf with Bit Digital
In this article, we will take a look at the industry leading benchmarking procedure for ML systems: MLPerf. In this post we cover:
- What is MLPerf?
- How MLPerf helps the AI community.
- Procedures available for MLPerf training and inference benchmarks.
What is MLPerf?
MLPerf is an industry standard benchmark suite for evaluating Machine Learning systems developed by NVIDIA. It is used to evaluate performance across both training and inference capabilities. Moreover, the testing is designed to cover different hardware, software frameworks, and cloud platforms to provide a standardized way to compare the performance of various machine learning solutions.
The tests available are highly varied and cover diverse use cases like image classification, language modeling, and object detection, to name a few. By measuring the capabilities of these systems across varied tasks, MLPerf essentially acts as an objective measuring stick for assessing how fast and efficiently a system can perform machine learning tasks.
In practice, MLPerf has become the de facto standard for testing ML systems for accuracy and capability via comparison with state-of-the-art system reports from the community. The open-source, community-driven nature of the project allows for the collective inference of what every machine setup should be capable of.
How MLPerf helps the AI Community
The main contribution of MLPerf is to the overarching AI community’s ability to assess hardware in a standardized manner. These tests can be recreated on any variety of systems, and run the gamut from Large Language Models of extreme size all the way down to edge models designed to run on mobile systems. By assessing their hardware setup using these benchmarks against industry standards, users can accurately determine the efficacy and efficiency of their hardware within the context of other systems.
MLPerf: Training and Inference
MLPerf focuses on both types of major ML processes: training and inference. Each process has a set of curated testing procedures that are able to be replicated across all types of machine learning systems. These processes are developed around appropriate models that are approved by the community and NVIDIA. Let’s take a closer look at the types of MLPerf benchmarks available for training and inference.
Types of MLPerf subjects
Training
Training is the cornerstone of any Machine Learning system, and without training paradigms, the AI revolution would not have been possible. Innovations in training regimens correspond directly to the true advancements in AI development. By testing and recreating the best training procedures, we continue to learn, optimize, and further innovation on this front.
Below, we can see the official list of MLPerf training tasks from MLPerf 4.1 Training, the latest round to be released. The training sets are composed of different sub-disciplines of Deep-Learning (DL) model training on different DL frameworks. By running these benchmarks on a machine, users can assess how their setup compares to optimized setups for each task type from a large number of respected submitters. For example, a machine may be optimized for LLMs and Transformers, but not necessarily for training a GNN. By comparing these for each sub-discipline of DL, we can develop a holistic understanding of how capable a machine setup is.
These training tasks are assessed in detail: object detection, image generation, NLP, Large Language Model training from scratch, LLM Fine-tuning, recommendation systems, and Graph Neural Networks. Each of the datasets is well reputed for their robustness and safeness. The training of these systems was conducted and reported by established institutions like Google, NVIDIA, and OpenAI.
Inference
The inference 5.0 benchmark is currently ongoing, and tests a wider variety of ML systems. These include but are not limited to classical assessments like image classification and NLP, to more advanced applications like pointpainting and Large Language Model inference. The growing number of open contributions are a reflection of the demand for more diverse, powerful hardware systems.
Let’s take a deeper look at the available tests below.
The sub-disciplines covered above include image classification, object detection, language models recommendation systems, 3d object modeling, Large Language Models, Mixture-of-Experts LLMs, graph, and 3d object detection. Cumulatively, these tasks cover nearly the full spectrum of popular DL applications, and are designed to be run across both data center and edge devices. This wide variety of inference tasks provides us with a methodology to assess the efficacy of any machine. New hardware releases can be successfully compared with these results, allowing for the creation of a true standardized benchmark for ML/DL hardware systems.
Running MLPerf with WhiteFiber
In the follow up to this article, we will begin showing our results for an inference benchmark on a WhiteFiber GPU Machine. These Bare Metal GPUs offer full functionality on NVIDIA H200 nodes. We intend to recreate the inference benchmarks for the LLM LLaMA 3.1 405b. Please follow @WhiteFiber_ for more updates!