Skip to content.

Introduction to MLPerf with Bit Digital

Lorem ipsum dolor sit 1

Introduction to MLPerf with Bit Digital

In this article, we will take a look at the industry leading benchmarking procedure for ML systems: MLPerf. In this post we cover:

  • What is MLPerf?
  • How MLPerf helps the AI community.
  • Procedures available for MLPerf training and inference benchmarks.

What is MLPerf?

MLPerf is an industry standard benchmark suite for evaluating Machine Learning systems developed by NVIDIA. It is used to evaluate performance across both training and inference capabilities. Moreover, the testing is designed to cover different hardware, software frameworks, and cloud platforms to provide a standardized way to compare the performance of various machine learning solutions.

The tests available are highly varied and cover diverse use cases like image classification, language modeling, and object detection, to name a few. By measuring the capabilities of these systems across varied tasks, MLPerf essentially acts as an objective measuring stick for assessing how fast and efficiently a system can perform machine learning tasks.

In practice, MLPerf has become the de facto standard for testing ML systems for accuracy and capability via comparison with state-of-the-art system reports from the community. The open-source, community-driven nature of the project allows for the collective inference of what every machine setup should be capable of. 

How MLPerf helps the AI Community

The main contribution of MLPerf is to the overarching AI community’s ability to assess hardware in a standardized manner. These tests can be recreated on any variety of systems, and run the gamut from Large Language Models of extreme size all the way down to edge models designed to run on mobile systems. By assessing their hardware setup using these benchmarks against industry standards, users can accurately determine the efficacy and efficiency of their hardware within the context of other systems.

MLPerf: Training and Inference

MLPerf focuses on both types of major ML processes: training and inference. Each process has a set of curated testing procedures that are able to be replicated across all types of machine learning systems. These processes are developed around appropriate models that are approved by the community and NVIDIA. Let’s take a closer look at the types of MLPerf benchmarks available for training and inference.

Types of MLPerf subjects

Training

Training is the cornerstone of any Machine Learning system, and without training paradigms, the AI revolution would not have been possible. Innovations in training regimens correspond directly to the true advancements in AI development. By testing and recreating the best training procedures, we continue to learn, optimize, and further innovation on this front.

Below, we can see the official list of MLPerf training tasks from MLPerf 4.1 Training, the latest round to be released. The training sets are composed of different sub-disciplines of Deep-Learning (DL) model training on different DL frameworks. By running these benchmarks on a machine, users can assess how their setup compares to optimized setups for each task type from a large number of respected submitters. For example, a machine may be optimized for LLMs and Transformers, but not necessarily for training a GNN. By comparing these for each sub-discipline of DL, we can develop a holistic understanding of how capable a machine setup is. 

Model Reference Implementation Framework Dataset Model Parameter Count
RetinaNet vision/object detection pytorch OpenImages 37M
Stable Diffusionv2 image generation pytorch LAION-400M-filtered 865M
BERT-large language/nlp tensorflow Wikipedia 2020/01/01 340M
GPT-3 language/llm paxml, megatron-lm C4 175B
LLama2 70B-LoRA language/LLM fine-tuning pytorch SCROLLS GovReport 70B
DLRMv2 recommendation torchrec Criteo 3.5TB multi-hot 167M
RGAT GNN pytorch IGBFull 25M

These training tasks are assessed in detail: object detection, image generation, NLP, Large Language Model training from scratch, LLM Fine-tuning, recommendation systems, and Graph Neural Networks. Each of the datasets is well reputed for their robustness and safeness. The training of these systems was conducted and reported by established institutions like Google, NVIDIA, and OpenAI.

Inference

The inference 5.0 benchmark is currently ongoing, and tests a wider variety of ML systems. These include but are not limited to classical assessments like image classification and NLP, to more advanced applications like pointpainting and Large Language Model inference. The growing number of open contributions are a reflection of the demand for more diverse, powerful hardware systems.

Let’s take a deeper look at the available tests below. 

Model Reference App Framework Dataset Category
resnet50-v1.5 vision/classification_and_detection tensorflow, onnx, tvm, ncnn imagenet2012 edge, datacenter
retinanet 800x800 vision/classification_and_detection pytorch, onnx openimages resized to 800x800 edge, datacenter
bert language/bert tensorflow, pytorch, onnx squad-1.1 edge
dlrm-v2 recommendation/dlrm_v2 pytorch Multihot Criteo Terabyte datacenter
3d-unet vision/medical_imaging/3d-unet-kits19 pytorch, tensorflow, onnx KiTS19 edge, datacenter
gpt-j language/gpt-j pytorch CNN-Daily Mail edge, datacenter
stable-diffusion-xl text_to_image pytorch COCO 2014 edge, datacenter
llama2-70b language/llama2-70b pytorch OpenOrca datacenter
llama3.1-405b language/llama3-405b pytorch LongBench, LongDataCollections, Ruler, GovReport datacenter
mixtral-8x7b language/mixtral-8x7b pytorch OpenOrca, MBXP, GSM8K datacenter
rgat graph/rgat pytorch IGBH datacenter
pointpainting automotive/3d-object-detection pytorch, onnx Waymo Open Dataset edge

The sub-disciplines covered above include image classification, object detection, language models recommendation systems, 3d object modeling, Large Language Models, Mixture-of-Experts LLMs, graph, and 3d object detection. Cumulatively, these tasks cover nearly the full spectrum of popular DL applications, and are designed to be run across both data center and edge devices. This wide variety of inference tasks provides us with a methodology to assess the efficacy of any machine. New hardware releases can be successfully compared with these results, allowing for the creation of a true standardized benchmark for ML/DL hardware systems. 

Running MLPerf with WhiteFiber

In the follow up to this article, we will begin showing our results for an inference benchmark on a WhiteFiber GPU Machine. These Bare Metal GPUs offer full functionality on NVIDIA H200 nodes. We intend to recreate the inference benchmarks for the LLM LLaMA 3.1 405b. Please follow @WhiteFiber_ for more updates!