Introduction to MLPerf with Bit Digital

, WhiteFiber

Learn how MLPerf benchmarks AI performance across training and inference. Discover key tests for image classification, NLP, and more in this deep dive.

Machine learning

Lorem ipsum dolor sit 1

Introduction to MLPerf with Bit Digital

‍

In this article, we will take a look at the industry leading benchmarking procedure for ML systems: MLPerf. In this post we cover:

What is MLPerf?
How MLPerf helps the AI community.
Procedures available for MLPerf training and inference benchmarks.

‍

What is MLPerf?

‍

MLPerf is an industry standard benchmark suite for evaluating Machine Learning systems developed by NVIDIA. It is used to evaluate performance across both training and inference capabilities. Moreover, the testing is designed to cover different hardware, software frameworks, and cloud platforms to provide a standardized way to compare the performance of various machine learning solutions.

‍

The tests available are highly varied and cover diverse use cases like image classification, language modeling, and object detection, to name a few. By measuring the capabilities of these systems across varied tasks, MLPerf essentially acts as an objective measuring stick for assessing how fast and efficiently a system can perform machine learning tasks.

‍

In practice, MLPerf has become the de facto standard for testing ML systems for accuracy and capability via comparison with state-of-the-art system reports from the community. The open-source, community-driven nature of the project allows for the collective inference of what every machine setup should be capable of.

‍

How MLPerf helps the AI Community

‍

The main contribution of MLPerf is to the overarching AI community’s ability to assess hardware in a standardized manner. These tests can be recreated on any variety of systems, and run the gamut from Large Language Models of extreme size all the way down to edge models designed to run on mobile systems. By assessing their hardware setup using these benchmarks against industry standards, users can accurately determine the efficacy and efficiency of their hardware within the context of other systems.

‍

MLPerf: Training and Inference

‍

MLPerf focuses on both types of major ML processes: training and inference. Each process has a set of curated testing procedures that are able to be replicated across all types of machine learning systems. These processes are developed around appropriate models that are approved by the community and NVIDIA. Let’s take a closer look at the types of MLPerf benchmarks available for training and inference.

‍

Types of MLPerf subjects

‍

Training

Training is the cornerstone of any Machine Learning system, and without training paradigms, the AI revolution would not have been possible. Innovations in training regimens correspond directly to the true advancements in AI development. By testing and recreating the best training procedures, we continue to learn, optimize, and further innovation on this front.

‍

Below, we can see the official list of MLPerf training tasks from MLPerf 4.1 Training, the latest round to be released. The training sets are composed of different sub-disciplines of Deep-Learning (DL) model training on different DL frameworks. By running these benchmarks on a machine, users can assess how their setup compares to optimized setups for each task type from a large number of respected submitters. For example, a machine may be optimized for LLMs and Transformers, but not necessarily for training a GNN. By comparing these for each sub-discipline of DL, we can develop a holistic understanding of how capable a machine setup is.

‍

Model	Reference Implementation	Framework	Dataset	Model Parameter Count
RetinaNet	vision/object detection	pytorch	OpenImages	37M
Stable Diffusionv2	image generation	pytorch	LAION-400M-filtered	865M
BERT-large	language/nlp	tensorflow	Wikipedia 2020/01/01	340M
GPT-3	language/llm	paxml, megatron-lm	C4	175B
LLama2 70B-LoRA	language/LLM fine-tuning	pytorch	SCROLLS GovReport	70B
DLRMv2	recommendation	torchrec	Criteo 3.5TB multi-hot	167M
RGAT	GNN	pytorch	IGBFull	25M

‍

These training tasks are assessed in detail: object detection, image generation, NLP, Large Language Model training from scratch, LLM Fine-tuning, recommendation systems, and Graph Neural Networks. Each of the datasets is well reputed for their robustness and safeness. The training of these systems was conducted and reported by established institutions like Google, NVIDIA, and OpenAI.

‍

Inference

The inference 5.0 benchmark is currently ongoing, and tests a wider variety of ML systems. These include but are not limited to classical assessments like image classification and NLP, to more advanced applications like pointpainting and Large Language Model inference. The growing number of open contributions are a reflection of the demand for more diverse, powerful hardware systems.

‍

Let’s take a deeper look at the available tests below.

‍

Model	Reference App	Framework	Dataset	Category
resnet50-v1.5	vision/classification_and_detection	tensorflow, onnx, tvm, ncnn	imagenet2012	edge, datacenter
retinanet 800x800	vision/classification_and_detection	pytorch, onnx	openimages resized to 800x800	edge, datacenter
bert	language/bert	tensorflow, pytorch, onnx	squad-1.1	edge
dlrm-v2	recommendation/dlrm_v2	pytorch	Multihot Criteo Terabyte	datacenter
3d-unet	vision/medical_imaging/3d-unet-kits19	pytorch, tensorflow, onnx	KiTS19	edge, datacenter
gpt-j	language/gpt-j	pytorch	CNN-Daily Mail	edge, datacenter
stable-diffusion-xl	text_to_image	pytorch	COCO 2014	edge, datacenter
llama2-70b	language/llama2-70b	pytorch	OpenOrca	datacenter
llama3.1-405b	language/llama3-405b	pytorch	LongBench, LongDataCollections, Ruler, GovReport	datacenter
mixtral-8x7b	language/mixtral-8x7b	pytorch	OpenOrca, MBXP, GSM8K	datacenter
rgat	graph/rgat	pytorch	IGBH	datacenter
pointpainting	automotive/3d-object-detection	pytorch, onnx	Waymo Open Dataset	edge

‍

The sub-disciplines covered above include image classification, object detection, language models recommendation systems, 3d object modeling, Large Language Models, Mixture-of-Experts LLMs, graph, and 3d object detection. Cumulatively, these tasks cover nearly the full spectrum of popular DL applications, and are designed to be run across both data center and edge devices. This wide variety of inference tasks provides us with a methodology to assess the efficacy of any machine. New hardware releases can be successfully compared with these results, allowing for the creation of a true standardized benchmark for ML/DL hardware systems.

‍

Running MLPerf with WhiteFiber

‍

In the follow up to this article, we will begin showing our results for an inference benchmark on a WhiteFiber GPU Machine. These Bare Metal GPUs offer full functionality on NVIDIA H200 nodes. We intend to recreate the inference benchmarks for the LLM LLaMA 3.1 405b. Please follow @WhiteFiber_ for more updates!

‍