Skip to content.

Architecting a Heterogeneous AI Cloud for Training and Inference

Lorem ipsum dolor sit 1

As with all things, one size rarely fits all. When it comes to AI infrastructure it’s entirely feasibleto spin up a cluster with your GPU of choice and get to training or serving inference workloads. However, as workloads scale and become more complex, teams will be looking for ways to optimize their infrastructure investments to both increase the efficiency of their workload, and their capital overhead.

The goal is to match each workload, whether training massive models or serving rapid inference requests, with the best-suited resources. In this guide, we outline considerations and best practices for designing such a heterogeneous infrastructure including how to leverage different GPU models, high-speed storage, and networking to maximize performance for both training and inference workloads.

WHY HETEROGENEOUS INFRASTRUCTURE FOR AI?

AI workloads come in many shapes and sizes. A one-size-fits-all hardware approach can lead to inefficiencies. For example, a large neural network training job may need many high-memory GPUs working in tandem, while a real-time inference service might run best on a fleet of smaller, cost-effective GPU instances. Heterogeneous infrastructure allows you to mix and match resources to meet these varying demands. This can improve utilization and cost-efficiency, ensuring expensive accelerators are fully used and not sitting idle. Studies have shown that mismatches in the stack (like slow storage feeding fast GPUs) can drop GPU utilization to as low as 30%, whereas a balanced design can push utilization into the 90%+ range. In short, embracing heterogeneity means optimizing each layer of your stack (compute, storage, network) for the specific needs of different AI tasks.

COMPUTE LAYER: CHOOSING THE RIGHT GPUS (AND MORE)

At the heart of any AI cloud are the GPUs or other accelerators. A heterogeneous strategy uses different GPU models or configurations to suit different jobs:

Latest vs. Previous Generation GPUs: B200, H200, H100

H200 vs. H100:

NVIDIA’s H200 builds on the H100 foundation, offering ~1.4× higher memory bandwidth - 4.8 TB/s vs. 3.2 TB/s - and a larger 141 GB HBM3e memory (vs.80 GB HBM3). In practical terms, H200 delivers roughly 2× faster inference on large models like Llama 2 70B compared to H100, while significantly reducing power draw and lowering total cost of ownership by about 50% in inference settings.

B200 vs. H200/H100:

The new B200 GPU, based on the Blackwell architecture, offers a major leap: 2.5× to 4× higher training throughput over H100/H200, depending on scale and interconnects. On inference, a single B200 outperforms 3–4 H100s, delivering massive performance gains for real-time AI services. Despite its larger size, the B200 also brings energy efficiency, up to 57% faster on training benchmarks in some self-hosted deployments, plus 10× lower operating cost compared to H100 setups.

Best practice:allocate cutting-edge GPUs to jobs that truly need the extra performance (e.g. training huge models or serving very high-throughput inference). Meanwhile, consider using older or mid-range GPUs for less demanding tasks to save cost, as long as those GPUs have sufficient memory and capability for the model.

GPU Memory and Model Size:

Ensure the GPU chosen has enough memory for the model and batch size. Training a large model might require 80 GB memory GPUs (A100 80GB or H100 80GB), whereas a smaller model could run on a 16–24 GB GPU. Insufficient GPU memory leads to frequent data transfers or not being able to load the model at all, severely hurting performance. For training, if the model is bigger than one GPU’s memory, you'll need multiple GPUs with fast interconnect or model parallelism. For inference, large models (like a 70B parameter language model) might only fit on a high-memory GPU or with techniques like model sharding across GPUs, which adds complexity.

Choose GPUs with memory sizes appropriate to your model sizes – it’s often more efficient to use one GPU that fits the model than to split across several smaller ones if possible.

Multi-Instance GPUs (MIG) and Fractional GPUs:

Newer NVIDIA data center GPUs support Multi-Instance GPU (MIG), allowing a single physical GPU to be partitioned into multiple isolated instances. For example, an NVIDIA A100 or H100 can be split into up to seven GPU instances, each with its own dedicated share of memory and cores. This is extremely useful for inference workloads. If one H100 is more powerful than needed for a single model service, you can slice it into, say, two or more instances. Each instance can run a separate model or serve requests independently, improving overall utilization. In fact, using fractional H100 GPUs can yield equal or better performance compared to using full A100 GPUs, at around 20% lower cost for the same inference workload. This means two H100 MIG instances (each using roughly half the GPU) can each match an A100’s throughput while the total H100 is used more efficiently. Best practice for inference is to leverage MIG or similar technologies to right-size the GPU allocation to the model’s needs – especially when you have large GPUs but many smaller models or microservices to serve.

Mixing GPU Types in Training Jobs:

One common question is whether you should mix different GPU models in a single distributed training job (for example, some newer and some older GPUs working together on one training run). In general, avoid mixing GPUs within one training job if possible. The reason is that training processes synchronize at iteration boundaries; if one GPU is slower (older generation or less memory bandwidth), it will bottleneck the others (the fast GPUs have to wait). It’s better to run a training job on homogeneous GPUs for consistency in performance. If you have a heterogeneous cluster, schedule each training job to a homogeneous set of GPUs. Research on GPU cluster schedulers also notes that different models have varying performance on different GPU types, and that choosing the “right” GPU for each job is important to maximize throughput per dollar. In practice, that means your cluster management should be heterogeneity-aware – matching each training job to the most appropriate GPU type (considering both performance and cost).

Alternative Accelerators:

GPUs are the dominant compute engine for AI, but a heterogeneous approach can also include CPUs or specialized accelerators for certain tasks. For example, some inference workloads (especially classical ML or smaller models) might run sufficiently on CPU if GPUs are scarce, and some training could be offloaded to TPUs or custom ASICs if available. The key is to evaluate the performance needs: use GPUs for what they’re best at (massively parallel computations), and don’t waste a top-tier GPU on a job that could run on a cheaper resource without impacting SLAs. In a cloud context, this might mean offering a range of instance types (CPU-only, small GPU, large GPU, etc.) and using the right tool for each job.

STORAGE LAYER: FEEDING DATA AT HIGH SPEED

Fast compute is useless without fast data. AI training in particular is data intensive, reading huge datasets (images, text, etc.) continuously. If the storage system can’t keep up with the GPUs’ data consumption rate, the GPUs will stall waiting for data. In one case study from Weka, Stability AI saw their GPU utilization jump from 30% to 93% simply by switching to a high-performance storage solution. In other words, their expensive GPUs were idle 70% of the time due to I/O bottlenecks until the storage was optimized! Here’s how to ensure your storage keeps your GPUs busy:

  • High-Throughput Distributed Storage:
    Use storage systems that are proven to deliver very high read throughput and low latency. Modern AI-optimized file systems or object storage like WEKA or VAST Data can sustain tens of gigabytes per second of throughput to many clients, and handle the small random reads typical of AI training (reading many small files or patches). Traditional network file systems might not scale; for example, some legacy NAS systems struggle with the concurrent access patterns of AI, leading to bottlenecks. By contrast, specialized AI storage platforms eliminate many of those bottlenecks and can feed data fast enough to saturate dozens of GPUs. In MLPerf Storage benchmarks, a single client with an optimized storage system could keep 90+% utilization on up to 74 H100 GPUs with over 13 GB/s throughput in one test, highlighting how strong storage performance enables large-scale training.
  • Local NVMe Caching:
    If using a cloud environment or a cluster with a shared network storage, consider caching hot data on local NVMe SSDs where possible. For example, when a training job starts, it might stage a portion of the dataset to a local SSD scratch space on the GPU server. Local NVMe can offer extremely high IOPS and throughput, reducing dependency on network storage for repeated reads. This is especially useful if your workload tends to reuse a subset of data or if the cluster is multi-tenant (to avoid all jobs hammering the central storage simultaneously).
  • GPUDirect Storage (GDS):
    Modern NVIDIA GPUs support GPUDirect Storage, which allows the GPUs to directly DMA data from storage (or network) into GPU memory, bypassing the CPU. If your storage and network support it, enabling GDS can cut down latency and CPU overhead for data feeding. This ensures data flows straight into the GPU’s memory as fast as possible. In practice, this means less CPU bottleneck on input pipelines and more consistent throughput, which helps keep those utilization numbers high.
  • For Inference:
    Storage is usually less of a bottleneck in inference than in training, but it still matters. Inference services often need to load trained models (which can be gigabytes in size) from storage into memory. A well-architected inference system will keep frequently-used models cached in memory or on local disk of the serving nodes to avoid loading from a slow central store for each request. If you have hundreds of models (think of a scenario like running many different customer models on a shared platform), using a fast distributed storage and perhaps an SSD cache on each node will ensure new models load quickly when needed. Also, if your inference deals with data inputs like images or videos coming from storage, similar rules to training apply – use fast data paths to prevent request latency from spiking due to file I/O.
  • Throughput vs. Capacity Planning:
    AI datasets are enormous, so storage architecture often needs to balance sheer capacity with performance. A best practice is to use a tiered approach: a tier of ultra-fast storage (NVMe-based NAS or parallel file system) for active datasets, and a capacity tier (like cloud object storage or cheaper disks) for colder data or archives. The active tier should be sized to the working set of your current training jobs. Monitor your jobs’ I/O patterns – if your GPUs aren’t near 100% utilization and you see IO wait times, it’s a sign your storage is a bottleneck and you may need to scale up throughput (add more storage nodes, enable better caching, etc.).

NETWORKING: HIGH-SPEED FABRIC FOR DISTRIBUTED WORKLOADS

In a heterogeneous AI cloud, network connectivity ties everything together – connecting GPUs to each other (for multi-GPU training), GPUs to storage, and inference servers to clients. Network design can make or break performance at scale:

The Infrastructure Challenge in AI/ML

Within a server, if you have multiple GPUs, the presence of NVLink or NVSwitch can dramatically speed up multi-GPU training. NVLink is NVIDIA’s high bandwidth interconnect; for example, an H100 GPU in an NVLink-connected system can achieve up to 900 GB/s peer-to-peer bandwidthdeveloper.nvidia.com, versus roughly 64 GB/s over a standard PCIe interface. NVIDIA’s newer H200 GPUs continue this trend with fourth generation NVLink, providing similarly high GPU-to-GPU communication bandwidth (around 900 GB/s as well). This difference means that GPUs can exchange data (such as model parameters or gradients) almost an order of magnitude faster than they could over PCIe. For distributed training of a single model across, say, 8 GPUs in one node, having NVLink/NVSwitch connectivity (as found in NVIDIA’s HGX baseboards) is critical – it drastically reduces communication overhead and improves scaling efficiency. When purchasing or designing GPU servers for training, prioritize systems that include these high-speed GPU interconnects, especially if you plan to use multiple GPUs in one job.

Cluster Networking (Ethernet vs InfiniBand):

For scaling across multiple servers (nodes), the network between servers becomes the limiter. Traditional 10 or 25 GbE (Gigabit Ethernet) networks will likely be insufficient for heavy distributed training across nodes, because gradient exchanges or parameter server communications can saturate those links and slow down training steps. High-performance options include 100 GbE or 200 GbE networking, and InfiniBand (which offers low latency and high bandwidth, commonly 100 Gbps and up with RDMA capabilities). Newer solutions, like NVIDIA’s Spectrum-X 400Gbps Ethernet fabrics, aim to approach InfiniBand-like performance on Ethernet and can support extremely high throughput clusters. The key is to ensure each GPU node has as much network bandwidth as it has data to send/receive. A good rule of thumb: match the network to the aggregate GPU memory bandwidth needed for all GPUs participating in a single training job. For example, if you have 8 GPUs that each need to exchange 2 GB of data per second during training, your network backbone per node should comfortably handle >16 GB/s (~128 Gbps) to avoid slowing the GPUs. Many AI cloud providers (including specialized ones like WhiteFiber) use advanced Ethernet mesh topologies or InfiniBand to deliver on the order of terabits of networking per node to support massive scale training.

Network for Storage and Inference:

The same fast network is also what carries data from storage to GPUs (unless storage is local). So, a weak link in networking will affect data feeding as well. Ensure that your storage servers and GPU servers are connected via high speed links (100 Gbps+ recommended for serious AI workloads). For inference, network considerations include the client-to-server path: if you deploy GPU inference across many nodes, you might use a load balancer or API gateway – make sure your network can handle the request throughput without adding too much latency. Within the inference cluster, if microservices or ensembles of models communicate, low-latency networking helps maintain snappy responses.

Topology and Placement:

In a heterogeneous setup, you might have different clusters or segments of the network. For instance, you might isolate a high-speed training cluster on its own InfiniBand network, separate from a general-purpose Ethernet network for inference and other services. This separation can ensure that training traffic (which can be extremely heavy) doesn’t interfere with user facing inference traffic. If using a unified network, consider quality of service (QoS) settings or software-defined networking to prioritize latency-sensitive flows (inference queries) over bulk flows (synchronizing training data). Keeping training GPUs “close” to each other (for example, scheduling jobs so that GPUs are on the same switch or same rack) will also improve performance, as opposed to having GPUs communicating across long network paths. Many cluster schedulers try to collocate GPUs for each job to minimize network hops, which is a good practice to adopt.

USE CASES AND EXAMPLE SCENARIOS

To cement these ideas, let’s consider a couple of example scenarios and how a heterogeneous approach applies:

  • Use Case 1: Training a Large Language Model (LLM).
    This is a training workload that might use dozens or hundreds of GPUs. The best practice here is to use the most powerful GPUs available (like H200 or B200 with NVLink) because training time scales roughly with compute power. We would design a cluster (or choose a cloud instance type) that has multiple GPUs per node with NVSwitch for intra-node speed, and InfiniBand or 200+ Gbps Ethernet between nodes. Storage needs to be very high throughput – potentially using a parallel file system that can deliver 10+ GB/s per node to keep data flowing. We would avoid mixing different GPU models; all GPUs in this training run should be identical for consistency. Network overhead should be minimized by placing the GPUs in the same cluster/rack. In orchestration, this would be scheduled as a high-priority job, getting exclusive access to a chunk of the cluster. All these choices ensure that nothing impedes the GPUs from running at full tilt.
  • Use Case 2: Real-Time Inference for a Chatbot Service.
    Here we have an inference workload that serves many user queries per second with low latency. Suppose the model is moderately large (a few billion parameters). We might choose to deploy this on H100 GPUs initially for cost reasons, but with optimization (like quantization to lower precision) it might even handle on smaller GPUs or CPU for portions. However, if latency is critical, using an H200 could cut response time because it can process more tokens/sec and handle more concurrent requests. We might use MIG to split a single GPU to serve two instances of the model if one instance doesn’t fully utilize the GPU (this increases overall throughput per GPU). The design would include load balancing to route requests to multiple GPU servers. Each server should have the model loaded in memory; thus, if we scale to N replicas, ensure you have a fast way to load the model weights initially (perhaps from an SSD or a fast network store). Networking is important mostly for carrying request/response traffic, which might be smaller in data size than training sync traffic, but still ensure your inference cluster has redundant high-bandwidth links to handle traffic spikes. Orchestration-wise, we set these inference deployments to always running, possibly with auto-scaling policies based on request rate. We’d monitor latency and GPU utilization; if GPUs are underutilized, we could increase concurrency per GPU (serve more parallel requests or models) until we reach a good balance. Using a serving framework like NVIDIA Triton Inference Server can help maximize GPU utilization by batching requests together. The heterogeneous aspect comes if, for example, we introduce a new GPU type: say we add B200 nodes gradually for the heaviest models or use cloud instances with newer GPUs when on-prem ones are busy. We must ensure the software (and possibly the model quantization) is compatible across both. Over time, we might phase out older GPUs as the load grows or repurpose them for other lighter services.
  • Use Case 3: Mixed Workload AI Lab.
    Consider a smaller-scale environment like an enterprise AI lab that runs many experiments: some jobs are training small models on 1-2 GPUs, others are doing hyperparameter tuning across several GPUs, and some services are hosting models for internal demos. In a heterogeneous setup, this lab could have a few high-end GPU servers and some mid-range ones. The trick is to use them efficiently: schedule the heavy experiments on the high-end GPUs (they finish faster and free up resources), and run smaller trainings or notebook sessions on the lower-end GPUs which are perfectly sufficient for those tasks. Inference demos could run on fractional GPUs or on an older GPU if latency isn’t critical. This kind of environment benefits greatly from a flexible scheduler that can queue jobs and assign them to whichever GPU type is free and suitable. For instance, if the top GPUs are busy, a less urgent training could run on a slower GPU and still get done in time. By monitoring job completion times and utilization, the lab can identify if maybe they need more of one kind of GPU or if some are underutilized. The heterogeneous approach here ensures no GPU sits idle while a queue waits – there’s always a way to use any available compute for something that fits its capability.

KEY TAKEAWAYS AND BEST PRACTICES

Designing a heterogeneous AI cloud requires a holistic view of compute, storage, and networking, tailored to both training and inference. Here is a summary of key best practices:

  • Match Workload to GPU Type:
    Use the latest GPUs (e.g. B200) for the most demanding training jobs or high-throughput inference needs, since they offer significantly higher performance (up to 2× faster inference than the previous generation). Deploy less intensive jobs on older or smaller GPUs to save costs, and leverage partitioning (MIG) to run multiple inference tasks on one physical GPU when appropriate.
  • Ensure Sufficient GPU Memory:
    Choose GPUs with enough memory for your models to avoid memory bottlenecks. It’s often better to use one large-memory GPU than two smaller ones that split a model, to reduce communication overhead.
  • Maximize Data Throughput:
    Invest in high-speed storage (NVMe-based distributed filesystems or similar) so your GPUs are constantly fed with data. Slow storage can leave expensive GPUs underutilized (e.g. only 30% busy) whereas optimized storage can drive utilization to 90%+. Monitor I/O and consider techniques like local caching and GPUDirect Storage to minimize data wait times.
  • Use High-Bandwidth Interconnects:
    When training across multiple GPUs, especially in the same server, technologies like NVLink and NVSwitch are essential. They provide an order of-magnitude more bandwidth than PCIe (600-900 GB/s vs 64-128 GB/s), enabling near-linear scaling. Similarly, use cluster networking (InfiniBand or high-speed Ethernet with RDMA) that can handle the aggregate data exchange of distributed training.
  • Separate and Optimize Workloads:
    Acknowledge the differences between training and inference. Consider isolating training jobs from inference services (via dedicated resources or scheduling policies) to meet their distinct needs (throughput vs. latency). Tune your cluster scheduler or Kubernetes setup to be aware of GPU heterogeneity – e.g., schedule jobs on the most cost-effective resource that meets the performance requirement, and avoid mixing vastly different GPU speeds in one job.
  • Leverage Elasticity:
    In a cloud context, scale resources up or down based on demand. Spin up additional GPU instances for inference during peak usage, and run training jobs in off-peak times if possible. This can be orchestrated automatically with the right tools, ensuring you pay only for what you need while maintaining performance.
  • Monitor and Iterate:
    Implement strong monitoring for GPU utilization, job latency, and throughput at each layer. Use these metrics to identify bottlenecks (e.g., if GPUs are at 50% utilization, is it due to data input, network lag, or something else?). Continuously refine your architecture - for example, adding more storage nodes, upgrading a network switch, or re-balancing which workloads go to which GPUs - based on real data.

By thoughtfully combining different GPUs, storage solutions, and networks, you can architect an AI cloud that delivers optimal performance for every workload. The heterogeneous approach is all about using the right tool for the job: whether it’s a powerhouse GPU to crush a training job in record time, or splitting a big GPU into fractions to serve many inference queries economically. With these best practices, you can design a flexible infrastructure that meets broad AI needs – from rapid model development to reliable production deployment – all while keeping efficiency and scalability in focus.