What Are the Key Components of AI Infrastructure?

Lorem ipsum dolor sit 1

AI workloads have different infrastructure requirements than traditional enterprise applications. Training large language models or running inference at scale creates demands on compute, storage, networking, and power that differ from serving web requests or processing transactions. This article examines the primary components of AI infrastructure and their technical characteristics.

‍

Compute layer

‍

The compute layer forms the foundation of AI infrastructure. Modern AI workloads use GPUs rather than CPUs for the parallel processing required by neural network training and inference. WhiteFiber deploys NVIDIA H100, H200, and B200 GPUs in their infrastructure, with GB200 Superchips available for workloads requiring higher throughput.

‍

Different GPU configurations have different specifications. The H100 provides 32 petaFLOPS of AI performance. The H200 offers 2X faster networking capability. The GB200 architecture combines Grace CPUs with Blackwell GPUs to deliver 1.8 TB/s GPU-to-GPU bandwidth and 72 petaFLOPS for training workloads. GPU memory bandwidth and interconnect speed affect how quickly models train and how many tokens per second inference can process.

‍

CPU compute remains relevant in AI infrastructure. Data preprocessing, orchestration tasks, and certain inference workloads run on CPU nodes. Infrastructure typically includes both GPU clusters for training and inference, plus CPU capacity for supporting tasks.

‍

Storage architecture

‍

Storage represents a common bottleneck in AI infrastructure. GPU clusters can process data faster than many storage systems can deliver it. Training a large model requires reading billions of parameters and feeding continuous streams of training data to dozens or hundreds of GPUs simultaneously.

‍

AI-optimized storage systems deliver data at rates that match GPU consumption. WhiteFiber's storage stack includes three systems for different access patterns:

‍

VAST Data:

Provides low-latency access for datasets requiring random reads, such as computer vision training with millions of individual image files

WEKA:

Delivers parallel file system performance for workloads needing consistent high throughput across many nodes

CEPH:

Offers distributed object storage for large datasets that can tolerate slightly higher latency

‍

The storage infrastructure delivers 40 GBps read performance per node, scaling to 500 GBps for multi-node systems. GPUDirect RDMA enables direct memory transfers from storage to GPU memory, bypassing the CPU for datasets exceeding local cache capacity.

‍

The storage layer also handles checkpointing. Training runs that take days or weeks save model state regularly. Write performance (20 GBps per node) determines how much time checkpointing requires.

‍

Network fabric

‍

Network architecture affects whether GPU clusters can scale efficiently. When training a model across multiple nodes, GPUs exchange gradient updates and synchronize state. Network latency and bandwidth affect training speed. A model that trains in 8 hours on one network might take 12 hours on another.

‍

Two networking technologies are common in AI infrastructure. InfiniBand has traditionally served HPC workloads, offering latency around 5 microseconds and high bandwidth. Ethernet has evolved to compete with InfiniBand through technologies like RoCEv2 and specialized fabrics. WhiteFiber's networking infrastructure uses DriveNets Network Cloud-AI, which delivers Ethernet latency at 7 microseconds while providing multi-tenancy support at 95%.

‍

The network fabric handles specific communication patterns. All-reduce operations, where every GPU receives aggregated data from all other GPUs, create bandwidth demands. A 3.2 TB/s interconnect fabric can support this communication pattern across large clusters.

‍

Network topology affects performance characteristics. A spine-and-leaf architecture provides predictable latency between any two nodes. Fat-tree topologies can deliver higher total bandwidth but may introduce variable latency depending on which nodes communicate.

‍

Data center foundations

‍

The physical infrastructure supporting AI compute differs from traditional data center design. GPU clusters consume more power per rack than typical server deployments. An H100 system can draw 700W per GPU, meaning a rack with 8 GPU nodes approaches 30-50kW of power draw. Traditional data centers typically provision 5-10kW per rack.

‍

This power density requirement relates to several infrastructure characteristics:

‍

Cooling systems

Air cooling has limitations with 30kW+ racks. Direct liquid cooling, where coolant circulates through cold plates attached to GPUs, removes heat while consuming less energy. WhiteFiber's data centers use direct-to-chip liquid cooling for high-density deployments.

Power distribution

Data centers hosting AI infrastructure have access to megawatt-scale power supplies. WhiteFiber's facilities support up to 50kW per cabinet with room-scale deployments reaching 24MW and beyond. Power redundancy through 2N electrical architecture maintains uptime during maintenance or failures.

Physical space

High-density racks require different spacing for coolant distribution and maintenance access than traditional server deployments. Retrofitting existing facilities for AI workloads involves mechanical and electrical upgrades.

‍

Connectivity and backbone

‍

AI systems require connectivity to the broader internet. Training data often comes from distributed sources. Inference services respond to API requests. Model weights and checkpoints may transfer between data centers or to customer systems.

‍

Multiple redundant Tier-1 ISP connections provide several characteristics. If one provider experiences an outage or routing issues, traffic can fail over to another. Peering agreements with multiple carriers can reduce latency to end users by selecting optimal routes. For global deployments, choosing ISPs with presence in target regions affects connectivity characteristics.

‍

Cross-data-center connectivity enables hybrid deployments. Dark fiber connections between facilities allow workloads to span multiple sites. This supports scenarios like bursting from on-premises clusters to cloud capacity, or distributing training across geographically separated GPUs while maintaining communication latency within acceptable ranges.

‍

‍

Orchestration and management

‍

Running AI workloads at scale uses software to manage cluster resources, schedule jobs, and monitor performance. Kubernetes has become common for containerized AI workloads, though bare-metal GPU provisioning often uses specialized tools.

‍

Job scheduling systems account for GPU topology and communication patterns. A training job requiring 64 GPUs runs faster when those GPUs are physically close, sharing the same network fabric segment. Schedulers that pack jobs onto arbitrary available GPUs may create different performance characteristics.

‍

Monitoring and observability have particular relevance in AI infrastructure. GPU utilization during training indicates whether bottlenecks exist in data loading, network communication, or software efficiency. Storage throughput monitoring shows when dataset access patterns exceed available bandwidth. Network performance metrics indicate whether collective communication operations complete within expected timeframes.

‍

Integration across layers

‍

AI infrastructure functions as a system. Fast GPUs operate within the constraints of storage speed. High-bandwidth networking requires storage that can feed data at matching rates. Power and cooling set limits on cluster density regardless of available rack space.

‍

WhiteFiber's vertically integrated approach addresses this by controlling the full stack from data center facilities through network fabric to storage architecture. This integration allows optimization across component boundaries.

‍

AI systems have infrastructure considerations across these components. A GPU cluster designed for inference has different network and storage characteristics than one built for training. Private AI deployments may prioritize different factors than cloud-based infrastructure. Understanding these components helps in planning infrastructure that matches workload requirements.

‍

Frequently asked questions

‍

What determines whether to use InfiniBand or Ethernet for AI workloads?

‍

The choice between InfiniBand and Ethernet depends on workload characteristics and infrastructure requirements. InfiniBand offers latency around 5 microseconds, while modern Ethernet implementations like DriveNets Network Cloud-AI achieve 7 microseconds. For multi-tenant environments, Ethernet provides better isolation at 95% versus InfiniBand's 70%. Training workloads with frequent synchronization across many nodes benefit from lower latency. Inference workloads with less inter-GPU communication may not require the lowest possible latency. Cost considerations also factor in, as Ethernet typically has broader vendor support and lower per-port costs at scale.

‍

How much storage bandwidth does AI training actually require?

‍

Storage bandwidth requirements depend on dataset characteristics and batch sizes. Computer vision models training on high-resolution images can require 10-40 GBps per node to keep GPUs fed. Natural language processing workloads with smaller per-sample data sizes may require less bandwidth but need low-latency random access. Large language model training with context windows of 32K+ tokens needs sustained throughput to load training sequences. A general guideline is that storage bandwidth should match or exceed the aggregate data consumption rate of all GPUs in a node. For an 8-GPU node where each GPU processes 1 GB/s of training data, storage bandwidth should deliver 8+ GBps.

‍

Can existing data centers be retrofitted for AI infrastructure?

‍

Retrofitting depends on available power and cooling capacity. Traditional data centers with 5-10kW per rack can support limited AI deployments using air-cooled GPUs. Higher-density deployments requiring 30-50kW per rack need electrical infrastructure upgrades and liquid cooling systems. The cost of retrofitting varies based on existing capacity. Facilities with spare power capacity and adequate cooling infrastructure can be retrofitted more economically than those requiring new electrical substations or cooling plants. WhiteFiber's approach involves retrofitting industrial properties with appropriate electrical and cooling infrastructure to Tier 3 standards.

‍

What causes GPU utilization to drop below 100% during training?

‍

GPU utilization below 100% indicates bottlenecks elsewhere in the system. Slow data loading from storage causes GPUs to wait for training batches. Network bottlenecks during distributed training make GPUs wait for gradient synchronization. Insufficient CPU preprocessing capacity creates delays in preparing data for GPU consumption. Software inefficiencies like unnecessary CPU-GPU memory transfers reduce utilization. Monitoring tools can identify which bottleneck affects utilization. Storage throughput metrics, network performance data, and CPU utilization help diagnose the constraint.

‍

How does infrastructure for training differ from infrastructure for inference?

‍

Training and inference have different resource patterns. Training requires high GPU-to-GPU bandwidth for gradient synchronization across multiple nodes. Inference typically runs on single GPUs or small clusters with less inter-GPU communication. Training needs high-capacity storage to handle large datasets accessed repeatedly. Inference requires lower storage capacity but benefits from low-latency access to model weights. Training uses batch processing where throughput matters more than individual request latency. Inference serves real-time requests where per-request latency affects user experience. Infrastructure optimized for training may include high-bandwidth network fabrics and parallel file systems. Infrastructure optimized for inference may prioritize redundancy, geographic distribution, and request routing capabilities.

‍