Skip to content.

Last updated: 

June 2026

Let Cooler Heads Prevail

Lorem ipsum dolor sit 1

Thermal density in datacenter/cloud design is a subject covered exhaustively in the ether and media from perspectives covering the span between community/ecosystem impact and the mechanical details of managing electrical and thermal IO within narrow spec tolerances at the gigawatt+ scale of effort. Decision-makers in the space figuring out which systems to rent/buy/design in this generation of mixed cooling approaches however are left with their most common question un-answered: do they join the race to the bottom for GPU/hr pricing for one last hardware iteration with air-cooled (AC) low density systems or do they get on the liquid-cooled (LC) train clearly leaving the station today and figure out how to amortize or better yet monetize the capex increase associated? 

Most of the conversations we have with clients about this start at "space and power" but very quickly end up in the same place: the technical merits of thermal regulation extend far beyond the physical placement, density, and cost structure; resulting in tangible double-digit impacts on what actually happens up-stack.

To level-set a bit: Introl published a great article nearly a year ago delving into the opex concerns and mechanical details of various cooling implementations for medium-density (50kW is unfortunately not much these days) racks. Their research clearly shows the economic breaking points for ROI and the physics constraints involved in dissipating that much heat from so little space using air. 

SuperMicro themselves published their analysis shortly thereafter helping to connect the dots of these low-level concerns with actual customer KPIs - throughput, SLA, operating frequencies, etc. Air-cooled systems under full and consistent load scale back to ~75-80% of clock capacity to manage thermal load making a ~4k LC GPU cluster computationally comparable to a ~5K AC one in simplified terms at first glance. One of the key elements of their analysis focused on the dynamic voltage frequency scaling (DVFS) rates within the devices as a measure of "what you pay for" relative to "what you can use." 

Some compelling arguments made in that paper as well and yet most of the literature out there misses a key point: computational and data operation rates (FLOPs & IOPs, effectively) of each device are directly proportional to the clock frequencies at which it runs in whatever DVFS scale is applied; and while these clock scales in the cores and memory can be managed relative to each at the device and even to some degree host dataplane (HGX mezzanine and the like) level, the distribution of operations within the cluster topology has to accommodate for those constraints in the fabric and operational coordination layers of the ecosystem.

Diagram showing GPU cluster fabric with four nodes, highlighting throttled lanes, bottlenecks, backpressure, and congestion control through a RoCE/RDMA switch

In human terms we can view the GPU fabric interfaces of a cluster host as an 8-lane highway (simplified to 4 in the visuals.) Much like traffic patterns in LA or other warm & densely populated areas, it only takes one car overheating in one lane to screw up the flow across the entire highway as everyone tries to work their way around the steaming car blinking its hazards and rolling to a stop. When every machine in a 1024 node arrangement is its own 8-lane highway all of which are inter-connected and occupied by the flow rates of traffic relative to the GPUs adjacent to the NIC inside the host and its peers in the topology, device-local thermal instability in the cores or memory which handle inbound data and push data outbound create irregularities in flows across the entire fabric. Akin to driving through Boston on a 90F+ day during rush hour to a scheduled appointment or meeting for a tangible sense of why this is "not good" for workloads of any sort.

Diagram illustrating RoCE switch congestion where PFC pauses and throttled lanes cause cluster wide performance impact and stalled nodes

Congestion control mechanisms in GPU data planes (networks, RoCE in our example) ultimately come down to "telling a device not to send" (Priority Flow Control, or PFC) because the fabric cannot hold any more data for ordered egress to the recipient while said recipient is running at a lower clock speed and consuming data off its ingress port more slowly - cars on the highway having to slow down and wait for lanes to open up until they can drive around the laggard. The propagation and distribution of Explicit Congestion Notification (ECN) through the fabric itself culminating in emission of PFC at the edge ports facing the host NICs produces micro-stalls in transmission across as few data-paths as possible to limit said congestion while clearing it out of the other one as quickly as possible; but within that affected domain it is still preventing traffic from flowing. This in-turn requires the GPU side of the workload to be able to handle the stall gracefully (taking those cycles to work on something else) or at least without breaking coherence which can further back-propagate through the transmission graph stalling other parts of the cluster IO. 

Every cycle of wait, pause, and thermal throttling impacts the time to completion and in the worst cases data/product quality if coherence control is not properly enforced and validated under fluctuating thermal conditions. Cost-credit allocated networks such as InfiniBand don't have the same mechanics of congestion control but the resulting effect is the same - not allowing a "lane" which has data to send to emit it because the path to the destination isn't clearing fast enough to deliver another transmission correctly.

Comparison of liquid cooled and air cooled GPU clusters showing equal throughput but lower power consumption, fewer GPUs, and reduced costs for liquid cooling systems

Practically this results in a multiplicative effect on the financial models everyone is trying to derive beyond the "4k of LC at ~100% clock ~= 5.3k of 75% DVFS-scaled AC compute" logic:

  • The ultimate off-taker who will run workloads across the topology such as training or all-to-all MoE inference has to understand the potential minima and maxima of their throughput and latency curves relative to (DVFS probability) * (device count) to assess the space between them for viability. The more systems and lanes they have, the higher the probability of a thermal event stalling one lane/frequency of micro-interruption even with all else being equal; and the more physical pieces of gear there are to fail and impact SLA for the same computing power the more impact it has on their bottom line.
  • The re-seller/aggregator/etc layer is seeing SLA requirements as coming standard with contract terms due to the length of these arrangements now that speculative builds are a distant memory on older-generation equipment. Terms vary but the performance/throughput consideration exists as do standard uptime clauses. Participating in the 'race to the bottom for GPU/hr' becomes a dicey proposition in terms of the risk carried by those holding the paper - will the clawbacks extracted exceed the margin they're trying to make and what is the risk of catastrophic loss from the corners cut to reduce price of implementation?
  • The builders/financiers/facilities providers being asked to construct these clusters for the first two groups are faced with the ultimate requirements for support, logistics, and often operation in which the opex variance grows in the same dimensions as the concerns for group 1. Under-investment in new builds or efforts to reuse existing capacity which cannot run LC at "margin-increasing density" creates risk profiles to both revenue and assets held for the various members of this group while increasing overall pressure on supply chain because people keep roasting kit.

In an industry with so many layers, each of which requires some profit margin to survive, the individual slices of pie are pretty thin. Risk tolerance is still fairly high but it doesn't take much to knock a key player entirely out just due to the sums involved. Entities that have consolidated these layers and operate across boundaries win bigger rewards but also concentrate the aforementioned risk profiles into their portfolios whether they know it or not.

The market is invariably moving to liquid cooling over the next generation or two but we have really already been there for two cycles at this point and the technical data is clear: thermodynamics define how efficiently electrons move through conductive materials and the physical load on those materials of expansion <-> contraction through heat cycles kills hardware faster than constant-state operation because those materials don't grow/shrink at the same rate. An average ~17% loss of operating efficiency at individual component level alone should give anyone reasoning about the cost involved pause but the SLA impact of added complexity to accommodate for that deficit and actual workload effects on what should be lock-step clustered operation without uniform thermal control of the fleet requires understanding how that workload communicates with itself and the impacts of laggards/blocking in the datapath on its ultimate efficacy and performance.

FAQs

Should we buy one more generation of air-cooled GPUs, or move to liquid cooling now?

Move to liquid cooling. The industry has effectively been transitioning for two hardware cycles, and the gap is no longer about rack density; it's about usable compute. Air-cooled fleets throttle to 75–80% of clock under sustained load, so the real decision is how you amortize or monetize the capex increase, not whether to make it.

Most conversations start at "space and power" but end in the same place: thermal regulation drives double-digit impacts further up the stack. Betting on one more air-cooled generation is a race to the bottom on GPU/hr that gets harder to win as SLAs tighten.

Why do air-cooled GPUs deliver less compute than liquid-cooled GPUs if the chips are identical?

Because clock frequency is what you actually pay for, and heat suppresses it. FLOPs and IOPs scale directly with the clock rates cores and memory sustain under DVFS scaling. Under sustained full load, air-cooled GPUs throttle to roughly 75–80% of clock, so a ~4,000-GPU liquid-cooled cluster at full clock is computationally comparable to about 5,300 air-cooled GPUs.

That equivalence is before the ~17% average efficiency loss at the individual component level, and before the cluster-level effects of uneven throttling. The silicon is the same; the thermal envelope decides how much of it you actually get to use.

Why does thermal throttling on a few GPU nodes slow down the entire cluster?

Because the fabric has to accommodate the slowest nodes. When a device throttles on core or memory heat, it pulls inbound data off its port more slowly, and congestion control forces connected paths to stall rather than overrun it. One laggard ripples across every interconnected lane like a single stalled car backing up an entire highway.

In RoCE fabrics this shows up as ECN propagating through the network and Priority Flow Control telling devices to stop sending, producing micro-stalls at the host-facing ports. InfiniBand uses credit-based congestion control instead, but the net effect is identical: a lane with data to send can't emit it. Every pause adds to time-to-completion, and weak coherence enforcement under fluctuating thermals can degrade output quality too.

Why is GPU cooling an SLA and financial risk, not just an engineering decision?

Because the cost of throttling is multiplicative, not additive, and it lands on every layer of the stack. Thermal-event probability rises with device count, so more air-cooled gear means more SLA exposure for the same compute. With long contracts now carrying standard performance and uptime clauses, racing to the bottom on GPU/hr risks clawbacks that exceed margin.

Off-takers running training or all-to-all MoE inference have to model throughput against DVFS probability multiplied by device count. Resellers and aggregators hold the paper on those SLAs. Builders and financiers carry the opex variance and asset risk of kit that degrades faster under thermal cycling. Each layer's margin is thin and whoever consolidates them concentrates all of that risk into one portfolio.