
As organizations push more workloads into inference and AI-driven applications, compute efficiency is moving to the top of every buyer’s checklist.
We sat down with Cirrascale CEO Dave Driggers to walk through the practical yardsticks they use when evaluating performance, the trade-offs behind scheduling and accelerator selection, and the engineering choices that sustain efficiency even at high rack densities.
Check out the Q&A below to learn how Cirrascale defines compute efficiency in business terms, what levers they pull to optimize GPU utilization, how storage tiers and data movement policies keep costs predictable, and the apples-to-apples tests Driggers recommends for validating provider claims.
A1: We measure actual job performance and build a TCO model based upon using different accelerators (including GPUs) to determine the most cost-efficient platform for the customer to use.
A2: We run the actual workload on different accelerators and measure the relative performance. We then compare the costs of both the hardware and the operating cost to run them. With that data we create a "total cost of ownership" (TCO). With our Inference as a Service offering we also look at the actual time the workloads need to run. Is it real time or batch? That determines the scheduling needed.
A3: We do not charge for Ingress or Egress of data, so the bill is very predictable. We offer multiple tiers of storage to best match the performance requirements.
A4: All of our racks support water to the rack. For densities higher than 75kW/rack, we leverage Direct Liquid to Chip (DLC) and additional water to air cooling at the rack level like using RDHX doors.
A5: We offer both token-based pricing with our inference as a service offering and GPU-hour billing on our dedicated inference offerings. The token-based pricing is typically a better deal for customers that are not using the servers 24/7 whereas the dedicated inference is better for folks using the GPUs continuously.