X

Cirrascale CEO on Defining Compute Efficiency

October 2, 2025

As organizations push more workloads into inference and AI-driven applications, compute efficiency is moving to the top of every buyer’s checklist.

We sat down with Cirrascale CEO Dave Driggers to walk through the practical yardsticks they use when evaluating performance, the trade-offs behind scheduling and accelerator selection, and the engineering choices that sustain efficiency even at high rack densities.

Check out the Q&A below to learn how Cirrascale defines compute efficiency in business terms, what levers they pull to optimize GPU utilization, how storage tiers and data movement policies keep costs predictable, and the apples-to-apples tests Driggers recommends for validating provider claims.

Q1: How do you define “compute efficiency” in business terms for Cirrascale customers—what simple yardsticks (e.g., performance delivered per dollar or sustained GPU utilization during runs) actually tell you they’re getting efficient compute?

A1: We measure actual job performance and build a TCO model based upon using different accelerators (including GPUs) to determine the most cost-efficient platform for the customer to use.

Q2: From the provider side, what are the two biggest levers you pull to raise effective GPU utilization—guiding customers to the right accelerator, improving scheduling/pooling to cut idle time, or removing storage/network bottlenecks? A quick example of impact would help.

A2: We run the actual workload on different accelerators and measure the relative performance.  We then compare the costs of both the hardware and the operating cost to run them.  With that data we create a "total cost of ownership" (TCO).  With our Inference as a Service offering we also look at the actual time the workloads need to run.  Is it real time or batch?  That determines the scheduling needed.

Q3: Inference at scale often loses efficiency when data movement gets expensive or slow. What have you done around storage tiers and data-movement policies to keep GPUs fed and the bill predictable?

A3: We do not charge for Ingress or Egress of data, so the bill is very predictable. We offer multiple tiers of storage to best match the performance requirements.  

Q4: Rack densities are rising fast. Without getting into plumbing, how are you planning for 100–150 kW racks so compute efficiency doesn’t drop to thermal throttling or queue delays? What’s one decision that materially changed outcomes?

A4: All of our racks support water to the rack. For densities higher than 75kW/rack, we leverage Direct Liquid to Chip (DLC) and additional water to air cooling at the rack level like using RDHX doors.

Q5: Your Inference Cloud uses token-based pricing. How does that map to customers’ efficiency goals versus GPU-hour billing, and when does it meaningfully lower total cost to serve?

A5: We offer both token-based pricing with our inference as a service offering and GPU-hour billing on our dedicated inference offerings.   The token-based pricing is typically a better deal for customers that are not using the servers 24/7 whereas the dedicated inference is better for folks using the GPUs continuously.

Subscribe to our newsletter

As organizations push more workloads into inference and AI-driven applications, compute efficiency is moving to the top of every buyer’s checklist.

We sat down with Cirrascale CEO Dave Driggers to walk through the practical yardsticks they use when evaluating performance, the trade-offs behind scheduling and accelerator selection, and the engineering choices that sustain efficiency even at high rack densities.

Check out the Q&A below to learn how Cirrascale defines compute efficiency in business terms, what levers they pull to optimize GPU utilization, how storage tiers and data movement policies keep costs predictable, and the apples-to-apples tests Driggers recommends for validating provider claims.

Q1: How do you define “compute efficiency” in business terms for Cirrascale customers—what simple yardsticks (e.g., performance delivered per dollar or sustained GPU utilization during runs) actually tell you they’re getting efficient compute?

A1: We measure actual job performance and build a TCO model based upon using different accelerators (including GPUs) to determine the most cost-efficient platform for the customer to use.

Q2: From the provider side, what are the two biggest levers you pull to raise effective GPU utilization—guiding customers to the right accelerator, improving scheduling/pooling to cut idle time, or removing storage/network bottlenecks? A quick example of impact would help.

A2: We run the actual workload on different accelerators and measure the relative performance.  We then compare the costs of both the hardware and the operating cost to run them.  With that data we create a "total cost of ownership" (TCO).  With our Inference as a Service offering we also look at the actual time the workloads need to run.  Is it real time or batch?  That determines the scheduling needed.

Q3: Inference at scale often loses efficiency when data movement gets expensive or slow. What have you done around storage tiers and data-movement policies to keep GPUs fed and the bill predictable?

A3: We do not charge for Ingress or Egress of data, so the bill is very predictable. We offer multiple tiers of storage to best match the performance requirements.  

Q4: Rack densities are rising fast. Without getting into plumbing, how are you planning for 100–150 kW racks so compute efficiency doesn’t drop to thermal throttling or queue delays? What’s one decision that materially changed outcomes?

A4: All of our racks support water to the rack. For densities higher than 75kW/rack, we leverage Direct Liquid to Chip (DLC) and additional water to air cooling at the rack level like using RDHX doors.

Q5: Your Inference Cloud uses token-based pricing. How does that map to customers’ efficiency goals versus GPU-hour billing, and when does it meaningfully lower total cost to serve?

A5: We offer both token-based pricing with our inference as a service offering and GPU-hour billing on our dedicated inference offerings.   The token-based pricing is typically a better deal for customers that are not using the servers 24/7 whereas the dedicated inference is better for folks using the GPUs continuously.

Subscribe to our newsletter

Transcript

Subscribe to TechArena

Subscribe