Discover all the highlights from OCP > VIEW our coverage
X

Optimizing AI Inference: Fireside Chat with Cerebras’ Andrew Feldman

Cerebras is on a tear. Just a week ago, I had the privilege of hosting Cerebras CEO Andrew Feldman, for a Fireside Chat. He laid out his vision for the company, the background of delivering a dinner plate sized ASIC to drive AI inference at scale, and how he drives disruptive innovation where others have failed. It was a fascinating interview that showcased why Cerebras is turning customer heads.  

And at AI speed, Cerebras has unleashed new momentum in the market. This week, the company announced a massive expansion of its inference capacity, adding six new AI inference data centers across North America and Europe. Powered by thousands of Cerebras CS-3 systems, this network will deliver over 40 million Llama 70B tokens per second, marking a significant leap forward in AI compute.

What’s more, the company announced a sweeping collaboration with Hugging Face, everyone’s favorite opensource AI model, bringing new capability to over 5 million AI developers.

In an AI landscape where advancements emerge weekly, the scale of deployment and the boldness of the collaboration has the potential to be a game-changer. You’d almost start to wonder if it was the week before GTC 😊.  

How did we get here? Let’s wind the clock back, all the way to last week’s fireside chat. Andrew shared that he and his co-founders crafted a vision to revolutionize AI compute, as far back as 2015, after observing that existing compute architectures weren’t well-suited to handle the unique performance requirements AI demanded. They saw an opportunity to create a new type of processor that was specialized to handle the increasing demands of AI inference. This ambition led to the creation of the largest chip ever designed—one that would not only challenge conventional architectural approaches, but also surpasses them in performance.

One of the most interesting aspects of our conversation was the way Andrew and his team approached the challenge of building such a massive chip. At the heart of Cerebras’ strategy is the belief that merely improving on existing technology wasn’t enough: "If you're going to attack an incumbent, doing something a little better and a little cheaper is unlikely to get the job done," Andrew explained. Instead, Cerebras went big—building a chip 56 times larger than the largest GPU ever made.  

But the scale of impact of Cerebras’ chip isn’t just about size. The design is optimized to deliver high performance with efficient power consumption, making it ideal for navigating the complex computations required by AI inference, especially when you consider power requirements of competing GPUs. The result is a chip that delivers performance that far exceeds traditional GPUs, with some models outpacing 50, 60, or even hundreds of GPUs, depending on the workload.

Today, Cerebras-based platforms are delivering results across a wide range of AI models. Andrew spoke about how their engineers are able to quickly adapt their hardware to new models, something that’s a challenge in today’s warp speed model advancement, to enable customers to utilize models of choice including support for models optimized for different languages. Cerebras technology, for example, was instrumental in optimizing Jais, a leading Arabic-language LLM developed by G42.

This model advancement, of course, assists with performance of end user deployments. One example Andrew shared that particularly resonated with me was their collaboration with GlaxoSmithKline to increase the pace of pharmaceutical innovation using new AI models. I have spoken on this platform before about the potential for AI tools to unlock new frontiers in medicine, and by putting cutting-edge compute in the hands of researchers, Cerebras is playing a role in solving important medical issues and improving lives.  

Looking toward the future, Andrew predicts that AI inference advancement will continue to skyrocket, especially in industries such as healthcare, energy, and communications. He explained that the next wave of AI adoption would be driven by an interesting interplay between edge and cloud, stating that the role of edge computing would enable broad inference adoption and place more demand for data center compute. Drawing a comparison to the launch of the original iPhone, Andrew posited “everybody was worried that it would mean less data center spend.
And lo and behold, we make the device in your hand more powerful, it does more on the edge, and what do we want? We want more—and that means you go back to the data center more and more frequently.”

So what’s the TechArena take? The news this weeks puts a lot of wind in the sails of Cerebras market advancement. I expect that Andrew and the Cerebras team will leverage this news to fill 2025 with continued performance optimization advancement across large language models. I’m expecting to see more customers hop aboard the Cerebras wagon as enterprises seek inference performance at scale and are willing to invest the deep engineering collaboration time to tap Cerebras hardware to its fullest capability. This will be bolstered by the massive uptick in developer engagement through the new Hugging Face collaboration. While Cerebras may not be every enterprise’s cup of tea (for example small models may not require the gut punch of compute power that Cerebras delivers), many will be attracted by Cerebras’ performance and performance efficiency... and having a second source of supply for AI inference at scale. I love that there is another horse in the race and can’t wait to see how this plays out. More on the AI front next week as TechArena ventures to San Jose with front row seats at GTC.

Cerebras is on a tear. Just a week ago, I had the privilege of hosting Cerebras CEO Andrew Feldman, for a Fireside Chat. He laid out his vision for the company, the background of delivering a dinner plate sized ASIC to drive AI inference at scale, and how he drives disruptive innovation where others have failed. It was a fascinating interview that showcased why Cerebras is turning customer heads.  

And at AI speed, Cerebras has unleashed new momentum in the market. This week, the company announced a massive expansion of its inference capacity, adding six new AI inference data centers across North America and Europe. Powered by thousands of Cerebras CS-3 systems, this network will deliver over 40 million Llama 70B tokens per second, marking a significant leap forward in AI compute.

What’s more, the company announced a sweeping collaboration with Hugging Face, everyone’s favorite opensource AI model, bringing new capability to over 5 million AI developers.

In an AI landscape where advancements emerge weekly, the scale of deployment and the boldness of the collaboration has the potential to be a game-changer. You’d almost start to wonder if it was the week before GTC 😊.  

How did we get here? Let’s wind the clock back, all the way to last week’s fireside chat. Andrew shared that he and his co-founders crafted a vision to revolutionize AI compute, as far back as 2015, after observing that existing compute architectures weren’t well-suited to handle the unique performance requirements AI demanded. They saw an opportunity to create a new type of processor that was specialized to handle the increasing demands of AI inference. This ambition led to the creation of the largest chip ever designed—one that would not only challenge conventional architectural approaches, but also surpasses them in performance.

One of the most interesting aspects of our conversation was the way Andrew and his team approached the challenge of building such a massive chip. At the heart of Cerebras’ strategy is the belief that merely improving on existing technology wasn’t enough: "If you're going to attack an incumbent, doing something a little better and a little cheaper is unlikely to get the job done," Andrew explained. Instead, Cerebras went big—building a chip 56 times larger than the largest GPU ever made.  

But the scale of impact of Cerebras’ chip isn’t just about size. The design is optimized to deliver high performance with efficient power consumption, making it ideal for navigating the complex computations required by AI inference, especially when you consider power requirements of competing GPUs. The result is a chip that delivers performance that far exceeds traditional GPUs, with some models outpacing 50, 60, or even hundreds of GPUs, depending on the workload.

Today, Cerebras-based platforms are delivering results across a wide range of AI models. Andrew spoke about how their engineers are able to quickly adapt their hardware to new models, something that’s a challenge in today’s warp speed model advancement, to enable customers to utilize models of choice including support for models optimized for different languages. Cerebras technology, for example, was instrumental in optimizing Jais, a leading Arabic-language LLM developed by G42.

This model advancement, of course, assists with performance of end user deployments. One example Andrew shared that particularly resonated with me was their collaboration with GlaxoSmithKline to increase the pace of pharmaceutical innovation using new AI models. I have spoken on this platform before about the potential for AI tools to unlock new frontiers in medicine, and by putting cutting-edge compute in the hands of researchers, Cerebras is playing a role in solving important medical issues and improving lives.  

Looking toward the future, Andrew predicts that AI inference advancement will continue to skyrocket, especially in industries such as healthcare, energy, and communications. He explained that the next wave of AI adoption would be driven by an interesting interplay between edge and cloud, stating that the role of edge computing would enable broad inference adoption and place more demand for data center compute. Drawing a comparison to the launch of the original iPhone, Andrew posited “everybody was worried that it would mean less data center spend.
And lo and behold, we make the device in your hand more powerful, it does more on the edge, and what do we want? We want more—and that means you go back to the data center more and more frequently.”

So what’s the TechArena take? The news this weeks puts a lot of wind in the sails of Cerebras market advancement. I expect that Andrew and the Cerebras team will leverage this news to fill 2025 with continued performance optimization advancement across large language models. I’m expecting to see more customers hop aboard the Cerebras wagon as enterprises seek inference performance at scale and are willing to invest the deep engineering collaboration time to tap Cerebras hardware to its fullest capability. This will be bolstered by the massive uptick in developer engagement through the new Hugging Face collaboration. While Cerebras may not be every enterprise’s cup of tea (for example small models may not require the gut punch of compute power that Cerebras delivers), many will be attracted by Cerebras’ performance and performance efficiency... and having a second source of supply for AI inference at scale. I love that there is another horse in the race and can’t wait to see how this plays out. More on the AI front next week as TechArena ventures to San Jose with front row seats at GTC.

Transcript

Subscribe to TechArena

Subscribe