OpenAI, an artificial intelligence (AI) research and deployment company, and Cerebras, an American AI and semiconductor company, have signed a multi-year agreement to deploy 750 MW of Cerebras wafer-scale AI systems, beginning in 2026. This deployment will progress in multiple phases through 2028.
In a press release, OpenAI says that Cerebras’ purpose-built AI systems aim to accelerate long outputs from AI models, and respond much faster. The unique speed comes from integrating massive compute, memory, and bandwidth on a single chip, eliminating bottlenecks that slow inference on conventional hardware. Large language models on Cerebras deliver responses up to 15× faster than GPU-based systems, enabling faster, more natural interactions and novel AI applications. This is being touted as the largest high-speed AI inference deployment in the world.
Sachin Katti, Head of Compute Infrastructure, OpenAI, said, “OpenAI’s compute strategy is to build a resilient portfolio that matches the right systems to the right workloads. Cerebras adds a dedicated low-latency inference solution, natural interactions, and a stronger foundation to scale real-time AI to many more people.”
Andrew Feldman, co-founder and CEO, Cerebras, said, “We are delighted to partner with OpenAI, bringing the world’s leading AI models to the world’s fastest AI processor. Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models.”
The OpenAI–Cerebras collaboration claims to have shown significant improvements in AI inference speed using Cerebras’ wafer‑scale engine. For instance, the GPT‑OSS‑120B model runs at about 3,000 tokens per second, considerably faster than typical GPU-based inference. Llama 3.2‑70B achieves around 2,100 tokens per second, roughly 16 times faster than the fastest GPUs, while the Llama API on Cerebras reaches 2,600 tokens per second, about 18 times faster than GPU-based setups. Earlier tests of Llama 3.1‑8B recorded 1,800 tokens per second, roughly 20 times faster than standard cloud GPUs.

