GPT-5.6 Sol on Cerebras: 750 Tokens per Second
GPT-5.6 Sol goes live on Cerebras at up to 750 tokens per second in July. Here is what that speed actually means, and why it is a chip story.
By Capital & Compute
The line everyone quoted from OpenAI’s GPT-5.6 announcement was the speed: Sol, the new flagship, will run on Cerebras hardware at up to 750 tokens per second when it lands in July. That is roughly eleven times faster than GPT-5.5 puts out today. The reaction was predictable. Take my kidney, take my ID, give my ADHD this superpower.
Here is the part the screenshots skip. 750 tokens per second is not something OpenAI built. It is something Cerebras does to almost any model you put on it. And the number is lower than what the same chips have already shown, not higher.
| Item | Value |
|---|---|
| GPT-5.5 (GPU, xhigh) | 68 |
| GPT-5.6 Sol (Cerebras) | 750 |
| GPT-5.3-Codex-Spark (Cerebras) | 1,000 |
| Llama 3.1-70B (Cerebras) | 2,100 |
| gpt-oss-120B (Cerebras) | 3,000 |
How fast is GPT-5.6 Sol on Cerebras?
GPT-5.6 Sol on Cerebras runs at up to 750 tokens per second, as reported by The Decoder from OpenAI’s announcement. For reference, GPT-5.5 at its highest reasoning effort outputs about 68 tokens per second on a normal API, according to Artificial Analysis. So the Cerebras version is roughly eleven times faster at producing text, with the same model intelligence behind it.
To put 68 versus 750 in human terms: at 68 tokens per second, a long agent step or a full file of code trickles out while you wait, watching the cursor. At 750, the same output is done before you have finished reading the first line. The model has not gotten smarter. It has gotten close to instant.
That gap is the whole pitch. And it is real. But the framing in the launch coverage gets the cause wrong.
750 tokens per second is a Cerebras number, not an OpenAI breakthrough
One of the sharper comments under the Reddit thread said it plainly: this is just the speed Cerebras runs things, and any vendor could get it by prioritizing Cerebras. That is correct, and the receipts back it up.
Cerebras builds a wafer-scale chip, the WSE-3, that keeps an entire model’s weights in on-chip memory instead of shuttling them across slow links to stacks of GPUs. The result is output speeds that GPUs do not reach. The company has been publishing these numbers for over a year:
- gpt-oss-120B ran at 3,000 tokens per second on Cerebras, per Cerebras, at full 128k context.
- Llama 3.1-70B hit 2,100 tokens per second, again per Cerebras.
- GPT-5.3-Codex-Spark, OpenAI’s first model on Cerebras, shipped at over 1,000 tokens per second, as ServeTheHome covered. In a coding demo, the Cerebras-backed version finished a task in 9 seconds against nearly 43 on the standard model.
Line those up and 750 stops looking like a peak. It looks like the floor for a large frontier model on this hardware. The smaller the model, the faster Cerebras runs it, because fewer weights means more of them fit and move on-chip. gpt-oss-120B is small and screams at 3,000. Sol is OpenAI’s heaviest model, the one built for the hardest coding and security work, so it is the slowest of the bunch at 750. That ordering is the tell. Speed here tracks model size, not model quality.
So the honest headline is not “OpenAI made Sol 11x faster.” It is “OpenAI put its flagship on the chip that makes everything fast, and even its biggest model clears 750.”
Speed does not change the price. It changes what the model is for.
This is the part worth slowing down on, because it is where the economics live.
Running Sol on Cerebras does not make it cheaper per token. OpenAI prices GPT-5.6 Sol at $5 per million input tokens and $30 per million output, as The Decoder reported from the announcement. The two cheaper tiers, Terra and Luna, come in at $2.50 / $15 and $1 / $6. Those are the rates regardless of how fast the tokens come out. A task that costs a dollar on a slow provider costs a dollar on Cerebras. If anything, premium-latency inference tends to carry a premium, not a discount.
So what does the speed actually buy? Usability, not savings. There is a category of work that simply does not function at 68 tokens per second:
- Agentic loops where the model calls a tool, reads the result, and decides the next call, dozens of times per task. At 68 tokens per second, each step stalls and the whole chain feels broken. At 750, the loop runs at something like conversation speed.
- Real-time coding assistance, where a suggestion that arrives after you have already typed the line is worthless. The Codex-Spark demo finishing in 9 seconds instead of 43 is the difference between a tool you use and one you tab away from.
- Anything a human waits on live, from a support agent to a voice interface.
Speed is a usability unlock, and a real one. It just is not a cost story, and it does not move the per-task math that decides whether running a model is affordable at scale. For that math, see how the GPT-5.6 cost per task shakes out at the confirmed rates, and how Sol stacks up against rivals in the GPT-5.6 versus Claude Fable 5 comparison. The token rate is the same number on a slow chip or a fast one.
The catches the launch tweet left out
Three of them, and each matters more than the speed.
Access is gated, by the government. The Cerebras version, like the rest of GPT-5.6, starts in a limited preview “only open to select partners through the API and Codex, at the explicit direction of the US government,” per The Decoder. OpenAI was blunt about disliking it: it does not believe this kind of government access process should become the long-term default. So “750 tokens per second in July” comes with a quiet asterisk, that most people cannot touch it yet. The full picture of why is in the breakdown of how the government became a release gate for this model.
Context windows on Cerebras have been tight. Someone in the thread asked whether Cerebras had improved its context size, and the answer was a wry “if only they had.” Wafer-scale memory is finite, and long context competes with the weights for the same on-chip space. Cerebras has shown 128k context on a small model like gpt-oss-120B, but a model the size of Sol leaves far less room. If the served context is short, that constrains exactly the long-horizon agent and codebase work Sol is built for. Watch for the real number when the preview opens.
Fast, but forgetful. Wafer-scale inference holds nothing between requests. There is no persistent memory across calls, which is why one observer described Cerebras as “ultra-fast forgetting”. For a single burst of generation that is fine. For an agent meant to carry state across a long task, the speed is real but the memory has to be rebuilt and re-sent every turn, which eats back some of the win.
So is it the best thing out of the announcement?
For the narrow case of latency-bound, real-time work, genuinely close. An eleven-times speedup with no loss of model quality is the kind of thing that changes which products are buildable. Voice, live coding, fast agent loops: those get materially better at 750 tokens per second.
For everyone else, it is a chip partnership dressed as a model feature. The intelligence is the same Sol you would get anywhere. The price is the same. The speed comes from Cerebras, which has been doing this to other models for over a year, and which runs Sol slower than its lighter siblings precisely because Sol is the big one. Useful framing for reading the next frontier launch: when a number is set by the hardware, the model maker gets the headline, but the chip did the work.
Frequently asked questions
- How fast is GPT-5.6 Sol on Cerebras?
- Up to 750 tokens per second, starting in July 2026, with access limited as capacity ramps. For comparison, GPT-5.5 at its highest reasoning effort outputs about 68 tokens per second on standard providers, so the Cerebras version is roughly eleven times faster at producing text.
- Is 750 tokens per second fast for an AI model?
- Very fast for a frontier model. Most large reasoning models output 50 to 100 tokens per second on GPUs. But 750 is actually low for Cerebras hardware, which has run smaller models like gpt-oss-120B at 3,000 tokens per second. Sol is slower because it is a much bigger model.
- Does running GPT-5.6 on Cerebras make it cheaper?
- No. Speed and price are separate. GPT-5.6 Sol is billed at $5 per million input tokens and $30 per million output regardless of the hardware. Cerebras changes how fast the tokens arrive, not how much they cost. Premium-latency inference often costs more, not less.
- When can I use GPT-5.6 Sol on Cerebras?
- The Cerebras version is slated for July 2026, but it starts in a limited preview restricted to select partners at the direction of the US government, with capacity ramping over time. A wider release timeline has not been confirmed.
Sources
- OpenAI (2026). Previewing GPT-5.6 Sol: a next-generation model. OpenAI announcement. https://openai.com/index/previewing-gpt-5-6-sol/
- The Decoder (2026). OpenAI’s GPT-5.6 Sol launches to rival Claude Mythos under government access rules it calls unsustainable. https://the-decoder.com/openais-claude-mythos-competitor-gpt-5-6-sol-launches-under-government-controlled-access-it-calls-unsustainable/
- Artificial Analysis (2026). GPT-5.5 (xhigh): Intelligence, Performance and Price Analysis. Independent benchmarking. https://artificialanalysis.ai/models/gpt-5-5
- ServeTheHome (2026). OpenAI GPT-5.3-Codex-Spark Now Running at 1K Tokens Per Second on BIG Cerebras Chips. https://www.servethehome.com/openai-gpt-5-3-codex-spark-now-running-at-1k-tokens-per-second-on-big-cerebras-chips/
- Cerebras (2025). Cerebras launches OpenAI’s gpt-oss-120B at a blistering 3,000 tokens/sec. Vendor blog. https://www.cerebras.ai/blog/cerebras-launches-openai-s-gpt-oss-120b-at-a-blistering-3-000-tokens-sec
- Cerebras (2024). Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s. Vendor blog. https://www.cerebras.ai/blog/cerebras-inference-3x-faster
- MemU (2025). Cerebras Delivers 2,100 Tokens Per Second on Llama 70B: the fastest inference without persistent memory means ultra-fast forgetting. https://memu.pro/blog/cerebras-wafer-scale-inference-agent-memory