Qwen 3.7 Max vs Claude for Coding: The Real Cost Per Task
Qwen 3.7 Max lists at half Claude Opus 4.8 and runs the same eval suite for a third of the cost. The catch is not hidden cost. It is what the price buys.
By Capital & Compute
Qwen 3.7 Max lists at roughly half Claude Opus 4.8’s per-token price, and the saving is real, not a sticker trick. On Artificial Analysis’s independent run, Qwen finishes the same Intelligence Index eval suite for $1,158.72 against Opus 4.8’s $4,011.58. That is about a third of the cost for the same fixed body of work. The catch is not a hidden bill. It is what the low price buys: a lower independent intelligence rank, coding wins that Qwen scores for itself, and an API that routes through China.
So “is Qwen 3.7 Max good enough to replace Claude for coding” has a real answer, and it is not the one the launch coverage gave you. The honest version depends on which task you run, which Claude you are replacing, and whether your code can legally leave the building.
What Qwen 3.7 Max actually is
Alibaba’s Qwen team announced Qwen 3.7 Max on May 20, 2026 at its Cloud Summit, with the commercial API going live on Model Studio a day earlier. It is a reasoning agent model with a 1M-token context window and a native extended-thinking mode, pitched at long-horizon agent work and described by Alibaba as able to run autonomously for up to 35 hours.
Two facts matter more than the headline specs. It is closed-weight and API-only, a break from the open-weight Qwen models that built the brand. And it speaks Anthropic’s Messages API, so it slots into Claude Code by swapping a base URL and a model id. That second fact is why every comparison frames it as a Claude replacement: you can point your existing harness at it in about a minute.
The 35-hour autonomous run is the hook every competing review leads with. Skip past it. A model’s ability to grind unattended says little about what a real task costs you, which is the only question a budget owner is actually asking.
The sticker price says Qwen wins
Start with the rate card, because that is where the “half the price” claim comes from. These are list prices per million tokens, current as of June 21, 2026.
| Model | Input / 1M | Output / 1M | Cached input / 1M |
|---|---|---|---|
| Qwen 3.7 Max (list) | $2.50 | $7.50 | $0.25 |
| Qwen 3.7 Max (promo) | ~$1.25 | ~$3.75 | ~$0.13 |
| Claude Opus 4.8 | $5.00 | $25.00 | $0.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 |
Against Opus 4.8, Qwen’s output token is a third of the price even at list, and a seventh under the promo. Against Sonnet 4.6 it still wins. A roughly 50% promotional discount has been running on the international endpoint, which is where the $1.25/$3.75 comes from. Alibaba has not published an end date, so treat the promo rate as temporary and the list rate as the number you budget against.
One row breaks the clean story, and it is the one competitors skip. Claude Haiku 4.5 lists at $1 input and $5 output, cheaper per token than Qwen 3.7 Max’s list rate on both sides. Qwen undercuts Claude’s flagship. It does not undercut Claude’s cheap tier. If your goal is purely the lowest token price for coding, the answer might be a Claude model, not a Chinese one. Hold that thought for the verdict.
Anthropic’s pricing for the full Claude line is on its API docs; the Qwen rates are confirmed via Artificial Analysis’s model page and live resellers, pending Alibaba’s own rate card.
The cheapness is real, not a sticker trick
Here is where most cost comparisons stop, and where the interesting part begins. A per-token rate tells you nothing until you know how many tokens the model burns to finish the job. A cheap model that rambles can cost more than a pricey one that is terse. That is the price-reversal phenomenon a 2026 Microsoft Research preprint measured across frontier models: in about a third of matchups, the cheaper-listed model finished the work at a higher cost.
So does Qwen ramble? No. This is the myth to kill before it spreads.
Artificial Analysis runs every model through the same fixed eval suite and logs both the token count and the dollar cost. By that independent measure, Qwen 3.7 Max generates about 100M output tokens to run the Intelligence Index, which AA calls “somewhat verbose” against a 93M field average. Claude Opus 4.8 generates 120M on the same suite, which AA calls “very verbose.” Qwen is barely above average. Opus is the chatty one. Verbosity is not quietly multiplying Qwen’s cheap tokens into an expensive bill. If anything it is doing that to Opus.
Which is why the total cost-to-run lands where the sticker price predicted, only more so.
This is the cleanest cost-per-task number in the comparison, because it is independent, like-for-like, and not modeled by anyone with a model to sell. Qwen 3.7 Max does the same fixed body of work for about 29% of what Opus 4.8 costs and about 45% of GPT-5.5. The cost-per-task moat this site keeps returning to points one way here: on raw economics, Qwen is genuinely cheap, and the cheapness survives contact with real token counts.
One precise caveat the citations demand. “Cost to run the Intelligence Index” is the cost of AA’s full eval suite, which spans reasoning, knowledge, and coding tasks, not a single coding ticket. It is the best independent proxy for “what this model costs to do a large fixed job,” and it is the honest number to anchor on. It is not literally a per-coding-task price. We will get to a single task in a moment.
What the low price does not buy
Set the intelligence scores next to the run costs above and the trade becomes visible. The dollars triple. The scores barely move.
| Item | Value |
|---|---|
| Qwen 3.7 Max | 46 |
| GPT-5.5 (xhigh) | 55 |
| Claude Opus 4.8 | 56 |
The whole trade sits in that gap. Spend 3.5 times more on Opus 4.8 and you buy ten index points, from 46 to 56. Put differently, Qwen 3.7 Max delivers about 82% of Opus 4.8’s measured intelligence at about 29% of the run cost. For a budget owner, that is a strong value case, and pretending otherwise to protect a premium would be dishonest.
Two honest dents in that case, though.
First, the independent index already deflated the launch hype. At release, coverage put Qwen 3.7 Max near the top of the AA leaderboard at 56.6, the highest-placed Chinese model. The live page now shows 46 and rank 12 of 154, the result of an index recalibration. The model did not get worse; the yardstick got stricter. Either way, the number to quote today is 46, not the launch figure.
Second, the index is a broad composite. It is not a coding-specific score. So it answers “how smart, in general, for the money” cleanly, and it does not by itself answer “how good at coding.” For that, you have to look at the coding benchmarks, and that is where the story gets softer.
Does Qwen 3.7 Max actually beat Claude on coding?
This is where to slow down, because the marketing and the evidence part ways.
Qwen’s announcement reports a strong coding suite: SWE-Bench Pro 60.6, SWE-Bench Multilingual 78.3, Terminal-Bench 2.0-Terminus 69.7, and SWE-Bench Verified 80.4. On paper, leadership numbers.
Every one of those is Qwen-reported. They come from Alibaba’s own announcement, reproduced on Together AI’s model page and in DataCamp’s writeup. Artificial Analysis, the independent benchmarker, does not publish its own SWE-Bench or Terminal-Bench figure for this model. So the coding wins rest on the vendor’s word, and this site’s rule is the same for every vendor: a self-reported number is a self-reported number until someone neutral re-runs it.
Read that again, because it inverts the headline. In Qwen’s own comparison table, on the most-cited agentic coding benchmark, the Claude model is ahead. Not by much. But “beats Claude on coding” is not what the vendor’s own chart shows, and the Claude figure it used (Opus 4.6 Max) is two releases behind the current Opus 4.8.
None of this makes Qwen bad at code. The benchmarks it reports are genuinely competitive, and on real work a model that scores in the high 70s and low 80s on these suites is useful. It means the confident “it beats Claude” framing is not supported by independent measurement, and a careful buyer treats the coding case as “competitive, vendor-reported” rather than “proven winner.”
A real cost per task
The AA cost-to-run is the grounded anchor. For a single ticket, the arithmetic is yours to run, and the ratio holds. Here is a transparent illustration, with every assumption on the table.
Take one mid-sized agentic coding task: resolving a bug across a few files in a real repo. Assume the agent loop reads and re-reads context across its turns to a cumulative 1.5M input tokens, and emits 80,000 output tokens of edits, plans, and tool calls. Those are illustrative numbers, not measured from a specific run, and your harness will differ.
- Qwen 3.7 Max (list): 1.5M x $2.50 + 0.08M x $7.50 = $3.75 + $0.60 = $4.35
- Qwen 3.7 Max (promo): about $2.18
- Claude Opus 4.8: 1.5M x $5.00 + 0.08M x $25.00 = $7.50 + $2.00 = $9.50
- Claude Haiku 4.5: 1.5M x $1.00 + 0.08M x $5.00 = $1.50 + $0.40 = $1.90
The same shape as the independent run. Qwen lands at roughly half of Opus per task. And Haiku 4.5, the cheap-Claude tier, comes in under Qwen, which is the point the rate card already warned about. Prompt caching changes all of these, often dramatically, since cached input on every model here runs a fraction of the live rate. To pressure-test the numbers against your own token profile and caching, run them through the AI coding cost calculator; the cost-per-task method post explains why tokens-per-task and loop count, not the sticker rate, decide the bill.
Does Qwen 3.7 Max work with Claude Code?
Yes. Qwen 3.7 Max exposes an Anthropic-compatible Messages API, so pointing Claude Code at it is a base-URL-and-model-id swap. You keep your existing agent, commands, and config, and route the model calls to Qwen’s endpoint. That compatibility is half the reason the “replace Claude” question is even live: the switching cost is close to zero on the tooling side.
The switching cost that is not zero is everything in the next section.
The catch nobody mentions: API access and data residency
Here is the section the entire first page of search results leaves out, and it is the one a serious team should read first.
Qwen 3.7 Max is API-only and closed-weight. There are no open weights to self-host, no on-prem option, and no fine-tuning rights. Every request leaves your environment for someone else’s. That someone is Alibaba Cloud, and where the request lands depends on the endpoint you pick. The default international routing runs through Singapore, with a “global” mode available via US and German regions, while the cheapest endpoint runs in mainland China and stores data there. For Western developers without an Alibaba Cloud relationship, OpenRouter resells it and Together AI runs a first-party endpoint with simpler US billing.
This is not hypothetical risk-mongering. Model availability and jurisdiction became a live engineering variable in 2026, not a footnote. The clearest example ran the other direction: a US export-control directive pulled Claude’s top coding model offline for every customer in June 2026. A model routed through China carries the mirror-image version of that exposure. If “which models can I legally run, with my data, this quarter” is part of your decision, Qwen and Claude are not interchangeable at any price.
So, should you replace Claude with Qwen?
There is no single answer, and the honest verdict turns on three things: which task, which Claude, and whose data.
- Replace with Qwen 3.7 Max if your work is cost-sensitive, high-volume, and not bound by data-residency rules: side projects, non-sensitive internal tooling, prototyping, learning. The roughly 3.5x cost advantage over Opus is real and independently measured, and the capability is competitive for everyday coding. The Claude Code compatibility means you lose almost nothing on tooling.
- Stay on Claude Opus 4.8 if the work is hard enough that the ten-point independent intelligence gap pays for itself, which it often does on gnarly multi-file reasoning, or if the task is the kind where you would rather pay for the model that leads. On its own SWE-Bench Verified figure, recall, the Claude model is the one in front.
- Look at Claude Haiku 4.5 before assuming Qwen is cheapest. If pure cost-per-token is the goal and you are leaving Opus to save money, Haiku 4.5 is cheaper than Qwen on the rate card and stays inside Anthropic’s compliance posture. Qwen beats the flagship on price, not the whole line.
- Rule Qwen out, regardless of price, if your code cannot legally or contractually leave for a third-party API that may route through China. No open weights means no way around it.
The most defensible setup for a cost-conscious team is not a wholesale switch. It is routing the daily grind to the cheapest model that clears the task, keeping the flagship for the hard 20%, and keeping anything sensitive on a provider whose data path you can live with. Qwen 3.7 Max earns a seat in that mix on cost. It does not earn a blanket “replace Claude,” because the thing the low price does not buy turns out to matter exactly when the task gets hard or the data gets sensitive.
Run both on a representative slice of your own work for a week, log the tokens and the dollars, and check where your code actually has to live. On a decision this close, your workload and your compliance rules are the only benchmarks that count.
Frequently asked questions
Is Qwen 3.7 Max open source? No. Unlike earlier open-weight Qwen models, Qwen 3.7 Max is closed-weight and API-only. There are no downloadable weights, no self-hosting, and no fine-tuning.
How much does Qwen 3.7 Max cost? List price is $2.50 per million input tokens and $7.50 per million output, with a roughly 50% promo running on the international endpoint that brings it to about $1.25/$3.75. By Artificial Analysis’s independent run, it costs $1,158.72 to run the full Intelligence Index, against $4,011.58 for Claude Opus 4.8.
Does Qwen 3.7 Max work with Claude Code? Yes. It exposes an Anthropic-compatible Messages API, so you can run it inside Claude Code by changing the base URL and model id.
Is Qwen 3.7 Max better than Claude for coding? On Qwen’s own reported benchmarks it is competitive, but those are vendor figures, independently unverified, and on SWE-Bench Verified Qwen’s self-reported 80.4 trails the 80.8 it lists for Claude Opus 4.6 Max. On the independent Intelligence Index, Opus 4.8 scores higher (56 vs 46).
Is Qwen 3.7 Max safe to use on proprietary or regulated code? That is the key risk, not the price. It is API-only with no on-prem option, and the cheapest endpoint stores data in mainland China. For regulated or contractually restricted code, the data path can disqualify it before cost is even considered.
Sources
Alibaba Qwen Team (2026). Qwen3.7-Max announcement, reported. TechNode (secondary coverage). https://technode.com/2026/05/21/alibaba-introduces-qwen3-7-max-as-next-gen-ai-agent-model/
Anthropic (2026). Pricing. Claude API documentation (vendor documentation). https://platform.claude.com/docs/en/about-claude/pricing
Artificial Analysis (2026). Qwen3.7 Max: Intelligence, Performance & Price Analysis. Artificial Analysis (independent benchmarking). https://artificialanalysis.ai/models/qwen3-7-max
Artificial Analysis (2026). Claude Opus 4.8: Intelligence, Performance & Price Analysis. Artificial Analysis (independent benchmarking). https://artificialanalysis.ai/models/claude-opus-4-8
DataCamp (2026). Qwen3.7-Max: Features, Benchmarks and Agent Capabilities. DataCamp (secondary; reproduces Qwen’s self-reported benchmark suite). https://www.datacamp.com/blog/qwen3-7-max
eesel AI (2026). Qwen pricing in 2026. eesel AI (secondary). https://www.eesel.ai/blog/qwen-pricing
Office Chai (2026). Qwen 3.7 Max Becomes Highest-Placed Chinese Model On Artificial Analysis Index. Office Chai (secondary; reports the 56.6 launch figure). https://officechai.com/ai/qwen-3-7-max-benchmarks/
Together AI (2026). Qwen3.7-Max API. Together AI (first-party reseller; reproduces Qwen’s reported benchmarks). https://www.together.ai/models/qwen37-max
VentureBeat (2026). Alibaba’s proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic’s Claude Code. VentureBeat (secondary coverage). https://venturebeat.com/technology/alibabas-proprietary-qwen3-7-max-can-run-for-35-hours-autonomously-and-supports-external-harnesses-like-anthropics-claude-code
Yotta Labs (2026). Qwen 3.7-Max: Release, Features, Open-Source Status and How to Access. Yotta Labs (secondary). https://www.yottalabs.ai/post/qwen-3-7-max-release-date-features-open-source-status-and-how-to-access-2026