Skip to content
Capital & Compute
· ai· coding-agents· pricing· economics· local-llms· open-source

Cohere North Mini Code: What a Free Coding Model Costs

Cohere North Mini Code is free on the API and open-weight. Here is what it really costs per task once you self-host it on a single H100.

By Capital & Compute

The price tag on Cohere’s new coding model is the most honest zero you will see this year, and the most misleading. North Mini Code, released June 9, 2026, costs $0 per million input tokens and $0 per million output tokens on the hosted endpoints that list it. You can route it through OpenRouter’s free tier today and pay nothing.

So the headline writes itself: a capable agentic coding model, free. The headline is also where most of the coverage stops. Nobody is asking the question that decides whether “free” survives contact with a real workflow: what does one coding task actually cost you?

Because the model is open-weight, that question has two completely different answers, and the gap between them is the whole story.

How a free coding model reaches $1.74 per task (self-hosted, Akash H100)Waterfall chart: the advertised API price starts at $0.00, the GPU time for one 40,000-token task adds about $0.11 when the H100 is kept busy, the idle hours at light personal use (about 20 tasks a day) add about $1.63, landing on a real cost of about $1.74 per task.$0.00$0.50$1.00$1.50$2.00Advertised API priceOpenRouter free tier$0.00GPU time per taskH100 kept busy+$0.11Idle hours you pay forlight use, ~20 tasks/day+$1.63Real cost per task$1.74
How a free coding model reaches $1.74 per task (self-hosted, Akash H100)
StepChangeRunning total
Advertised API price (OpenRouter free tier)$0.00$0.00
GPU time per task (H100 kept busy)+$0.11$0.11
Idle hours you pay for (light use, ~20 tasks/day)+$1.63$1.74
Real cost per task$1.74$1.74
The advertised price is $0. The real cost of one coding task self-hosted on the cheapest H100 (Akash, about $1.45/hr) is built from the GPU time the task uses plus the idle hours you still pay for at light personal volume. Modeled, not a vendor benchmark.Source: Modeled from H100 hourly rates (Akash via the Capital & Compute GPU breakdown) and an assumed 150 tok/s single-stream throughput on a 3B-active model; illustrative

Is Cohere North Mini Code free?

Cohere North Mini Code is free per token on hosted endpoints (OpenRouter and Cohere Model Vault) and open-weight under Apache 2.0, but it is not free to self-host: running it needs a GPU that bills by the hour whether you use it or not. The API is free today; the hardware never is. That is the whole distinction, and it is the entire point of this piece.

That answer matters more than it looks, because a free hosted tier is a commercial decision, not a permanent feature. Free tiers get rate-limited, throttled, or quietly retired once a model stops being a customer-acquisition tool. The open weights, by contrast, are yours for good. So the cost that actually has a future is the self-hosted one, and that is the number worth working out.

What North Mini Code actually is

North Mini Code is Cohere’s first model aimed squarely at developers, and the design is built around cheap inference rather than raw size. Per Cohere’s launch post, it is a sparse mixture-of-experts model with 30 billion total parameters but only 3 billion active on any given token. It carries a 256K-token context window and can generate up to 64K tokens in a single response.

The mixture-of-experts trick is what keeps it small at runtime. The model holds the knowledge capacity of a 30B network, but each forward pass routes through roughly 3B parameters, so you pay the inference cost of a 3B model while keeping much of the competence of a far larger one. That is why Cohere lists a minimum hardware bar of a single H100 at FP8, the detail VentureBeat put in its headline. One datacenter GPU, not a cluster.

On capability, it punches above its active-parameter count. It scores 33.4 on the Artificial Analysis Coding Index, ahead of Devstral Small 2 and larger open models, and Cohere reports up to 2.8x higher output throughput than Devstral Small 2 at matched concurrency and hardware, with SWE-bench numbers measured on the SWE-agent harness. Throughput is not a vanity metric here. For a model you run yourself, tokens per second is the lever that sets cost per task, because the bill is rented time and faster generation means fewer rented seconds per job.

The real cost-per-task math

Here is where the $0 stops being the interesting number. When you self-host, your cost per task is rented GPU time, and that follows the same arithmetic as any owned or rented hardware:

(GPU hourly rate) ÷ (tasks you actually finish in that hour)

Both inputs are knowable. The hourly rate for a single H100, as worked through in the decentralized GPU cost breakdown, runs from about $1.45 on a decentralized marketplace like Akash to roughly $6.88 on AWS on-demand, with neo-cloud specialists in between. The task throughput depends on the model, and North Mini Code’s 3B active parameters make it fast.

Model a single-stream generation rate on the order of 150 tokens per second for a 3B-active model on one H100, and call a real coding task 40,000 generated output tokens (the multi-file-change profile the AI coding cost calculator uses as its reference job). That task occupies the GPU for about 267 seconds of generation, or roughly 0.074 of an hour. On a $1.45/hr Akash H100 that is about $0.11 per task. On a $6.88/hr AWS box, the same task is about $0.51.

The provider tier swings that number as much as the model does. The same task that costs about $0.11 on a busy Akash H100 is about $0.51 on a busy AWS box, because you are paying nearly five times the hourly rate for the same generation work.

~$0.11
Akash H100, kept busy
$1.45/hr, back-to-back tasks
~$0.51
AWS H100, kept busy
$6.88/hr, back-to-back tasks
~$1.74
Akash H100, light use
~20 tasks/day, idle dominates
~$8.25
AWS H100, light use
~20 tasks/day, idle dominates

Now the catch, and it is the same one that runs through self-hosted LLM tokenomics: nobody rents a GPU by the task. You rent it by the hour or the month, and you pay for every minute it sits idle waiting for you to type. Keep a $1.45/hr Akash H100 running around the clock and you spend about $1,044 a month. Push a flat-out grind of nearly 10,000 tasks through it and you are back near $0.11 each. Run a realistic personal load of 20 tasks a day, about 600 a month, and that same $1,044 spreads over far less work: roughly $1.74 per task. The hardware did not get more expensive. The idle hours did.

When self-hosting a “free” model beats the free API

If the hosted tier is genuinely $0 and the self-hosted version costs real money, why would anyone run their own? The same reasons that show up every time the rate card is not the whole story, which is the trap dissected in why cheaper AI models can cost more.

A free hosted endpoint is the most repriceable thing in computing. It can add rate limits next quarter, gate the model behind a paid plan, or drop it from the free list entirely once it has done its job of getting developers in the door. None of that can touch a set of Apache 2.0 weights sitting on your own disk. You traded a variable risk for a fixed cost, which is exactly the swap that makes sense when the work is sensitive, latency-critical, or simply too important to depend on someone else’s pricing committee.

Then there is data. If your code cannot leave the building for legal or contractual reasons, the cost-per-task comparison is moot: a hosted API is off the table at any price, free included, and a model you can run on one in-house H100 is the only option that exists. North Mini Code fitting on a single GPU is what makes that practical without a cluster, a point worth weighing against the broader why local LLMs got good in 2026 argument.

For everyone else, running interactive, bursty, human-paced work at low volume, the free hosted tier wins on price for exactly as long as it stays free. That is a real win. It is also a rented one.

The bottom line

North Mini Code is a genuinely useful release: a fast, open-weight, single-GPU coding model that scores well for its size and costs nothing to try. The $0 is not a marketing lie. It is just a launch-tier price on a hosted endpoint, and prices like that have a way of changing once the model has served its purpose.

The number that will still be true next year is the self-hosted one, and it is a utilization story, not a sticker price. Keep an H100 busy and a coding task costs around a dime. Let it idle and the same task costs a couple of dollars, because you are paying for the waiting, not the work. Self-host it when control, privacy, or a fixed, un-repriceable bill is worth more than a free tier you do not own. If cheap tokens are the only goal, the free API already won, right up until the day it is no longer free. To track where North Mini Code sits against priced models, the AI model registry keeps the current lineup, and the AI pricing tracker holds the per-token rate cards it is competing against.

Frequently asked questions

Is Cohere North Mini Code free?
Yes per token on hosted endpoints: North Mini Code lists $0 input and $0 output on OpenRouter's free tier and Cohere Model Vault, and its weights are open under Apache 2.0 with no licence fee. Self-hosting is not free, because the model needs a GPU that costs money by the hour whether you use it or not. The hosted free tier is a launch-tier price that can change; the open weights are permanent.
What hardware do you need to run North Mini Code?
A single NVIDIA H100. Cohere lists a minimum bar of one H100 at FP8 (or FP4), which is possible because the model is a 30B-total, 3B-active mixture-of-experts, so each token only routes through about 3B parameters. That keeps inference cheap enough to fit one datacenter GPU rather than a cluster.
How much does North Mini Code cost per task if you self-host it?
Modeled, a 40,000-token coding task costs roughly $0.11 on a $1.45/hr Akash H100 kept busy and about $0.51 on a $6.88/hr AWS H100, since the task occupies the GPU for about 0.074 of an hour. At light personal volume, where the rented GPU mostly idles, the same task rises to roughly $1.74 because you pay for the idle hours too.
Is North Mini Code good for coding?
For its size, yes. It scores 33.4 on the Artificial Analysis Coding Index, ahead of Devstral Small 2 and larger open models, and Cohere reports up to 2.8x higher output throughput than Devstral Small 2 at matched concurrency. It is built for agentic software tasks with a 256K context window and up to 64K tokens of output.

Sources

Subscribe to Capital & Compute

Source-backed analysis of what AI compute really costs, sent when a new post goes live.

No spam. Unsubscribe anytime.

← Back to all posts