Local LLM Tokenomics: Self-Hosted Cost Per Token (2026)
What a self-hosted LLM token really costs in 2026: cost per token across owned hardware, why memory bandwidth sets speed, and where buying beats the API.
By Capital & Compute
A used pile of three RTX 3090s, the kind people sell after a GPU upgrade, generates tokens more than three times faster than NVIDIA’s $4,699 “AI supercomputer on your desk.” Same model, same test. The cheap rig wins on the one number that decides whether a local LLM feels usable: tokens per second during generation.
That single fact is the whole of local-LLM tokenomics in miniature. The price tag tells you almost nothing about how fast a box runs a model, and the speed tells you almost nothing about what a token actually costs you. Both of those are settled by things the spec sheet buries: memory bandwidth, and how hard you keep the hardware working. Get those two wrong and you overpay for a slow machine that never earns back its sticker price.
This is the hardware side of the local-versus-API question. The capability side, why open-weights models got good enough to bother with in the first place, is covered in why local LLMs got good in 2026. Here the question is narrower and more mercenary: if you run a model yourself, on a box you own, what does each token cost, and when does that beat just paying an API?
Is it cheaper to run an LLM locally or use an API?
For almost everyone, no: a hosted API is cheaper than running an LLM locally. APIs charge per token and stay cheap at low and medium volume, while owned hardware front-loads a large fixed cost. Self-hosting beats it on price only above roughly a billion tokens a month, or when privacy, latency, or a fixed bill matters more than the rate.
That holds because the per-token API rate is genuinely low and the hardware bill is genuinely large, so the fixed cost only spreads thin at enormous, steady volume. The rest of this piece works the exact numbers.
Here is what the hardware looks like when you line it up by the number that matters. The model is held constant: gpt-oss-120b, OpenAI’s 120-billion-parameter open-weights model, quantized to MXFP4, which is roughly the largest “serious” model a desk-side machine can hold today.
| Setup | Price (2026) | Memory | Bandwidth | Generation, gpt-oss-120b | Best-case cost/Mtok |
|---|---|---|---|---|---|
| 3x RTX 3090 (used) | ~$3,500 all-in | 72 GB (3x 24) | 936 GB/s per card | ~124 tok/s | ~$0.74 |
| NVIDIA DGX Spark | $4,699 | 128 GB unified | 273 GB/s | ~38.55 tok/s | ~$1.61 |
| Mac Studio (M3 Ultra) | from $3,999 | up to 256 GB | 819 GB/s | ~20-30 tok/s (70B) | varies |
| RTX 4090 (single) | ~$1,600 used | 24 GB | 1,008 GB/s | n/a (model exceeds VRAM) | n/a |
| Hosted open-model API | $0/upfront | n/a | n/a | provider-dependent | $0.10-0.90 |
The generation figures come from independent community benchmarking, not vendor marketing: the LMSYS write-up on running GPT-OSS on the DGX Spark (November 2025) and the llama.cpp DGX Spark performance discussion, with the head-to-head against a triple-3090 build documented in a 2026 Dendro Logic concurrency benchmark. Numbers shift with quantization, runtime, and batch size, so read them as the shape of the result, not a fixed constant. The “best-case cost/Mtok” column assumes the machine runs flat out 24/7, which nobody actually does. More on why that number is a fantasy in a minute.
Two cells deserve a flag. The single RTX 4090 has the highest memory bandwidth on the list and still can not run this model, because 24 GB of VRAM can not hold a 120B model at all. And the Mac Studio’s headline once read “up to 512 GB”; Apple pulled the 512 GB option in March 2026 as global DRAM shortages bit, capping the machine at 256 GB and raising the 256 GB upgrade by $400. The same memory squeeze is why the DGX Spark’s own price jumped: NVIDIA officially raised the Founders Edition MSRP from $3,999 to $4,699 on February 23, 2026. RAM prices are now a line item in local-AI tokenomics.
Why memory bandwidth, not price, sets tokens per second
Start with what the machine is actually doing when it writes you a reply. To produce each new token, the model reads its active weights out of memory, runs them against the running context, and emits one token. Then it does the whole thing again for the next token. And the next. Autoregressive generation is a sequence of single steps, and every step has to pull the active weight set across the memory bus once.
So the ceiling on generation speed is not how many math operations the chip can do. It is how fast it can move weights from memory into the compute units. That is memory bandwidth, measured in gigabytes per second. A chip with twice the raw FLOPS but half the bandwidth will generate tokens slower, because it spends most of its time waiting on memory, not computing.
This is why the DGX Spark, a genuinely capable piece of silicon, generates so slowly. Its Grace Blackwell GB10 chip delivers up to 1 petaFLOP of FP4 compute per NVIDIA’s own spec page, which is enormous. But it feeds that compute from 128 GB of LPDDR5x at 273 GB/s. The RTX 3090, a card from 2020, moves 936 GB/s of GDDR6X. Stack three of them and the aggregate bandwidth dwarfs the Spark’s single pool. The Spark has the bigger engine; the 3090 rig has the wider fuel line, and generation is a fuel-line problem.
| Item | Value |
|---|---|
| RTX 4090 | 1008 GB/s |
| RTX 3090 (per card) | 936 GB/s |
| Mac Studio (M3 Ultra) | 819 GB/s |
| DGX Spark | 273 GB/s |
Capacity and bandwidth are different constraints, and you need both. Capacity (total GB of memory) decides whether the model fits at all. Bandwidth (GB/s) decides how fast it runs once it fits. The single RTX 4090 has the best bandwidth on the table and the worst capacity, so it screams on a 13B model and can not load a 120B one. The DGX Spark inverts that: huge capacity, thin bandwidth, so it holds big models and runs them slowly. A multi-3090 rig and a maxed Mac Studio are the setups that get both, which is exactly why they are what serious local-LLM builders actually buy. Runtimes like llama.cpp and vLLM squeeze more out of a given machine with tricks like speculative decoding, but they can not move more bytes than the bus allows.
The split nobody mentions: prompt processing versus generation
There is a second number on that benchmark, and it tells the opposite story. On prompt processing, the DGX Spark hits roughly 1,723 tokens per second, slightly ahead of the triple-3090 rig’s 1,642. The “slow” machine wins the prefill phase.
That is not a contradiction. It is the tell that two different phases of inference stress two different parts of the hardware.
Prompt processing (prefill) is when the model reads your input: the system prompt, the pasted file, the long context. It can chew through all those tokens in parallel, in big matrix multiplies, which is a compute-bound job. Here the Spark’s FP4 compute and the architecture’s efficiency shine. Generation (decode) is the one-token-at-a-time phase described above, which is memory-bound. The Spark is fast at the part you wait through once and slow at the part you wait through for every single token of the answer.
The DGX Spark is fast at reading the question and slow at writing the answer. For interactive use, you feel the second number far more than the first.
Which number matters depends on the workload. Stuff a 100,000-token codebase into context and ask one short question, and prefill dominates, so the Spark’s profile is fine. Hold a long back-and-forth where the model writes paragraphs of output, and decode dominates, so the 3090 rig feels three times more responsive. Most interactive use, chat, coding, agents, lives in the second world. That is why the generation number is the one the table sorts on, and the one that should drive a buying decision.
What a self-hosted token actually costs
Now the money. The cost of a token you generate yourself is not the GPU’s sticker price and it is not the API’s rate card. It is this:
(hardware amortized over its life + electricity to run it) / tokens you actually generate
The numerator is mostly fixed. The denominator is entirely up to you. That ratio, not the hardware tier, is what sets your real cost per token, and it is why two people with identical rigs can have a 20x difference in what a token costs them.
Work it through with the triple-3090 build, since it is the cheapest fast option. Three used 3090s run roughly $750-1,050 each ($900 is a fair mid-point), and a host to hold them, power supply, board, CPU, memory, adds maybe $800. Call it $3,500 all-in. Amortized over a three-year life, that is about $97 a month before a single token is generated. Under full load the rig draws around 1.1 kW. At the U.S. average residential electricity rate of about 17.7 cents per kWh (EIA, early 2026), running it flat out around the clock adds roughly $140 a month in power.
At a flat-out grind, 124 tokens per second works out to about 321 million output tokens a month. Fixed cost plus power is roughly $237, so the math lands near $0.74 per million output tokens. That is a genuinely low number, and it is the best case the hardware can ever produce. It barely undercuts the cheapest hosted open-model APIs, where DeepSeek V4 lists $0.87 per million output tokens and the lighter V4-Flash runs cheaper still. Run the rig at its theoretical maximum forever, and you draw level with a price you could have paid with no hardware, no setup, and no electricity bill.
Nobody runs a personal rig at 100% utilization. Push a heavy individual workload of 10 million tokens a month through the same box and the picture inverts. The fixed $97 now spreads over a fraction of the output, so the cost per million tokens climbs to roughly $10, more than ten times the hosted rate. The hardware does not get more expensive; the idle time does. Every hour the GPU sits waiting for you to type is fixed cost amortizing over zero tokens.
So where is the breakeven?
Set owned-hardware cost equal to API cost and solve for volume, and the result is brutal at the cheap end. Run the triple-3090 rig flat out and its marginal cost, the electricity alone with the $3,500 hardware treated as already spent, still lands near $0.44 per million tokens. That is already inside the range of the cheapest hosted open-model APIs. So before you count a dollar of the hardware, the rig barely undercuts a $0.50-per-million API, and the fixed cost only fully amortizes against that rate somewhere up in the billions of tokens a month. That is not a personal workload. That is a small product serving real traffic.
Against a premium frontier API at around $5 per million tokens, the breakeven drops to roughly 20 million tokens a month, which is reachable for a heavy user. But that comparison cheats: you would be running an open model on your rig and comparing it to a frontier model on the API, which is a capability downgrade, not a like-for-like swap. The honest comparison, open model against hosted open model, keeps the breakeven up in the billions. This is the same cost-per-token-versus-cost-per-task trap dissected in why cheaper AI models can cost more: the rate card is only half the equation, and the cheap-looking option often loses once you count what it takes to finish the job.
For the full per-token API rate card to benchmark against, the Capital & Compute AI pricing tracker keeps the current numbers, and the AI coding cost calculator models what a real task costs at those rates. If you are weighing rented cloud GPUs instead of owned hardware, that carries the per-token problem and adds an hourly meter on top; the decentralized GPU cost breakdown works that case through.
When local wins for reasons that are not price
Cost is the wrong lens for most people who self-host anyway. Here is when owning the hardware is the right call regardless of the per-token math.
Data that can not leave. If your inputs are regulated, confidential, or contractually barred from third-party processing, local inference is not an optimization, it is the requirement. No API rate competes with “the data never crossed the network.”
Latency, offline, and no rate limits. A local model has no network round trip, no throttling, and no dependence on someone else’s uptime. For tight interactive loops, air-gapped environments, or a laptop on a plane, that is decisive on its own.
Control and permanence. A model you have downloaded can not be deprecated out from under you, quietly swapped for a cheaper quantization, or repriced next quarter. Owning the weights removes a class of vendor risk no API tier erases.
A fixed bill. This is the one genuine cost-shape advantage, even when the per-token price loses. Owned hardware turns a variable, metered expense into a flat monthly number you can budget around. For a workload with spiky usage, the predictability can be worth more than the raw savings, because there is no surprise invoice when a job runs long.
The bottom line
Local-LLM tokenomics rewards a clear head about two numbers. Memory bandwidth decides whether the machine is fast, and it is not the number on the price tag, which is how a 2020-era used-GPU rig can embarrass a 2026 “supercomputer” at the task you actually care about. Utilization decides what a token costs, and unless you are feeding the box near-continuously, an owned token costs more than a hosted one, often by an order of magnitude.
None of that makes self-hosting a mistake. It makes it a deliberate choice with a clear shape: a fixed-cost asset that pays off on volume, privacy, latency, or control, and loses on raw price for everyone running interactive, bursty, human-paced work. Buy the hardware when one of those non-price reasons is real. If the only goal is cheap tokens, the API already won, and it did not ask you to assemble anything.
Frequently asked questions
- Is it cheaper to run an LLM locally or use an API?
- For individuals and most teams, the API is cheaper. Hosted open-model APIs run roughly $0.10-0.90 per million tokens, a floor that owned hardware only beats at billions of tokens a month of sustained use. Local hardware wins on price only at very high, steady volume, and otherwise wins on privacy, latency, control, and predictable cost.
- How many tokens per second can a 3090, 4090, Mac Studio, or DGX Spark generate?
- On the gpt-oss-120b model, a three-card RTX 3090 rig generates about 124 tokens per second and a DGX Spark about 38.55, per community benchmarks. A Mac Studio M3 Ultra does roughly 20-30 tokens per second on a 70B 4-bit model. A single RTX 4090 is very fast on small models but its 24 GB can not hold a 120B model at all.
- Why is the DGX Spark slow at token generation?
- Token generation is limited by memory bandwidth, not compute, because the model reads its active weights from memory once per token. The DGX Spark has strong FP4 compute but only 273 GB/s of LPDDR5x bandwidth, so it generates slowly. It is fast at prompt processing, which is compute-bound and runs in parallel.
- What is the breakeven for self-hosting an LLM?
- Against a cheap hosted open-model API near $0.50 per million tokens, an owned rig only breaks even up in the billions of tokens a month of sustained use, because a flat-out rig's electricity alone already costs about $0.44 per million tokens. Against a premium frontier API near $5 per million tokens it drops to about 20 million tokens a month, but that compares an open local model to a frontier hosted one, which is not like-for-like.
- Does memory bandwidth affect LLM speed?
- Yes, it is the main constraint on generation speed. Each token requires reading the model active weights across the memory bus, so tokens per second is bounded by bandwidth in GB/s, not by raw compute. Memory capacity is a separate constraint that decides whether a model fits at all; you need enough of both.
Sources
- NVIDIA (2026). NVIDIA DGX Spark (product specifications page). https://www.nvidia.com/en-us/products/workstations/dgx-spark/
- VideoCardz (2026, February). NVIDIA officially raises DGX Spark Founders Edition MSRP to $4,699. https://videocardz.com/newz/nvidia-officially-raises-dgx-spark-founders-edition-msrp-to-4699
- Tom’s Hardware (2026, March). Apple pulls $4,000 512GB Mac Studio upgrade option as AI RAM squeeze continues. https://www.tomshardware.com/tech-industry/apple-pulls-512-mac-studio-upgrade-option
- Apple (2026). Mac Studio: Technical Specifications. https://www.apple.com/mac-studio/specs/
- LMSYS Org (2025, November). Optimizing GPT-OSS on NVIDIA DGX Spark. https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/
- ggml-org (2026). Performance of llama.cpp on NVIDIA DGX Spark (GitHub Discussion #16578). https://github.com/ggml-org/llama.cpp/discussions/16578
- Dendro Logic (2026). NVIDIA DGX Spark Concurrency Benchmark. https://dendro-logic.com/engineering/nvidia-dgx-spark-concurrency-benchmark/
- U.S. Energy Information Administration (2026). Electric Power Monthly (average residential electricity price). https://www.eia.gov/electricity/monthly/
- DeepSeek (2026). DeepSeek API Pricing. https://api-docs.deepseek.com/quick_start/pricing