Can I Run This LLM? Local vs Cloud vs API Cost

Model	Params	Runs at Q4?	Room for	~tok/s
Llama 3.2 3B Meta	3B	Yes	FP16	32 Fast
Qwen3 4B Alibaba	4B	Yes	FP16	24 Usable
Qwen3 8B Alibaba	8B	Yes	Q8	12 Usable
Llama 3.1 8B Meta	8B	Yes	Q8	12 Usable
gpt-oss-20b OpenAI· MoE	20B	Yes	Q4	27 Usable
Gemma 3 27B Google	27B	Too big	—	—
Qwen3 32B Alibaba	32B	Too big	—	—
Llama 3.3 70B Meta	70B	Too big	—	—
gpt-oss-120b OpenAI· MoE	117B	Too big	—	—
Llama 405B Meta	405B	Too big	—	—
DeepSeek V3 671B DeepSeek· MoE	671B	Too big	—	—
DeepSeek R1 671B DeepSeek· MoE	671B	Too big	—	—
Kimi K2 (1T) Moonshot AI· MoE	1000B	Too big	—	—

The one rule: half a gigabyte per billion parameters

A model is a pile of numbers, and memory is where those numbers sit while it runs. At 4-bit quantization (Q4), the format almost everyone runs locally, each billion parameters takes about 0.5GB. Add 15 to 30 percent for the context window and framework overhead, so a safe planning figure is 0.75GB per billion. Divide your usable memory by 0.75 and you have the largest model, in billions of parameters, you can comfortably run. The whole checker above is that ratio, applied across real devices and models. Thefull RAM-for-local-LLM guidewalks the entire ladder from an 8GB laptop to a 1.1TB GPU node.

Capacity decides whether it loads; bandwidth decides how fast

There are three kinds of memory, and they differ less in capacity than in speed. System RAM in a normal PC moves on the order of 80 to 100 GB/s. GPU VRAM is far faster: an RTX 5090 moves about 1.8 TB/s. Apple and NVIDIA unified memory sit in between. A model that fits entirely in fast memory runs quickly; the moment it spills into slow system RAM, throughput falls off a cliff. That is why generation speed tracks memory bandwidth, not the price on the box, and why the tokens-per-second figures above are estimates rather than promises. The mechanism is worked through inself-hosted LLM tokenomics.

MoE: why a trillion-parameter model can run on one node

A dense model uses every parameter for every token. A mixture-of-experts (MoE) model holds many specialists but activates only a few per token: Kimi K2 has about a trillion parameters but only 32B active at a time, and gpt-oss-120b and DeepSeek R1 are MoE too. The rule that follows is worth memorizing: total parameters set how much memory you need, active parameters set how fast it runs. All the weights must load, but the speed only ever pays for the active slice.

The part the other calculators skip: should you even run it locally?

Every VRAM calculator tells you whether a model fits. Almost none tell you whether it is worth it. For most individuals, the hosted API is cheaper: hosted open-model APIs run roughly $0.10 to $0.90 per million tokens, a floor that owned hardware only beats at billions of tokens a month of sustained use. Renting a cloud GPU sits in between: an NVIDIA H100 rents from about $1.45 an hour on adecentralized network to roughly $6.88 on AWS on-demand. Buying wins when privacy, offline use, latency, or a fixed bill matters more than the per-token rate. To put a number on your own case, thesubscription vs API calculator and thecost-per-task calculator turn usage into a break-even.

The 2026 catch: the hardware got more expensive

The case for buying got worse in 2026 for a specific reason: the resource local AI runs on is exactly the one that spiked in price. Memory makers have run the most lucrative shortage in chip history, which you can watch on theDRAM price tracker, and it flowed straight through to devices. On June 25 2026 Apple raised prices across its Mac line. If you are buying, size the machine to the largest model you will genuinely use, not the largest the tier could theoretically hold, and check current model specs and sizes first.

Frequently asked questions

What LLM can I run on my hardware?

The one number that decides it is your memory. At 4-bit (Q4), a model needs about 0.5GB per billion parameters plus headroom for context and the OS, so divide your usable memory by roughly 0.75 to estimate the largest model in billions of parameters. A 16GB machine runs an 8B model comfortably; 24GB reaches a 32B; 64GB reaches a 70B; and the 671B open models need about 512GB. On a 16GB machine here, the largest that fits is gpt-oss-20b (20B).

How much VRAM do I need to run an LLM?

At 4-bit quantization, budget about 0.5GB of VRAM per billion parameters for the weights, plus 15 to 30 percent for the KV cache and overhead. So a 7B model needs roughly 5 to 6GB, a 13B needs about 10GB, and a 70B needs about 42GB. Full precision (FP16) is four times that. VRAM is far faster than system RAM, so a model that fits entirely in VRAM runs much quicker.

Can I run an LLM without a GPU?

Yes. Small 3B to 4B models run on a CPU with 8GB of RAM, and 8B models on 16GB, just slower than on a GPU. At the extreme, a CPU server with 768GB of RAM can hold a 671B model, but it generates only single-digit tokens per second because system RAM has a fraction of a GPU or Mac unified memory bandwidth. Capacity decides whether a model loads; bandwidth decides how fast it runs.

Is it cheaper to run an LLM locally or use the API?

For most individuals, the hosted API is cheaper. Hosted open-model APIs run roughly $0.10 to $0.90 per million tokens, a floor that owned hardware only beats at billions of tokens a month of sustained use, because the hardware front-loads a large fixed cost that only pays off at high, steady volume. Buy hardware when privacy, offline use, latency, or a fixed bill matters more than the per-token rate.

How fast will a local LLM run on my machine?

Generation speed is set by memory bandwidth, not the price tag, because the model reads its active weights from memory once per token. A model that fits in fast GPU VRAM generates quickly; the same model spilling into system RAM slows to a crawl. A mixture-of-experts model is the exception: it loads all its weights but only runs the active slice, so a trillion-parameter MoE can generate as fast as a small dense model.

How the numbers are modeled

Feasibility is capacity math: model footprint (parameters times the per-billion figure for the quantization) plus context and OS overhead, compared against your usable memory. Speed is a memory-bandwidth estimate, tokens per second scaling with bandwidth divided by the active-weight bytes read per token, calibrated so a 671B MoE on a 512GB Mac Studio lands near its measured 17 to 18 tokens per second. These are modeled figures for planning, not benchmarks. Treat tokens per second as an order of magnitude.

Sources

Capital & Compute. (2026). How Much RAM Do You Need to Run a Local LLM? (memory math, model footprints, and device tiers). /blog/how-much-ram-to-run-a-local-llm/
OpenAI. (2025). Introducing gpt-oss (16GB and 80GB memory specs for the 20b and 120b models). Verified July 2026. openai.com/index/introducing-gpt-oss
NVIDIA. (2026). DGX Spark (128GB unified memory, 273 GB/s, up to 200B inference). Verified July 2026. nvidia.com/en-us/products/workstations/dgx-spark
TechRadar. (2025). Mac Studio M3 Ultra runs DeepSeek R1 671B entirely in memory (~17 to 18 tokens per second). techradar.com
Capital & Compute. (2026). Decentralized GPU vs Cloud (H100 hourly rates: Akash ~$1.45, AWS ~$6.88). /blog/decentralized-gpu-cost-vs-cloud/
Capital & Compute. (2026). Self-Hosted LLM Cost Per Token (hosted open-model API floor $0.10 to $0.90 per million tokens; break-even math). /blog/self-hosted-llm-cost-per-token/

Can I run this LLM?

What LLM can I run on my hardware?

Can I run this LLM?

The one rule: half a gigabyte per billion parameters

Capacity decides whether it loads; bandwidth decides how fast

MoE: why a trillion-parameter model can run on one node

The part the other calculators skip: should you even run it locally?

The 2026 catch: the hardware got more expensive

Frequently asked questions

What LLM can I run on my hardware?

How much VRAM do I need to run an LLM?

Can I run an LLM without a GPU?

Is it cheaper to run an LLM locally or use the API?

How fast will a local LLM run on my machine?

How the numbers are modeled

Sources