Can I run this LLM?
Pick your GPU, Mac, or RAM and see which open models you can actually run, at what quantization and roughly how fast. Then the part every other calculator skips: whether buying the hardware, renting a cloud GPU, or just paying the API is the cheaper way to get the model.
What LLM can I run on my hardware?
The one number that decides it is your memory. At 4-bit (Q4), a model needs about 0.5GB per billion parameters plus headroom for context and the OS, so divide your usable memory by roughly 0.75 to estimate the largest model in billions of parameters. A 16GB machine runs an 8B model comfortably; 24GB reaches a 32B; 64GB reaches a 70B; and the 671B open models need about 512GB. On a 16GB machine here, the largest that fits is gpt-oss-20b (20B).
Example: a 16GB machine has about 13GB usable for the model and runs gpt-oss-20b at Q4 (about 27 tok/s). Change the hardware below to check your own.
Can I run this LLM?
Modeled estimateFor most individuals the hosted API is cheaper: owned hardware only wins on price at billions of tokens a month of steady use. Buy when privacy, offline use, or a fixed bill matters more than the rate. Work your own break-even in the subscription vs API calculator, or see the full self-hosted cost-per-token math.
| Model | Params | Runs at Q4? | Room for | ~tok/s |
|---|---|---|---|---|
| Llama 3.2 3B Meta | 3B | Yes | FP16 | 32 Fast |
| Qwen3 4B Alibaba | 4B | Yes | FP16 | 24 Usable |
| Qwen3 8B Alibaba | 8B | Yes | Q8 | 12 Usable |
| Llama 3.1 8B Meta | 8B | Yes | Q8 | 12 Usable |
| gpt-oss-20b OpenAI· MoE | 20B | Yes | Q4 | 27 Usable |
| Gemma 3 27B Google | 27B | Too big | — | — |
| Qwen3 32B Alibaba | 32B | Too big | — | — |
| Llama 3.3 70B Meta | 70B | Too big | — | — |
| gpt-oss-120b OpenAI· MoE | 117B | Too big | — | — |
| Llama 405B Meta | 405B | Too big | — | — |
| DeepSeek V3 671B DeepSeek· MoE | 671B | Too big | — | — |
| DeepSeek R1 671B DeepSeek· MoE | 671B | Too big | — | — |
| Kimi K2 (1T) Moonshot AI· MoE | 1000B | Too big | — | — |
Feasibility is capacity math: a model needs about 0.5GB per billion parameters at Q4, plus room for context and the OS. Speed is a memory-bandwidth estimate, calibrated to community benchmarks, not a measurement: treat tok/s as an order of magnitude, not a promise. MoE models load all their weights but only run the active slice, so a trillion-parameter MoE can generate as fast as a small model. To run a model you cannot fit locally, see the inference providers.
The one rule: half a gigabyte per billion parameters
A model is a pile of numbers, and memory is where those numbers sit while it runs. At 4-bit quantization (Q4), the format almost everyone runs locally, each billion parameters takes about 0.5GB. Add 15 to 30 percent for the context window and framework overhead, so a safe planning figure is 0.75GB per billion. Divide your usable memory by 0.75 and you have the largest model, in billions of parameters, you can comfortably run. The whole checker above is that ratio, applied across real devices and models. Thefull RAM-for-local-LLM guidewalks the entire ladder from an 8GB laptop to a 1.1TB GPU node.
Capacity decides whether it loads; bandwidth decides how fast
There are three kinds of memory, and they differ less in capacity than in speed. System RAM in a normal PC moves on the order of 80 to 100 GB/s. GPU VRAM is far faster: an RTX 5090 moves about 1.8 TB/s. Apple and NVIDIA unified memory sit in between. A model that fits entirely in fast memory runs quickly; the moment it spills into slow system RAM, throughput falls off a cliff. That is why generation speed tracks memory bandwidth, not the price on the box, and why the tokens-per-second figures above are estimates rather than promises. The mechanism is worked through inself-hosted LLM tokenomics.
MoE: why a trillion-parameter model can run on one node
A dense model uses every parameter for every token. A mixture-of-experts (MoE) model holds many specialists but activates only a few per token: Kimi K2 has about a trillion parameters but only 32B active at a time, and gpt-oss-120b and DeepSeek R1 are MoE too. The rule that follows is worth memorizing: total parameters set how much memory you need, active parameters set how fast it runs. All the weights must load, but the speed only ever pays for the active slice.
The part the other calculators skip: should you even run it locally?
Every VRAM calculator tells you whether a model fits. Almost none tell you whether it is worth it. For most individuals, the hosted API is cheaper: hosted open-model APIs run roughly $0.10 to $0.90 per million tokens, a floor that owned hardware only beats at billions of tokens a month of sustained use. Renting a cloud GPU sits in between: an NVIDIA H100 rents from about $1.45 an hour on adecentralized network to roughly $6.88 on AWS on-demand. Buying wins when privacy, offline use, latency, or a fixed bill matters more than the per-token rate. To put a number on your own case, thesubscription vs API calculator and thecost-per-task calculator turn usage into a break-even.
The 2026 catch: the hardware got more expensive
The case for buying got worse in 2026 for a specific reason: the resource local AI runs on is exactly the one that spiked in price. Memory makers have run the most lucrative shortage in chip history, which you can watch on theDRAM price tracker, and it flowed straight through to devices. On June 25 2026 Apple raised prices across its Mac line. If you are buying, size the machine to the largest model you will genuinely use, not the largest the tier could theoretically hold, and check current model specs and sizes first.
Frequently asked questions
What LLM can I run on my hardware?
The one number that decides it is your memory. At 4-bit (Q4), a model needs about 0.5GB per billion parameters plus headroom for context and the OS, so divide your usable memory by roughly 0.75 to estimate the largest model in billions of parameters. A 16GB machine runs an 8B model comfortably; 24GB reaches a 32B; 64GB reaches a 70B; and the 671B open models need about 512GB. On a 16GB machine here, the largest that fits is gpt-oss-20b (20B).
How much VRAM do I need to run an LLM?
At 4-bit quantization, budget about 0.5GB of VRAM per billion parameters for the weights, plus 15 to 30 percent for the KV cache and overhead. So a 7B model needs roughly 5 to 6GB, a 13B needs about 10GB, and a 70B needs about 42GB. Full precision (FP16) is four times that. VRAM is far faster than system RAM, so a model that fits entirely in VRAM runs much quicker.
Can I run an LLM without a GPU?
Yes. Small 3B to 4B models run on a CPU with 8GB of RAM, and 8B models on 16GB, just slower than on a GPU. At the extreme, a CPU server with 768GB of RAM can hold a 671B model, but it generates only single-digit tokens per second because system RAM has a fraction of a GPU or Mac unified memory bandwidth. Capacity decides whether a model loads; bandwidth decides how fast it runs.
Is it cheaper to run an LLM locally or use the API?
For most individuals, the hosted API is cheaper. Hosted open-model APIs run roughly $0.10 to $0.90 per million tokens, a floor that owned hardware only beats at billions of tokens a month of sustained use, because the hardware front-loads a large fixed cost that only pays off at high, steady volume. Buy hardware when privacy, offline use, latency, or a fixed bill matters more than the per-token rate.
How fast will a local LLM run on my machine?
Generation speed is set by memory bandwidth, not the price tag, because the model reads its active weights from memory once per token. A model that fits in fast GPU VRAM generates quickly; the same model spilling into system RAM slows to a crawl. A mixture-of-experts model is the exception: it loads all its weights but only runs the active slice, so a trillion-parameter MoE can generate as fast as a small dense model.
How the numbers are modeled
Feasibility is capacity math: model footprint (parameters times the per-billion figure for the quantization) plus context and OS overhead, compared against your usable memory. Speed is a memory-bandwidth estimate, tokens per second scaling with bandwidth divided by the active-weight bytes read per token, calibrated so a 671B MoE on a 512GB Mac Studio lands near its measured 17 to 18 tokens per second. These are modeled figures for planning, not benchmarks. Treat tokens per second as an order of magnitude.
Sources
- Capital & Compute. (2026). How Much RAM Do You Need to Run a Local LLM? (memory math, model footprints, and device tiers). /blog/how-much-ram-to-run-a-local-llm/
- OpenAI. (2025). Introducing gpt-oss (16GB and 80GB memory specs for the 20b and 120b models). Verified July 2026. openai.com/index/introducing-gpt-oss
- NVIDIA. (2026). DGX Spark (128GB unified memory, 273 GB/s, up to 200B inference). Verified July 2026. nvidia.com/en-us/products/workstations/dgx-spark
- TechRadar. (2025). Mac Studio M3 Ultra runs DeepSeek R1 671B entirely in memory (~17 to 18 tokens per second). techradar.com
- Capital & Compute. (2026). Decentralized GPU vs Cloud (H100 hourly rates: Akash ~$1.45, AWS ~$6.88). /blog/decentralized-gpu-cost-vs-cloud/
- Capital & Compute. (2026). Self-Hosted LLM Cost Per Token (hosted open-model API floor $0.10 to $0.90 per million tokens; break-even math). /blog/self-hosted-llm-cost-per-token/