How Much RAM Do You Need to Run a Local LLM?
How much RAM you need to run a local LLM in 2026: what models 8GB to 512GB can run, the per-billion-parameter math, and the device for each tier.
By Capital & Compute
The single number that decides which local LLM you can run is your RAM, and the math is simple enough to do in your head. A model needs roughly half a gigabyte of memory per billion parameters at 4-bit quantization, the format almost everyone runs locally. So 8GB of RAM tops out near a small 7B model, 32GB reaches a 32B, and the largest open model on the planet, DeepSeek R1 at 671 billion parameters, needs a machine with 512GB to hold it. Above that sit the trillion-parameter models that take a multi-GPU server or a workstation with more than a terabyte of RAM. This guide walks the entire ladder, from an 8GB laptop to a 1.1TB GPU node, covers the difference between system RAM and the much faster VRAM on a graphics card, and tells you which device fits each rung. Everything is detail on top of that one ratio. In a hurry? The Can I Run This LLM? checker turns this math into a picker: choose your hardware and see which models fit, how fast, and whether owning, renting, or the API is cheaper.
The math: half a gigabyte per billion parameters
A model is a pile of numbers (its parameters, or weights), and memory is just where those numbers sit while the model runs. So the size of the model and the precision of each number set the bill.
At full precision (FP16, the format models are trained in), each parameter takes 2 bytes, so a 7B model needs about 14GB just for weights. Almost nobody runs full precision locally. The standard for local inference is quantization: storing each weight in fewer bits with a small, usually acceptable loss in quality. The common rungs:
- FP16 (full): 2 bytes per parameter, so ≈2 GB per billion.
- Q8 (8-bit): ≈1 byte per parameter, ≈1 GB per billion. Near-lossless.
- Q4 (4-bit): ≈0.5 bytes per parameter, ≈0.5 GB per billion. The practical default, and what every number in this guide assumes unless stated.
Q4 is the one that matters because it is where local LLMs became practical: it roughly quarters the memory cost versus full precision for a quality drop most people cannot feel in everyday use. That is a big part of why local LLMs got good in 2026.
So the working formula is: (usable RAM in GB) ÷ 0.75 ≈ the largest model in billions of parameters you can comfortably run at Q4. A 16GB machine with about 11GB free lands near a 14B model. The rest of this guide turns that into a shopping list.
VRAM, system RAM, and unified memory: the three kinds of memory
“How much RAM” hides a question that decides everything about how a local model behaves: which kind of memory. There are three, and they differ less in capacity than in speed.
- System RAM (DDR5). The sticks in a normal PC or laptop. Cheap and available in huge amounts on a server, but slow by AI standards: a typical dual-channel desktop moves on the order of 80 to 100 GB/s. The CPU reads model weights from it. You can run very large models here, just slowly.
- GPU VRAM (GDDR or HBM). The memory soldered onto a graphics card. Far faster: an RTX 5090 moves about 1.8 TB/s, and a data-center H200 about 4.8 TB/s, twenty to fifty times a desktop’s RAM. This is why a model that fits entirely in VRAM is so much quicker. The catch is capacity: consumer cards top out at 32GB, and the 80 to 192GB cards cost as much as a car.
- Unified memory (Apple Silicon, NVIDIA DGX Spark). A single pool the CPU and GPU share, so there is no slow copy between them. Bandwidth sits in the middle: Apple’s M3 Ultra reaches about 819 GB/s, NVIDIA’s DGX Spark about 273 GB/s. This is the trick that lets a Mac hold a model far larger than any consumer GPU can, at a speed that is slower than that GPU but vastly faster than CPU RAM.
| Memory type | Typical bandwidth | Capacity you can buy | Cost per GB | Best for |
|---|---|---|---|---|
| System RAM (DDR5) | ~80–100 GB/s (desktop); higher on multi-channel servers | 16GB to 1.5TB+ | Lowest | Holding very large models on CPU, slowly |
| Unified memory | ~270–820 GB/s | 16GB to 512GB | Medium | Large models at moderate speed on one box |
| GPU VRAM (GDDR/HBM) | ~1,000–4,800 GB/s | 8GB to 192GB per card | Highest | Maximum speed for any model that fits |
The practical upshot: fit your model in the fastest memory it will fit in. A 13B model belongs in a 16GB GPU, not spread across 64GB of system RAM. A 671B model has no choice but to live in unified memory or across many GPUs. Everything below is really about matching a model to the right kind of memory, not just enough of it.
Quantization: the dial that sets the footprint
Quantization is the single biggest lever on memory, so it is worth seeing the full dial rather than just the Q4 default. Lower bits per weight means a smaller footprint and faster generation, traded against a gradual loss of quality.
| Format | Bits per weight | Memory per 1B params | Quality | When to use |
|---|---|---|---|---|
| FP16 / BF16 | 16 | ~2.0 GB | Full (reference) | Training, and serving when memory is not the constraint |
| Q8 | 8 | ~1.0 GB | Near-lossless | When you have headroom and want maximum fidelity |
| Q6 | 6 | ~0.75 GB | Very close to Q8 | A safe step down from Q8 |
| Q5 | 5 | ~0.65 GB | Slightly below Q6 | A middle ground on tight memory |
| Q4 | 4 | ~0.5 GB | Small, usually unnoticeable drop | The default for local inference |
| Q3 | 3 | ~0.4 GB | Noticeable degradation | Only to squeeze a model that almost fits |
| Q2 | 2 | ~0.3 GB | Heavy degradation | Last resort, often not worth it |
Q4 is the sweet spot because it roughly quarters the memory cost of full precision for a quality drop most people cannot feel. Every footprint figure in the tables below assumes Q4 unless noted.
What model footprints actually look like
The reason RAM is the gatekeeper is the sheer spread of model sizes. A small model you can run on a phone and the largest open model a hyperscaler-grade box runs differ by more than two orders of magnitude in memory. Plotted on a normal axis the small models vanish; on a log scale you can see the whole ladder at once.
| Tool | Cost per task | Multiple of baseline |
|---|---|---|
| Llama 3.2 (3B) | 2 GB | - |
| Qwen3 (8B) | 5 GB | - |
| gpt-oss-20b | 13 GB | - |
| Qwen3 (32B) | 20 GB | - |
| Llama 3.3 (70B) | 42 GB | - |
| gpt-oss-120b | 63 GB | - |
| DeepSeek R1 (671B) | 404 GB | - |
The system-RAM ladder: what each tier runs, on what device
This ladder is for system memory and Apple-style unified memory, the pool you size when you buy a laptop, a Mac, or a workstation. The VRAM ladder for graphics cards comes next. The table below is the whole guide in one view. “Usable for the model” assumes you leave headroom for the OS and a modest context window. Model footprints assume Q4. Devices are representative, not exhaustive.
| RAM | Usable for the model | Largest comfortable model (Q4) | Example models | Typical device | What you can actually do |
|---|---|---|---|---|---|
| 8 GB | ~3–4 GB | 3B, up to a tight 7B | Llama 3.2 3B, Qwen3 4B, Gemma 3 4B, Phi | Base MacBook Air, mainstream laptop, high-end phone | Offline chat, summarizing, autocomplete, simple retrieval over a few docs |
| 16 GB | ~10–11 GB | 8B comfortably, 13–14B tight | Qwen3 8B, Llama 3.1 8B, gpt-oss-20b (MoE) | Mid-range laptop, M-series Air/Pro | A genuinely useful daily assistant, decent coding help, RAG over a document set |
| 24–32 GB | ~18–26 GB | 14B to 32B dense | Qwen3 32B, Gemma 3 27B, gpt-oss-20b at full quality | RTX 4090 (24GB) / RTX 5090 (32GB), 32GB Mac | Near-frontier-lite quality, agentic coding, longer context windows |
| 48–64 GB | ~40–52 GB | 70B | Llama 3.3 70B, Qwen 72B | 64GB Mac, dual 24GB GPUs | Strong general reasoning, serious local coding, multi-document RAG |
| 96–128 GB | ~80–110 GB | 120B; 70B at Q8/BF16 | gpt-oss-120b (80GB), 70B at higher precision | NVIDIA DGX Spark (128GB), 128GB Mac Studio | Frontier-class open models; fine-tune up to 70B on the DGX Spark |
| 256 GB | ~220 GB | 200B-class, or several big models at once | Large MoE models, multi-model setups | High-RAM Mac Studio, multi-GPU workstation | Run a 200B model plus tooling, or two 70B models side by side |
| 512 GB+ | ~440 GB+ | 405B to 671B | DeepSeek R1/V3 671B, Llama 405B | Mac Studio M3 Ultra (512GB), 8x80GB GPU server | The largest open weights, held entirely in memory |
8GB: small models, real uses
This is the floor, and it is more useful than it sounds. After the operating system takes its cut you have roughly 3 to 4GB for a model, which is a 3B to 4B at Q4. Models like Llama 3.2 3B and Qwen3 4B handle summarization, drafting, autocomplete, and simple question-answering over a handful of documents without ever touching the network. What they are not is a reasoning engine: expect them to stumble on multi-step logic and longer context. On 8GB, a small fast model you actually use beats a larger one you cannot load.
16GB: the mainstream sweet spot
Sixteen gigabytes is where local AI stops being a demo. An 8B model (Qwen3 8B is about 5GB at Q4) leaves plenty of room for context and runs quickly on a modern laptop. This tier also unlocks the first genuinely strong option: OpenAI says its gpt-oss-20b runs on edge devices with just 16GB of memory, because its mixture-of-experts design activates only a fraction of its parameters per token. For most people who want a private, capable assistant on the machine they already own, 16GB is the answer.
24 to 32GB: the prosumer GPU tier
This is the high-end consumer graphics card bracket: an RTX 4090 carries 24GB of VRAM and the newer RTX 5090 carries 32GB. It runs 14B to 32B dense models at Q4, which is where open models start to feel close to the commercial frontier for everyday work. A 32B like Qwen3 32B (about 20GB) fits with room for a long context, and agentic coding becomes realistic. If you are choosing hardware specifically to run models, this tier is the best balance of capability and cost for most enthusiasts.
48 to 128GB: 70B models and the personal supercomputer
A 70B model at Q4 needs roughly 40 to 48GB, so 64GB is the entry point for the heavyweight open models like Llama 3.3 70B. Push to 128GB and you reach the most interesting recent category: the personal AI box. NVIDIA’s DGX Spark pairs 128GB of unified memory with 273 GB/s of bandwidth and, per NVIDIA, runs inference on models up to 200 billion parameters and fine-tunes up to 70B. A 128GB Mac Studio reaches the same class. This is also the tier where gpt-oss-120b lives: OpenAI says the 120B version runs within 80GB of memory.
512GB and up: the largest open weights
At the top of the ladder is one headline use case: running the biggest open models in existence. DeepSeek R1 at 671 billion parameters consumes about 404GB even at 4-bit, which is why it needs a 512GB machine to hold it with working room. The remarkable part is that this is now possible on a single desktop. As MacRumors reported, a Mac Studio with an M3 Ultra and 512GB of unified memory runs DeepSeek R1 locally, and a TechRadar reviewer measured it at roughly 17 to 18 tokens per second while drawing under 200 watts. The alternative, a multi-GPU server, costs far more and burns far more power, and is the start of the next ladder.
The VRAM ladder: consumer cards to rack-scale systems
If you run models on a graphics card rather than in system memory, this is the ladder that matters. VRAM is faster but scarcer, so the rungs are smaller and the prices climb steeply. The table runs from an entry consumer card to a full data-center rack that NVIDIA treats as one giant GPU.
| VRAM | Example hardware | Class | Largest model (Q4) | Notes |
|---|---|---|---|---|
| 8 GB | RTX 4060, RTX 3050 | Consumer | 7B, tight | Entry GPU; keep context short |
| 12 GB | RTX 3060, RTX 4070 | Consumer | 13B | Comfortable small-model card |
| 16 GB | RTX 4080, RTX 5060 Ti 16GB | Consumer | 14B with context | Good 8B card with long context |
| 24 GB | RTX 3090, RTX 4090, RX 7900 XTX | Prosumer | 32B | The long-time enthusiast standard |
| 32 GB | RTX 5090 | Consumer flagship | 32B comfortably | ~1.8 TB/s, the fastest consumer card |
| 48 GB | RTX 6000 Ada, L40S | Workstation | 70B, tight | Single-card 70B becomes possible |
| 96 GB | RTX PRO 6000 Blackwell | Workstation | 120B | The most VRAM on a non-data-center card |
| 80–141 GB | A100 / H100 (80GB), H200 (141GB) | Data center | 70B at FP16, 100B+ at Q4 | HBM, ~3.3–4.8 TB/s bandwidth |
| 192–288 GB | B200 (192GB), B300 Blackwell Ultra (288GB) | Data center flagship | 200B+ on a single GPU | B300 is the current production flagship |
| 13.4–20.7 TB | GB200 NVL72 / GB300 NVL72 | Rack-scale | Trillion-parameter, served to thousands | 72 GPUs wired as one |
For most people the story stops at 32GB: an RTX 5090 is the fastest card you can put in a desktop and runs anything up to a 32B model briskly. Step up to workstation cards and the RTX PRO 6000 Blackwell carries 96GB, enough for a 120B model on one card. Above that you are buying data-center silicon: an H100 holds 80GB, an H200 holds 141GB, and a single H200 runs a 70B model at full FP16 precision with room for a long context.
The current top of the single-GPU ladder is NVIDIA’s Blackwell generation. The B200 carries 192GB of HBM3e, and the Blackwell Ultra B300, the flagship in production as of mid-2026, raises that to 288GB at roughly 8 TB/s. Beyond a single chip, NVIDIA stitches 72 GPUs into one rack-scale unit: the GB200 NVL72 pools 13.4 TB of fast GPU memory, and the GB300 NVL72 (built on B300s) pushes past 20 TB. These are the boxes that serve frontier models to millions of users, not desktop hardware, but they are the literal ceiling of the ladder.
When a model does not fit: offloading and the bandwidth cliff
You do not have to fit a model entirely in one kind of memory. Tools like llama.cpp and LM Studio let you split a model, keeping some layers in fast VRAM and spilling the rest into system RAM where the CPU handles them. This is how a 24GB card runs a 70B model at all. The cost is speed, and it is steep.
Generation slows roughly in proportion to how much of the model lives in slow memory: offload half the layers and you get about half the speedup, because every token still has to read those CPU-side weights across the much slower memory bus. The practical guidance: a model that fits entirely in VRAM is the goal; partial offload is a usable compromise; a model running mostly from system RAM will be slow no matter how fast your GPU is. The one happy exception is mixture-of-experts models, where only a few experts are active per token, so offloading the idle ones hurts far less.
MoE vs dense: why a trillion-parameter model can run on one node
The headline parameter count can badly mislead you on memory, because of how modern large models are built. A dense model uses every parameter for every token, so a 70B dense model does 70B parameters’ worth of work each step. A mixture-of-experts (MoE) model holds many specialist sub-networks but activates only a few per token: Kimi K2 has about 1 trillion total parameters but only 32B active at a time, and gpt-oss-120b and DeepSeek R1 are MoE too.
The rule that follows is worth memorizing: total parameters set how much memory you need; active parameters set how fast it runs. All the weights must be loaded, so a 1T-parameter MoE still needs roughly 600GB at Q4. But because only 32B are active per token, it generates as quickly as a 32B dense model would, far faster than a 671B dense model. This is exactly why a trillion-parameter model can run on a single 8-GPU node when a much smaller dense model would choke: the memory holds the whole thing, and the speed only ever pays for the active slice.
Beyond 512GB: CPU servers, multi-GPU nodes, and trillion-parameter models
Past half a terabyte, consumer hardware runs out (a maxed Mac Studio stops at 512GB) and you move into two server-shaped options.
CPU servers with a terabyte of RAM. A dual-socket AMD EPYC workstation takes 768GB to 1.5TB of DDR5 across many memory channels, which is enough to hold even the 671B models in higher precision. The trade is speed: running entirely on CPU, builders report DeepSeek R1 671B at roughly 3.5 to 8 tokens per second, depending on quantization and memory channels, on rigs that can cost as little as $2,000 used. It is the cheapest way to touch a frontier-size model, and the slowest. Memory bandwidth, set by the number of populated channels, matters more here than core count.
Multi-GPU nodes. Stack eight data-center cards and the VRAM adds up: 8 x H100 gives 640GB, and 8 x H200 gives about 1.1TB. This is enough to run the largest open weights in fast memory. Per a 2026 GPU sizing cheat sheet, Llama 405B and DeepSeek V3-class models are served on 8-GPU nodes, typically at FP8. Kimi K2, the 1-trillion-parameter MoE, fits on a single 8 x H100 node at Q4 because its weights pack to roughly 620GB, just under the 640GB ceiling. Run the same model at full BF16 precision and it needs well over a terabyte, which is multi-node territory.
The trillion-parameter ceiling. At the very top, full-precision frontier models and high-concurrency serving spill across many nodes linked by NVLink and InfiniBand. This is what the NVL72 racks above are for. For an individual, the realistic options are: a CPU server for slow-but-cheap access to a 671B model, or renting an 8-GPU node by the hour in the cloud. Owning the multi-GPU hardware outright is a six-figure decision that only makes sense at sustained, heavy load, which is the same buy-versus-rent math we run for decentralized GPU compute.
Capacity is not speed: the bandwidth catch
Here is the trap that the RAM-per-billion math hides. Having enough memory to load a model only means it will run, not that it will run well. Every token a model generates requires reading its entire active parameter set out of memory, so generation speed is set by memory bandwidth, not capacity. That is why the same DeepSeek R1 that loads on a 512GB Mac Studio generates at a usable-but-deliberate ~17 tokens per second rather than the hundreds you get from a small model on a fast GPU.
The spread is enormous. The slow system RAM that lets a cheap server hold a 671B model moves bytes about fifty times slower than the HBM on a data-center GPU, which is the entire reason that same model crawls on CPU and flies on an H200. Plotted on a log scale, the memory you can afford and the memory that is fast sit at opposite ends.
| Tool | Cost per task | Multiple of baseline |
|---|---|---|
| Desktop DDR5 (system RAM) | 83 GB/s | - |
| DGX Spark (unified) | 273 GB/s | - |
| Apple M3 Ultra (unified) | 819 GB/s | - |
| RTX 4090 (GDDR6X) | 1,008 GB/s | - |
| RTX 5090 (GDDR7) | 1,792 GB/s | - |
| H100 (HBM3) | 3,350 GB/s | - |
| H200 (HBM3e) | 4,800 GB/s | - |
| B300 Blackwell Ultra (HBM3e) | 8,000 GB/s | - |
The economics changed in 2026
A year ago, the pitch for buying a big-memory machine was buy-once, run-free. That math got worse in 2026 for a specific reason: the resource local AI runs on is exactly the one that spiked in price. Memory makers have run the most lucrative shortage in chip history, which you can watch in real time on our memory price tracker, and the cost flowed straight through to devices. On June 25 2026 Apple raised prices across its Mac and iPad line, pushing the Mac Studio M3 Ultra (the box people buy for the largest models) from $3,999 to $5,299. The increases scaled with memory density, which is to say the local-AI tax was the whole story.
That does not kill the case for buying, but it sharpens the question. If your reasons are privacy, offline use, or a heavy steady workload, owning the hardware still wins. If you just want occasional access to a frontier model, the falling price of API tokens makes renting the better near-term math. Size the machine to the largest model you will genuinely use, not the largest the tier could theoretically hold, and check current model specs and sizes before you commit to a memory budget. On timing, the RAM price forecast is blunt: relief is not expected before late 2027, so buy what you need now rather than waiting out the shortage.
Frequently asked questions
Frequently asked questions
- How much RAM do I need to run a local LLM?
- For a useful general-purpose model, 16GB is the practical minimum and runs 8B models comfortably. 8GB works for small 3B to 4B models. To run a 70B model you need about 64GB, and the very largest open models (671B) require 512GB.
- Can I run a local LLM with 8GB of RAM?
- Yes, but only small models. After the operating system takes its share you have roughly 3 to 4GB for the model, which fits a 3B to 4B model at 4-bit quantization. That is enough for chat, summarizing, and autocomplete, but not for heavy reasoning.
- Is VRAM or system RAM better for local LLMs?
- GPU VRAM is faster because of its higher memory bandwidth, so a model that fits entirely in VRAM generates faster. System RAM (or Apple unified memory) lets you load far larger models for the money, but bandwidth is usually lower, so big models run slower. The ideal is enough fast memory to hold your target model.
- What is the cheapest way to run a large model locally?
- Apple unified memory currently gives the most gigabytes per dollar for very large models: a Mac Studio holds models that would need a multi-GPU server costing several times more. For models up to about 32B, a single high-end consumer GPU like an RTX 5090 is the better value and much faster.
- Does quantization hurt quality?
- Going from full precision to 8-bit is nearly lossless. 4-bit (Q4) is the common local standard and trades a small, usually unnoticeable quality drop for roughly a quarter of the memory. Below 4-bit the quality loss becomes more visible, so Q4 is the practical floor for most uses.
- Can I run a local LLM on CPU only, without a GPU?
- Yes. Any model that fits in system RAM will run on the CPU, and tools like llama.cpp support this directly. It is much slower than a GPU because system memory bandwidth is far lower, but it is how very large models run on cheap servers: a dual-socket EPYC box with 768GB to 1.5TB of RAM runs DeepSeek R1 671B at roughly 3.5 to 8 tokens per second on CPU alone.
- How do I run a model that is bigger than my VRAM?
- Use offloading. Tools like llama.cpp and LM Studio keep as many layers as fit in GPU VRAM and run the rest on the CPU from system RAM. Speed drops roughly in proportion to how much of the model sits in slow memory, so it is a usable compromise rather than a free lunch. Mixture-of-experts models suffer the least because only a few experts are active per token.
- How much memory does a trillion-parameter model need?
- All the weights must be loaded regardless of how many are active per token, so a 1-trillion-parameter mixture-of-experts model like Kimi K2 needs roughly 600GB at 4-bit. That fits on a single 8-GPU node (8x H100 = 640GB) or a workstation with more than a terabyte of RAM. At full BF16 precision the same model needs well over a terabyte and spans multiple nodes.
- What is the highest-end GPU for running LLMs in 2026?
- For a single chip, NVIDIA Blackwell Ultra (B300) is the production flagship at 288GB of HBM3e, with the new Vera Rubin generation (288GB HBM4) entering production in mid-2026. Rack-scale systems like the GB300 NVL72 pool more than 20TB of GPU memory across 72 GPUs. None of these are desktop hardware; the realistic consumer ceiling is the RTX 5090 (32GB) or a workstation RTX PRO 6000 Blackwell (96GB).
- Does context length affect how much RAM I need?
- Yes. The KV cache, the model working memory for the current conversation, grows with context length and can add several gigabytes on a long prompt or document. Budget headroom beyond the model weights, especially if you plan to use long contexts or feed in large files.
Sources
OpenAI (2025). Introducing gpt-oss. OpenAI. https://openai.com/index/introducing-gpt-oss/
NVIDIA (2026). NVIDIA DGX Spark (product specifications). NVIDIA. https://www.nvidia.com/en-us/products/workstations/dgx-spark/
NVIDIA (2026). GB200 NVL72 (product specifications). NVIDIA. https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA (2026). GTC 2026: Vera Rubin and the next generation of AI. NVIDIA Blog. https://blogs.nvidia.com/blog/gtc-2026-news/
Spheron (2026). GPU Requirements Cheat Sheet 2026. Spheron Blog. https://www.spheron.network/blog/gpu-requirements-cheat-sheet-2026/
Digital Spaceport (2025). How To Run DeepSeek R1 671B Fully Locally On a $2000 EPYC Server. Digital Spaceport. https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/
Saplin, M. (2026). llama.cpp: CPU vs GPU, shared VRAM and Inference Speed. DEV Community. https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl
MacRumors (2025). Mac Studio With M3 Ultra Runs Massive DeepSeek R1 AI Model Locally. MacRumors. https://www.macrumors.com/2025/03/17/apples-m3-ultra-runs-deepseek-r1-efficiently/
TechRadar (2025). Apple Mac Studio M3 Ultra workstation can run Deepseek R1 671B AI model entirely in memory using less than 200W. TechRadar Pro. https://www.techradar.com/pro/apple-mac-studio-m3-ultra-workstation-can-run-deepseek-r1-671b-ai-model-entirely-in-memory-using-less-than-200w-reviewer-finds