How much RAM do I need to run a local LLM?

For a useful general-purpose model, 16GB is the practical minimum and runs 8B models comfortably. 8GB works for small 3B to 4B models. To run a 70B model you need about 64GB, and the very largest open models (671B) require 512GB.

Can I run a local LLM with 8GB of RAM?

Yes, but only small models. After the operating system takes its share you have roughly 3 to 4GB for the model, which fits a 3B to 4B model at 4-bit quantization. That is enough for chat, summarizing, and autocomplete, but not for heavy reasoning.

Is VRAM or system RAM better for local LLMs?

GPU VRAM is faster because of its higher memory bandwidth, so a model that fits entirely in VRAM generates faster. System RAM (or Apple unified memory) lets you load far larger models for the money, but bandwidth is usually lower, so big models run slower. The ideal is enough fast memory to hold your target model.

What is the cheapest way to run a large model locally?

Apple unified memory currently gives the most gigabytes per dollar for very large models: a Mac Studio holds models that would need a multi-GPU server costing several times more. For models up to about 32B, a single high-end consumer GPU like an RTX 5090 is the better value and much faster.

Does quantization hurt quality?

Going from full precision to 8-bit is nearly lossless. 4-bit (Q4) is the common local standard and trades a small, usually unnoticeable quality drop for roughly a quarter of the memory. Below 4-bit the quality loss becomes more visible, so Q4 is the practical floor for most uses.

Can I run a local LLM on CPU only, without a GPU?

Yes. Any model that fits in system RAM will run on the CPU, and tools like llama.cpp support this directly. It is much slower than a GPU because system memory bandwidth is far lower, but it is how very large models run on cheap servers: a dual-socket EPYC box with 768GB to 1.5TB of RAM runs DeepSeek R1 671B at roughly 3.5 to 8 tokens per second on CPU alone.

How do I run a model that is bigger than my VRAM?

Use offloading. Tools like llama.cpp and LM Studio keep as many layers as fit in GPU VRAM and run the rest on the CPU from system RAM. Speed drops roughly in proportion to how much of the model sits in slow memory, so it is a usable compromise rather than a free lunch. Mixture-of-experts models suffer the least because only a few experts are active per token.

How much memory does a trillion-parameter model need?

All the weights must be loaded regardless of how many are active per token, so a 1-trillion-parameter mixture-of-experts model like Kimi K2 needs roughly 600GB at 4-bit. That fits on a single 8-GPU node (8x H100 = 640GB) or a workstation with more than a terabyte of RAM. At full BF16 precision the same model needs well over a terabyte and spans multiple nodes.

What is the highest-end GPU for running LLMs in 2026?

For a single chip, NVIDIA Blackwell Ultra (B300) is the production flagship at 288GB of HBM3e, with the new Vera Rubin generation (288GB HBM4) entering production in mid-2026. Rack-scale systems like the GB300 NVL72 pool more than 20TB of GPU memory across 72 GPUs. None of these are desktop hardware; the realistic consumer ceiling is the RTX 5090 (32GB) or a workstation RTX PRO 6000 Blackwell (96GB).

Does context length affect how much RAM I need?

Yes. The KV cache, the model working memory for the current conversation, grows with context length and can add several gigabytes on a long prompt or document. Budget headroom beyond the model weights, especially if you plan to use long contexts or feed in large files.

How Much RAM Do You Need to Run a Local LLM?

The single number that decides which local LLM you can run is your RAM, and the math is simple enough to do in your head. A model needs roughly half a gigabyte of memory per billion parameters at 4-bit quantization, the format almost everyone runs locally. So 8GB of RAM tops out near a small 7B model, 32GB reaches a 32B, and the largest open model on the planet, DeepSeek R1 at 671 billion parameters, needs a machine with 512GB to hold it. Above that sit the trillion-parameter models that take a multi-GPU server or a workstation with more than a terabyte of RAM. This guide walks the entire ladder, from an 8GB laptop to a 1.1TB GPU node, covers the difference between system RAM and the much faster VRAM on a graphics card, and tells you which device fits each rung. Everything is detail on top of that one ratio. In a hurry? The Can I Run This LLM? checker turns this math into a picker: choose your hardware and see which models fit, how fast, and whether owning, renting, or the API is cheaper.

≈0.5 GB

Memory per billion parameters at 4-bit (Q4). Add 15 to 30 percent for the context window and overhead, so budget closer to 0.75 GB per billion in practice.

16 GB

Enough to run OpenAI gpt-oss-20b, a capable reasoning model, entirely on a mainstream laptop, per OpenAI's own spec.

512 GB

What it takes to hold DeepSeek R1 (671B) fully in memory. A maxed Mac Studio M3 Ultra is the one consumer box that can.

Key takeaways

The rule of thumb: at 4-bit quantization a model needs about 0.5 GB of memory per billion parameters, plus headroom for context and the operating system. Divide your usable RAM by roughly 0.75 GB to estimate the largest model you can run.
8GB runs small 3B to 7B models for chat, summaries, and autocomplete. 16GB is the mainstream sweet spot: 8B models comfortably, plus MoE models like gpt-oss-20b.
24 to 32GB (an RTX 4090 or 5090, or a 32GB Mac) runs 14B to 32B dense models at near-frontier-lite quality. 64GB reaches a 70B model at 4-bit.
128GB unlocks frontier-class open models: gpt-oss-120b fits in 80GB, and an NVIDIA DGX Spark or a 128GB Mac Studio handles models up to 200B parameters.
The 512GB-plus tier is for the very largest open weights: DeepSeek R1/V3 at 671B and Llama 405B. A maxed Mac Studio M3 Ultra runs DeepSeek R1 entirely in memory at about 17 to 18 tokens per second.
Beyond 512GB you are into servers: a dual-socket EPYC box with 768GB to 1.5TB of RAM runs 671B on CPU alone at single-digit tokens per second, and an 8-GPU node (640GB on H100s, 1.1TB on H200s) runs trillion-parameter MoE models like Kimi K2 at speed.
VRAM is not the same as RAM. GPU VRAM is many times faster than system memory, so a model that fits entirely in VRAM runs far quicker; the moment it spills into system RAM, throughput falls off a cliff.
Capacity decides whether a model loads at all; memory bandwidth decides how fast it runs. A box can have the RAM to hold a model and still feel slow.

The math: half a gigabyte per billion parameters

A model is a pile of numbers (its parameters, or weights), and memory is just where those numbers sit while the model runs. So the size of the model and the precision of each number set the bill.

At full precision (FP16, the format models are trained in), each parameter takes 2 bytes, so a 7B model needs about 14GB just for weights. Almost nobody runs full precision locally. The standard for local inference is quantization: storing each weight in fewer bits with a small, usually acceptable loss in quality. The common rungs:

FP16 (full): 2 bytes per parameter, so ≈2 GB per billion.
Q8 (8-bit): ≈1 byte per parameter, ≈1 GB per billion. Near-lossless.
Q4 (4-bit): ≈0.5 bytes per parameter, ≈0.5 GB per billion. The practical default, and what every number in this guide assumes unless stated.

Q4 is the one that matters because it is where local LLMs became practical: it roughly quarters the memory cost versus full precision for a quality drop most people cannot feel in everyday use. That is a big part of why local LLMs got good in 2026.

So the working formula is: (usable RAM in GB) ÷ 0.75 ≈ the largest model in billions of parameters you can comfortably run at Q4. A 16GB machine with about 11GB free lands near a 14B model. The rest of this guide turns that into a shopping list.

VRAM, system RAM, and unified memory: the three kinds of memory

“How much RAM” hides a question that decides everything about how a local model behaves: which kind of memory. There are three, and they differ less in capacity than in speed.

System RAM (DDR5). The sticks in a normal PC or laptop. Cheap and available in huge amounts on a server, but slow by AI standards: a typical dual-channel desktop moves on the order of 80 to 100 GB/s. The CPU reads model weights from it. You can run very large models here, just slowly.
GPU VRAM (GDDR or HBM). The memory soldered onto a graphics card. Far faster: an RTX 5090 moves about 1.8 TB/s, and a data-center H200 about 4.8 TB/s, twenty to fifty times a desktop’s RAM. This is why a model that fits entirely in VRAM is so much quicker. The catch is capacity: consumer cards top out at 32GB, and the 80 to 192GB cards cost as much as a car.
Unified memory (Apple Silicon, NVIDIA DGX Spark). A single pool the CPU and GPU share, so there is no slow copy between them. Bandwidth sits in the middle: Apple’s M3 Ultra reaches about 819 GB/s, NVIDIA’s DGX Spark about 273 GB/s. This is the trick that lets a Mac hold a model far larger than any consumer GPU can, at a speed that is slower than that GPU but vastly faster than CPU RAM.

Memory type	Typical bandwidth	Capacity you can buy	Cost per GB	Best for
System RAM (DDR5)	~80–100 GB/s (desktop); higher on multi-channel servers	16GB to 1.5TB+	Lowest	Holding very large models on CPU, slowly
Unified memory	~270–820 GB/s	16GB to 512GB	Medium	Large models at moderate speed on one box
GPU VRAM (GDDR/HBM)	~1,000–4,800 GB/s	8GB to 192GB per card	Highest	Maximum speed for any model that fits

The practical upshot: fit your model in the fastest memory it will fit in. A 13B model belongs in a 16GB GPU, not spread across 64GB of system RAM. A 671B model has no choice but to live in unified memory or across many GPUs. Everything below is really about matching a model to the right kind of memory, not just enough of it.

Quantization: the dial that sets the footprint

Quantization is the single biggest lever on memory, so it is worth seeing the full dial rather than just the Q4 default. Lower bits per weight means a smaller footprint and faster generation, traded against a gradual loss of quality.

Format	Bits per weight	Memory per 1B params	Quality	When to use
FP16 / BF16	16	~2.0 GB	Full (reference)	Training, and serving when memory is not the constraint
Q8	8	~1.0 GB	Near-lossless	When you have headroom and want maximum fidelity
Q6	6	~0.75 GB	Very close to Q8	A safe step down from Q8
Q5	5	~0.65 GB	Slightly below Q6	A middle ground on tight memory
Q4	4	~0.5 GB	Small, usually unnoticeable drop	The default for local inference
Q3	3	~0.4 GB	Noticeable degradation	Only to squeeze a model that almost fits
Q2	2	~0.3 GB	Heavy degradation	Last resort, often not worth it

Q4 is the sweet spot because it roughly quarters the memory cost of full precision for a quality drop most people cannot feel. Every footprint figure in the tables below assumes Q4 unless noted.

What model footprints actually look like

The reason RAM is the gatekeeper is the sheer spread of model sizes. A small model you can run on a phone and the largest open model a hyperscaler-grade box runs differ by more than two orders of magnitude in memory. Plotted on a normal axis the small models vanish; on a log scale you can see the whole ladder at once.

Local LLM memory footprint at 4-bit, by model
Tool	Cost per task	Multiple of baseline
Llama 3.2 (3B)	2 GB	-
Qwen3 (8B)	5 GB	-
gpt-oss-20b	13 GB	-
Qwen3 (32B)	20 GB	-
Llama 3.3 (70B)	42 GB	-
gpt-oss-120b	63 GB	-
DeepSeek R1 (671B)	404 GB	-

Memory footprint at 4-bit (Q4) for representative open models, from a 3B that runs on a phone to DeepSeek R1's 671B, which consumes about 404GB. The axis is logarithmic because the range spans more than 200x. Footprints are weights only; add headroom for context and the OS.Source: Footprints derived from parameter counts at Q4; DeepSeek R1 figure per TechRadar and MacRumors reporting on the M3 Ultra test.

The system-RAM ladder: what each tier runs, on what device

This ladder is for system memory and Apple-style unified memory, the pool you size when you buy a laptop, a Mac, or a workstation. The VRAM ladder for graphics cards comes next. The table below is the whole guide in one view. “Usable for the model” assumes you leave headroom for the OS and a modest context window. Model footprints assume Q4. Devices are representative, not exhaustive.

RAM	Usable for the model	Largest comfortable model (Q4)	Example models	Typical device	What you can actually do
8 GB	~3–4 GB	3B, up to a tight 7B	Llama 3.2 3B, Qwen3 4B, Gemma 3 4B, Phi	Base MacBook Air, mainstream laptop, high-end phone	Offline chat, summarizing, autocomplete, simple retrieval over a few docs
16 GB	~10–11 GB	8B comfortably, 13–14B tight	Qwen3 8B, Llama 3.1 8B, gpt-oss-20b (MoE)	Mid-range laptop, M-series Air/Pro	A genuinely useful daily assistant, decent coding help, RAG over a document set
24–32 GB	~18–26 GB	14B to 32B dense	Qwen3 32B, Gemma 3 27B, gpt-oss-20b at full quality	RTX 4090 (24GB) / RTX 5090 (32GB), 32GB Mac	Near-frontier-lite quality, agentic coding, longer context windows
48–64 GB	~40–52 GB	70B	Llama 3.3 70B, Qwen 72B	64GB Mac, dual 24GB GPUs	Strong general reasoning, serious local coding, multi-document RAG
96–128 GB	~80–110 GB	120B; 70B at Q8/BF16	gpt-oss-120b (80GB), 70B at higher precision	NVIDIA DGX Spark (128GB), 128GB Mac Studio	Frontier-class open models; fine-tune up to 70B on the DGX Spark
256 GB	~220 GB	200B-class, or several big models at once	Large MoE models, multi-model setups	High-RAM Mac Studio, multi-GPU workstation	Run a 200B model plus tooling, or two 70B models side by side
512 GB+	~440 GB+	405B to 671B	DeepSeek R1/V3 671B, Llama 405B	Mac Studio M3 Ultra (512GB), 8x80GB GPU server	The largest open weights, held entirely in memory

8GB: small models, real uses

This is the floor, and it is more useful than it sounds. After the operating system takes its cut you have roughly 3 to 4GB for a model, which is a 3B to 4B at Q4. Models like Llama 3.2 3B and Qwen3 4B handle summarization, drafting, autocomplete, and simple question-answering over a handful of documents without ever touching the network. What they are not is a reasoning engine: expect them to stumble on multi-step logic and longer context. On 8GB, a small fast model you actually use beats a larger one you cannot load.

16GB: the mainstream sweet spot

Sixteen gigabytes is where local AI stops being a demo. An 8B model (Qwen3 8B is about 5GB at Q4) leaves plenty of room for context and runs quickly on a modern laptop. This tier also unlocks the first genuinely strong option: OpenAI says its gpt-oss-20b runs on edge devices with just 16GB of memory, because its mixture-of-experts design activates only a fraction of its parameters per token. For most people who want a private, capable assistant on the machine they already own, 16GB is the answer.

24 to 32GB: the prosumer GPU tier

This is the high-end consumer graphics card bracket: an RTX 4090 carries 24GB of VRAM and the newer RTX 5090 carries 32GB. It runs 14B to 32B dense models at Q4, which is where open models start to feel close to the commercial frontier for everyday work. A 32B like Qwen3 32B (about 20GB) fits with room for a long context, and agentic coding becomes realistic. If you are choosing hardware specifically to run models, this tier is the best balance of capability and cost for most enthusiasts.

48 to 128GB: 70B models and the personal supercomputer

A 70B model at Q4 needs roughly 40 to 48GB, so 64GB is the entry point for the heavyweight open models like Llama 3.3 70B. Push to 128GB and you reach the most interesting recent category: the personal AI box. NVIDIA’s DGX Spark pairs 128GB of unified memory with 273 GB/s of bandwidth and, per NVIDIA, runs inference on models up to 200 billion parameters and fine-tunes up to 70B. A 128GB Mac Studio reaches the same class. This is also the tier where gpt-oss-120b lives: OpenAI says the 120B version runs within 80GB of memory.

512GB and up: the largest open weights

At the top of the ladder is one headline use case: running the biggest open models in existence. DeepSeek R1 at 671 billion parameters consumes about 404GB even at 4-bit, which is why it needs a 512GB machine to hold it with working room. The remarkable part is that this is now possible on a single desktop. As MacRumors reported, a Mac Studio with an M3 Ultra and 512GB of unified memory runs DeepSeek R1 locally, and a TechRadar reviewer measured it at roughly 17 to 18 tokens per second while drawing under 200 watts. The alternative, a multi-GPU server, costs far more and burns far more power, and is the start of the next ladder.

The VRAM ladder: consumer cards to rack-scale systems

If you run models on a graphics card rather than in system memory, this is the ladder that matters. VRAM is faster but scarcer, so the rungs are smaller and the prices climb steeply. The table runs from an entry consumer card to a full data-center rack that NVIDIA treats as one giant GPU.

VRAM	Example hardware	Class	Largest model (Q4)	Notes
8 GB	RTX 4060, RTX 3050	Consumer	7B, tight	Entry GPU; keep context short
12 GB	RTX 3060, RTX 4070	Consumer	13B	Comfortable small-model card
16 GB	RTX 4080, RTX 5060 Ti 16GB	Consumer	14B with context	Good 8B card with long context
24 GB	RTX 3090, RTX 4090, RX 7900 XTX	Prosumer	32B	The long-time enthusiast standard
32 GB	RTX 5090	Consumer flagship	32B comfortably	~1.8 TB/s, the fastest consumer card
48 GB	RTX 6000 Ada, L40S	Workstation	70B, tight	Single-card 70B becomes possible
96 GB	RTX PRO 6000 Blackwell	Workstation	120B	The most VRAM on a non-data-center card
80–141 GB	A100 / H100 (80GB), H200 (141GB)	Data center	70B at FP16, 100B+ at Q4	HBM, ~3.3–4.8 TB/s bandwidth
192–288 GB	B200 (192GB), B300 Blackwell Ultra (288GB)	Data center flagship	200B+ on a single GPU	B300 is the current production flagship
13.4–20.7 TB	GB200 NVL72 / GB300 NVL72	Rack-scale	Trillion-parameter, served to thousands	72 GPUs wired as one

For most people the story stops at 32GB: an RTX 5090 is the fastest card you can put in a desktop and runs anything up to a 32B model briskly. Step up to workstation cards and the RTX PRO 6000 Blackwell carries 96GB, enough for a 120B model on one card. Above that you are buying data-center silicon: an H100 holds 80GB, an H200 holds 141GB, and a single H200 runs a 70B model at full FP16 precision with room for a long context.

The current top of the single-GPU ladder is NVIDIA’s Blackwell generation. The B200 carries 192GB of HBM3e, and the Blackwell Ultra B300, the flagship in production as of mid-2026, raises that to 288GB at roughly 8 TB/s. Beyond a single chip, NVIDIA stitches 72 GPUs into one rack-scale unit: the GB200 NVL72 pools 13.4 TB of fast GPU memory, and the GB300 NVL72 (built on B300s) pushes past 20 TB. These are the boxes that serve frontier models to millions of users, not desktop hardware, but they are the literal ceiling of the ladder.

When a model does not fit: offloading and the bandwidth cliff

You do not have to fit a model entirely in one kind of memory. Tools like llama.cpp and LM Studio let you split a model, keeping some layers in fast VRAM and spilling the rest into system RAM where the CPU handles them. This is how a 24GB card runs a 70B model at all. The cost is speed, and it is steep.

Generation slows roughly in proportion to how much of the model lives in slow memory: offload half the layers and you get about half the speedup, because every token still has to read those CPU-side weights across the much slower memory bus. The practical guidance: a model that fits entirely in VRAM is the goal; partial offload is a usable compromise; a model running mostly from system RAM will be slow no matter how fast your GPU is. The one happy exception is mixture-of-experts models, where only a few experts are active per token, so offloading the idle ones hurts far less.

MoE vs dense: why a trillion-parameter model can run on one node

The headline parameter count can badly mislead you on memory, because of how modern large models are built. A dense model uses every parameter for every token, so a 70B dense model does 70B parameters’ worth of work each step. A mixture-of-experts (MoE) model holds many specialist sub-networks but activates only a few per token: Kimi K2 has about 1 trillion total parameters but only 32B active at a time, and gpt-oss-120b and DeepSeek R1 are MoE too.

The rule that follows is worth memorizing: total parameters set how much memory you need; active parameters set how fast it runs. All the weights must be loaded, so a 1T-parameter MoE still needs roughly 600GB at Q4. But because only 32B are active per token, it generates as quickly as a 32B dense model would, far faster than a 671B dense model. This is exactly why a trillion-parameter model can run on a single 8-GPU node when a much smaller dense model would choke: the memory holds the whole thing, and the speed only ever pays for the active slice.

Beyond 512GB: CPU servers, multi-GPU nodes, and trillion-parameter models

Past half a terabyte, consumer hardware runs out (a maxed Mac Studio stops at 512GB) and you move into two server-shaped options.

CPU servers with a terabyte of RAM. A dual-socket AMD EPYC workstation takes 768GB to 1.5TB of DDR5 across many memory channels, which is enough to hold even the 671B models in higher precision. The trade is speed: running entirely on CPU, builders report DeepSeek R1 671B at roughly 3.5 to 8 tokens per second, depending on quantization and memory channels, on rigs that can cost as little as $2,000 used. It is the cheapest way to touch a frontier-size model, and the slowest. Memory bandwidth, set by the number of populated channels, matters more here than core count.

Multi-GPU nodes. Stack eight data-center cards and the VRAM adds up: 8 x H100 gives 640GB, and 8 x H200 gives about 1.1TB. This is enough to run the largest open weights in fast memory. Per a 2026 GPU sizing cheat sheet, Llama 405B and DeepSeek V3-class models are served on 8-GPU nodes, typically at FP8. Kimi K2, the 1-trillion-parameter MoE, fits on a single 8 x H100 node at Q4 because its weights pack to roughly 620GB, just under the 640GB ceiling. Run the same model at full BF16 precision and it needs well over a terabyte, which is multi-node territory.

The trillion-parameter ceiling. At the very top, full-precision frontier models and high-concurrency serving spill across many nodes linked by NVLink and InfiniBand. This is what the NVL72 racks above are for. For an individual, the realistic options are: a CPU server for slow-but-cheap access to a 671B model, or renting an 8-GPU node by the hour in the cloud. Owning the multi-GPU hardware outright is a six-figure decision that only makes sense at sustained, heavy load, which is the same buy-versus-rent math we run for decentralized GPU compute.

Capacity is not speed: the bandwidth catch

Here is the trap that the RAM-per-billion math hides. Having enough memory to load a model only means it will run, not that it will run well. Every token a model generates requires reading its entire active parameter set out of memory, so generation speed is set by memory bandwidth, not capacity. That is why the same DeepSeek R1 that loads on a 512GB Mac Studio generates at a usable-but-deliberate ~17 tokens per second rather than the hundreds you get from a small model on a fast GPU.

The spread is enormous. The slow system RAM that lets a cheap server hold a 671B model moves bytes about fifty times slower than the HBM on a data-center GPU, which is the entire reason that same model crawls on CPU and flies on an H200. Plotted on a log scale, the memory you can afford and the memory that is fast sit at opposite ends.

Memory bandwidth by type, system RAM to data-center GPU
Tool	Cost per task	Multiple of baseline
Desktop DDR5 (system RAM)	83 GB/s	-
DGX Spark (unified)	273 GB/s	-
Apple M3 Ultra (unified)	819 GB/s	-
RTX 4090 (GDDR6X)	1,008 GB/s	-
RTX 5090 (GDDR7)	1,792 GB/s	-
H100 (HBM3)	3,350 GB/s	-
H200 (HBM3e)	4,800 GB/s	-
B300 Blackwell Ultra (HBM3e)	8,000 GB/s	-

Memory bandwidth by memory type, from desktop DDR5 to a Blackwell Ultra B300. Higher is faster generation. The axis is logarithmic. Figures are manufacturer specifications; system RAM varies with channel count.Source: Manufacturer specifications (NVIDIA, Apple, AMD); DGX Spark figure per NVIDIA.

The economics changed in 2026

A year ago, the pitch for buying a big-memory machine was buy-once, run-free. That math got worse in 2026 for a specific reason: the resource local AI runs on is exactly the one that spiked in price. Memory makers have run the most lucrative shortage in chip history, which you can watch in real time on our memory price tracker, and the cost flowed straight through to devices. On June 25 2026 Apple raised prices across its Mac and iPad line, pushing the Mac Studio M3 Ultra (the box people buy for the largest models) from $3,999 to $5,299. The increases scaled with memory density, which is to say the local-AI tax was the whole story.

That does not kill the case for buying, but it sharpens the question. If your reasons are privacy, offline use, or a heavy steady workload, owning the hardware still wins. If you just want occasional access to a frontier model, the falling price of API tokens makes renting the better near-term math. Size the machine to the largest model you will genuinely use, not the largest the tier could theoretically hold, and check current model specs and sizes before you commit to a memory budget. On timing, the RAM price forecast is blunt: relief is not expected before late 2027, so buy what you need now rather than waiting out the shortage.

Frequently asked questions

How much RAM do I need to run a local LLM?: For a useful general-purpose model, 16GB is the practical minimum and runs 8B models comfortably. 8GB works for small 3B to 4B models. To run a 70B model you need about 64GB, and the very largest open models (671B) require 512GB.
Can I run a local LLM with 8GB of RAM?: Yes, but only small models. After the operating system takes its share you have roughly 3 to 4GB for the model, which fits a 3B to 4B model at 4-bit quantization. That is enough for chat, summarizing, and autocomplete, but not for heavy reasoning.
Is VRAM or system RAM better for local LLMs?: GPU VRAM is faster because of its higher memory bandwidth, so a model that fits entirely in VRAM generates faster. System RAM (or Apple unified memory) lets you load far larger models for the money, but bandwidth is usually lower, so big models run slower. The ideal is enough fast memory to hold your target model.
What is the cheapest way to run a large model locally?: Apple unified memory currently gives the most gigabytes per dollar for very large models: a Mac Studio holds models that would need a multi-GPU server costing several times more. For models up to about 32B, a single high-end consumer GPU like an RTX 5090 is the better value and much faster.
Does quantization hurt quality?: Going from full precision to 8-bit is nearly lossless. 4-bit (Q4) is the common local standard and trades a small, usually unnoticeable quality drop for roughly a quarter of the memory. Below 4-bit the quality loss becomes more visible, so Q4 is the practical floor for most uses.
Can I run a local LLM on CPU only, without a GPU?: Yes. Any model that fits in system RAM will run on the CPU, and tools like llama.cpp support this directly. It is much slower than a GPU because system memory bandwidth is far lower, but it is how very large models run on cheap servers: a dual-socket EPYC box with 768GB to 1.5TB of RAM runs DeepSeek R1 671B at roughly 3.5 to 8 tokens per second on CPU alone.
How do I run a model that is bigger than my VRAM?: Use offloading. Tools like llama.cpp and LM Studio keep as many layers as fit in GPU VRAM and run the rest on the CPU from system RAM. Speed drops roughly in proportion to how much of the model sits in slow memory, so it is a usable compromise rather than a free lunch. Mixture-of-experts models suffer the least because only a few experts are active per token.
How much memory does a trillion-parameter model need?: All the weights must be loaded regardless of how many are active per token, so a 1-trillion-parameter mixture-of-experts model like Kimi K2 needs roughly 600GB at 4-bit. That fits on a single 8-GPU node (8x H100 = 640GB) or a workstation with more than a terabyte of RAM. At full BF16 precision the same model needs well over a terabyte and spans multiple nodes.
What is the highest-end GPU for running LLMs in 2026?: For a single chip, NVIDIA Blackwell Ultra (B300) is the production flagship at 288GB of HBM3e, with the new Vera Rubin generation (288GB HBM4) entering production in mid-2026. Rack-scale systems like the GB300 NVL72 pool more than 20TB of GPU memory across 72 GPUs. None of these are desktop hardware; the realistic consumer ceiling is the RTX 5090 (32GB) or a workstation RTX PRO 6000 Blackwell (96GB).
Does context length affect how much RAM I need?: Yes. The KV cache, the model working memory for the current conversation, grows with context length and can add several gigabytes on a long prompt or document. Budget headroom beyond the model weights, especially if you plan to use long contexts or feed in large files.

Sources

OpenAI (2025). Introducing gpt-oss. OpenAI. https://openai.com/index/introducing-gpt-oss/

NVIDIA (2026). NVIDIA DGX Spark (product specifications). NVIDIA. https://www.nvidia.com/en-us/products/workstations/dgx-spark/

NVIDIA (2026). GB200 NVL72 (product specifications). NVIDIA. https://www.nvidia.com/en-us/data-center/gb200-nvl72/

NVIDIA (2026). GTC 2026: Vera Rubin and the next generation of AI. NVIDIA Blog. https://blogs.nvidia.com/blog/gtc-2026-news/

Spheron (2026). GPU Requirements Cheat Sheet 2026. Spheron Blog. https://www.spheron.network/blog/gpu-requirements-cheat-sheet-2026/

Digital Spaceport (2025). How To Run DeepSeek R1 671B Fully Locally On a $2000 EPYC Server. Digital Spaceport. https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/

Saplin, M. (2026). llama.cpp: CPU vs GPU, shared VRAM and Inference Speed. DEV Community. https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

MacRumors (2025). Mac Studio With M3 Ultra Runs Massive DeepSeek R1 AI Model Locally. MacRumors. https://www.macrumors.com/2025/03/17/apples-m3-ultra-runs-deepseek-r1-efficiently/

TechRadar (2025). Apple Mac Studio M3 Ultra workstation can run Deepseek R1 671B AI model entirely in memory using less than 200W. TechRadar Pro. https://www.techradar.com/pro/apple-mac-studio-m3-ultra-workstation-can-run-deepseek-r1-671b-ai-model-entirely-in-memory-using-less-than-200w-reviewer-finds

The math: half a gigabyte per billion parameters

VRAM, system RAM, and unified memory: the three kinds of memory

Quantization: the dial that sets the footprint

What model footprints actually look like

The system-RAM ladder: what each tier runs, on what device

8GB: small models, real uses

16GB: the mainstream sweet spot

24 to 32GB: the prosumer GPU tier

48 to 128GB: 70B models and the personal supercomputer

512GB and up: the largest open weights

The VRAM ladder: consumer cards to rack-scale systems

When a model does not fit: offloading and the bandwidth cliff

MoE vs dense: why a trillion-parameter model can run on one node

Beyond 512GB: CPU servers, multi-GPU nodes, and trillion-parameter models

Capacity is not speed: the bandwidth catch

The economics changed in 2026

Frequently asked questions

Frequently asked questions

Sources

Subscribe to Capital & Compute