Skip to content
Capital & Compute
· ai· local-llms· hardware

How Much RAM Do You Need to Run a Local LLM?

How much RAM you need to run a local LLM in 2026: what models 8GB to 512GB can run, the per-billion-parameter math, and the device for each tier.

By Capital & Compute

The single number that decides which local LLM you can run is your RAM, and the math is simple enough to do in your head. A model needs roughly half a gigabyte of memory per billion parameters at 4-bit quantization, the format almost everyone runs locally. So 8GB of RAM tops out near a small 7B model, 32GB reaches a 32B, and the largest open model on the planet, DeepSeek R1 at 671 billion parameters, needs a machine with 512GB to hold it. Above that sit the trillion-parameter models that take a multi-GPU server or a workstation with more than a terabyte of RAM. This guide walks the entire ladder, from an 8GB laptop to a 1.1TB GPU node, covers the difference between system RAM and the much faster VRAM on a graphics card, and tells you which device fits each rung. Everything is detail on top of that one ratio. In a hurry? The Can I Run This LLM? checker turns this math into a picker: choose your hardware and see which models fit, how fast, and whether owning, renting, or the API is cheaper.

≈0.5 GB
Memory per billion parameters at 4-bit (Q4). Add 15 to 30 percent for the context window and overhead, so budget closer to 0.75 GB per billion in practice.
16 GB
Enough to run OpenAI gpt-oss-20b, a capable reasoning model, entirely on a mainstream laptop, per OpenAI's own spec.
512 GB
What it takes to hold DeepSeek R1 (671B) fully in memory. A maxed Mac Studio M3 Ultra is the one consumer box that can.

The math: half a gigabyte per billion parameters

A model is a pile of numbers (its parameters, or weights), and memory is just where those numbers sit while the model runs. So the size of the model and the precision of each number set the bill.

At full precision (FP16, the format models are trained in), each parameter takes 2 bytes, so a 7B model needs about 14GB just for weights. Almost nobody runs full precision locally. The standard for local inference is quantization: storing each weight in fewer bits with a small, usually acceptable loss in quality. The common rungs:

  • FP16 (full): 2 bytes per parameter, so ≈2 GB per billion.
  • Q8 (8-bit): ≈1 byte per parameter, ≈1 GB per billion. Near-lossless.
  • Q4 (4-bit): ≈0.5 bytes per parameter, ≈0.5 GB per billion. The practical default, and what every number in this guide assumes unless stated.

Q4 is the one that matters because it is where local LLMs became practical: it roughly quarters the memory cost versus full precision for a quality drop most people cannot feel in everyday use. That is a big part of why local LLMs got good in 2026.

So the working formula is: (usable RAM in GB) ÷ 0.75 ≈ the largest model in billions of parameters you can comfortably run at Q4. A 16GB machine with about 11GB free lands near a 14B model. The rest of this guide turns that into a shopping list.

VRAM, system RAM, and unified memory: the three kinds of memory

“How much RAM” hides a question that decides everything about how a local model behaves: which kind of memory. There are three, and they differ less in capacity than in speed.

  • System RAM (DDR5). The sticks in a normal PC or laptop. Cheap and available in huge amounts on a server, but slow by AI standards: a typical dual-channel desktop moves on the order of 80 to 100 GB/s. The CPU reads model weights from it. You can run very large models here, just slowly.
  • GPU VRAM (GDDR or HBM). The memory soldered onto a graphics card. Far faster: an RTX 5090 moves about 1.8 TB/s, and a data-center H200 about 4.8 TB/s, twenty to fifty times a desktop’s RAM. This is why a model that fits entirely in VRAM is so much quicker. The catch is capacity: consumer cards top out at 32GB, and the 80 to 192GB cards cost as much as a car.
  • Unified memory (Apple Silicon, NVIDIA DGX Spark). A single pool the CPU and GPU share, so there is no slow copy between them. Bandwidth sits in the middle: Apple’s M3 Ultra reaches about 819 GB/s, NVIDIA’s DGX Spark about 273 GB/s. This is the trick that lets a Mac hold a model far larger than any consumer GPU can, at a speed that is slower than that GPU but vastly faster than CPU RAM.
Memory type Typical bandwidth Capacity you can buy Cost per GB Best for
System RAM (DDR5) ~80–100 GB/s (desktop); higher on multi-channel servers 16GB to 1.5TB+ Lowest Holding very large models on CPU, slowly
Unified memory ~270–820 GB/s 16GB to 512GB Medium Large models at moderate speed on one box
GPU VRAM (GDDR/HBM) ~1,000–4,800 GB/s 8GB to 192GB per card Highest Maximum speed for any model that fits

The practical upshot: fit your model in the fastest memory it will fit in. A 13B model belongs in a 16GB GPU, not spread across 64GB of system RAM. A 671B model has no choice but to live in unified memory or across many GPUs. Everything below is really about matching a model to the right kind of memory, not just enough of it.

Quantization: the dial that sets the footprint

Quantization is the single biggest lever on memory, so it is worth seeing the full dial rather than just the Q4 default. Lower bits per weight means a smaller footprint and faster generation, traded against a gradual loss of quality.

Format Bits per weight Memory per 1B params Quality When to use
FP16 / BF16 16 ~2.0 GB Full (reference) Training, and serving when memory is not the constraint
Q8 8 ~1.0 GB Near-lossless When you have headroom and want maximum fidelity
Q6 6 ~0.75 GB Very close to Q8 A safe step down from Q8
Q5 5 ~0.65 GB Slightly below Q6 A middle ground on tight memory
Q4 4 ~0.5 GB Small, usually unnoticeable drop The default for local inference
Q3 3 ~0.4 GB Noticeable degradation Only to squeeze a model that almost fits
Q2 2 ~0.3 GB Heavy degradation Last resort, often not worth it

Q4 is the sweet spot because it roughly quarters the memory cost of full precision for a quality drop most people cannot feel. Every footprint figure in the tables below assumes Q4 unless noted.

What model footprints actually look like

The reason RAM is the gatekeeper is the sheer spread of model sizes. A small model you can run on a phone and the largest open model a hyperscaler-grade box runs differ by more than two orders of magnitude in memory. Plotted on a normal axis the small models vanish; on a log scale you can see the whole ladder at once.

Local LLM memory footprint at 4-bit, by modelLog-scale dot plot of approximate Q4 memory footprint in gigabytes. Llama 3.2 3B about 2GB, Qwen3 8B about 5GB, gpt-oss-20b about 13GB, Qwen3 32B about 20GB, Llama 3.3 70B about 42GB, gpt-oss-120b about 63GB, and DeepSeek R1 671B about 404GB.2 GB8 GB32 GB128 GB512 GBmemory footprint at Q4, GB (log scale)Llama 3.2 (3B)2 GBQwen3 (8B)5 GBgpt-oss-20b13 GBQwen3 (32B)20 GBLlama 3.3 (70B)42 GBgpt-oss-120b63 GBDeepSeek R1 (671B)404 GB
Local LLM memory footprint at 4-bit, by model
ToolCost per taskMultiple of baseline
Llama 3.2 (3B)2 GB-
Qwen3 (8B)5 GB-
gpt-oss-20b13 GB-
Qwen3 (32B)20 GB-
Llama 3.3 (70B)42 GB-
gpt-oss-120b63 GB-
DeepSeek R1 (671B)404 GB-
Memory footprint at 4-bit (Q4) for representative open models, from a 3B that runs on a phone to DeepSeek R1's 671B, which consumes about 404GB. The axis is logarithmic because the range spans more than 200x. Footprints are weights only; add headroom for context and the OS.Source: Footprints derived from parameter counts at Q4; DeepSeek R1 figure per TechRadar and MacRumors reporting on the M3 Ultra test.

The system-RAM ladder: what each tier runs, on what device

This ladder is for system memory and Apple-style unified memory, the pool you size when you buy a laptop, a Mac, or a workstation. The VRAM ladder for graphics cards comes next. The table below is the whole guide in one view. “Usable for the model” assumes you leave headroom for the OS and a modest context window. Model footprints assume Q4. Devices are representative, not exhaustive.

RAM Usable for the model Largest comfortable model (Q4) Example models Typical device What you can actually do
8 GB ~3–4 GB 3B, up to a tight 7B Llama 3.2 3B, Qwen3 4B, Gemma 3 4B, Phi Base MacBook Air, mainstream laptop, high-end phone Offline chat, summarizing, autocomplete, simple retrieval over a few docs
16 GB ~10–11 GB 8B comfortably, 13–14B tight Qwen3 8B, Llama 3.1 8B, gpt-oss-20b (MoE) Mid-range laptop, M-series Air/Pro A genuinely useful daily assistant, decent coding help, RAG over a document set
24–32 GB ~18–26 GB 14B to 32B dense Qwen3 32B, Gemma 3 27B, gpt-oss-20b at full quality RTX 4090 (24GB) / RTX 5090 (32GB), 32GB Mac Near-frontier-lite quality, agentic coding, longer context windows
48–64 GB ~40–52 GB 70B Llama 3.3 70B, Qwen 72B 64GB Mac, dual 24GB GPUs Strong general reasoning, serious local coding, multi-document RAG
96–128 GB ~80–110 GB 120B; 70B at Q8/BF16 gpt-oss-120b (80GB), 70B at higher precision NVIDIA DGX Spark (128GB), 128GB Mac Studio Frontier-class open models; fine-tune up to 70B on the DGX Spark
256 GB ~220 GB 200B-class, or several big models at once Large MoE models, multi-model setups High-RAM Mac Studio, multi-GPU workstation Run a 200B model plus tooling, or two 70B models side by side
512 GB+ ~440 GB+ 405B to 671B DeepSeek R1/V3 671B, Llama 405B Mac Studio M3 Ultra (512GB), 8x80GB GPU server The largest open weights, held entirely in memory

8GB: small models, real uses

This is the floor, and it is more useful than it sounds. After the operating system takes its cut you have roughly 3 to 4GB for a model, which is a 3B to 4B at Q4. Models like Llama 3.2 3B and Qwen3 4B handle summarization, drafting, autocomplete, and simple question-answering over a handful of documents without ever touching the network. What they are not is a reasoning engine: expect them to stumble on multi-step logic and longer context. On 8GB, a small fast model you actually use beats a larger one you cannot load.

16GB: the mainstream sweet spot

Sixteen gigabytes is where local AI stops being a demo. An 8B model (Qwen3 8B is about 5GB at Q4) leaves plenty of room for context and runs quickly on a modern laptop. This tier also unlocks the first genuinely strong option: OpenAI says its gpt-oss-20b runs on edge devices with just 16GB of memory, because its mixture-of-experts design activates only a fraction of its parameters per token. For most people who want a private, capable assistant on the machine they already own, 16GB is the answer.

24 to 32GB: the prosumer GPU tier

This is the high-end consumer graphics card bracket: an RTX 4090 carries 24GB of VRAM and the newer RTX 5090 carries 32GB. It runs 14B to 32B dense models at Q4, which is where open models start to feel close to the commercial frontier for everyday work. A 32B like Qwen3 32B (about 20GB) fits with room for a long context, and agentic coding becomes realistic. If you are choosing hardware specifically to run models, this tier is the best balance of capability and cost for most enthusiasts.

48 to 128GB: 70B models and the personal supercomputer

A 70B model at Q4 needs roughly 40 to 48GB, so 64GB is the entry point for the heavyweight open models like Llama 3.3 70B. Push to 128GB and you reach the most interesting recent category: the personal AI box. NVIDIA’s DGX Spark pairs 128GB of unified memory with 273 GB/s of bandwidth and, per NVIDIA, runs inference on models up to 200 billion parameters and fine-tunes up to 70B. A 128GB Mac Studio reaches the same class. This is also the tier where gpt-oss-120b lives: OpenAI says the 120B version runs within 80GB of memory.

512GB and up: the largest open weights

At the top of the ladder is one headline use case: running the biggest open models in existence. DeepSeek R1 at 671 billion parameters consumes about 404GB even at 4-bit, which is why it needs a 512GB machine to hold it with working room. The remarkable part is that this is now possible on a single desktop. As MacRumors reported, a Mac Studio with an M3 Ultra and 512GB of unified memory runs DeepSeek R1 locally, and a TechRadar reviewer measured it at roughly 17 to 18 tokens per second while drawing under 200 watts. The alternative, a multi-GPU server, costs far more and burns far more power, and is the start of the next ladder.

The VRAM ladder: consumer cards to rack-scale systems

If you run models on a graphics card rather than in system memory, this is the ladder that matters. VRAM is faster but scarcer, so the rungs are smaller and the prices climb steeply. The table runs from an entry consumer card to a full data-center rack that NVIDIA treats as one giant GPU.

VRAM Example hardware Class Largest model (Q4) Notes
8 GB RTX 4060, RTX 3050 Consumer 7B, tight Entry GPU; keep context short
12 GB RTX 3060, RTX 4070 Consumer 13B Comfortable small-model card
16 GB RTX 4080, RTX 5060 Ti 16GB Consumer 14B with context Good 8B card with long context
24 GB RTX 3090, RTX 4090, RX 7900 XTX Prosumer 32B The long-time enthusiast standard
32 GB RTX 5090 Consumer flagship 32B comfortably ~1.8 TB/s, the fastest consumer card
48 GB RTX 6000 Ada, L40S Workstation 70B, tight Single-card 70B becomes possible
96 GB RTX PRO 6000 Blackwell Workstation 120B The most VRAM on a non-data-center card
80–141 GB A100 / H100 (80GB), H200 (141GB) Data center 70B at FP16, 100B+ at Q4 HBM, ~3.3–4.8 TB/s bandwidth
192–288 GB B200 (192GB), B300 Blackwell Ultra (288GB) Data center flagship 200B+ on a single GPU B300 is the current production flagship
13.4–20.7 TB GB200 NVL72 / GB300 NVL72 Rack-scale Trillion-parameter, served to thousands 72 GPUs wired as one

For most people the story stops at 32GB: an RTX 5090 is the fastest card you can put in a desktop and runs anything up to a 32B model briskly. Step up to workstation cards and the RTX PRO 6000 Blackwell carries 96GB, enough for a 120B model on one card. Above that you are buying data-center silicon: an H100 holds 80GB, an H200 holds 141GB, and a single H200 runs a 70B model at full FP16 precision with room for a long context.

The current top of the single-GPU ladder is NVIDIA’s Blackwell generation. The B200 carries 192GB of HBM3e, and the Blackwell Ultra B300, the flagship in production as of mid-2026, raises that to 288GB at roughly 8 TB/s. Beyond a single chip, NVIDIA stitches 72 GPUs into one rack-scale unit: the GB200 NVL72 pools 13.4 TB of fast GPU memory, and the GB300 NVL72 (built on B300s) pushes past 20 TB. These are the boxes that serve frontier models to millions of users, not desktop hardware, but they are the literal ceiling of the ladder.

When a model does not fit: offloading and the bandwidth cliff

You do not have to fit a model entirely in one kind of memory. Tools like llama.cpp and LM Studio let you split a model, keeping some layers in fast VRAM and spilling the rest into system RAM where the CPU handles them. This is how a 24GB card runs a 70B model at all. The cost is speed, and it is steep.

Generation slows roughly in proportion to how much of the model lives in slow memory: offload half the layers and you get about half the speedup, because every token still has to read those CPU-side weights across the much slower memory bus. The practical guidance: a model that fits entirely in VRAM is the goal; partial offload is a usable compromise; a model running mostly from system RAM will be slow no matter how fast your GPU is. The one happy exception is mixture-of-experts models, where only a few experts are active per token, so offloading the idle ones hurts far less.

MoE vs dense: why a trillion-parameter model can run on one node

The headline parameter count can badly mislead you on memory, because of how modern large models are built. A dense model uses every parameter for every token, so a 70B dense model does 70B parameters’ worth of work each step. A mixture-of-experts (MoE) model holds many specialist sub-networks but activates only a few per token: Kimi K2 has about 1 trillion total parameters but only 32B active at a time, and gpt-oss-120b and DeepSeek R1 are MoE too.

The rule that follows is worth memorizing: total parameters set how much memory you need; active parameters set how fast it runs. All the weights must be loaded, so a 1T-parameter MoE still needs roughly 600GB at Q4. But because only 32B are active per token, it generates as quickly as a 32B dense model would, far faster than a 671B dense model. This is exactly why a trillion-parameter model can run on a single 8-GPU node when a much smaller dense model would choke: the memory holds the whole thing, and the speed only ever pays for the active slice.

Beyond 512GB: CPU servers, multi-GPU nodes, and trillion-parameter models

Past half a terabyte, consumer hardware runs out (a maxed Mac Studio stops at 512GB) and you move into two server-shaped options.

CPU servers with a terabyte of RAM. A dual-socket AMD EPYC workstation takes 768GB to 1.5TB of DDR5 across many memory channels, which is enough to hold even the 671B models in higher precision. The trade is speed: running entirely on CPU, builders report DeepSeek R1 671B at roughly 3.5 to 8 tokens per second, depending on quantization and memory channels, on rigs that can cost as little as $2,000 used. It is the cheapest way to touch a frontier-size model, and the slowest. Memory bandwidth, set by the number of populated channels, matters more here than core count.

Multi-GPU nodes. Stack eight data-center cards and the VRAM adds up: 8 x H100 gives 640GB, and 8 x H200 gives about 1.1TB. This is enough to run the largest open weights in fast memory. Per a 2026 GPU sizing cheat sheet, Llama 405B and DeepSeek V3-class models are served on 8-GPU nodes, typically at FP8. Kimi K2, the 1-trillion-parameter MoE, fits on a single 8 x H100 node at Q4 because its weights pack to roughly 620GB, just under the 640GB ceiling. Run the same model at full BF16 precision and it needs well over a terabyte, which is multi-node territory.

The trillion-parameter ceiling. At the very top, full-precision frontier models and high-concurrency serving spill across many nodes linked by NVLink and InfiniBand. This is what the NVL72 racks above are for. For an individual, the realistic options are: a CPU server for slow-but-cheap access to a 671B model, or renting an 8-GPU node by the hour in the cloud. Owning the multi-GPU hardware outright is a six-figure decision that only makes sense at sustained, heavy load, which is the same buy-versus-rent math we run for decentralized GPU compute.

Capacity is not speed: the bandwidth catch

Here is the trap that the RAM-per-billion math hides. Having enough memory to load a model only means it will run, not that it will run well. Every token a model generates requires reading its entire active parameter set out of memory, so generation speed is set by memory bandwidth, not capacity. That is why the same DeepSeek R1 that loads on a 512GB Mac Studio generates at a usable-but-deliberate ~17 tokens per second rather than the hundreds you get from a small model on a fast GPU.

The spread is enormous. The slow system RAM that lets a cheap server hold a 671B model moves bytes about fifty times slower than the HBM on a data-center GPU, which is the entire reason that same model crawls on CPU and flies on an H200. Plotted on a log scale, the memory you can afford and the memory that is fast sit at opposite ends.

Memory bandwidth by type, system RAM to data-center GPULog-scale dot plot of approximate memory bandwidth in gigabytes per second. Desktop DDR5 about 83, NVIDIA DGX Spark 273, Apple M3 Ultra 819, RTX 4090 about 1008, RTX 5090 about 1792, H100 about 3350, H200 about 4800, and B300 Blackwell Ultra about 8000.64 GB/s256 GB/s1,024 GB/s4,096 GB/s16,000 GB/smemory bandwidth, GB/s (log scale)Desktop DDR5 (system RAM)83 GB/sDGX Spark (unified)273 GB/sApple M3 Ultra (unified)819 GB/sRTX 4090 (GDDR6X)1,008 GB/sRTX 5090 (GDDR7)1,792 GB/sH100 (HBM3)3,350 GB/sH200 (HBM3e)4,800 GB/sB300 Blackwell Ultra (HBM3e)8,000 GB/s
Memory bandwidth by type, system RAM to data-center GPU
ToolCost per taskMultiple of baseline
Desktop DDR5 (system RAM)83 GB/s-
DGX Spark (unified)273 GB/s-
Apple M3 Ultra (unified)819 GB/s-
RTX 4090 (GDDR6X)1,008 GB/s-
RTX 5090 (GDDR7)1,792 GB/s-
H100 (HBM3)3,350 GB/s-
H200 (HBM3e)4,800 GB/s-
B300 Blackwell Ultra (HBM3e)8,000 GB/s-
Memory bandwidth by memory type, from desktop DDR5 to a Blackwell Ultra B300. Higher is faster generation. The axis is logarithmic. Figures are manufacturer specifications; system RAM varies with channel count.Source: Manufacturer specifications (NVIDIA, Apple, AMD); DGX Spark figure per NVIDIA.

The economics changed in 2026

A year ago, the pitch for buying a big-memory machine was buy-once, run-free. That math got worse in 2026 for a specific reason: the resource local AI runs on is exactly the one that spiked in price. Memory makers have run the most lucrative shortage in chip history, which you can watch in real time on our memory price tracker, and the cost flowed straight through to devices. On June 25 2026 Apple raised prices across its Mac and iPad line, pushing the Mac Studio M3 Ultra (the box people buy for the largest models) from $3,999 to $5,299. The increases scaled with memory density, which is to say the local-AI tax was the whole story.

That does not kill the case for buying, but it sharpens the question. If your reasons are privacy, offline use, or a heavy steady workload, owning the hardware still wins. If you just want occasional access to a frontier model, the falling price of API tokens makes renting the better near-term math. Size the machine to the largest model you will genuinely use, not the largest the tier could theoretically hold, and check current model specs and sizes before you commit to a memory budget. On timing, the RAM price forecast is blunt: relief is not expected before late 2027, so buy what you need now rather than waiting out the shortage.

Frequently asked questions

Frequently asked questions

How much RAM do I need to run a local LLM?
For a useful general-purpose model, 16GB is the practical minimum and runs 8B models comfortably. 8GB works for small 3B to 4B models. To run a 70B model you need about 64GB, and the very largest open models (671B) require 512GB.
Can I run a local LLM with 8GB of RAM?
Yes, but only small models. After the operating system takes its share you have roughly 3 to 4GB for the model, which fits a 3B to 4B model at 4-bit quantization. That is enough for chat, summarizing, and autocomplete, but not for heavy reasoning.
Is VRAM or system RAM better for local LLMs?
GPU VRAM is faster because of its higher memory bandwidth, so a model that fits entirely in VRAM generates faster. System RAM (or Apple unified memory) lets you load far larger models for the money, but bandwidth is usually lower, so big models run slower. The ideal is enough fast memory to hold your target model.
What is the cheapest way to run a large model locally?
Apple unified memory currently gives the most gigabytes per dollar for very large models: a Mac Studio holds models that would need a multi-GPU server costing several times more. For models up to about 32B, a single high-end consumer GPU like an RTX 5090 is the better value and much faster.
Does quantization hurt quality?
Going from full precision to 8-bit is nearly lossless. 4-bit (Q4) is the common local standard and trades a small, usually unnoticeable quality drop for roughly a quarter of the memory. Below 4-bit the quality loss becomes more visible, so Q4 is the practical floor for most uses.
Can I run a local LLM on CPU only, without a GPU?
Yes. Any model that fits in system RAM will run on the CPU, and tools like llama.cpp support this directly. It is much slower than a GPU because system memory bandwidth is far lower, but it is how very large models run on cheap servers: a dual-socket EPYC box with 768GB to 1.5TB of RAM runs DeepSeek R1 671B at roughly 3.5 to 8 tokens per second on CPU alone.
How do I run a model that is bigger than my VRAM?
Use offloading. Tools like llama.cpp and LM Studio keep as many layers as fit in GPU VRAM and run the rest on the CPU from system RAM. Speed drops roughly in proportion to how much of the model sits in slow memory, so it is a usable compromise rather than a free lunch. Mixture-of-experts models suffer the least because only a few experts are active per token.
How much memory does a trillion-parameter model need?
All the weights must be loaded regardless of how many are active per token, so a 1-trillion-parameter mixture-of-experts model like Kimi K2 needs roughly 600GB at 4-bit. That fits on a single 8-GPU node (8x H100 = 640GB) or a workstation with more than a terabyte of RAM. At full BF16 precision the same model needs well over a terabyte and spans multiple nodes.
What is the highest-end GPU for running LLMs in 2026?
For a single chip, NVIDIA Blackwell Ultra (B300) is the production flagship at 288GB of HBM3e, with the new Vera Rubin generation (288GB HBM4) entering production in mid-2026. Rack-scale systems like the GB300 NVL72 pool more than 20TB of GPU memory across 72 GPUs. None of these are desktop hardware; the realistic consumer ceiling is the RTX 5090 (32GB) or a workstation RTX PRO 6000 Blackwell (96GB).
Does context length affect how much RAM I need?
Yes. The KV cache, the model working memory for the current conversation, grows with context length and can add several gigabytes on a long prompt or document. Budget headroom beyond the model weights, especially if you plan to use long contexts or feed in large files.

Sources

OpenAI (2025). Introducing gpt-oss. OpenAI. https://openai.com/index/introducing-gpt-oss/

NVIDIA (2026). NVIDIA DGX Spark (product specifications). NVIDIA. https://www.nvidia.com/en-us/products/workstations/dgx-spark/

NVIDIA (2026). GB200 NVL72 (product specifications). NVIDIA. https://www.nvidia.com/en-us/data-center/gb200-nvl72/

NVIDIA (2026). GTC 2026: Vera Rubin and the next generation of AI. NVIDIA Blog. https://blogs.nvidia.com/blog/gtc-2026-news/

Spheron (2026). GPU Requirements Cheat Sheet 2026. Spheron Blog. https://www.spheron.network/blog/gpu-requirements-cheat-sheet-2026/

Digital Spaceport (2025). How To Run DeepSeek R1 671B Fully Locally On a $2000 EPYC Server. Digital Spaceport. https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/

Saplin, M. (2026). llama.cpp: CPU vs GPU, shared VRAM and Inference Speed. DEV Community. https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

MacRumors (2025). Mac Studio With M3 Ultra Runs Massive DeepSeek R1 AI Model Locally. MacRumors. https://www.macrumors.com/2025/03/17/apples-m3-ultra-runs-deepseek-r1-efficiently/

TechRadar (2025). Apple Mac Studio M3 Ultra workstation can run Deepseek R1 671B AI model entirely in memory using less than 200W. TechRadar Pro. https://www.techradar.com/pro/apple-mac-studio-m3-ultra-workstation-can-run-deepseek-r1-671b-ai-model-entirely-in-memory-using-less-than-200w-reviewer-finds

Subscribe to Capital & Compute

Source-backed analysis of what AI compute really costs, sent when a new post goes live.

No spam. Unsubscribe anytime.

← Back to all posts