LLM VRAM Calculator

How to Calculate VRAM for Local LLM Runs in 2026

Running large language models locally has become mainstream — Ollama, LM Studio, and llama.cpp put production-quality inference within reach of any developer with a mid-range GPU. But the single most common failure point is VRAM: you download a 15 GB model file, fire up Ollama, and get an immediate out-of-memory error. This guide explains exactly what fills your VRAM and how to calculate it precisely.

Component 1: Model Weights

The dominant component is the model weight tensor. The formula is straightforward:

VRAM_weights (GB) = (Parameters_B × bits_per_weight) ÷ 8

A 7B parameter model at FP16 (16 bits) requires 7 × 16 ÷ 8 = 14 GB. The same model at Q4_K_M (4.85 effective bits) needs 7 × 4.85 ÷ 8 ≈ 4.24 GB. That's the power of quantization.

Component 2: The KV Cache

The KV (Key-Value) cache is a critical optimization: rather than recomputing attention for every previous token on each generation step, the model caches the key and value tensors from every transformer layer. Without it, generation would be O(n²) per token. With it, generation is O(n) — but all those cached tensors live in VRAM.

VRAM_kv (GB) = 2 × n_layers × n_kv_heads × head_dim × ctx_len × bytes ÷ 1e9

For Llama 3 8B at 8192 context (FP16): 2 × 32 layers × 8 KV heads × 128 head_dim × 8192 × 2 bytes = ~0.54 GB. The same model at 128k context: ~6.7 GB — just from the cache. Modern GQA architectures (Llama 3, Qwen 2.5, Mistral) dramatically reduce this by sharing KV heads, which is why Llama 3 8B can sustain much longer contexts than Llama 2 7B.

Component 3: Runtime Overhead

Beyond weights and cache, every inference session carries overhead: CUDA driver and library initialization (~300–500 MB), activation buffers proportional to batch size, allocator fragmentation, and memory reserved for Metal/ROCm on non-NVIDIA hardware. The calculator applies a conservative 10% overhead on model weights plus a fixed 0.5 GB baseline — validated against real llama.cpp measurements across dozens of GPU + model combinations.

Choosing the Right Quantization

Q4_K_Mis the community standard for good reason: it uses k-quantization (mixed precision within blocks) to preserve quality at the critical attention layers while aggressively compressing the feed-forward network. At 4.85 bits/weight vs FP16's 16, it achieves roughly 85% of the perplexity quality at 30% of the VRAM. If you have spare headroom, Q5_K_M at 5.69 bits is noticeably better for tasks that require precise factual recall or code generation. Below Q4, quality degrades rapidly — only use Q3 or Q2 when a large model absolutely will not fit any other way.

GPU Split (Offloading to CPU RAM)

When a model doesn't fit in VRAM, both llama.cpp and Ollama support layer splitting: the --n-gpu-layers flag (llama.cpp) or the num_gpu parameter (Ollama) specifies how many transformer layers to keep on GPU. The remainder processes on CPU RAM via PCIe. Token generation speed degrades linearly with the proportion of CPU-side layers — expect 2–5 tokens/second for heavily CPU-offloaded 70B models vs 20–40 t/s for fully GPU-resident 7B.

Frequently Asked Questions

What is the KV cache and why does it use VRAM?+

The KV (Key-Value) cache stores intermediate attention computations for every token in the context window. Longer context = more tokens = more KV cache memory. For a 7B model at 32k context with FP16, the KV cache alone can consume 4–8 GB, making context length as important as model size when planning VRAM usage.

What quantization should I use for the best quality-to-VRAM ratio?+

Q4_K_M is the most popular choice for a reason: it achieves ~85% of FP16 quality at roughly 30% of the memory footprint. If you have headroom, Q5_K_M (89% quality) is better. For very large models that barely fit, Q3_K_L can be the difference between running and not running, but expect noticeable quality degradation.

Can I run a model that doesn't fit entirely in VRAM?+

Yes — llama.cpp and Ollama support layer splitting (--n-gpu-layers). Layers are split between GPU VRAM and system RAM. The GPU processes as many layers as fit in VRAM and offloads the rest to CPU RAM via the PCIe bus. Performance degrades proportionally to the number of CPU-side layers, but it's a valid strategy for large models on consumer hardware.

Why does the calculator add 10–15% overhead on top of model weights?+

Running a model requires more than just the weight tensors. CUDA context initialization (~200–400 MB), activation buffers during generation (scale with batch size), optimizer state for fine-tuning, and memory fragmentation from allocator overhead all contribute. The 15% heuristic closely matches empirical measurements from llama.cpp benchmarks.

Does quantization affect speed, or just memory?+

Both. Lower-bit quantization reduces memory bandwidth requirements (the primary bottleneck for LLM inference) which actually speeds up token generation on most consumer GPUs. A Q4_K_M 7B model typically generates tokens 20–40% faster than FP16 on an RTX 3070/4060 because memory bandwidth is freed up.

What is GQA (Grouped Query Attention) and why does the calculator show fewer KV heads?+

GQA is an architecture optimization used in Llama 3, Qwen 2.5, Mistral, and most modern LLMs. Instead of one KV head per attention head, multiple query heads share a single KV head. A Llama 3 8B has 32 query heads but only 8 KV heads — reducing KV cache size by 4× compared to a non-GQA model of the same size. The calculator uses accurate per-model KV head counts from published architecture configs.