How to Calculate VRAM for Local LLM Runs in 2026
Running large language models locally has become mainstream — Ollama, LM Studio, and llama.cpp put production-quality inference within reach of any developer with a mid-range GPU. But the single most common failure point is VRAM: you download a 15 GB model file, fire up Ollama, and get an immediate out-of-memory error. This guide explains exactly what fills your VRAM and how to calculate it precisely.
Component 1: Model Weights
The dominant component is the model weight tensor. The formula is straightforward:
A 7B parameter model at FP16 (16 bits) requires 7 × 16 ÷ 8 = 14 GB. The same model at Q4_K_M (4.85 effective bits) needs 7 × 4.85 ÷ 8 ≈ 4.24 GB. That's the power of quantization.
Component 2: The KV Cache
The KV (Key-Value) cache is a critical optimization: rather than recomputing attention for every previous token on each generation step, the model caches the key and value tensors from every transformer layer. Without it, generation would be O(n²) per token. With it, generation is O(n) — but all those cached tensors live in VRAM.
For Llama 3 8B at 8192 context (FP16): 2 × 32 layers × 8 KV heads × 128 head_dim × 8192 × 2 bytes = ~0.54 GB. The same model at 128k context: ~6.7 GB — just from the cache. Modern GQA architectures (Llama 3, Qwen 2.5, Mistral) dramatically reduce this by sharing KV heads, which is why Llama 3 8B can sustain much longer contexts than Llama 2 7B.
Component 3: Runtime Overhead
Beyond weights and cache, every inference session carries overhead: CUDA driver and library initialization (~300–500 MB), activation buffers proportional to batch size, allocator fragmentation, and memory reserved for Metal/ROCm on non-NVIDIA hardware. The calculator applies a conservative 10% overhead on model weights plus a fixed 0.5 GB baseline — validated against real llama.cpp measurements across dozens of GPU + model combinations.
Choosing the Right Quantization
Q4_K_Mis the community standard for good reason: it uses k-quantization (mixed precision within blocks) to preserve quality at the critical attention layers while aggressively compressing the feed-forward network. At 4.85 bits/weight vs FP16's 16, it achieves roughly 85% of the perplexity quality at 30% of the VRAM. If you have spare headroom, Q5_K_M at 5.69 bits is noticeably better for tasks that require precise factual recall or code generation. Below Q4, quality degrades rapidly — only use Q3 or Q2 when a large model absolutely will not fit any other way.
GPU Split (Offloading to CPU RAM)
When a model doesn't fit in VRAM, both llama.cpp and Ollama support layer splitting: the --n-gpu-layers flag (llama.cpp) or the num_gpu parameter (Ollama) specifies how many transformer layers to keep on GPU. The remainder processes on CPU RAM via PCIe. Token generation speed degrades linearly with the proportion of CPU-side layers — expect 2–5 tokens/second for heavily CPU-offloaded 70B models vs 20–40 t/s for fully GPU-resident 7B.