If you’ve spent any time searching for “best local LLM for X GB VRAM,” you’ve probably noticed the same problem I ran into: almost every guide is a spec-sheet comparison. Parameter counts, benchmark scores, MMLU percentages — all useful in theory, but none of it tells you how a model actually behaves when you ask it something real.
So instead of trusting the spec sheets, I ran a real head-to-head test across 8 local models on my own hardware, using real prompts, and I’m writing up exactly what I found — including the dead ends, the wrong assumptions, and the actual numbers.
The Hardware
- GPU: AMD RX 9060 XT, 16GB VRAM
- CPU: Ryzen 5 5500
- OS: Linux Mint, running headless (no desktop GUI — more on why this matters below)
- Stack: Ollama + Open WebUI, ROCm backend
Finding #1: Removing the GUI actually matters
Most VRAM guides assume you’re running a desktop environment. If you’re building a dedicated AI box, you’re probably not — I stripped the GUI off this machine entirely and access everything through Open WebUI over the network.
That single change moved my usable VRAM ceiling from roughly 11GB to about 13.5GB. The desktop compositor and X server were quietly eating VRAM I could otherwise give to a bigger model or more context. If your AI server doesn’t need a monitor, don’t run a desktop environment on it.
Finding #2: Q6_K quantization is basically an endangered species
Going in, my rule was simple: Q6 or Q8 quantization only, never Q4. The idea is that Q4 throws away enough precision to measurably increase hallucination risk, while Q6/Q8 stay within a percent or two of full precision.
Here’s what I didn’t expect: almost none of the major model families ship an official Q6_K tag on Ollama’s library. I checked eight different models. Here’s what I actually found:
| Model | Official Q6_K on Ollama? | What was actually available |
|---|---|---|
| Phi-4-mini | ❌ No | Q4_K_M or Q8_0 only |
| Phi-4 (14B) | ❌ No | Q4_K_M or Q8_0 only |
| Qwen3.5 (all sizes) | ❌ No | Q4_K_M, Q8_0, BF16 only |
| Gemma 4 (all sizes) | ❌ No | Q4_K_M or QAT variants only |
| Qwen3 14B (official) | ❌ No | Q4_K_M, Q8_0, FP16 only |
| Ministral 3 14B | ❌ No | Q4_K_M only |
| Qwen2.5 14B | ✅ Yes | Real Q6_K tag exists |
| Mistral Nemo 12B | ✅ Yes | Real Q6_K tag exists |
Only two out of eight models had an official mid-tier quant. The rest forced a choice between “too compressed” (Q4) or “no VRAM headroom left” (Q8).
The workaround that actually worked: GGUF is a portable file format — it’s not exclusive to Ollama’s official library. Trusted community quantizers like bartowski and unsloth on Hugging Face publish the full quant ladder (Q5, Q6, Q8) for almost every popular model, and you can pull them straight into Ollama:
ollama pull hf.co/microsoft/phi-4-gguf:Q6_K
ollama pull hf.co/bartowski/gemma-4-12B-it-GGUF:Q6_K
This is how I ended up testing real Q6_K versions of Phi-4 and Gemma 4 12B that don’t officially exist on Ollama’s own library.
Finding #3: KV cache quantization is a genuine free lunch
This one surprised me. Ollama defaults to storing conversation context (the “KV cache”) at full FP16 precision, which eats VRAM fast as conversations get longer. There’s a setting to quantize that cache down to 8-bit instead:
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
(Set via sudo systemctl edit ollama, added to the override.)
Published benchmarks put the quality cost at a perplexity increase of roughly 0.002 to 0.05 — genuinely negligible. This roughly halves the VRAM cost of context growth, meaning longer conversations before you hit your ceiling. One caveat worth knowing: research testing found Qwen-family models specifically degrade sharply below 8-bit KV cache (don’t push to Q4 cache), but at Q8 specifically, Qwen was fine.
It’s not the default in Ollama because flash attention isn’t universally supported cleanly across every possible GPU/backend combination — but if you know your hardware, there’s no real reason not to turn it on.
The Actual Test
Spec sheets don’t tell you how a model behaves, so I gave all 8 models the same real-world prompt:
“I’m on Linux, give me the sudo command to check GPU/VRAM usage.”
This is a simple, everyday question — and it turned out to separate the models cleanly. What I was checking for:
- Does it give a correct, real command?
- Does it correctly note that
nvidia-smi/rocm-smiusually don’t needsudoat all (the honest answer)? - Does it invent a tool that doesn’t exist?
Results
| Model | Quant | Result |
|---|---|---|
| Gemma 4 12B | Q6_K | ✅ Correct, caught the sudo nuance, added genuinely useful extras (nvtop, watch -n1) |
| gpt-oss:20b | native MXFP4 | ✅ Correct, caught the sudo nuance, correctly noted Intel iGPUs have no discrete VRAM to track |
| Qwen2.5 14B | Q6_K | ⚠️ Mostly correct, but invented a fake tool: amdxdpmgpuinfo |
| Mistral Nemo 12B | Q6_K | ⚠️ Correct info, but missed the sudo nuance entirely |
| Ministral 3 14B | Q6_K | ⚠️ Well-organized, but invented a fake tool: amdgpu-proctool |
| Qwen3 14B | Q5_K_M | ⚠️ Incorrectly prefixed sudo onto commands that never need it |
| Phi-4 (14B) | Q6_K | ⚠️ Correct commands, but missed the sudo nuance, suggested an overcomplicated workaround |
| Phi-4-mini | Q8_0 | ❌ Claimed no Linux CLI tools exist for checking VRAM (false), invented amdinfo2, cited a March 2023 knowledge cutoff unprompted |
Two real patterns emerged:
- Fabrication risk didn’t track cleanly with model size. A 14B model (Qwen2.5, Ministral) still confidently invented a nonexistent tool. Bigger isn’t automatically safer.
- Phi-4-mini was the weakest performer by a wide margin, both on this test and on every other test that night — repeatedly hedging, hallucinating tools, and citing a stale 2023 knowledge cutoff unprompted even after the family’s much larger sibling model performed reasonably well.
The Tie-Breaker: Real Regex, Real Log Format
Gemma 4 12B and gpt-oss:20b were the clear top two, so I ran a harder, more decisive test: write a one-line bash pipeline to parse /var/log/syslog for OOM (out-of-memory) kernel kill events, extract the process name and PID, and rank the top 3 offenders by frequency.
The real Linux kernel OOM format looks like this:
Killed process 12345 (chromium) total-vm:1234567kB, anon-rss:123456kB, ...
PID first, then process name in parentheses.
gpt-oss:20b’s regex:
/Killed process ([0-9]+) \(([^\)]+)\)/
This matches the real format exactly.
Gemma 4 12B’s regex:
:[[:space:]]*([a-zA-Z0-9\._-]+) \[([0-9]+)\]
This expects the process name first, then the PID in square brackets — which doesn’t match real OOM syslog output at all. Worse: Gemma’s own written explanation described the format as killed process [process_name] (pid), directly contradicting the code it had just written. The explanation and the code disagreed with each other — a real hallucination under pressure, not just a stylistic weakness.
gpt-oss:20b won the tie-breaker decisively on technical correctness.
Real VRAM Numbers
Once I had a winner, I checked actual VRAM usage at idle (not file size on disk — actual rocm-smi readings):
| Model | VRAM at idle | % of 16GB |
|---|---|---|
| Gemma 4 12B (Q6_K) | ~11GB | 69% |
| gpt-oss:20b (MXFP4) | ~12.8GB | 80% |
Gemma is about 1.8GB more VRAM-efficient, despite gpt-oss’s MoE architecture only activating 3.6B of its 21B total parameters per token — that efficiency helps speed, not memory footprint, since all the weights still have to sit in VRAM regardless of how many actually fire per token.
For long-context use specifically, gpt-oss’s own model card states roughly 0.2GB of KV cache per 8,192 tokens (about half that with Q8_0 cache quantization on), giving a rough usable ceiling in the hundreds of thousands of tokens on a 16GB card. Gemma 4 uses a hybrid sliding-window + shared-KV-cache architecture across layers that’s architecturally even leaner per token, though I don’t have a precise published figure to quote for it directly.
Final Verdict
Both models earned a permanent spot on this machine rather than picking one:
- Gemma 4 12B (Q6_K) — better VRAM efficiency, genuinely multimodal (text + image), better everyday accuracy
- gpt-oss:20b (native MXFP4) — won on hard technical correctness (the regex tie-breaker), purpose-built for tool/agentic use
Every other model tested was removed after the comparison. Total time from “let’s see what’s out there” to a confirmed, tested, real-world-validated final answer: one very long evening — but a much more confident result than trusting a spec sheet would have given me.
Key Takeaways for Anyone Doing This Themselves
- If your AI box doesn’t need a display, don’t run a desktop environment on it. Free VRAM for the taking.
- Don’t assume Q6 quantization exists just because it “should.” Check the actual tag list before planning around it — most major labs skip it entirely on the official Ollama library.
- Hugging Face GGUF repos (bartowski, unsloth) are a legitimate escape hatch when the official quant you want doesn’t exist — just verify the source is a known, trusted quantizer first.
- Turn on KV cache quantization. At Q8_0 specifically, the quality cost is close to zero and the VRAM savings are real.
- Test with real prompts, not benchmarks. A simple, honest question like “how do I check my GPU usage” separated 8 models more clearly than any spec sheet could have.
- Bigger doesn’t mean more honest. Some 14B models fabricated tools just as readily as smaller ones — model size didn’t predict hallucination risk in this test.