Finding the Best Local LLM for a 16GB AMD GPU

If you’ve spent any time searching for “best local LLM for X GB VRAM,” you’ve probably noticed the same problem I ran into: almost every guide is a spec-sheet comparison. Parameter counts, benchmark scores, MMLU percentages — all useful in theory, but none of it tells you how a model actually behaves when you ask it something real.

So instead of trusting the spec sheets, I ran a real head-to-head test across 8 local models on my own hardware, using real prompts, and I’m writing up exactly what I found — including the dead ends, the wrong assumptions, and the actual numbers.

The Hardware

GPU: AMD RX 9060 XT, 16GB VRAM
CPU: Ryzen 5 5500
OS: Linux Mint, running headless (no desktop GUI — more on why this matters below)
Stack: Ollama + Open WebUI, ROCm backend

Finding #1: Removing the GUI actually matters

Most VRAM guides assume you’re running a desktop environment. If you’re building a dedicated AI box, you’re probably not — I stripped the GUI off this machine entirely and access everything through Open WebUI over the network.

That single change moved my usable VRAM ceiling from roughly 11GB to about 13.5GB. The desktop compositor and X server were quietly eating VRAM I could otherwise give to a bigger model or more context. If your AI server doesn’t need a monitor, don’t run a desktop environment on it.

Finding #2: Q6_K quantization is basically an endangered species

Going in, my rule was simple: Q6 or Q8 quantization only, never Q4. The idea is that Q4 throws away enough precision to measurably increase hallucination risk, while Q6/Q8 stay within a percent or two of full precision.

Here’s what I didn’t expect: almost none of the major model families ship an official Q6_K tag on Ollama’s library. I checked eight different models. Here’s what I actually found:

Model	Official Q6_K on Ollama?	What was actually available
Phi-4-mini	❌ No	Q4_K_M or Q8_0 only
Phi-4 (14B)	❌ No	Q4_K_M or Q8_0 only
Qwen3.5 (all sizes)	❌ No	Q4_K_M, Q8_0, BF16 only
Gemma 4 (all sizes)	❌ No	Q4_K_M or QAT variants only
Qwen3 14B (official)	❌ No	Q4_K_M, Q8_0, FP16 only
Ministral 3 14B	❌ No	Q4_K_M only
Qwen2.5 14B	✅ Yes	Real Q6_K tag exists
Mistral Nemo 12B	✅ Yes	Real Q6_K tag exists

Only two out of eight models had an official mid-tier quant. The rest forced a choice between “too compressed” (Q4) or “no VRAM headroom left” (Q8).

The workaround that actually worked: GGUF is a portable file format — it’s not exclusive to Ollama’s official library. Trusted community quantizers like bartowski and unsloth on Hugging Face publish the full quant ladder (Q5, Q6, Q8) for almost every popular model, and you can pull them straight into Ollama:

ollama pull hf.co/microsoft/phi-4-gguf:Q6_K
ollama pull hf.co/bartowski/gemma-4-12B-it-GGUF:Q6_K

This is how I ended up testing real Q6_K versions of Phi-4 and Gemma 4 12B that don’t officially exist on Ollama’s own library.

Finding #3: KV cache quantization is a genuine free lunch

This one surprised me. Ollama defaults to storing conversation context (the “KV cache”) at full FP16 precision, which eats VRAM fast as conversations get longer. There’s a setting to quantize that cache down to 8-bit instead:

[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

(Set via sudo systemctl edit ollama, added to the override.)

Published benchmarks put the quality cost at a perplexity increase of roughly 0.002 to 0.05 — genuinely negligible. This roughly halves the VRAM cost of context growth, meaning longer conversations before you hit your ceiling. One caveat worth knowing: research testing found Qwen-family models specifically degrade sharply below 8-bit KV cache (don’t push to Q4 cache), but at Q8 specifically, Qwen was fine.

It’s not the default in Ollama because flash attention isn’t universally supported cleanly across every possible GPU/backend combination — but if you know your hardware, there’s no real reason not to turn it on.

The Actual Test

Spec sheets don’t tell you how a model behaves, so I gave all 8 models the same real-world prompt:

“I’m on Linux, give me the sudo command to check GPU/VRAM usage.”

This is a simple, everyday question — and it turned out to separate the models cleanly. What I was checking for:

Does it give a correct, real command?
Does it correctly note that nvidia-smi/rocm-smi usually don’t need sudo at all (the honest answer)?
Does it invent a tool that doesn’t exist?

Results

Model	Quant	Result
Gemma 4 12B	Q6_K	✅ Correct, caught the sudo nuance, added genuinely useful extras (`nvtop`, `watch -n1`)
gpt-oss:20b	native MXFP4	✅ Correct, caught the sudo nuance, correctly noted Intel iGPUs have no discrete VRAM to track
Qwen2.5 14B	Q6_K	⚠️ Mostly correct, but invented a fake tool: `amdxdpmgpuinfo`
Mistral Nemo 12B	Q6_K	⚠️ Correct info, but missed the sudo nuance entirely
Ministral 3 14B	Q6_K	⚠️ Well-organized, but invented a fake tool: `amdgpu-proctool`
Qwen3 14B	Q5_K_M	⚠️ Incorrectly prefixed `sudo` onto commands that never need it
Phi-4 (14B)	Q6_K	⚠️ Correct commands, but missed the sudo nuance, suggested an overcomplicated workaround
Phi-4-mini	Q8_0	❌ Claimed no Linux CLI tools exist for checking VRAM (false), invented `amdinfo2`, cited a March 2023 knowledge cutoff unprompted

Two real patterns emerged:

Fabrication risk didn’t track cleanly with model size. A 14B model (Qwen2.5, Ministral) still confidently invented a nonexistent tool. Bigger isn’t automatically safer.
Phi-4-mini was the weakest performer by a wide margin, both on this test and on every other test that night — repeatedly hedging, hallucinating tools, and citing a stale 2023 knowledge cutoff unprompted even after the family’s much larger sibling model performed reasonably well.

The Tie-Breaker: Real Regex, Real Log Format

Gemma 4 12B and gpt-oss:20b were the clear top two, so I ran a harder, more decisive test: write a one-line bash pipeline to parse /var/log/syslog for OOM (out-of-memory) kernel kill events, extract the process name and PID, and rank the top 3 offenders by frequency.

The real Linux kernel OOM format looks like this:

Killed process 12345 (chromium) total-vm:1234567kB, anon-rss:123456kB, ...

PID first, then process name in parentheses.

gpt-oss:20b’s regex:

/Killed process ([0-9]+) \(([^\)]+)\)/

This matches the real format exactly.

Gemma 4 12B’s regex:

:[[:space:]]*([a-zA-Z0-9\._-]+) \[([0-9]+)\]

This expects the process name first, then the PID in square brackets — which doesn’t match real OOM syslog output at all. Worse: Gemma’s own written explanation described the format as killed process [process_name] (pid), directly contradicting the code it had just written. The explanation and the code disagreed with each other — a real hallucination under pressure, not just a stylistic weakness.

gpt-oss:20b won the tie-breaker decisively on technical correctness.

Real VRAM Numbers

Once I had a winner, I checked actual VRAM usage at idle (not file size on disk — actual rocm-smi readings):

Model	VRAM at idle	% of 16GB
Gemma 4 12B (Q6_K)	~11GB	69%
gpt-oss:20b (MXFP4)	~12.8GB	80%

Gemma is about 1.8GB more VRAM-efficient, despite gpt-oss’s MoE architecture only activating 3.6B of its 21B total parameters per token — that efficiency helps speed, not memory footprint, since all the weights still have to sit in VRAM regardless of how many actually fire per token.

For long-context use specifically, gpt-oss’s own model card states roughly 0.2GB of KV cache per 8,192 tokens (about half that with Q8_0 cache quantization on), giving a rough usable ceiling in the hundreds of thousands of tokens on a 16GB card. Gemma 4 uses a hybrid sliding-window + shared-KV-cache architecture across layers that’s architecturally even leaner per token, though I don’t have a precise published figure to quote for it directly.

Final Verdict

Both models earned a permanent spot on this machine rather than picking one:

Gemma 4 12B (Q6_K) — better VRAM efficiency, genuinely multimodal (text + image), better everyday accuracy
gpt-oss:20b (native MXFP4) — won on hard technical correctness (the regex tie-breaker), purpose-built for tool/agentic use

Every other model tested was removed after the comparison. Total time from “let’s see what’s out there” to a confirmed, tested, real-world-validated final answer: one very long evening — but a much more confident result than trusting a spec sheet would have given me.

Key Takeaways for Anyone Doing This Themselves

If your AI box doesn’t need a display, don’t run a desktop environment on it. Free VRAM for the taking.
Don’t assume Q6 quantization exists just because it “should.” Check the actual tag list before planning around it — most major labs skip it entirely on the official Ollama library.
Hugging Face GGUF repos (bartowski, unsloth) are a legitimate escape hatch when the official quant you want doesn’t exist — just verify the source is a known, trusted quantizer first.
Turn on KV cache quantization. At Q8_0 specifically, the quality cost is close to zero and the VRAM savings are real.
Test with real prompts, not benchmarks. A simple, honest question like “how do I check my GPU usage” separated 8 models more clearly than any spec sheet could have.
Bigger doesn’t mean more honest. Some 14B models fabricated tools just as readily as smaller ones — model size didn’t predict hallucination risk in this test.