2026.07.02

Three tests, one winner: what Gemma 4 got right that gpt-oss didn't

Didn’t set out to run a Google-vs-OpenAI bake-off. Just kept testing models on real tasks across two machines, and the pattern kept repeating itself until it was hard to ignore. Three separate tests, three different axes, same result each time.

Test 1: the 6GB laptop

Five models on the GX531GM’s GTX 1060, tested on live-measured VRAM plus a hard OOM-log regex challenge. batiai/gemma4-e2b:q6 used roughly half the VRAM of every Q4 competitor — while running at a higher precision quant — and was the only model whose hard-mode answer had valid syntax that would actually execute. Qwen3:8b used the most VRAM and took the longest to respond, for a regex answer with broken shell quoting that wouldn’t run at all. More thinking tokens, worse output.

Fitting a good model into 6GB of VRAM

Test 2: the 16GB desktop

Eight models on the asrock-desktop’s RX 9060 XT, same style of test. Gemma 4 12B (Q6_K) and gpt-oss:20b were the only two that survived the first cut — everyone else either missed the sudo nuance or fabricated a tool that doesn’t exist. On the regex tie-breaker, gpt-oss actually won: correct pattern, matched the real syslog format. Gemma’s regex was wrong and its own explanation of the regex contradicted the code — a real hallucination under pressure. But on VRAM, Gemma was ~1.8GB more efficient at idle despite gpt-oss’s MoE architecture only activating a fraction of its parameters per token.

Finding the Best Local LLM for a 16GB AMD GPU

This is the one place gpt-oss actually won outright. Worth saying plainly, since the other two tests didn’t go its way.

Test 3: design judgment

The least technical, most subjective test — could either model act as a real design collaborator, iterating on critique instead of just answering prompts. Gemma was grounded and non-generic from its very first response. gpt-oss needed three full rounds of direct correction just to stop reaching for template defaults, and even its best idea (round 4) shipped with an unflagged security problem — exposing live infrastructure state on a public blog, no mention of the risk. Gemma, given the same class of correction, re-derived a better idea from first principles in one round and kept its earlier design decisions consistent while doing it.

Can a local model actually design something?

The honest pattern

Gemma won or tied in every test except the hard regex tie-breaker on the 16GB box, where gpt-oss’s answer was simply more correct. That’s a real result, not a rounding error — worth keeping in the story instead of quietly dropping it.

What’s consistent across all three tests isn’t “Gemma is smarter.” It’s that Gemma is more efficient for its size — half the VRAM of comparable models on the laptop, ~1.8GB leaner on the desktop — and, on the one test that wasn’t about raw correctness, more careful. It didn’t reach for clichés it had to be walked back from, and once it was pointed at a real problem, it fixed the actual problem instead of patching the symptom.

Sample size of two model families, on my hardware, with my test suite — not a universal verdict. But three tests in, three different kinds of question, and the same model kept coming out ahead more often than not. That’s enough to make it the default and make gpt-oss the thing I reach for when a task specifically needs hard technical precision over everything else.