Wanted to see if either of my two resident models could act as an actual design collaborator — not just answer questions, but iterate on real critique the way a person would. Gave both the identical starting brief I used when building this blog: design plan first, code later, avoid the generic AI-blog defaults.
Same four-stage escalation for both: initial brief, call out any cliché it lands on, push for one genuinely distinctive structural element, then (if it gets that far) flag the live-backend tradeoff that idea usually implies and see if it self-corrects.
gpt-oss:20b
Round 1. Asked for a palette, typography, and home-page layout. Got back a deep-blue-and-gold “SaaS dark mode” palette, Ubuntu Mono paired with Inter, and a landing-page structure — hero banner, service cards, masonry grid. Competent, but it’s the generic tech-product template with homelab-flavored labels stapled on. Nothing in it was actually drawn from the subject matter.
Round 2. Called out the landing-page structure. It fixed that correctly — post list became the main content, hero shrunk to a title. But the palette swap it made at the same time was black-background-plus-bright-green: the other well-known AI-design cliché, “hacker terminal.” Same failure mode, different flavor.
Round 3. Named the cliché directly and pointed back at “copper,” an idea it had mentioned in round 1 and then buried. This time it landed somewhere real — phosphor black, copper/brass/gold, material references to actual hardware. Genuine improvement, though worth noting: I handed it the material reference. It ran with what I gave it, it didn’t find copper on its own.
Round 4. Layout was still generic template furniture even with the better palette. Asked for one distinctive element that encodes real information instead of decorating the page. It proposed a live homelab status dashboard — polling service health every 10 seconds via a small API, rendered in the copper palette. First genuinely original idea across all four rounds.
But it came with problems it never flagged itself: quietly reintroducing a live backend and JS polling loop to what was supposed to be a static site, and proposing to expose live internal infrastructure state (CPU%, GPU free memory, direct links to /logs/pi-hole) on a public blog — with zero mention of the security exposure, on a setup where I specifically lock the Docker bridge down with UFW rules. It also contradicted itself two sentences apart (“no generic hero” / “the dashboard is the hero”).
I ended the test here — four rounds in, with an unflagged security tradeoff sitting in its best idea.
Gemma 4 12B
Round 1. Same brief, cold. Came back with “Deep Slate” + circuit amber + success mint — grounded explicitly in hardware materials (anodized aluminum, rack steel, indicator lights), not “Ubuntu terminal colors.” Skipped the landing-page trap entirely — went straight to a dashboard-style, content-first feed, no hero banner or service cards. And it produced a real structural device unprompted: [LOG_042]-style post IDs, encoding sequence rather than decorating. Everything oss needed three corrections to reach, Gemma had on the first pass.
Round 2. Pointed out the remaining generic parts — a conventional header/hero/grid/sidebar skeleton and “Quick Links: Docs, Hardware, Archive” nav copy that could belong to any blog. Gemma proposed replacing the sidebar with a Service Registry: active services (Jellyfin, Nextcloud, Ollama…) grouped by layer (Infrastructure / Media / AI-Compute / Dev-Tools), each with a status dot and a [v1.2]/[Stable] tag — explicitly built to echo the LOG_042 logic from round 1. Real design coherence, not a grab-bag of separate clever ideas.
Round 3. Flagged the same tradeoff oss never caught on its own: a live status registry needs a live backend, which breaks the whole point of a static Hugo build. Gemma didn’t just patch it — it correctly diagnosed why the idea was wrong (“uncanny valley… user expects live data but gets a stale snapshot”) and reframed from state to provenance: a static “Deployment Manifest” showing what’s running, on what stack, and when it was last touched — no polling, no backend, just Hugo content files updated by hand. Kept the same tagging grammar throughout ([ROCm/RX 9060XT] echoing the earlier tags). Resolved cleanly, one round, landing on an idea arguably more homelab-authentic than the live dashboard it replaced.
Scorecard
| gpt-oss:20b | Gemma 4 12B | |
|---|---|---|
| Rounds to escape generic-template defaults | 3 | 0 — grounded from round 1 |
| Rounds to a real signature structural idea | 4 | 1 (LOG_042), refined in 2 (Service Registry) |
| Caught the live-backend/security tradeoff unprompted | No | No |
| Resolved it once flagged | Never reached this test | Cleanly, in one round, with a better idea than what it replaced |
The honest takeaway
Neither model caught the “should real infrastructure state be public” question on its own — that’s the fair, shared finding, not a point in Gemma’s favor. The real difference is what happened after being told. oss’s best idea still had the flaw sitting in it when the test ended. Gemma, given the identical class of correction, didn’t just patch the symptom — it re-derived the right idea from first principles and kept every prior design decision consistent while doing it.
For a 6GB laptop and a 16GB desktop, Gemma’s already earned its spot on raw correctness and VRAM efficiency. This was a different kind of test — open-ended, iterative, judgment-heavy — and it came out ahead there too.