Not every post here ends with an answer. This one’s a log of a bug I haven’t fixed yet, written down so the next session doesn’t start from zero.
The symptom
batiai/gemma4-e2b:q6 on the GX531GM (GTX 1060 6GB) — the model that won the earlier bake-off outright — has started showing inconsistent response latency. Not the model’s own thinking time, which is separate and expected. A silent gap before generation even starts, anywhere from nothing to 40-60+ seconds, with no obvious pattern:
- “hello” → 19s
- next message → 42s
- next message → ~100s total (60s silent gap + 42s visible thinking)
- next message → started immediately
- next message → 60s+ delay again
Not tied to prompt complexity. Even “hello” and “say hello” showed the delay on some attempts and not others in the same session.
What I ruled out, and how
Went through this methodically rather than guessing, checking each theory directly instead of assuming:
- Thermal throttling — GPU stayed 41-61°C throughout, nowhere near Pascal’s ~83-90°C throttle point. Confirmed via
nvidia-smiandsensors. - Power/clock throttling — Plugged into AC. Live-monitored clocks during generation showed correct boost to 1650-1750MHz near the 1911MHz max, 94-100% utilization, power draw pinned at 37-41W against the card’s 40W cap. That’s the GPU working exactly as a maxed-out 40W mobile chip should — not a bug.
- RAM spillover — VRAM stayed in the expected 2.7-5GB range, GPU utilization pinned high (not idle-while-CPU-spins, which would signal spillover), system RAM for the process stayed reasonable. No overflow.
- Model unloading between messages — Real theory, real fix: added
OLLAMA_KEEP_ALIVE=30mto the systemd override, confirmed the process stays resident viapgrep -a llama-server. Didn’t fix the intermittent delay. Ruled out as the sole cause. - Conversation context length — A brand-new chat, 4-6 messages in, well under the 131K context window, still showed the same slow/fast inconsistency. Disproven directly.
- System memory pressure —
free -hshowed 12GB available of 15GB. Some swap usage present but not clearly correlated with the slow instances. - Duplicate/stuck processes — Only one
llama-serverinstance running at any checked time. No zombies competing for the GPU. - Open WebUI’s hidden background calls (title generation, follow-ups, tags) — Checked
docker logs -f open-webuiaround several completions calls. No suspicious flood of extra requests, just normal polling and expected model-list calls. Not strongly supported by what the logs show, though I didn’t run a proper controlled A/B with these settings toggled off — still worth doing.
What’s actually still unknown
Open WebUI’s access log only timestamps when a request completes, not when it arrives. So the gap before a POST /api/chat/completions finishes doesn’t tell me where the time is actually going — network wait, queued behind something, or genuine model compute time. That’s the real missing piece, and it’s a logging gap, not a diagnosis.
Also noticed, unrelated to the bug
The model itself has repeatedly and confidently described itself as “running on a remote server” or lacking “access to the operating system” — when it’s running entirely locally on this laptop’s own GPU. Small local models don’t have real insight into their own deployment. Worth remembering generally: don’t trust a model’s self-report about its own infrastructure, verify independently the way this whole session did with nvidia-smi and pgrep.
Next steps, for whenever I pick this back up
- Get actual request-level timing. Wrap the API call with
curl -wtiming flags (or similar) to isolate time-to-first-byte from real model compute time — separates network/queue delay from the GPU actually working. - Controlled A/B on the Open WebUI settings theory. Explicitly disable Title Auto-Generation, Follow-Up Suggestions, and Tag Generation in Admin Settings, then run a real before/after comparison instead of just eyeballing logs.
- Try a smaller explicit context window instead of the default 131K allocation, to see if a smaller fixed KV cache reduces first-token variance — separate from the already-disproven “growing conversation” theory.
- Check disk I/O contention with
iostat -x 1during a slow instance — the NVMe could be busy with something else (backups, indexing, swap) at the exact moment generation is requested. - Retest after a clean reboot, same time of day. Tonight’s testing came after hours of continuous heavy use — model downloads, repeated GPU load, an overnight session. Worth ruling out slow cumulative state drift that wouldn’t show up in any single snapshot.
Writing this down instead of just remembering it, because half the value of a build log is not having to re-derive “wait, did I already check thermal throttling” three weeks from now.