Five Models, One GTX 1060: The GX531GM Local AI Bake-Off

The Setup

The GX531GM is not a machine anyone would call cutting-edge anymore. An i7-8750H and a GTX 1060 with 6GB of VRAM — a Pascal-generation card that’s been out of the spotlight for years. But it’s still my daily driver for on-the-go local AI, and after a clean Ollama + Open WebUI (v0.10.2) reinstall, it was time to figure out which models this thing can actually run without making me want to throw it across the room.

Before any of the model testing mattered, there was a bug to squash. Every single model was crawling — minutes to respond to a simple “hello,” even though the same model via the ollama run CLI answered instantly. The culprit wasn’t the GPU, Ollama, or Docker networking. It was Open WebUI’s Builtin Tools and Default Features, enabled by default on every model, silently injecting large hidden context and tool schemas into every request. Some models without tool-calling support (Gemma 2, Gemma 4 E2B) threw hard errors. Models that do support tools just choked on the prompt bloat — a “hello” ballooning past 6,000 tokens.

The fix: go into Admin → Settings → Models, click into each model, and turn off every toggle under Builtin Tools and Default Features. That one change turned every model from “unusable” to “responsive.” Lesson learned — check your tool settings before you blame your hardware.

With that sorted, it was time for the actual bake-off: five models, same GPU, same questions, tools toggled on and off to see how each one actually behaves under real conditions.

The Contenders

gemma2:2b-instruct-q4_K_M — The Sprinter

Gemma 2 2B has zero tolerance for tools. Leave a single toggle checked — Memory, Chat History, whatever — and it hard-errors immediately. But turn everything off, and it’s the fastest model of the entire lineup. Instant typing, no hesitation. If you want a model that just answers immediately with no frills, this is it.

llama3.2:3b-instruct-q4_K_M — The Daily Driver

This one earned the top spot for everyday conversational use. Loaded up with a full set of tools checked — File Upload, File Context, Web Search, Builtin Tools, Time & Calculation, Memory, Chat History, Calendar — it handled everything thrown at it. Good memory recall, fast responses, no confusion, no slowdown. It’s currently the model I trust most for daily chat, though I want more time living with it before I fully rely on it for anything important.

batiai/gemma4-e2b:q6 — The Overthinker (In a Good Way)

Gemma 4 E2B is, hands down, the smartest model in this lineup — the best-quality responses of the bunch. But that intelligence comes at a real cost, and the cost scales directly with how many tools you leave checked:

All tools off: fastest configuration for this model
Partial tools checked (File Upload, File Context, Web Search, Time & Calculation, Memory, Chat History, Calendar): about 10-15 seconds to process the question, then another 10-15 seconds of visible “thinking” before it starts answering. Once it starts typing, though, it’s not slow.
Every tool and capability checked: things get rough. About a full minute to process, then 30 seconds to a full minute of thinking, then another 1-2 minutes to actually type out the answer.

That last configuration is technically usable, but the cumulative wait — process, think, type — stacks up to a point where it genuinely starts to cause some anxiety waiting on it. This is the model I reach for when I need work done that I actually trust, not when I need it fast.

qwen3:4b — The One That Didn’t Make It

Qwen3 4B simply doesn’t work on this hardware. It takes 4 to 6 minutes just to load, and even after loading, another 2 to 3 minutes to answer a question — with every single tool toggle turned off. The likely cause is Qwen3’s default reasoning/thinking mode, which doesn’t have a simple off-switch through Ollama. Whatever intelligence it might offer isn’t worth a wait that long on a 6GB Pascal card.

Ministral 3B (bartowski GGUF, Q4_K_M) — The Confidently Wrong One

Ministral handled the same heavy tool config as Llama 3.2 — File Upload, File Context, Web Search, Builtin Tools, Time & Calculation, Memory, Chat History, Calendar — without any of the response-time penalties Gemma 4 suffered. Fully tool-tolerant, fast, no slowdown.

But it has a different problem: it gets facts confused rather than failing to recall them. I asked it a simple question — what’s my name? It had the correct information in memory: my name is Evaristo, and I go by JR. Instead of answering correctly, it swapped the two and told me, “I’m JR, but I go by Evaristo.” It didn’t forget the information. It just reassembled it backwards. That’s a subtler, and in some ways more concerning, failure mode than an outright wrong answer.

What I Learned

Model	Tool Tolerance	Speed	Verdict
gemma2:2b-instruct-q4_K_M	None — hard errors with any tool on	Fastest once tools are off	Great for instant, no-frills answers
llama3.2:3b-instruct-q4_K_M	Full	Fast even loaded up	Current daily driver
batiai/gemma4-e2b:q6	Full, but costly	Scales badly with tools enabled	Best answers, worth the wait for important work
qwen3:4b	N/A	Unusable (minutes to load and respond)	Ruled out entirely on this hardware
Ministral 3B	Full	Fast, no penalty	Confuses recalled facts — use with caution

The biggest takeaway isn’t which model “wins” — it’s that tool configuration matters as much as model choice on limited hardware. A model that’s fast with zero tools enabled can become unusable the moment you turn on Memory and Web Search, and the degradation isn’t linear — it’s a cliff. Architecture and default behavior (like Qwen3’s baked-in reasoning mode) can matter more than raw parameter count.

For now, the game plan on the GX531GM is: Llama 3.2 3B for daily conversation, and Gemma 4 E2B when the answer needs to actually be right — accepting the wait as the cost of doing business. Ministral stays benched until I can dig into why it’s scrambling recalled details. Qwen3 is off the laptop entirely.

More testing to come as trust builds (or doesn’t) with each of these over time.