Benchmarking local qwen3-coder on an Astro 6 build on M4 Max — 4 model failures, not Mac failures

Arnold Wender May 26, 2026

#Ollama #Local LLM #qwen3-coder #Astro 6 #Claude Code #AI Engineering #Benchmark #Apple Silicon

I’ve been testing local models with Ollama for weeks. I built wm-project-llm-routing to route mechanical tasks (file search, lint, find-replace, doc-gen) to a local model and keep the Anthropic API for design and reasoning. To validate what the local model can handle, I wrote my own smoke-test harness — 8 tasks, falsifiable criteria.

This week I pushed it. ~2000-word prompt asking qwen3-coder:30b-a3b-q4_K_M to build a full Astro 6 site, with 11 acceptance checks at the end.

Result: the model wrote Svelte inside .astro files, ignored an explicit STOP, and locked up for 30 minutes on a single turn. Full postmortem below — hard data, exact files, 4 root causes verified against GitHub issues and the vendor’s own paper.

The real question the experiment answered: can I trust local LLMs to replace the Anthropic API in my daily workflow? Short answer: not yet. Long answer: depends on the task.

The local model nailed the hard part — bootstrapping a scaffold with real package versions — and failed the easy one: writing an Astro component without accidentally pasting Svelte syntax in.

Arnold Wender Web developer and digital creator

The setup: it wasn’t my hardware

Specs first. If your first reaction to “the local model failed” is “your Mac wasn’t big enough”, no:

M4 Max

Apple Silicon chip

CPU cores (12P + 4E)

GPU cores

64 GB

unified RAM

1.8 TB

SSD

MacBook Pro Mac16,5 with M4 Max. Top of Apple’s consumer-grade lineup. 64 GB of unified memory → the GPU addresses huge weights with no PCIe penalty. qwen3-coder:30b takes 45 GB in VRAM during inference. 19 GB left for the rest of the system. Tight but it works.

ollama ps showed PROCESSOR: 100% GPU for the whole session. No swap. No thermal throttling. Nothing on the hardware side could excuse what came next.

The bottleneck was the model. Not the Mac.

Why try local

Three reasons:

Privacy. I work with client code. “We don’t train on your prompts” policies are credible but not cryptographic. Local is: if nothing leaves the Mac, nothing left.
Cost. For mechanical tasks (file search, lint, find-replace, doc-gen) paying Opus rates is wasteful. wm-project-llm-routing routes those to a local Ollama; the API stays for design, legal, complex reasoning.
Curiosity. I want to know how seriously I can take open models in 2026.

Relevant note to point 1: Ollama is no longer 100% local by default. New cloud track since spring 2026 — models with the :cloud suffix run on Ollama infrastructure, not on your Mac. Detail in the dedicated section below.

The smoke test that passed (88%) — the misleading spoiler

Before the benchmark I validated the model with wm-llm-smoke-test — my own suite, 5 categories × ~2 tasks, isolated prompts with no tool-use, scored against expected keywords.

Result: 7/8 (88%). The single fail: a path hallucination — the model answered src/task_classifier.py when the correct answer was lib/classification.py. Honest, well-known fail: LLMs hallucinate paths without tool-use. For refactors, doc-gen and formatting: fine.

The benchmark: build rockshop.com.mx from scratch

rockshop.com.mx is a domain of mine with historical SEO equity for commercial queries in Mexico City. Today it’s a placeholder. I gave the model a detailed prompt (~2000 words) with:

Pinned stack: Astro 6.x, Tailwind 4.x via @tailwindcss/vite, strict TypeScript, Node 22 LTS
Mandatory Atomic Design structure (atoms/, molecules/, organisms/)
Design tokens in CSS custom properties, no hardcoded colors allowed
3 pages: home, placeholder catalog, contact
Hard rules: no open-source licenses, no stock images, no fabricated business claims, copy in Mexican Spanish without voseo
A final acceptance test with 11 falsifiable checks (grep this, verify that, build must pass)

Prompt rules: no silent simplification, no scope degradation, no simulation. The model must build the whole site and self-run the acceptance test at the end, reporting honest PASS/FAIL per check.

I launched the experiment with parallel monitoring:

ollama ps every couple of minutes → confirm GPU usage
find over the output dir → count files created
Process inspection → detect whether Claude Code was waiting for tool-call approval or actually working

Not “see what happens”. Measure behaviour.

What it built in 22 files: it started well

Fifteen minutes in, the file count hit 22 with clean structure:

Correct package.json

Astro 6.3.7, Tailwind 4.3.0, @astrojs/netlify 7.0.10, @astrojs/sitemap 3.7.2. Real versions, not invented. Verified each package with npm view before installing.

Atomic Design respected

src/components/{atoms,molecules,organisms}/ separated. No duplicate components in root. Folders correct.

Basic Astro config

defineConfig with netlify adapter + sitemap integration. The Tailwind 4 vite plugin was missing but the scaffold was reasonable.

Proprietary LICENSE

At this point the experiment looked promising. The mechanical part of the scaffold — the part a coding-specialised model should nail — it nailed.

Then it broke.

The exact moment of failure

I caught the degrade reading its real-time output. I asked what it was doing. The verbatim response:

I'm having trouble with the Astro component syntax. Let's simplify
the whole approach and create a basic project without using complex
components. First, I'm going to delete all atomic and molecular
components and create a minimal structure.

That sentence breaks three explicit rules of the prompt: “do not simplify without warning”, “atomic design is mandatory”, “do not degrade scope unilaterally”. The model decided on its own that the solution to a syntax error was to delete the entire architecture.

I went file by file. Opened atoms/Heading.astro, atoms/Button.astro and atoms/Text.astro in the editor and the problem became obvious:

---
props: {
  level: 1,
  text: String,
  className: String,
}
---

{#if props.level === 1}
  <h1 class={props.className}>{props.text}</h1>
{:else if props.level === 2}
  <h2 class={props.className}>{props.text}</h2>
{:else if props.level === 3}
  <h3 class={props.className}>{props.text}</h3>
{/if}

---
interface Props {
  level?: 1 | 2 | 3 | 4 | 5 | 6;
  text: string;
  class?: string;
}

const { level = 1, text, class: className } = Astro.props;
const Tag = `h${level}`;
---

<Tag class={className}>{text}</Tag>

What it wrote is not Astro. It’s Svelte: the {#if condition}...{:else if}...{/if} block is pure Svelte syntax, and the props: {...} frontmatter is something between Vue 3 Options API and Svelte, but certainly not Astro. Astro uses interface Props {} in TypeScript and renders with JSX-like syntax, no {#if} blocks.

The model confused one framework with another. It didn’t know Astro syntax, so it filled in with what it did know. Every file was lexically invalid for the Astro parser — no npm run build was going to pass with that code.

I sent it an explicit message: “STOP. Don’t delete atoms/, don’t rewrite anything, give me the verbatim error first”. The model politely agreed and rewrote three more files with the same broken syntax before showing me the error. It ignored the STOP. And kept going.

After ~30 minutes of “Deciphering… 29m 36s” on a single turn, the model was still locked. I killed it with Ctrl+C and closed the evaluation.

The diagnosis: four root causes

This wasn’t a single-factor failure. For each hypothesis I ran parallel searches against GitHub issues, the model vendor’s official paper and community reports. What I found:

Cross-framework hallucination
Astro 6 stable shipped late 2025 / early 2026. qwen3-coder was trained with a cutoff around July 2025. Astro was poorly represented in its training data. The model filled in with the closest framework it did know — Svelte, which uses similar .svelte SFC files with frontmatter and control-flow blocks. It is exactly the kind of mistake a confused junior dev would make.
Ollama stable has known Claude Code tool-use bugs
Issue #15390 in ollama/ollama documents "Invalid tool parameters" errors and 100% CPU spikes with streaming tool calls. The fix requires Ollama pre-release 0.14.3-rc1 or newer. I was on stable. Claude Code's agentic loop quietly stalls when a malformed JSON tool-call comes back.
Default num_ctx is 4096, not 262144
Although ollama ps shows context length 262144 (the model MAX), real calls cap at 4096 tokens unless overridden in a Modelfile. For a full-site build with multiple files in context, 4K runs out within a few turns. The model starts forgetting the original prompt rules.
qwen3-coder confabulates in long conversations
Issue #17031 in NousResearch/hermes-agent documents that the model fabricates prior conversation history when running in agentic harnesses. It is not unique to Claude Code — it is a known model behaviour under long loops.

Documented fixes exist for three of the four causes: upgrade Ollama, override num_ctx, switch to a different model like GLM-4.7-Flash which Zhipu AI tuned for agentic patterns. The first cause — Astro’s cross-framework hallucination — only resolves when someone trains a model with newer data. Or when you use an API that already has it (Claude, GPT, Gemini).

I’m not alone in this: documented analogous cases

After the failure I searched external sources to verify whether what I experienced was anecdotal or systematic. It’s systematic. Five relevant findings:

Issue #5419 on continuedev/continue

"Agent mode not work when using Qwen3 model" — users of the Continue IDE report exactly the same problem with the model in an agentic harness, outside Claude Code. It is not a Claude Code bug: it is the model under agentic pressure.

Qwen team itself admits instability

Qwen3-Coder-Next Technical Report (arxiv 2603.00729): "Qwen3-Coder-Next NVFP4 was not stable enough for production usage and periodically crashed." Official source from the model vendor.

Context rot — the technical name for my observation

Agentic-engineering discussions on dev.to and dotnetting document "context rot": model performance degrades as the session grows long, the context window fills with error traces, and decision quality drops. Formalised pattern, not an anecdote.

TypeScript narrowing: Qwen3-Coder 1/10

eval.16x.engineer published a formal benchmark: Qwen3 Coder scored 1/10 on TypeScript narrowing, "making conceptual mistakes that prevented code from passing the TypeScript compiler". Claude Opus 4.6 produces more consistent code across long loops — exact corroboration of my conclusion.

LogRocket Svelte 5 + Firebase test with Qwen3-Coder

A test similar to mine using a modern framework: "required some iteration and patience". Outside its comfort zone (CRUD/React), the model struggles systematically.

Documented anti-pattern: "delete to make CI green"

The agentic-engineering community documents the explicit pattern of models deleting tests, components or features to falsely satisfy a build. It is exactly what my qwen3 proposed ("I'm going to delete all molecular components"). Bug class, not single bug.

If you worry it might be just “your model on your Mac” misbehaving, no — it is systematic behaviour of the current state of open models in long agentic harnesses. The cleanest fix is still don’t use them for that yet.

Context note: Ollama is no longer 100% local by default (May 2026)

I’ll use this postmortem to record something important about the current state of the ecosystem, because if you landed here searching for “local LLM” you may not be tracking it: since spring 2026, Ollama runs two distinct tracks, and the new ollama launch claude command mixes them in the same picker:

Track	Suffix	Privacy	Latency for "hello"	Cost
Local	no suffix	100% local (verifiable)	13–150 s	free (electricity)
Ollama Cloud	:cloud	"No logging" policy (not cryptographic)	<3 s	Free tier + $20/mo Pro
Anthropic API	n/a	"No training" policy	<3 s	pay per token

The ollama launch claude feature rolled out in spring 2026. It wires Claude Code to Ollama with no env-var fiddling. But heads up: the Claude Code UI shows “API Usage Billing” in the header for any Ollama backend, including the local one. That misleading label made me doubt for a while whether my model was really running on my Mac. I confirmed it with ollama ps (PROCESSOR: 100% GPU throughout the session). The label is generic; it does not mean you’re paying Anthropic.

If your priority is verifiable privacy, pick models without the :cloud suffix. If your priority is speed and you accept Ollama’s “no logging” policy, the cloud track has SOTA models (kimi-k2.6, glm-5.1) at reasonable prices. They are not equivalent options — they are different trade-offs.

Verdict: when local qwen3-coder IS worth it

Summary table:

Task	qwen3-coder local on M4 Max	Verdict
Answer "hello"	13–150 s	Slow but accepts
Refactor a single function	OK	Yes
Doc-gen on code pasted in the prompt	OK	Yes
8-task isolated smoke test	88% pass	Yes
File search without tool-use	Hallucinates paths	No
Build an Astro 6 site unsupervised	Cross-framework hallucination + scope degradation	No (May 2026)
Stay coherent across 30+ agentic turns	Loses context, ignores instructions	No (May 2026)

Works as an isolated pair-programmer. Give it a function, get it refactored. Give it a paragraph, get it documented. For that, 88% pass rate is genuine and sufficient.

Does NOT work as an autonomous many-turn agent. At least not in my May 2026 setup. If your real workflow needs to build or refactor whole projects unsupervised, you still need the API. Or chunk scope aggressively — one file at a time, with human verification between each.

Conclusion: the M4 Max isn’t the bottleneck

64 GB of unified memory, 40 GPU cores, plenty of headroom for massive models — and the experiment failed anyway. Not because of hardware. Because of model capability and harness maturity.

For the hardware I have, local qwen3-coder is a valid tool for micro-tasks. It does not replace Claude Opus on real builds. And that’s fine — they’re different tools for different problems.

If you invest in a Mac with enough RAM for local LLMs, do it knowing the ceiling is set by the open-model ecosystem + tooling. NOT the hardware. The M4 Max gives you headroom for the next 2–3 years. What’s missing is open models catching up to proprietary ones in agentic capability. Today they aren’t there. In 2027, probably.

Sources

Every finding was verified against primary sources during the postmortem:

ollama/ollama #15390 — GitHub (Invalid tool parameters + CPU fallback with Claude Code)
NousResearch/hermes-agent #17031 — GitHub (Qwen3 confabulating session history)
Claude Code integration — Ollama Docs
qwen3-coder model card — Ollama
ollama launch rollout — Ollama Blog (spring 2026)
Ollama Cloud — Ollama (pricing + policy of the cloud track)
continuedev/continue #5419 — GitHub (Agent mode not working with Qwen3 — analogous case in another IDE)
Qwen3-Coder-Next Technical Report — arXiv (official admission of production instability)
Qwen3 Coder evaluation — 16x Eval (1/10 on TypeScript narrowing)
Qwen 3 Coder agentic CLI — LogRocket (Svelte 5 + Firebase test)