OpenCoder v3.0 — Qwen3.5 Agentic Setup

Date: 2026-03-30
Author: Justin / Comfac IT / CGG R&D & BD
Status: ✅ Working — qwen3.5-4b-instruct executes tools in OpenCode
Part of: Project OpenCoder: AI Independence Initiative

1. Problem Statement

The goal was to run a locally-hosted agentic coding assistant equivalent to Claude Code, using quantized open-source models on consumer hardware (8GB VRAM), served via Ollama and accessed through OpenCode or Qwen Code.

Two tools were tested in parallel:

OpenCode (v1.3.7) — open-source Claude Code equivalent, TUI-based
Qwen Code (v0.13.2) — Alibaba's terminal agent, optimized for Qwen models

2. What Failed and Why

2.1 Qwen 2.5 Coder (7B) — No Tool Execution

Models tested:

qwen2.5-coder:7b-instruct-q4_K_M
qwen-opencode:latest (custom Modelfile, tool dispatch tuned)
qwen-agent-7b:latest (agent-tuned variant)

Symptom: Models generated syntactically correct bash commands as plain text in the response body. OpenCode displayed them but never executed them. The "Build" status showed green (model responded) but no tool was invoked.

Root cause: Qwen 2.5 Coder at 7B does not reliably emit structured tool call JSON. It understands the task and writes the correct command, but outputs it as prose rather than invoking the framework's tool interface. The model lacks robust function-calling at this parameter count.

After extensive testing across multiple sessions and model variants, these were removed from the config entirely.

2.2 Qwen 3.5 (4B/9B) Base — Thinking Mode Interference

Models tested:

qwen3.5:4b
qwen3.5:9b

Symptom: Models showed visible Thinking: blocks in the output and generated commands as text. Same non-execution pattern as 2.5 Coder, but for a different reason.

Root cause: Qwen 3.5 operates in thinking mode by default, generating <think>...</think> content before its final response. When served via Ollama's OpenAI-compatible /v1 endpoint, the thinking block was being passed through in the response content. OpenCode's tool call parser received a response beginning with <think> text and either misidentified it, or the tool call was embedded inside the think block rather than emitted as a clean structured call after it.

Key finding from Qwen3.5 model card:

"Qwen3.5 models operate in thinking mode by default... Qwen3.5 does not officially support the soft switch of Qwen3, i.e., /think and /nothink."

This means the thinking mode cannot be toggled with a simple prompt prefix — it must be disabled at the API parameter level or suppressed via system prompt in a custom Modelfile.

Performance note: qwen3.5:9b took 4–5 minutes to generate a response. Investigation revealed models had been migrated to a mechanical HDD. Benchmarked read speed: approximately 8.3–16 MB/s (vs. 600 MB/s on SSD). A 5–6 GB model file at 16 MB/s ≈ 6 minutes load time. This explained the extreme latency — it was a storage bottleneck, not a model or GPU issue.

Template:Mbox

2.3 Qwen Code — Credential and Path Confusion

Symptom: Failed to switch model to 'openai::ollama'. Missing credentials for modelProviders model 'ollama'.

Root cause (two separate issues):

Issue 1: The env block declaring OLLAMA_API_KEY was missing from settings.json. Qwen Code requires the key to exist even though Ollama doesn't authenticate — it just checks that the env var is set.

Issue 2 — Distrobox path split: Qwen Code running inside a distrobox container reads ~/.qwen/settings.json which resolves to /home/justin/.qwen/ inside the container. The file being edited was at /run/host/home/justin/.qwen/settings.json on the host. These are two different files. Edits to the host path had no effect on what the running container read.

Fix: Write directly to ~/.qwen/settings.json from inside the correct container, or use a heredoc to overwrite cleanly.

Template:Mbox

3. What Worked — The Modelfile Solution

3.1 The Modelfile Approach

Since Qwen3.5 doesn't support /nothink and Ollama doesn't expose chat_template_kwargs at the API level, the solution was to create a custom Ollama Modelfile that:

Bases on the existing qwen3.5:4b (or :9b) pulled model
Sets coding-optimized sampling parameters per the official model card
Injects a SYSTEM prompt that suppresses thinking output and instructs the model to use tools directly

Modelfile for 4B (save as Modelfile.qwen3.5-4b-instruct):

FROM qwen3.5:4b

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.0
PARAMETER num_ctx 32768

SYSTEM """You are a precise coding and system assistant running inside an agentic terminal environment.

CRITICAL RULES:
- Do NOT output <think> or </think> tags
- Do NOT show reasoning steps or internal monologue
- Do NOT explain what you are about to do before doing it
- Respond with tool calls ONLY — no prose, no narration

When asked to perform any file, shell, or system operation:
- Invoke the appropriate tool immediately (bash, write, read, etc.)
- Use the exact tool call format the framework expects
- Execute first, confirm after only if asked

Available tools: bash, file read/write, glob, grep, list, task, webfetch.
Always use tools. Never output commands as plain text."""

Build command:

ollama create qwen3.5-4b-instruct -f Modelfile.qwen3.5-4b-instruct

3.2 Why the Modelfile Made It Work

The Modelfile changes three things simultaneously:

Layer	What changed	Effect
Sampling params	`temperature=0.6`, `top_p=0.95`, `top_k=20`	Matches Qwen3.5 model card recommendation for precise coding tasks — reduces token variance, more deterministic tool call format
System prompt	Explicit instruction to suppress `<think>` tags and use tools	Overrides the default thinking behavior at the prompt level even when API-level `enable_thinking: false` isn't available
Model identity	Registered as a new Ollama model tag	Can be added to OpenCode/Qwen Code config independently, tested in isolation, and versioned separately from the base model

The key insight: Qwen3.5 does support tool calling (45.6 on TIR-Bench). The problem was never capability — it was output format. The Modelfile forces the model into a mode where it emits tool calls in the structured format OpenCode expects, rather than narrating what it would do.

3.3 Confirmed Working

qwen3.5-4b-instruct in OpenCode successfully executed mkdir, echo, file creation, and multi-step mkdir && echo && cat chains
Response time: 7–10 seconds per tool call on 4B (vs 4–5 minutes on 9B from HDD)
Thinking blocks: suppressed — clean tool call output observed

4. Config Management Scripts

Three scripts were produced to manage the model ecosystem. These live in the OpenCoder working directory.

4.1 sync-opencode-models.py

Syncs Ollama's model list into opencode.json under provider.ollama.models (dict format). Preserves custom names, auto-generates names using family/size/capability detection, supports --dry-run and --verbose.

4.2 sync-qwencode-models.py

Same logic but targets ~/.qwen/settings.json under modelProviders.openai[id=ollama].models (array format). Uses NAME_RULES lookup table with 3-letter SFX codes for model role identification. Also writes correct contextLength per model (262144 for Qwen3.5-9B, 32768 for others).

4.3 fix-opencode-model-names.py

One-shot rename pass over an existing opencode.json to apply the Family (Size) - Descriptor · SFX naming convention without a full sync.

Important: These scripts should be run using Claude or another capable agent to keep the NAME_RULES and PINNED_NAMES tables current as new models are pulled.

4.4 Naming Convention

Family (Size) - Descriptor [Quant] · SFX

SFX	Meaning
`AGT`	Agentic / Tool Dispatch
`COD`	Coding focused
`RSN`	Reasoning / Thinking
`VIS`	Vision / Multimodal
`INS`	Instruct (non-thinking)
`GRM`	Grammar / Style utility
`GEN`	General purpose
`DST`	Distilled from larger model

Examples:

Qwen 3.5 (4B) - Instruct No-Think · INS
Qwen 3.5 (9B) - Claude Opus Distill [Q4_K_M] · RSN·DST
Qwen 2.5 Coder (7B) - Tool Dispatch · AGT

5. Infrastructure Notes

5.1 Storage Impact on Model Performance

Storage	Read speed	5GB model load	Usable?
NVMe SSD	~3,000 MB/s	~2 seconds	✅ Ideal
SATA SSD	~600 MB/s	~8 seconds	✅ Good
HDD (observed)	~8–16 MB/s	4–6 minutes	❌ Not viable for interactive use

Action: Always store active Ollama model files on SSD. Use HDD only for cold storage of models not currently in rotation.

5.2 OpenCode vs Qwen Code

Feature	OpenCode	Qwen Code
Config format	`opencode.json` — models as dict	`settings.json` — models as array
Config path	`~/.config/opencode/`	`~/.qwen/`
Distrobox compatible	✅ Yes	⚠️ Path issues — run from host
Burner container workflow	✅ Yes	❌ Not compatible
Ollama provider	`provider.ollama` with npm package	`modelProviders.openai[id=ollama]`
Credentials	Not required for Ollama	Requires `OLLAMA_API_KEY` env var

6. Replication Steps (Clean Setup)

# 1. Pull base model
ollama pull qwen3.5:4b

# 2. Create Modelfile (save as Modelfile.qwen3.5-4b-instruct)
# [see Section 3.1]

# 3. Build instruct variant
ollama create qwen3.5-4b-instruct -f Modelfile.qwen3.5-4b-instruct

# 4. Verify in Ollama
ollama list | grep qwen3.5-4b-instruct

# 5. Quick sanity check — should produce NO <think> tags
ollama run qwen3.5-4b-instruct "create a file called test.txt with hello world"

# 6. Add to opencode.json
# Under provider.ollama.models:
# "qwen3.5-4b-instruct:latest": {
#   "name": "Qwen 3.5 (4B) - Instruct No-Think · INS"
# }

# 7. Run sync script to keep config current
python3 sync-opencode-models.py --verbose

# 8. Launch OpenCode and test
opencode

7. Open Questions

Does Ollama strip <think> tags at the API boundary in newer versions? (Would make the Modelfile SYSTEM prompt partially redundant but harmless)
What is the practical context limit for qwen3.5-4b-instruct on 8GB VRAM before OOM?
Can qwen3.5:9b (on SSD) match 4B tool call reliability with the same Modelfile approach?
Is there a way to run Qwen Code inside the distrobox burner workflow with symlinked config paths?
At what task complexity does the 4B start failing vs needing the 9B?

Document generated: 2026-03-30
Part of: Project OpenCoder v3.0 — CGG R&D & BD Unit

OpenCoder Qwen35 Agentic Setup 260330