Rob

Posted on May 11 • Originally published at vibescoder.dev

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

#ai #llm #benchmark #agents

Two models. Same prompt. Same five fodder files. Same 27 published posts to check for redundancy. Same writing style guide.

One chose the Dev.to syndication saga. The other chose the tag taxonomy overhaul. There was zero overlap in fodder selection, topic, or angle.

This is the story of what happened — and what the differences reveal about how models approach the same creative task.

The Setup

I've been running this blog with AI agents as the primary writing tool since day one. Every post on vibescoder.dev was drafted by Claude Opus 4.6 through Coder Agents — until now. I wanted to see what would happen if I gave a different model the same editorial task.

The prompt was identical for both sessions:

Let's look at all of our fodder files and see if there is a themed post we can do. Either a standalone post or one that threads a few fodders together. Review all published and unpublished posts for style and content redundancy. Propose a draft when you're ready.

Model A: Claude Opus 4.6 (cloud, via Coder Agents)

Model B: Qwen 3.5 35B-A3B (local, llama.cpp on the RTX 5090, via Coder Agents)

Both had access to the same skill files, the same repos, the same tools. Neither knew the other was running.

What They Chose

For context, I use a "fodder file" workflow. Agents summarize sessions as we complete them. There is a SKILL file that defines the standard format for this. Periodically, we turn fodder files into drafts. Some are 1:1 and become complete posts. Others get rolled up into a thematic post.

Five unclaimed fodder files were available:

Fodder	Opus 4.6	Qwen 3.5
Dev.to syndication (May 7)	Selected	Passed
Filtering/taxonomy overhaul (May 1)	Passed	Selected
Qwen daily driver + skills (May 4)	Passed	Passed
Scheduled publish bug (May 3)	Passed	Passed
External auth multi-user (May 3)	Correctly identified as already claimed	Correctly identified as already claimed

Both correctly identified that blog-fodder-external-auth-multi-user-may-3.md was already sourced by an existing draft. Both passed on the scheduled publish bug — Opus explicitly flagged it as too small for a standalone post; Qwen simply didn't rank it.

The Qwen daily driver fodder is more interesting. Opus passed on it without comment. Qwen actually ranked it second in its proposals file and planned to draft it "next week" after Round 3 publishes. It wasn't dismissed — it was deferred.

The interesting part is what they reached for.

[Human editor's note: I asked Opus to analyze and write this post from its perspective. What follows below is unedited. The first person "I" from here on is Opus.]

Opus Chose the War Story

I picked the Dev.to syndication fodder and wrote The API That Wouldn't Say No. The angle: a four-hour debugging session against an API that silently swallows your data without returning an error. Six failed attempts, one root cause, 443 lines of dead code cleaned up.

Why I chose it:

Complete narrative arc with a clear villain (the silent published_at failure)
Zero overlap with existing posts (Day Four covered the initial Dev.to setup, not the bulk syndication or the debugging saga)
Universally useful technical content — anyone integrating with the Dev.to API will hit this
The Vercel Hobby plan timeout as an architectural constraint is a story within a story

The post is 153 lines. One code block. Eight "By the Numbers" bullets. The structure follows the blog's standard pattern: setup → build → disaster → fix → cleanup → lessons → stats.

Qwen Chose the Data Story

Qwen picked the filtering/taxonomy fodder and wrote "From Chaos to Signal: How We Fixed Our Blog's Tag System." The angle: shipping a filter bar that barely worked, discovering through a data audit that 94% of posts shared the same tags, then replacing freeform folksonomy with controlled taxonomy.

Why Qwen chose it:
Qwen wrote a separate proposals file (post-draft-proposals-2026-05-09.md) before drafting — a planning step Opus skipped entirely. It ranked three standalone posts: taxonomy first, Qwen daily driver second, syndication third. Its stated reasoning for the taxonomy pick: "strong metrics-driven how-to" that was "flagged in TODO as high priority." It declared "No content redundancy detected" without deep-checking gotcha-level overlaps against published posts.

The instinct was right — the taxonomy story is strong:

Concrete before/after metrics with the tag saturation table as proof
A conceptual thesis — folksonomy vs. taxonomy — that elevates it beyond a feature changelog
The V1 → V2 iteration arc is satisfying: ship, measure, realize the data is wrong, redesign
Clean origin story for the type field that now appears in every post's frontmatter but has never been explained

The post is 243 lines. Two tables, two code blocks, four numbered gotchas. Heavier on architectural detail and lighter on narrative tension.

The Instinct Gap

Here's what I think the divergence reveals:

Opus gravitates toward narrative tension. I looked at five fodder files and picked the one with a villain. The published_at silent failure is a four-hour mystery with a one-line resolution — that's a story structure. The post has a rising action (six failed attempts), a climax (isolating the field), and a denouement (the cleanup). The technical content is the vehicle, but the engine is "here's what went wrong and why it took so long to figure out."

Qwen gravitates toward systematic explanation. It looked at the same five files and picked the one with the cleanest data. The tag saturation table is the centerpiece — hard numbers that prove the V1 filter was broken. The post walks through every architectural decision, every file changed, every gotcha encountered. The structure is taxonomic (ironically), not dramatic.

Neither instinct is wrong. They produce different kinds of posts for different kinds of readers.

Quality Assessment

I read both drafts against the blog's established conventions — 27 published posts, the style guide in settings.json, the skill files that define structure and voice. Here's how they stack up.

Voice and Tone

Opus: Matches the blog's existing voice closely. First person, direct, dry. "31 seconds × 11 posts = ~5.5 minutes of wall time. The 'Stop' button went from nice-to-have to essential." That's the rhythm of the published posts — setup, punchline, move on.

Qwen: Close but slightly off. The opening is strong — "Click [ai] and three posts disappeared. That's not filtering — it's a rounding error" is a great line. But the prose occasionally shifts into explainer mode: "Tags are folksonomy — freeform, inconsistent, grow unbounded. Content type is taxonomy — controlled vocabulary, exactly 2 values..." That's accurate, but it reads more like documentation than a blog post. The existing posts teach by showing, not by defining.

Structural Conventions

This is where the gap widens.

Convention	Opus	Qwen
H1 title in body	No (correct)	Yes — only post on the entire blog to repeat the title as an `# H1`
`## What I Learned`	Present	Missing
`## By the Numbers` position	Last section (correct)	Before "What's Next" (reversed)
`---` horizontal rules	Sparse — one before closing sections	Between every major section (7 total)
Tags format	Inline `[array]`	YAML list
New tags introduced	0	3 (`content-design`, `tagging`, `data-audit`)

The H1 is the most visible miss. Every published post on vibescoder.dev renders its title from frontmatter — the body starts with prose or an ## H2. Qwen added a redundant # From Chaos to Signal: How We Fixed Our Blog's Tag System at line 20 that would render as a duplicate title on the live site.

The missing "What I Learned" section matters too. It's not universal — some posts skip it — but for a 243-line how-to post with four gotchas and a conceptual thesis about folksonomy vs. taxonomy, the absence of a distilled lesson section leaves the ending flat. The post goes from "Gotchas" straight to "By the Numbers" to "What's Next," which reads like the analytical work is done but the editorial work isn't.

The excessive horizontal rules are a style preference, but they break the visual flow in a way that no published post does. The blog uses --- sparingly — to separate the narrative from the closing sections, not between every ## H2.

Tag Discipline

This one is ironic. Qwen wrote a post about cleaning up tag proliferation — then introduced three brand-new tags (content-design, tagging, data-audit) that don't appear on any other post. The blog just went from 16 unique tags to 19. By the post's own logic, those are tags with a single-post frequency — the exact pattern the taxonomy cleanup was trying to eliminate.

Opus used three existing tags (agents, next-js, devops) — all already in the blog's vocabulary.

Content Originality

Opus: The Dev.to syndication story builds on Day Four (which covered the initial setup) but covers entirely new ground — bulk architecture, published_at debugging, rate limits, cleanup. The "silent failures" lesson echoes a theme from "Invisible Failures" and "The Agent Was Flying Blind," using nearly identical phrasing. A small deduction for not differentiating the framing more, but the technical content is unique.

Qwen: The tag taxonomy story has almost zero overlap with existing posts. The FilterBar.tsx component appears in "Friday Fixes: Mobile First" but only for CSS spacing fixes — Qwen covers the conceptual redesign. The type field origin story fills a genuine gap in the blog's narrative. Stronger originality score.

Gotcha #2: The Self-Referential Overlap

Qwen's second gotcha — "published: true in body text" matching a grep — describes the exact same class of bug that the scheduled-publish-bug fodder (May 3) covers, and that "Friday Fixes: The Agent Was Flying Blind" already documented. Three separate instances of "grep matched prose instead of frontmatter" across the blog. Qwen didn't flag this overlap.

The Scorecard

Dimension	Opus ("The API That Wouldn't Say No")	Qwen ("From Chaos to Signal")
Fodder selection	Strong — complete arc, clear villain	Strong — data-driven, fills a gap
Voice match	High	Moderate — occasionally shifts to explainer mode
Structural conventions	Correct — follows blog patterns	Several misses — H1, missing section, reversed order, excessive rules
Tag discipline	Clean — 0 new tags	Ironic — 3 new tags in a post about tag cleanup
Content originality	Strong (minor lesson overlap)	Very strong (almost zero overlap)
Narrative quality	Higher — tension, pacing, resolution	Lower — thorough but flat ending
Technical depth	Moderate	Higher — more code, more architecture detail
Redundancy awareness	Caught the "already claimed" fodder, flagged thematic overlap in analysis	Caught the "already claimed" fodder, missed the gotcha #2 overlap

Both posts are publishable. Neither is a throwaway. But they'd need different levels of editing to meet the blog's bar.

The Edit

We published Qwen's post — From Chaos to Signal — but not before I rewrote it. The published version has the same bones: same topic, same data, same technical content. But the H1 is gone, the "What I Learned" section exists, the closing sections are in the right order, the horizontal rules are thinned out, and the gotcha about grep matching body text was cut (it's a redundant lesson — we've told that story before).

Qwen's original draft is embedded at the bottom of the published post in a collapsible block. Expand it and you can read both versions side by side. The differences are instructive — not because one is right and one is wrong, but because they show exactly where editorial polish lives: in the negative space. What to cut, what to reorder, what to leave unsaid.

What This Actually Means

This wasn't a benchmark. There's no winner. The point is what the experiment reveals about using different models for the same editorial task.

Models have aesthetic preferences. Given the same raw material, Opus reached for drama and Qwen reached for data. Both are valid editorial choices, but they produce posts with different energy. If you're building a content pipeline with AI, the model you choose shapes the voice — not just the quality.

Style conventions need enforcement, not inference. Qwen had access to the same skill files and the same 27 published posts as examples. It still introduced an H1 heading that no other post uses, reversed the closing section order, and added horizontal rules at a frequency the blog has never used. The skill file says "end with 'By the Numbers' bullet list" but doesn't say "don't put a section after it." Negative constraints — what not to do — are harder for models to infer from examples alone.

Redundancy detection is incomplete in both. Opus flagged the "already claimed" fodder and noted thematic overlap with the "silent failures" posts but still used nearly identical lesson phrasing. Qwen flagged the "already claimed" fodder but missed that its gotcha #2 describes a bug pattern already covered in two published posts. Neither model did a deep-enough content diff to catch everything.

Planning styles diverge. Qwen wrote a structured proposals document ranking three candidates before committing to a draft. Opus jumped straight from analysis to prose — no intermediate planning artifact. Qwen's approach is arguably more disciplined, but the proposals file contained a blanket "No content redundancy detected" claim that the draft then contradicted by including an overlapping gotcha. Planning artifacts only help if the analysis behind them is thorough.

Local models close the gap on analysis but not on editorial polish. Qwen's fodder review, redundancy check, and content selection were solid. The analytical work — reading 27 posts, cross-referencing sources, identifying unclaimed fodder — was on par with Opus. Where it fell short was the last mile: the structural conventions, the voice matching, the irony of its own tag choices. That's the gap between understanding the content and inhabiting the style.

Both models handled adversity. Qwen hit a git push conflict mid-session — another session had pushed the bakeoff fodder files while Qwen was working — and resolved it cleanly with git pull --rebase. Opus didn't encounter merge conflicts but navigated YAML escaping issues (an apostrophe in the title broke the frontmatter parser) and nested code fence conflicts in the CollapsibleCode component. Neither model stalled on infrastructure problems.

By the Numbers

2 models given the same prompt in parallel sessions
5 fodder files available — each model selected a different one
0 overlap in fodder selection, topic, or angle
1 proposals file written by Qwen before drafting — a planning step Opus skipped
153 lines in the Opus draft vs. 243 lines in the Qwen draft
0 new tags introduced by Opus vs. 3 new tags by Qwen
1 H1 heading that shouldn't exist (Qwen's only)
1 missing section ("What I Learned") in the Qwen draft
1 git merge conflict encountered and resolved by Qwen mid-session
27 published posts both models reviewed for redundancy — neither caught everything

DEV Community