DEV Community

Cover image for I Was About to Rewrite My Chat Router. The Bug Was Two Lines in a Prompt.
Ali Afana
Ali Afana

Posted on

I Was About to Rewrite My Chat Router. The Bug Was Two Lines in a Prompt.

Marketing copy mistaken for product catalogs

TL;DR: A customer asked my AI sales bot "what do you have?" and the bot listed product categories the store doesn't sell. My instinct was to rewrite the search router. I spent twenty minutes about to do exactly that. Then I traced where the hallucinated category list was actually coming from: not the search results, not the database, not the router. It was coming from the store's "About" text — which the system prompt was injecting as Store: ${store.description}. The model read that label as a catalog header and treated the marketing copy as inventory truth. The fix was renaming one variable string from Store: to About the store (brand voice / background — NOT a product catalog): and adding one CRITICAL rule. Zero changes to the architecture.


The Bug

I run a multi-tenant AI sales chatbot platform. One of the test stores sells men's casual clothing — shirts, pants, the basics. Its description field, the marketing blurb the merchant types on signup, reads something like:

"Modern men's wardrobe. From sharp business shirts to weekend essentials, suits, shoes, and everything in between."

Standard SEO-friendly copy. Reads fine on the storefront page.

A test customer asks the chatbot:

"Hey, what do you have?"

The bot replies:

"We've got a full men's wardrobe — business shirts, weekend essentials, suits, shoes, and everything in between. What are you in the mood for?"

Customer:

"Great, I'll take a suit."

The store has zero suits. Has never sold a suit. The product table has thirty-four rows; none of them are suits. The bot just promised something the catalog can't deliver. The customer escalates, asks for sizing, and now there's a trust break two messages into the conversation.

I have seen this kind of bug before. I had a whole architecture in place to prevent exactly this.


The Architecture I Was Sure I'd Have to Rewrite

When the customer's message hits a generic phrase like "what do you have" or "show me everything," my chat router doesn't call a freeform "describe the store" prompt. It branches into a dedicated path that pulls the actual product table, builds a category breakdown — { "Shirts": 18 items, $20-$60 }, { "Pants": 12 items, $30-$80 } — and feeds that into the response model as the source of truth.

The architecture is deliberate. I wrote about it before: prompt engineering controls tone, architecture controls behavior. If you want the model to never invent a product, don't beg it not to; give it search results and a tool contract that says "you can only reference what came back from this call." The grounded-LLM playbook.

So when I saw the bot recite suits and shoes for a store that has neither, my first instinct was the obvious one. The architecture must have broken. Either:

  1. The generic-phrase detection isn't firing, so we're falling through to the freeform path where hallucinations are possible.
  2. The category breakdown is returning wrong data — maybe pulling from another store, maybe miscategorizing.
  3. The search results are being clobbered somewhere between the SQL and the response prompt.

I started reading the router code with the intent to rewrite it. I had a branch open and a commit message half-typed before I stopped and did one thing first: I read the actual system prompt that was being sent to the model.


Where the Suits Came From

This is the relevant slice of the response-call system prompt as it was being assembled:

const desc = store.description ? ` Store: ${store.description}` : "";
const typeText = store.store_type ? ` Type: ${store.store_type}.` : "";
const countryText = store.country ? ` Location: ${store.country}.` : "";

const systemPrompt = `
You are the sales assistant for ${store.name}.${desc}${typeText}${countryText}
Search results for "${query}":
${searchResults}
...
`;
Enter fullscreen mode Exit fullscreen mode

Look at the line that builds desc. The label is the word Store: followed by whatever the merchant typed into their description field.

Now look at what the model sees, in order:

You are the sales assistant for Diwan.
Store: Modern men's wardrobe. From sharp business shirts to weekend essentials, suits, shoes, and everything in between.
Type: Clothing & Fashion.
Location: Palestine.
Search results for "what do you have":
{ category_overview: { "Shirts": 18 items, "Pants": 12 items } }
...
Enter fullscreen mode Exit fullscreen mode

The architectural defense — the real category overview — is there, lower in the prompt. It's correct. It's accurate. But two lines above it, there's another block of text labeled Store: listing categories that look like inventory: "shirts," "suits," "shoes."

The model has to decide which of those two sources to trust. The architecture was correct. The labels weren't.

The word Store: is not specific. The model doesn't know it's marketing copy. It reads exactly like the kind of label that introduces an inventory list, because in training data, structured labels followed by category-shaped text usually are inventory lists. Every Shopify product CSV header. Every catalog JSON. The model is doing exactly what its training pulls it toward.

The marketing blurb wasn't being treated as marketing. It was being treated as a catalog because it had been labeled like one.


The Fix: Two Lines

There was no architectural change. The router stayed. The search results stayed. The category-overview path stayed. Two edits to the prompt construction:

Edit one — relabel the injection:

const desc = store.description
  ? ` About the store (brand voice / background — NOT a product catalog): ${store.description}`
  : "";
Enter fullscreen mode Exit fullscreen mode

The model now reads the description with an explicit epistemic frame. This text exists, but it is brand voice. It is not inventory. There is a different source for inventory below.

Edit two — add a CRITICAL rule to the response prompt:

CRITICAL: When the customer asks what you have / what you sell / your
catalog / "شو عندك" / "إيش عندكم" "What do you have — list ONLY categories that appear
in the search results. NEVER enumerate categories from the store
background or description text. The background is brand voice; the
search results are inventory truth.
Enter fullscreen mode Exit fullscreen mode

That's the entire fix. Same architecture, same database, same router branches, same tool contract. The bug closed. The bot stopped offering suits the store doesn't sell.


Architecture vs Prompt Is the Wrong Dichotomy

There's a clean-sounding mental model that goes: "if the bug is the model behaving badly, change the architecture; if the bug is the model sounding wrong, change the prompt." I've written and quoted versions of that myself.

It's not wrong, exactly. It's just not the right axis when you're sitting in front of an actual bug, three minutes from typing git checkout -b rewrite-search-router.

A better question to ask first:

Where, in the bytes I send the model, does the wrong information live?

Not "is my architecture sound." Not "is my prompt strict enough." Where, literally, on the screen, are the suits coming from?

In my case, the suits were in the prompt — in a string I'd inserted myself, with a label that the model was perfectly entitled to interpret as a catalog. The architecture was clean. The search was clean. The defense was clean. I just hadn't been careful about what frame I gave the model for each block of context I passed in.

The general pattern, which I now check on every grounded-LLM bug:

  1. Trace the output back to a span of bytes in the prompt. Not metaphorically — literally find the substring the model echoed. Is it from searchResults? From store.description? From an example in a few-shot block? From an old conversation summary you forgot was being passed?
  2. Look at the label that introduces that span. Store: is not a label, it's a noise word. About the store (brand voice / background — NOT a product catalog): is a label. Specificity here is grounding.
  3. Check whether another span in the same prompt contains the correct answer. If yes, the bug is precedence, not absence. The model has both truths in front of it and picked the wrong one because the wrong one had higher epistemic weight from its labeling.
  4. Only then ask if the architecture needs changing. Usually it doesn't.

The first time I ran this checklist, the "two-line fix" only existed because I'd already written the architectural defense months earlier. The category-overview path was the truth I needed the model to use. The prompt was just calling something else "Store:" right above it and letting the model decide.


The Inversion

I've published before that prompt engineering controls tone and architecture controls behavior. That's still true. But there's a second half I want to write down, because I keep relearning it:

Architecture builds the truth. The prompt decides whether the model believes it.

You can have a flawless retrieval pipeline, a tool contract, a typed search result, a JSON-mode response constraint — and the model will still output a hallucination if the prompt above the truth says, in any voice, "here's the inventory" while pointing at the wrong block.

The two layers aren't in opposition. They're stacked. Architecture is what you make available to the model. The prompt is how you label what you made available. If the labels are vague, the model fills in the meaning from its training, which usually means it picks the most common interpretation — and the most common interpretation of Store: followed by category-shaped prose is "this is the store's inventory."

When the bug looks architectural, check the prompt. When the bug looks like a prompt problem, check what context is reaching the model. The bug almost always lives at the seam between the two, not inside one of them.


The Takeaway

You don't have to choose between "fix the architecture" and "fix the prompt." That dichotomy will burn afternoons.

Ask one question before you reach for either tool: where, in the bytes I'm sending, does the wrong answer come from?

For me, it was the marketing description. Wearing a catalog label. Sitting two lines above the real catalog. The model wasn't wrong to read it that way. I was wrong to label it that way.

The fix was a string rename. Twenty minutes of diagnosis, eight characters of code. The architecture I almost rewrote was already correct.

Top comments (8)

Collapse
 
max_quimby profile image
Max Quimby

This is the single most under-appreciated debugging skill for LLM systems: before you touch any code, point at the exact span of text the model is reading and ask whether you would draw the same conclusion from it. I lost a full afternoon last month on a similar one — a system prompt that said "Customer:" before each turn, then later "User profile:" in front of metadata. The model started attributing the user's bio to the customer's intent. Same shape as your Store: label collision.

The label-as-contract framing is gold. We started running a habit where every prompt field that touches the model has to answer two questions explicitly: what kind of thing is this (catalog, voice, history, hint), and what is the model forbidden to do with it. It feels heavy at first but it kills exactly this category of bug. Curious whether you've added eval cases for the regression — once you've seen a "suits" hallucination once, you basically have a free seed for an adversarial test set.

Collapse
 
alimafana profile image
Ali Afana

The "Customer:" / "User profile:" collision is exactly the same shape — different surface, identical bug. Two labeled fields competing for the model's attention, the more specific frame loses. Adding it to my mental catalog.
The two-question contract is the part that generalizes hardest. My article's step 2 ("look at the label that introduces the span") was diagnostic — it tells you where to look once the bug exists. Your contract is prescriptive — it tells you what to write so the bug never gets written in. The "what kind of thing is this" half is what I patched with the rename. The "what is the model forbidden to do with it" half is what I didn't have, and that's the half that kills the bug class, because it forces you to articulate the failure mode at the moment of injection, when the labeling decision is still hot. Xidao's identity-vs-reference section split (thread above) is the structural twin: sections give you the slots, contracts give you the spec for what each slot is allowed to mean.
On eval cases: you're right, and I owe my future self this. No harness yet. But I think the seed isn't literally "suits" — it's the shape "marketing prose containing noun-list category language injected into a labeled field." Generalize that and you have a test class, not a test case. The wrinkle (chewing on it in Xidao's thread) is that the adversarial inputs aren't from attackers — they're from merchants writing well-intentioned SEO copy. The accidentally-adversarial shape is the production distribution, which makes synthetic generation harder than it looks. The real test set has to be drawn from real tenant data.

Collapse
 
xidao profile image
Xidao

This is a great writeup of a class of bug that is really underappreciated in LLM-powered products. The variable labeling issue you describe — where the model treats any labeled context as structured truth — is something I have hit multiple times in multi-tenant setups.

The fix of renaming the label to explicitly disclaim its purpose is clever and pragmatic. One additional pattern I have found helpful is splitting system context into two clearly separated sections: one for "identity and rules" (always authoritative) and one for "reference data" (explicitly marked as non-authoritative background). Even then, some models will still occasionally blur the boundary, so I tend to add a lightweight output validation layer as a second line of defense.

Do you have a set of adversarial test cases for this kind of context injection ambiguity, or was this caught purely through manual testing? Curious how you approach regression testing for prompt-level bugs like this.

Collapse
 
alimafana profile image
Ali Afana

The identity-and-rules vs reference-data split is cleaner than what I did. I fixed at the label level; you're describing the fix at the section level. The label rename worked because there was one offending injection. The section split is what survives once you're injecting five or six context blocks and any of them could quietly read as authoritative.
Honest answer on testing: this was caught manually — and only because I'd developed a habit of reading the literal prompt the model sees before touching the architecture. The "CRITICAL: list ONLY categories from search results" rule is the regression patch, but it's a static guard in the prompt, not a test suite. No adversarial harness yet. Closest thing is dogfooding against real merchant descriptions in dev mode, which catches some of these and obviously misses the long tail.
One wrinkle I keep getting stuck on with adversarial testing in multi-tenant: the reference data isn't mine. It's whatever the merchant types into their description field. Even with a clean section split, the words in the reference section can be accidentally adversarial — a merchant writes "we have everything from suits to shoes" as marketing copy and the model reads it as inventory. The injection isn't malicious; it's shaped like inventory because Shopify-style descriptions tend to be.
How do you build the adversarial set when the "adversaries" are also your tenants providing well-intentioned data? Pull from real tenant descriptions, or synthetic generation?

Collapse
 
haltonlabs profile image
Vikrant Shukla

This is the most underrated debugging lesson in LLM systems: "infra-looking" symptoms are almost always prompt or schema issues in disguise. Every time we've been tempted to rip out an LLM router or vector store, a postmortem found the real cause in the prompt contract — ambiguous role descriptions, conflicting few-shots, or an instruction that quietly contradicted a tool spec. Two practices that save us hours now: (1) golden traces — store the exact (prompt, model, params, response) for every failure class and diff against current behavior on each prompt change, and (2) treat prompts as code with versioning, code review, and CI evals. The two-line fix you found will reappear in someone else's system next quarter if it's not enforced structurally.

Collapse
 
alimafana profile image
Ali Afana

"infra-looking symptoms are almost always prompt or schema issues in disguise" — stealing that as the article's missing TL;DR.
What I'm taking from your comment: the 4-step checklist I ended on is the diagnostic layer. It catches the bug when you're staring at it. Golden traces + prompts-as-code is the prevention layer — it stops the next mislabel from reaching prod at all. They stack rather than compete. The article only covers diagnostics because that's where I currently am in my practice. Every postmortem I've published on Dev.to is essentially a failure-class snapshot in prose form — the manual version of what you've automated.
One thing I keep getting stuck on with eval-on-prompt-change for grounded LLMs: what's the diff criterion? Exact response match is too brittle (temperature noise, paraphrase drift). Structural match — "categories mentioned, tools called, hallucinated entities = 0" — feels closer, but it's harder to define generically across domains. How do you draw the line in your golden traces?

Collapse
 
haltonlabs profile image
Vikrant Shukla • Edited

The trick is not picking one criterion — it's splitting the trace into three things and diffing each differently.
Deterministic structure (tool name, arguments, schema, retrieved IDs) > exact match. Most regressions land here before prose even matters.
Grounded content > set-membership against retrieval, not against the golden response. Your suits bug is a one-liner: mentioned_categories ⊆ retrieved_categories. Six to ten invariants like that probably cover most of a catalog product's failure surface.
Prose > rubric-scored judge model as a soft signal, never a gate. Skip embedding cosine; "suits" and "shirts" are neighbors in vector space, which is exactly the distance you need to resolve.
One thing I'd push on regardless: don't diff a single sample. Sample N at prod temperature and track invariant pass rate as a distribution a prompt edit that moves hallucination from 1% to 6% looks clean on any single diff.
Your checklist already encodes this; CI just needs the span roles (catalog / voice / history / hint).

Thread Thread
 
alimafana profile image
Ali Afana

mentioned_categories ⊆ retrieved_categories is the article's diagnostic step rewritten as a one-line invariant — exactly the upgrade from "where did the bytes come from" to "is the output bounded by valid sources." Banking that.
The cosine point is the one that subverts a default I would have reached for. The whole bug class lives at distances embedding similarity treats as "the same" — "suits" and "shirts" are neighbors in vector space because they're neighbors in the world, and that's exactly when the hallucination is most expensive to catch. Rubric-scored judge as soft signal is going on my list.
Distribution-over-samples is the one I haven't lived yet. Provia isn't in front of real merchants at scale, so today's "regression surface" is dogfood plus a small set of seed stores. The 1%-to-6% drift you're describing is invisible from where I'm standing — and that changes the moment the first paying tenant is on, which is when this stops being an article I'm bookmarking and starts being infrastructure I have to actually build.
Reading you next to Xidao's section split and Max's label contracts (other threads here), the convergence is striking — span roles, sections, contracts. Three different vocabularies for the same primitive: every piece of context the model sees needs an explicit role and an explicit forbidden-action. Your CI version is what enforces it at scale; the prompt-level version is what gets you to the starting line.