Takayuki Kawazoe

Posted on May 11

"How one empty message poisoned an entire AI consultation (and the three-layer fix)"

#ai #python #debugging #claude

A user opened a support thread saying their AI consultation had gone unresponsive. Every message they sent came back with an error. Refreshing didn't help. Starting a new tab didn't help. From their side, the conversation was dead.

The product is Codens Green, a PRD management tool where users hold long, iterative conversations with Claude to refine product requirements. Some of those conversations run dozens of turns. This particular one had thirty-something messages of history, all looking normal in the database. The row was there. The user was authenticated. The organization had credits. And yet every new message hit the API and bounced.

By the time we shipped the fix it was three layers deep, and only one of those layers is the "actual" fix. The other two were the kind of belt-and-suspenders you only put on once you've been burned. I want to walk through what we saw, what we tried first (which was wrong), what the real cause turned out to be, and the shape of the patch.

What 400 BadRequest looked like

The backend log for the failing consultation looked like this on every request:

ERROR Failed to generate AI response: Error code: 400
{'type': 'invalid_request_error',
 'message': 'messages.17: text content blocks must be non-empty'}

Same error, same index, every time. The user retried, our code retried, the error didn't move. Index 17 was always index 17 because index 17 was sitting in their stored history.

I went down the wrong path first. The error code was 400, which felt like an auth-shaped problem, so I started there. Wrong key? The key was fine, every other org was working. Rate limit? No, this org wasn't anywhere close. Model deprecation? We were on a current model, and other consultations using the exact same model were responding normally. I checked the Anthropic status page. Green across the board. I checked our own credit-deduction logic to make sure we weren't somehow short-circuiting requests. Clean.

About forty minutes in I noticed the messages.17 part of the error and felt stupid. The API was telling me exactly which message in the array it didn't like. I just hadn't read it.

The real cause

I pulled the consultation row, parsed its messages JSON, and walked it. Most messages had a few hundred characters of content. Message 17, an assistant message, had content: "". Empty string. Not whitespace, not null, just empty.

Claude's API rejects requests where any message in the messages array has empty content. That's a hard validation at the boundary, not a soft failure. Which meant: the moment that empty message landed in the consultation's history, every future call was guaranteed to fail, because every future call assembled the full history and sent it back to the API. The conversation had been poisoned by one row.

The user couldn't recover from inside the app. Our UI didn't expose a "delete message" affordance for this surface, and even if it did, the broken message was an assistant turn, not theirs to edit. From the user's perspective, the consultation just stopped working. Forever. With no error message that meant anything to them.

This is the worst kind of bug. It only surfaces for users with enough history to have triggered the rare condition that produced the bad row, the dashboards don't flag it (a 400 from Claude looks like an intermittent upstream failure if you don't drill in), and the root cause is invisible because it happened on some earlier request you weren't watching.

How an empty assistant message ever got saved

Once I knew what to look for, the chain was straightforward.

Claude's API occasionally returns a response where the assistant's text_content is empty. I don't have a great theory for why. Could be transient, could be an edge case in their content filtering, could be a race in how we parse content blocks when the response has tool-use blocks but no text blocks. It's rare. I'd guess less than one in ten thousand calls in our traffic. But across enough users and enough turns, "rare" becomes "guaranteed."

Our previous code did approximately this:

ai_result = await self._claude_client.generate_consultation_response(
    messages=messages,
    title=consultation.title,
    context=consultation.context,
)
ai_response = ai_result["response"]
# ...
consultation.add_assistant_message(ai_response, metadata=ai_metadata)

ai_response could be "". Nothing checked. The empty string flowed into add_assistant_message, got appended to the message list, and the entity got persisted. From that point forward, the consultation was permanently broken.

One unchecked write, two days earlier, became a permanent block on the user's account.

The three-layer fix

The patch split into three layers. Each one defends a different boundary, and only the middle one is what I'd call the real fix. The other two are there because the real fix doesn't help users who already have a poisoned row, and because I wanted to bound the failure surface.

Layer 1: filter on the way out

In the Consultation domain entity, get_messages_for_ai() is what assembles the array we send to Claude. The old version included every non-system message. The new version also excludes anything with empty or whitespace-only content:

def get_messages_for_ai(self) -> list[dict[str, str]]:
    return [
        {"role": msg.role.value, "content": msg.content}
        for msg in self.messages
        if msg.role != MessageRole.SYSTEM
        and msg.content
        and msg.content.strip()
    ]

This is the layer that unsticks every existing poisoned consultation. We didn't run a data migration. We didn't write a one-shot cleanup script. The filter at read time simply skips the bad row on the way to the API, and the conversation works again. The bad row is still sitting in the DB, but it's never sent anywhere that would reject it.

I want to be honest about what this layer is and isn't. It's defensive. It papers over bad data. It does not prevent the bug from happening again. If you only ship this layer, you keep generating empty rows and keep skipping them, which is fine until something else relies on the history being complete (PRD generation from conversation summary, for instance) and now the user's PRD is missing a turn.

Layer 2: detect on the way in

This is the real fix. In our Claude client wrapper, generate_consultation_response() now refuses to return an empty response at all:

text_content = "".join(
    block.text for block in response.content if block.type == "text"
)
if not text_content.strip():
    raise ValueError("No text content in Claude API response")

If Claude hands us back a response with no text blocks (or only empty text blocks), we raise. The caller in AddMessageUseCase already has a try/except around the API call and falls back to a generic "sorry, please try again" message. Crucially, that fallback message goes to the user as a transient response. It does not get persisted as an assistant turn:

try:
    messages = consultation.get_messages_for_ai()
    ai_result = await self._claude_client.generate_consultation_response(...)
    ai_response = ai_result["response"]
except Exception as e:
    logger.error(f"Failed to generate AI response: {e}")
    ai_response = "申し訳ありません。AIからの応答の生成中にエラーが発生しました。..."

Wait, that's not quite right as stated. Look at the existing code and you'll see the fallback message does get persisted via add_assistant_message further down. That's a separate concern we'll come back to. What matters here is that with Layer 2 in place, the assistant message that gets stored on a failed call is either real text or our explicit, non-empty fallback string. It is never "". The DB cannot accumulate another poisoned row from this code path.

If you can only ship one of the three layers, ship this one. Defending at the output boundary, the moment data crosses from "external API response" into "thing we persist," is where bad data deserves to die. Filtering at read time is a workaround. Validating at write time is the fix.

Layer 3: bound the history

This one is technically a separate bug, but I shipped it in the same PR series because the user-visible symptom overlaps. Long consultations were starting to push against the context window, and a few users were seeing failures that looked similar (intermittent API errors on long-running conversations) but had a different cause.

So in AddMessageUseCase, we cap the history we send:

MAX_HISTORY = 40
if len(messages) > MAX_HISTORY:
    messages = messages[-MAX_HISTORY:]
    while messages and messages[0]["role"] != "user":
        messages = messages[1:]

Forty messages is roughly twenty user/assistant turns. The trailing slice gets the most recent context, which is almost always what matters. The while loop handles a Claude API requirement that conversations must start with a user role. If the slice happens to begin with an assistant message (because we truncated mid-turn), we drop the leading assistants until we find a user message.

Three things to flag about Layer 3. First, twenty turns is a product choice, not a technical limit; we picked it because our consultation UI doesn't show more than that comfortably anyway, and longer histories were producing diminishing returns on AI quality. Second, the first-user-role correction is a Claude-specific constraint. Don't carry this verbatim to a different provider without checking their docs. Third, this layer is unrelated to the empty-message bug. It's bundled in because the failure mode looks adjacent from a triage perspective, and shipping them together meant one round of regression testing instead of two.

The migration we didn't write

One thing I want to underline. Layer 1, the read-time filter, accidentally did the work of a data migration without being a data migration. Every existing poisoned consultation in our DB started working again the moment the deploy went out. No SQL to write, no rows to update, no offline job to run. The defensive layer absorbed the historical damage.

That's not always the right tradeoff. If we'd needed downstream consumers (analytics, PRD generation, exports) to see a complete history, leaving bad rows in place would have leaked into those features later. In our case the only consumer that read the bad message was the call to Claude itself, so filtering at read time was sufficient. But it's worth naming the pattern explicitly: a defensive read-side filter can serve as a zero-downtime migration for a class of bad data, as long as you're confident you've enumerated every reader.

What I'd take away

The thing I keep coming back to is that the cause of the user's problem (one empty cell, written two days earlier, somewhere on the request path) had nothing visible in common with the symptom they were experiencing (every new message fails with a 400 today). The signal that mattered was buried in the error message itself, and I spent forty minutes chasing API keys before I read it. Read the error.

The three-layer shape, defend on the way in, defend on the way out, bound the size, is general. It works for any case where you're persisting outputs from an external API and replaying them as inputs. Validate before you persist. Filter before you replay. Cap the surface.

If you're building anything with Claude, Codens is what we use this same stack to build.

Top comments (1)

Hollow House Institute • May 11

This is a good example of how small runtime failures can turn into longitudinal system failures once interaction history becomes part of the execution surface.

One poisoned message breaking every future interaction is basically a replay continuity problem.

The interesting part is that the failure was not happening at generation time anymore. It became persistent because the invalid state kept surviving across future execution cycles.

That’s where things like Governance Telemetry, validation boundaries, and replay-safe persistence start becoming important operationally.

Especially in long-running agent systems where historical interaction state continuously re-enters runtime.