DEV Community

Code Pocket
Code Pocket

Posted on • Originally published at westoeast.com

FAQ schema and AI citation lift: measuring, then attacking, a positive finding

The first time we measured a citation lift from FAQ schema, my reaction was something like "great, write it up." That instinct is exactly how teams ship findings that don't hold. We waited, then we tried to break the finding. Part of it broke. Part of it didn't.

This is the report.

The initial finding

In a 12-client portfolio, across roughly 180 pages where we added FAQ schema to existing pages that already had FAQ-style content in the visible HTML, we measured a 14% relative lift in A+B tier citations over an 8-week window after deployment. The control was an internal A/B-style split where roughly half of comparable pages on the same domains got the schema and half didn't, with the assignment based on publication date (older half got it, newer didn't) to avoid biasing toward fresher content.

14% looked clean. The confidence interval was wide because the per-page citation counts were small, but the direction was consistent.

So we wrote it down and started recommending FAQ schema deployment as part of our standard GEO engagement, which the agency I work with has been doing since late 2025. And then I asked the team: what's the strongest argument that this finding is wrong?

Attempt 1: Was it really the schema, or was it the content?

Adding FAQ schema isn't a no-op. The pages that got schema had to have FAQ-formatted content. The pages that didn't get schema sometimes had less structured content, even if we'd told ourselves it was "comparable." When we re-coded the pre-schema pages for content structure (independent of schema), we found that about a third of the lift was probably attributable to content cleanup that happened at the same time. Not the schema itself.

That dropped the schema-attributable lift to something more like 9-10%. Still positive, but smaller, and with even wider uncertainty.

Attempt 2: Does the lift persist across engines?

We re-ran the breakdown by engine. The lift was strongest in Google AIO (around 18% relative), moderate in ChatGPT with web on (about 11%), small in Perplexity (5-7%), and basically zero in Gemini. The portfolio average of 14% was carried by AIO, which makes intuitive sense: AIO is the most directly continuous with Google's existing structured-data pipeline. The other engines may parse schema, but they don't seem to weight it the same way.

So "FAQ schema lifts AI citations by 14%" is true in aggregate and misleading in detail. The honest version is "FAQ schema lifts AI citations primarily on Google AIO, with smaller lifts on ChatGPT, and unclear effects on Perplexity and Gemini."

Attempt 3: Does it survive over time?

Eight weeks is not a long window. We extended the tracking to 20 weeks for the subset of pages where we had clean data, and the AIO lift held steady. The ChatGPT lift compressed (from 11% to about 6%). Perplexity bounced around in a way we can't characterize confidently. Gemini stayed flat. We don't have a clean explanation for the ChatGPT compression. One hypothesis is that ChatGPT's training data ingestion changed over the window; another is that we're just looking at noise.

What we did wrong

We initially reported the 14% number to one client before doing any of the breaking-the-finding work. They made a budget decision partially based on it. That was premature. We've since shared the breakdown with them and the recommendation didn't change materially, but the timeline of how we communicated it wasn't great. The internal process change we made: any portfolio-level finding has to survive at least one structured "how would this be wrong" pass before it goes to a client. That's added about a week to our finding-to-recommendation cycle. It's worth it.

Attempt 4: Are FAQ-rich pages just better pages?

This was the attack I least wanted to run, because it threatened the cleanest part of our finding. The question: are the pages we'd marked up with FAQ schema systematically better than the pages we hadn't, on other dimensions that AI engines might reward?

We did a manual readability and quality audit of the schema-on and schema-off pages, blind to which was which (one team member assigned IDs, another ran the audit without knowing the schema status). The schema-on pages scored modestly higher on readability and structure metrics, on average. Not because of the schema, but because the schema deployment had been done by a team that also tended to do small content polish at the same time.

When we statistically controlled for the audit quality score, the schema-attributable lift shrank again, to something more like 6-7%. Still positive in our sample, but now we were three attempts deep and the original 14% had been cut in half. The honest reporting framing became: "FAQ schema is associated with a citation lift, mostly on Google AIO, with effects in the 6-10% range after controlling for confounds we could identify."

That's a far less marketable sentence than "FAQ schema lifts citations 14%." It's also closer to what we actually know.

What we're still unsure about

We have not run a clean RCT. Our split was based on publication date, which is a proxy for randomization and not a substitute for it. There may be a temporal confound we're not seeing.

We also haven't tested other schema types systematically. Article schema, HowTo schema, Organization schema — we have anecdotes but not data. Don't read this piece as "schema is good." Read it as "FAQ schema, specifically, in this portfolio, did this specific thing, mostly on AIO."

There's a deeper uncertainty: AI engines update their parsing pipelines without telling anyone. A lift we measure today might evaporate in three months if Google AIO changes how it weights structured data, or persist for years if it doesn't. Schema findings have an unknown shelf life. We try to remeasure quarterly on a smaller subset of pages, partly to catch this kind of drift early. We've seen one minor compression already (the ChatGPT effect mentioned above) that may be a precursor.

How we communicate findings to clients now

A practical change that came out of this exercise: our client reports now include a "confidence summary" section that explicitly names the attempts we made to break our own findings, the controls we did and didn't apply, and the range we'd defend versus the point estimate. It's three more paragraphs per report. Most clients read past them. The ones who care notice, and those tend to be the ones whose internal teams catch issues earliest and who are the most useful to work with long-term.

The agency I work with has, I think, gotten more cautious in its language partly because of findings like this one. We say "associated with" more than we used to. We say "in our portfolio, in this window" more than we used to. Some prospective clients prefer the agencies that say "X delivers Y." We've lost some pitches that way. The retention rate on the clients we do sign is, anecdotally, higher than it was when we were sharper-edged in our claims. I can't prove causation on that either.

The thing I want to flag for anyone reading this

If you measure something positive and your first instinct is to publish it, wait. Try to break it. Try harder than is comfortable. We now treat this as a standard part of our research process, partly because we've been embarrassed before by writing up findings that didn't survive replication.

If you've measured a schema effect in your own work and tried to break it the same way, what did you find? I'd genuinely like to know whether our 6-10% adjusted estimate is high, low, or just specific to our client mix.


This field report was published by **westOeast, a B Corp certified marketing agency working on generative engine optimization for B2B SaaS. The methodology, framework, and data described here come from internal audits at westOeast across our client portfolio in 2025-2026. More field notes at westoeast.com.

Top comments (0)