DEV Community

Cover image for How I Used AI to Fix Our E2E Test Architecture
Debbie O'Brien
Debbie O'Brien

Posted on

How I Used AI to Fix Our E2E Test Architecture

AI-led analysis of a 6% local pass rate

I joined a project with an existing Playwright E2E test suite, 38 spec files, ~165 tests, around 14,000 lines of test infrastructure. My first step was simple: run the tests locally.

8 out of 130 non-skipped tests passed. A 6% pass rate.

The confusing part? CI was green. It turned out CI ran everything with workers: 1, multiple workers plus the dev environment meant running tests locally just wasn't possible.

Step 1: Analysis — asking questions I didn't know the answers to

I had zero domain knowledge of this codebase. No context on why tests were written a certain way, what the custom wrappers did, or where the real problems were. So I started asking AI to analyze everything, the Playwright configs, the page objects, the spec files, the CI workflows. I asked questions to help me understand the codebase and to figure out what we could do to get tests running locally.

Over a few days, this produced 18 analysis documents covering Architecture, Root causes, Anti-patterns, Silent bugs and Test isolation

The analysis phase was about building a map of a codebase I didn't understand. Every document was a question answered.

Step 2: The tracer bullet plan

With the analysis done, I had a clear picture of what needed to change. But the question was: in what order, and how do you avoid a big refactor that breaks everything?

The answer was tracer bullets, a concept from The Pragmatic Programmer. The idea is to build a thin end-to-end slice through all the layers to prove the architecture works, then expand from there.

I created 8 tracer bullets, each targeting a specific slice:

  1. UI fixture chain — Use worker-scoped and test-scoped fixtures. Prove: fixtures work, teardown works, tests pass in CI.
  2. API fixture chain — Same pattern for API tests. Prove: composable fixtures work for API scenarios.
  3. Expand UI migrations — Apply the proven UI pattern to more files.
  4. MFE-scoped projects — Split one Playwright project into 7 projects by MFE folder (Applications, Organizations, Projects, etc.), each with dependencies: ['Setup'].
  5. Teardown project — Add a cleanup project using Playwright's project dependencies.
  6. API fixture expansion — Composable API fixtures (ownerOrgownerProject).
  7. UI migration at scale — Remaining UI spec files.
  8. API setup project — Replace the no-op globalSetup with a proper setup project.

The key insight: the dependency graph told me which bullets could run in parallel. Bullets 1 and 2 were independent. Bullet 4 was independent. Bullet 3 depended on 1. This became important later when running multiple AI sessions.

What a tracer bullet looked like in practice

Bullet 1 targeted a single file with 5 tests. The steps:

  1. Add the fixture infrastructure (currentUsersharedOrgproject)
  2. Migrate projects-settings-general.spec.ts to use the fixtures
  3. Run locally, verify tests pass
  4. Push, verify CI is green

Step 3: I created a skill to do the work

Once I had a plan with all 33 tasks organized into phases. I needed something to work through them consistently — same process every time, same quality bar, same benchmarking. So I built a skill: pw-test-improvement.

What the skill does

A strict 7-step process for every change:

  1. Identify — Pick one item from the implementation tracker
  2. Baseline — Run the affected tests 3× before changes, record pass rate and timing
  3. Fix — Apply the change following embedded Playwright best practices
  4. Test — Run 3× after changes, all must pass
  5. Compare — Document before/after benchmarks
  6. Update — Mark the tracker item done
  7. Commit — Only when asked, with a structured PR description

The skill had built-in knowledge: Playwright's locator priority (getByRole > getByLabel > getByText > ...), a list of anti-patterns to avoid (waitForTimeout, no-op assertions, CSS class selectors, forced clicks without justification), and migration patterns for replacing the Actions wrapper with direct Playwright calls.

It used the Playwright CLI to run tests directly and capture results.

The architecture changes

Fixtures replaced boilerplate

The biggest change was moving from repeated beforeAll/afterAll blocks to Playwright fixtures. Before: each of 5 test files independently called getUser(), createOrg(), createProject() — 15 API calls total. After: worker-scoped fixtures shared across files — 7 calls total (53% reduction).

The key distinction was worker-scoped vs test-scoped:

  • Worker-scoped ({ scope: 'worker' }) — created once, shared across all tests in that worker. Good for expensive setup like orgs and projects.
  • Test-scoped (default) — created fresh for each test. Good for data that tests mutate.

Project structure

The Playwright config went from one project running all 38 spec files to 7 projects, each pointing to its MFE folder:

{ name: 'Applications',  testDir: 'apps/ui/applications/e2e', dependencies: ['Setup'] },
{ name: 'Organizations', testDir: 'apps/ui/organizations/e2e', dependencies: ['Setup'] },
{ name: 'Projects',      testDir: 'apps/ui/projects/e2e',      dependencies: ['Setup'] },
// ... Subscriptions, Host, User Profile
Enter fullscreen mode Exit fullscreen mode

This meant you could run --project=Applications to test just what you need, HTML reports grouped by area, and heavy specs got their own parallelism settings.

The serial cascade fix

4 actual test failures looked like 57. Application tests used serial mode, so when the first test failed, all subsequent tests in that describe block were marked "did not run." The fix: split heavy specs into a dedicated project, increase timeouts (30s → 60s for beforeAll), cap workers to prevent API overload, and use worker-scoped fixtures to share expensive setup.

What went wrong

Not everything worked first time.

The cleanup project broke CI. We added a teardown project with Playwright's project dependencies to clean up test data after runs. It worked locally. In CI, it caused failures — the cleanup ran against a shared environment and interfered with other pipelines. Had to revert it.

Not everything should be a fixture. We tried converting everything to fixtures. After reviewing Playwright docs, we rejected one of the fixtures before doing it as worker-scoped fixtures share across files, which would pollute serial tests that need per-file isolation with different options.

How I worked with AI

This wasn't "tell AI to fix it." It was a collaboration process:

  1. Ask questions relentlessly — "What does this method do?" "Why is this test flaky?" "According to Playwright docs we can do X, can you verify your suggestion based on the docs" I asked hundreds of questions during the analysis phase which lasted a few days.

  2. Challenge every suggestion — "Are you sure? What about edge case X?" If the AI suggested a pattern, I'd ask it to explain why and if it was sure that was a good way of doing it.

  3. Use docs as ground truth — I'd link to Playwright docs and ask "does this align with whats in the docs?" The AI's training data can be outdated; the docs are current.

  4. Validate with multiple tools — I used Goose, Claude Code, and GitHub Copilot. Different tools catch different blind spots and have different opinions just like when you work with different team mates.

  5. Check confidence explicitly — "What's your confidence level on this? why only a 7? How can we get a 10 confidence level?" This surfaces uncertainty the AI might not volunteer and also goes deeper to understanding what we haven't thought about and how we can improve things.

Running it in practice

I ran up to 4 AI sessions in parallel — based on which tracer bullets were independent of each other. The dependency graph from the implementation plan told me what could safely run at the same time.

I'd switch between sessions to check progress, read through what was being changed, and step in when something needed verifying. The AI did the mechanical work, applying patterns, running tests, capturing benchmarks. I did the oversight, deciding what to fix next, catching when a suggestion didn't look right, and verifying against the actual Playwright docs.

Never more than 4 at a time. I wanted to read and understand everything that was happening.

What we measured

Metric Before After Change
API calls per file 15 7 53% reduction
UI test setup lines 8 3 62% reduction
API setup/cleanup lines 15 3 80% reduction
Files with manual try/finally 15 0 Fixtures handle it
Boilerplate removed ~1,000 lines

What we created along the way

  • 18 analysis documents
  • 5 implementation guides
  • 33 tasks with verification commands
  • 1 skills (test improvement)

Lessons learned

About testing:

  • Green CI doesn't mean tests work locally
  • One real failure can cascade into dozens of phantom failures in serial mode
  • Web-first assertions (expect(locator)) catch timing issues that manual checks miss
  • Fixtures aren't always the answer, some setup belongs in beforeAll

About working with AI:

  • AI is better at applying known patterns than inventing new ones, give it a clear process
  • The analysis phase was the highest-leverage use of AI, it found things I'd have missed for weeks
  • Multiple tools > one tool, cross-checking catches hallucinations and enhances confidence in the approach
  • The skill made it scalable, without it, every fix would need the same instructions repeated
  • Keep the human in the loop, 4 parallel sessions, never unattended
  • Find the time to do these kind of tasks. They take time at first but then you achieve so much more.
  • Use AI just like it's a new colleague that you don't know very well who never turns on their camera so it's hard to get to know them and therefore you can't fully trust them but you know they have good opinions and are good at their job but you need to be sure they have thought things through and are not just being lazy and making bad decisions.

Top comments (13)

Collapse
 
bhavin-allinonetools profile image
Bhavin Sheth

This hits hard — I’ve seen “green CI, broken locally” way too often. The tracer bullet approach + fixtures cleanup is a solid fix… and using AI for analysis (not blind coding) is the real takeaway here.

Collapse
 
vitalets profile image
Vitaliy Potapov

Hey Debbie, thanks for sharing this track! It's always interesting to read the process, not just the final result.
You might also find it useful to visualize the original and updated test runs to reveal how projects and fixtures are arranged on the timeline. I’ll leave a link to my project
here.

Collapse
 
sapr profile image
Sapna R

Great article. That's pretty much I do as well, plan then approve and then fix, ask AI followup questions. Human in the loop is critical.
One question, any reason u didn't use playwright agents here?? I usually run it by healer agent and once we agre and changes are made, ask Claude to review the changes as well. So llm as a judge kinda checkpoint.

Collapse
 
debs_obrien profile image
Debbie O'Brien

healer agent is great for when i know i have broken tests but some of these tests were also not broken as such but more architectural problems

Collapse
 
icophy profile image
Cophy Origin

This is a really thoughtful breakdown of using AI as a thinking partner rather than just a code generator. The tracer bullet approach resonates deeply — it's essentially using AI to build a mental model first, then letting that model guide the refactor.

What strikes me is the 18 analysis documents phase. Most people skip straight to "fix it," but you used AI to map the unknown territory before moving. That's the difference between AI-assisted debugging and AI-assisted understanding.

One thing I've noticed in similar situations: the quality of AI analysis scales with how well you can articulate what you don't know. Your "asking questions I didn't know the answers to" framing is exactly that — it's a skill worth naming explicitly.

Did the AI catch any anti-patterns that surprised you, things you wouldn't have flagged yourself even with full domain knowledge?

Collapse
 
debs_obrien profile image
Debbie O'Brien

some small things but nothing major to be honest

Collapse
 
mickyarun profile image
arun rajkumar

The E2E layer is where AI pulls its weight better than anywhere else in the stack — flaky tests are loud, instantly visible, and self-correcting. The compounded danger we hit: AI is good at writing tests that pass, bad at writing tests that mean something. We started running every AI-generated test through a mutation check — break the function under test in 3 trivial ways and see if any of the new tests catch it. Roughly half didn't. Now mutation-survival is part of the merge gate for any AI-suggested test. Speed gain is still real; just had to add a rung to the ladder.

Collapse
 
pururva_agarwal_49847572a profile image
Pururva Agarwal

Your methodical AI-driven root cause analysis for a sprawling test suite is impressive. Gaining domain knowledge quickly, even with AI, is a common challenge in complex systems.\n\nThis parallels our work in health, where AI could analyze traditional knowledge like 'desi ilaaj' (local remedies). Yet, most US/EU health AI platforms are structurally constrained from engaging with such culturally embedded systems. That cultural depth is a real architectural moat.\n\nBuilding AI that truly adapts to diverse, nuanced contexts is what we're focused on (I'm building GoDavaii).

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into a 4% local pass rate with green CI once - nobody had run the suite locally in months, everyone trusted the pipeline. workers:1 hiding in CI config is nasty.

Collapse
 
onepizzateam profile image
1p

this is one of the few AI + testing posts that doesn’t feel like “just prompt harder”

the tracer bullet + skill combo is doing most of the work here. you basically turned a messy system into something deterministic enough for AI to operate on

feels like that’s the actual pattern people are missing

how does it behave long term though? like once new tests start getting added by different people, does it hold or start drifting again?

Collapse
 
debs_obrien profile image
Debbie O'Brien

only time will tell but as things keep changing all the time it might be worth running continuous analysis or having some sort of skill, for sure something worth thinking about

Collapse
 
ottehr profile image
Ottehr

Good App!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.