I asked AI to review its own code last week.
The code had a bug. An edge case. A variable name that made no sense.
The AI's review?
This code is...
For further actions, you may consider blocking this person and/or reporting abuse
Level 5 is the dream! I love that you defined it as a system rather than a moment. Most people forget that the 'review' continues the second the code hits production.
I’m really glad ArkForge brought up the hash-log idea in the comments, but your original point about the 'Human + AI Hybrid' is the real takeaway. AI catches the 'how' (syntax/patterns), but humans catch the 'why' (business logic). Thanks for giving us a vocabulary to explain to our managers why we still need time for manual PR reviews even with all these new tools!
Syed you just named exactly why I wrote this. Thank you.
AI catches the how. Humans catch the why.
That's the whole thing in one sentence. I'm borrowing that.
And yes the vocabulary to explain to managers piece is real. When someone says why do we need manual review? AI can do it now we can say: AI doesn't know your business logic. It doesn't know why this feature exists, why this edge case matters, why this user behavior is assumed.
Level 5 as a system not a moment means review happens before merge AND after. Production is the final reviewer. It always has been.
Thank you for reading and for articulating the takeaway better than I did. 🙌
Great breakdown of AI Code Review into levels. I like this building block appraoch for best practices. I would add that when a code changes touches multiple files and systems it becomes essential to make it easier for other team members to review the code by tagging them exactly in the relevant files/lines that they need to focus on reviewing and providing clear testing instructions in the PR description.
Great addition, Julien.
Tagging reviewers on specific lines + clear testing instructions in PR description that's the difference between someone look at this and "here's exactly what changed and how to verify it.
Level 4.5, maybe human review with intentional coordination.
Thanks for this. 🙌
You are welcome! 4.5 level sounds good :D
Thank you for this valuable read! I think I needed to hear it since most of my colleagues are AI enthusiasts at levels 1-2, and I'm the party pooper who always says "naaah, hold your horses, we cannot trust that what it generated is 100% correct!" 😅
I place myself at level 4, not because of experience and learning from my own mistakes, but rather from the general mistrust. Your article gave me some food for thought, though. Thank you!
Klaudia thank you for your honest comment.
Being the party pooper isn't a bad thing it's a superpower. Every team needs someone who says hold on, let's check that first.
You're at Level 4 whether it's from mistrust or experience Either way, you're in the right direction.
Just remember: there's a difference between I don't trust AI and I know exactly why this code is wrong.
Teams need people like you. Keep going. 🙌
Aaaaw, thank you for your kind words 🥹
push back on production-ready as a destination. the AI still has no skin in the game at any level - it doesn’t get paged when review misses a race condition. review quality and accountability are different problems.
Great article! It really highlights what to watch out for when reviewing AI-generated code. In my project we’re still at level 0 — everything is reviewed by a human 😄
Sylwia thank you This means a lot coming from you. 🙏
Everything is reviewed by a human that's not level 0 That's the level most teams aspire to but rarely achieve Human judgment at the end of the pipeline is still irreplaceable.
Thank you for reading and for the kind words. 🙌
Really like the framing.
I kept running into a slightly different problem: the hardest part wasn't reviewing the code - it was understanding what the model changed in the first place. In my case, a "small change" ended up rewriting half the repo.
What helped was defining boundaries before generation (in a spec).
It turned review into "compare against spec" instead of "reverse-engineer the diff".
Feels like this could sit as a "level 0" before the rest.
Kirill this is a brilliant addition. Thank you. 🙏
Review turned into compare against spec instead of 'reverse-engineer the diff.
That's the key line. Most of us don't write specs. We prompt vaguely, get vague output, then spend hours trying to figure out what the model actually did. The reverse-engineering tax is real and you've named it perfectly.
Defining boundaries before generation this is the missing step Not review after the fact. Constraint before the fact. A spec isn't just documentation. It's a contract between you and the AI.
And you're right this sits before Level 1. Level 0: Spec-First.
Thank you for this genuinely made the framework stronger. 🙌
The Level 2 to Level 3 jump highlights a real structural problem: different models produce different errors, but you still have no proof the review actually ran. A cross-model pipeline where Claude catches what GPT missed is stronger, but the review itself is ephemeral. If the CI log says "3 models approved," how do you verify that later during an incident post-mortem?
One pattern that helps: hash the code snapshot + each model's raw review output into an append-only log. Not for compliance theater, but so your on-call team can trace exactly what each reviewer saw and said. Especially useful at Level 5 when incidents feed back into the process -- you need the original review context, not a reconstructed summary.
ArkForge this is the Level 5 detail I didn't include and I should have. Thank you.
The review itself is ephemeral.
Yes That's the hidden gap in every AI review pipeline You can run the review. You can log the result. But if you can't replay what each model saw and said during an incident, you're debugging blind.
Your append only log pattern is elegant. Hash the code snapshot + raw model output Not for compliance For tracing so when something breaks at 2 AM, your on-call engineer knows whether the AI missed something or the human overrode it.
At Level 5, this isn't optional. The feedback loop only works if you have the original context. Reconstructed summaries lose signal.
I'm adding this Thank you for the upgrade. 🙌
Strong framing. The part that resonated with me most was Level 3 vs Level 4 — cross-model review catches pattern-level mistakes, but the real production bugs usually show up when business context and invariants are fuzzy. As someone building AI products, I’ve found the best workflow is exactly what you describe: let the models surface candidates, then let a human challenge the assumptions before shipping.
Vic you've articulated the exact gap I was trying to name. Thank you.
Cross-model review catches pattern-level mistakes. Real bugs come from fuzzy business context.
That's the difference between correctness and relevance Level 3 tells you if the code works. Level 4 tells you if it should be doing what it's doing.
Models don't know your business invariants. They don't know which edge case will actually happen at 3 AM. They only know patterns.
Surface candidates, then challenge assumptions that's the workflow. Models generate candidates. Humans stress-test the why
This is exactly what Human+AI should mean. Not human as backup. Human as gatekeeper of context
Thanks for this going into my notes. 🙌
"Human as gatekeeper of context" — that framing is exactly right. Models are fast at generating candidates, but they have zero access to the implicit knowledge that lives in your head: the product decisions, the tech debt tradeoffs, the edge case your CEO mentioned once in a Slack thread. That context doesn't exist in any codebase. Level 4 is really about injecting that missing context back into the review loop.
Level 2 is such a trap. You feel responsible because you're using AI to review AI, but it's the same blind spots. I've definitely shipped code that passed self-review and then broke in prod. Now I just assume any AI review is missing something.
Kris shipped code that passed self-review and broke in prod that's the line that hurts because it's true.
Level 2 is a trap. Same blind spots. Same confidence. Same 2am page.
Now I just assume any AI review is missing something not pessimism. Operational wisdom. Once you've been burned, you stop trusting the shortcut.
Thanks for the real-world take. 🙌
Wow. It is amazing but I kept on imagining how ai can be perfect in generating codes without a single bug.
That's the dream, right? Perfect code, zero bugs, no review needed.
But here's the thing: AI generates code based on patterns it's seen before. Bugs are often where patterns break edge cases, unusual business logic, things the training data didn't cover.
So AI will get better at avoiding common bugs. But the subtle, context-specific ones? Those will still need human judgment.
The goal isn't perfect AI. It's AI that makes humans more effective at catching what only humans can catch.
Thanks for reading and for imagining the better future. 🙌
A useful framework showing how AI-assisted code review can evolve from casual guessing to reliable, production-grade validation.
Thank you, Laura casual guessing to production grade validation is exactly the arc I was trying to capture.
Glad you found it useful. 🙌
Thank you for sharing, Harsh. Beautifully written.
Thank you Urmila. Glad you liked it.
Recommended!
Appreciate that thank you! 🙌
Thank you for this valuable read!
Glad you found it valuable thanks for reading! 🙌