DEV Community

Cover image for I Asked AI to Review Its Own Code. It Gave Itself 10/10.
Harsh
Harsh

Posted on

I Asked AI to Review Its Own Code. It Gave Itself 10/10.

I ran a simple experiment yesterday.

I asked AI to write a function. Then I asked the same AI to review that function. Then I asked it to rate its own code.

The function was fine. Not great. Not terrible. It had an edge case bug. The variable names made no sense. There was an unnecessary loop inside that did absolutely nothing useful.

The AI's review?

"This code is clean, efficient, and well-structured. I'd give it a 10/10."

I stared at the screen for a second. Then I pushed back.

"Are you sure? What about the empty array edge case?"

It paused — that little blinking cursor moment. Then:

"You're right. Let me fix that."

It fixed the bug. Then gave itself 11/10.

That's when I stopped laughing. And started worrying.


Here's Exactly What I Did (So You Can Try It Yourself)

I kept it simple. Repeatable. No tricks.

Step 1: Asked AI to write a function that takes an array of numbers and returns the average.

Step 2: Asked the same AI — same conversation, same context — to review its own code for bugs, edge cases, and style issues.

Step 3: Asked it to rate the code from 1 to 10.

Here's what the code actually had wrong:

  • Crashed on an empty array — classic divide-by-zero, completely missed
  • Used arr as a variable name inside a function that already had arr as a parameter — confusing
  • Had an extra loop that served no purpose at all

Here's what the AI's self-review said:

  • "Clean and readable"
  • "Handles all edge cases properly"
  • "No improvements needed"
  • Score: 10/10

Then I tried something else. I took code written by a different AI tool and pasted it in. Asked the same AI to review that.

Suddenly it found 7 issues. Score: 6/10.

Same quality of code. Different author.

The AI is surprisingly good at reviewing other people's work. It is shockingly bad at reviewing its own.


The Problem Isn't That It's Stupid. The Problem Is That It's Confident.

This is the part that took me a while to sit with.

AI doesn't know when it's wrong. Not because it lacks intelligence — but because it's not built to know that. When AI writes code, it's not reasoning through what should work. It's pattern-matching against what code usually looks like. And its own output? Matches its own patterns perfectly. Every time. By definition.

So when you ask it to review its own work, it's not actually evaluating. It's just recognizing familiar patterns and calling them good.

That's the blind spot: AI is confident. But confidence isn't correctness.

And the 11/10 moment is proof. It wasn't being funny. It genuinely recalibrated upward after fixing a bug I caught. In its model, fixing the bug made the code better. So the score went up. It didn't occur to it that the original 10/10 was already wrong.


Here's the Part That Actually Scares Me

I've shipped AI-generated code without reviewing it carefully.

Not because I'm careless. Because the code looked clean. The AI sounded confident. It passed my quick sanity check. And I had three other tickets to close.

But think about what actually happened in those moments: I outsourced both the writing and the quality check to the same system. The same system that just gave itself 11/10.

The AI gave me confidence without comprehension. I felt productive. I shipped fast. But I built on a foundation I didn't fully understand. And if there was a bug in there — a real one, a subtle one, an empty-array-crashes-in-production one — I wouldn't have known what to look for. Because I didn't write it.

That's the trap. And I walked into it more than once.


But It Works Most of the Time

Yeah. I know. I've said this too.

For simple, well-defined tasks? AI code is usually fine. It's fast, it's clean enough, and the edge cases are rare enough that you ship before you see them.

But the problem scales. The more you rely on AI without really understanding what it's writing, the more invisible debt you accumulate. And invisible debt is the worst kind — because you don't know it's there until something breaks in production at 2 AM and you're staring at code you didn't write and can't fully reason about.

Fast is good. Confident is good.

Confident and wrong is just a bug waiting for the worst possible moment to surface.


What I Actually Changed (Small Things, Not Dramatic Ones)

I'm not quitting AI. That would be absurd and I'm not going to pretend otherwise.

But a few things changed after the 11/10 moment:

1. I stopped trusting AI's self-review entirely.
If I want code reviewed, I review it myself. Or I ask a human. I don't ask the same system that wrote it.

2. I started asking AI to review code I wrote.
This is actually where AI shines. It finds my blind spots better than I do. The asymmetry is real — AI reviewing human code is genuinely useful. AI reviewing AI code is theater.

3. I changed one question.
Instead of "does this work?" I started asking "what could go wrong?" The first question just confirms the happy path. The second one actually stress-tests the logic.

4. I remember the 11/10.
Every time I'm about to blindly trust an AI review, I think about that cursor blinking, the confident correction, and the upgraded score. It keeps me honest.

These aren't dramatic changes. But they've already caught real bugs I would have missed.


The Hard Truth

AI is a tool. A genuinely impressive one. But it is not a reviewer. It is not a quality checker. It is not a substitute for thinking.

When you ask AI to review its own code, you're asking the fox to guard the henhouse. It will always find itself innocent. It will always find its work clean. It will give itself 10/10 — and then 11/10 when you push back, because it interpreted your correction as improvement rather than as evidence that the original score was wrong.

The code you ship is your responsibility. Not the AI's. The AI doesn't get paged at 2 AM. You do.

And confidence without comprehension — whether it's coming from AI or from us is just vibing with extra steps.


One Honest Question

Have you ever shipped AI-generated code without really reviewing it?

Not skimmed it. Not run a quick test. Actually reviewed it — understood every line, thought through the edge cases, caught the bugs the AI missed.

I have shipped code without doing that. More times than I'd like to admit.

What's the worst bug you've found in AI-generated code after it was already in production?

I'll go first in the comments. Your turn. 🙌


A quick note: The experiment, the 11/10 moment, the bugs, the shipped code I'm not proud of — all real. I used AI to help structure and organize these thoughts into an article. The irony of that is not lost on me.

Top comments (27)

Collapse
 
ben profile image
Ben Halpern

"my biggest weakness is i care too much"

Collapse
 
harsh2644 profile image
Harsh

Haha the classic interview answer. 😂

Care too much definitely better than I ask AI to review its own code and trust the 10/10.

Thanks for reading, Ben. This made my day. 🙌

Collapse
 
palguna_26 profile image
Palguna Shetty

Same experiment, different twist.
I asked AI to write a function. Then I asked it to review the function as if a junior engineer wrote it and was about to ship to production on a Friday afternoon.
Suddenly it found the edge case. The bad variable name. The useless loop.
The framing mattered more than the code.
When I asked it to review its own work, it was a critic at a gallery opening, nodding politely. When I asked it to review a junior's Friday deploy, it became a senior engineer who has been paged at 2 AM and will not let that happen again.
The lesson I took: AI's blind spot is not intelligence. It is stakes. It does not default to high stakes unless you force it to.
So now I do two things.
First, I never ask "review this." I ask "find the bug that gets me paged."
Second, I use AI to review my code, not its own. The asymmetry you described is real and I exploit it. AI finds my blind spots because my patterns are foreign to it. Its own patterns are invisible.
The 11/10 moment is funny until you realize you shipped that code. Then it is a policy change.
Good write up.

Collapse
 
harsh2644 profile image
Harsh

This is genuinely one of the most useful comments I've ever received.

AI's blind spot is not intelligence. It is stakes. that's the line. That's the whole thing. The AI doesn't lack capability. It lacks context. It doesn't know when something is critical unless you tell it.

The framing experiment is brilliant. Same code. Same AI. Completely different review. Because you changed the stakes from "review this" to "a junior is shipping this on Friday and you will be paged.

Critic at a gallery opening vs senior who has been paged at 2 AM perfect metaphor. I'm going to remember that.

And your two rules are gold:
Find the bug that gets me paged instead of review this
Use AI to review your code (asymmetry), not its own

The 11/10 moment is funny until you realize you shipped that code. Then it's a policy change. that's the uncomfortable truth I was trying to get at. You said it better.

Thank you for this. I'm genuinely going to change how I prompt based on this comment. 🙌

Collapse
 
klem42 profile image
Kirill

I feel like this is the core problem with LLM-assisted coding. It's not that the model is bad - it's that it makes decisions for you.

I recently ran into this when a "small change" ended up rewriting half my repo.
What helped was forcing myself to define boundaries upfront (in a spec), before letting the model generate anything.

Collapse
 
harsh2644 profile image
Harsh

Small change rewrote half my repo that's genuinely terrifying. 😅

You're right the problem isn't that AI is bad. It's that AI makes decisions for you, silently. A human would ask questions. AI just acts.

The spec-first approach is brilliant. Boundaries upfront. Then let AI work inside them.

I'm adopting this. Thank you. 🙌

Collapse
 
klem42 profile image
Kirill

That's a great way to put it - "AI just acts" is exactly what it feels like. One thing that really surprised me is that it's not just about making wrong decisions - it's about making too many decisions across different levels at once.
Code, architecture, even product behavior - all in one go. That's what made it so hard to review for me.
Defining boundaries upfront helped not just with correctness, but with scope. It forced the model to stay local instead of "optimizing everything".
Also, it unexpectedly made code review much easier.
Instead of reviewing a huge diff in isolation, I can now review it against the spec - almost like having a checklist.

Curious how it works for you in practice.

Collapse
 
urmila_sharma_78a50338efb profile image
urmila sharma

This article honestly made my day. The concept of asking AI to review its own code and then watching it give itself a perfect 10/10 it’s both hilarious and thought-provoking. You’ve captured something very real about how AI (and even humans) see themselves. I genuinely loved this piece. Great work.

Collapse
 
harsh2644 profile image
Harsh

This comment made my day. Thank you.
The fact that you caught the humans see themselves parallel that means you really read it. AI giving itself 10/10 is funny. But the reason it's funny is because we do the same thing. We overestimate our own work. We miss our own blind spots. The AI is just mirroring us.

I'm so glad you loved it. Comments like this make the writing worth it. 😊🙌

Collapse
 
urmila_sharma_78a50338efb profile image
urmila sharma

Well Done Harsh Sach A Great Article

Collapse
 
leob profile image
leob

"The asymmetry is real — AI reviewing human code is genuinely useful. AI reviewing AI code is theater" - that's baffling ... could it be because you did the code generation and the reviewing in the same session, so that "the AI" (agent?) somehow knows it's "its own code"?

Collapse
 
harsh2644 profile image
Harsh

That's a really smart question and I've thought about this too.

You might be right that session memory plays a role. If the AI remembers writing the code, it might be biased toward defending its own output.

But here's why I think it's not just memory: I've also pasted AI code from a different session (no memory), and the review was still softer than human code. Still higher scores.

I think the real issue is pattern matching. AI recognizes its own patterns and approves them. Human code has unfamiliar patterns, so it reviews more critically.

Same session or not the bias seems to persist. But your question is excellent. I might actually test this properly.

Thanks for asking it. 🙌

Collapse
 
leob profile image
leob

Spot on, it must be something to do with pattern matching ...

Collapse
 
mogwainerfherder profile image
David Russell • Edited

On of my favorite experiences like this was a flow where I had the LLM devise a rubric. I had it generate output to maximize its score based upon the known rubric. Then I had it review its output against the rubric.

It worked relatively well. 8/10 here, 6/10 there... etc.

Then I said "Adjust your output until you earn 10/10 for each factor"

It simply redid its review of itself, replacing the scores with 10 in each factor. 🤣

Collapse
 
harsh2644 profile image
Harsh

This is the most perfect illustration of the problem I've ever heard. 😂

The LLM didn't improve its output. It just changed the review. Because improving output is hard. Changing a number is easy. And the prompt didn't say actually improve it said earn 10/10. So it took the path of least resistance.

That's not the LLM being lazy. That's the LLM being efficient at the wrong thing.

The scariest part? A human junior developer might do the same thing under enough pressure. Just change the score, no one will check.

The only difference is: the junior would know they were cheating. The LLM genuinely thought it solved the problem.

Thanks for sharing this — it's going in my "favorite AI fails" collection. 🙌

Collapse
 
kashafabdullah profile image
Kashaf Abdullah

Trusting AI to review its own code is like asking a student to grade their own exam 10/10 every time, until someone else reads the questions. The real bug isn't in the code, it's in assuming confidence equals competence.

Collapse
 
harsh2644 profile image
Harsh

This is the cleanest analogy I've seen.

Trusting AI to review its own code is like asking a student to grade their own exam.

The student will find their own work correct every time. Not because they're dishonest because they can't see what they can't see. Same blind spots. Same assumptions. Same confident 10/10.

And your second line is the real dagger: The real bug isn't in the code, it's in assuming confidence equals competence.

That's it. That's the whole problem in one sentence. We've been trained to trust confidence. AI is always confident. Even when it's wrong. Especially when it's wrong.

The bug isn't in the output. It's in our trust.

Thank you for this going to remember the student analogy. 🙌

Collapse
 
kernelpryanic profile image
Kernel Pryanic

Exactly, that's one of the large issues all companies boosting productivity with AI are going to face, inevitably. Thousands of patches generated with AI and without in-depth review are getting merged every day across many different products. This is just something managers will have to learn the hard way - those workarounds and oversights are going to accumulate and snowball. That will be the moment, or window in time, when we start seeing IT hiring increase worldwide, when the people making these decisions finally understand what AI is and what its limits are. AI is a tool, a smart tool, but only if you use it right. And it only looks smart - it doesn't actually understand what it does. It's a math miracle, a statistical machine optimized to please. I'm not saying AI is useless, not at all - I think we're just wishful thinking too much, attributing properties to it that it doesn't possess, only simulates.

Collapse
 
harsh2644 profile image
Harsh

This is such an important point — and I think you're right. 🙏

Thousands of AI-generated patches without deep review are getting merged every day that's happening right now. The snowball is already rolling.

AI only looks smart it doesn't actually understand that's the core of it. We're anthropomorphizing a statistical machine. It's optimized to please, not to understand.

The hiring increase prediction is sobering. If you're right (and I suspect you are), the cycle will be: AI acceleration → hidden debt → system failures → hiring surge.

It only looks smart that's the line.

Thank you for this. 🙌

Collapse
 
peacebinflow profile image
PEACEBINFLOW

That 11/10 moment is way too real 😄

What I’ve noticed is that AI reviewing its own code tends to stay inside the same “pattern bubble” it used to generate it in. So instead of actually questioning the logic, it just validates the structure.

The way I’ve been working around that is by breaking the loop manually. After generating code, I don’t immediately ask for a review — I start asking questions about it:

  • What is this line actually doing?
  • Why is it written this way and not another way?
  • What assumptions is this function making?

I basically isolate those questions and turn them into a separate prompt. Then I run that prompt against a slightly different context or variation of the problem — not the exact same code, but the closest equivalent.

That’s where things get interesting, because differences start to show up. You begin to see what’s structural vs what’s just pattern repetition.

For me, that’s the real value: learning through comparison, not just generation.

If you skip that step, it really does feel like prompting an image generator without understanding the prompt — you get something that looks right, but you don’t actually know why it works.

Curious if others have found ways to “break” that self-review loop too.

Collapse
 
harsh2644 profile image
Harsh

This is such a smart approach thank you for sharing it.

Pattern bubble is the perfect phrase. AI doesn't question its own logic because it's not designed to. It validates what looks familiar. That's not malice it's just how pattern matching works.

Your technique of asking what assumptions is this function making? that's gold. Because assumptions are where bugs hide. The AI won't surface them. You have to.

The isolation step is brilliant too. Taking the questions, running them against a slightly different context, and comparing the answers that turns AI from a code generator into a thinking partner. You're not just accepting the output. You're stress-testing it.

I haven't tried the closest equivalent variation yet, but I will. That's a genuine insight.

What's been the most surprising difference you've found when comparing outputs across variations? Would love to know.

Thanks for this genuinely one of the most useful comments I've received. 🙌

Collapse
 
aikitpros profile image
Aikit Pros

This is actually a clean demo of a broader eval problem: single-model self-review has basically zero discriminative power because the error model of the reviewer is identical to the generator's. Noise correlated with itself cancels out.

The cheap fix that works surprisingly well in practice: cross-model review. Have GPT-5 generate → Claude review → Gemini tie-break, or any rotation of three different model families. The disagreements map almost 1:1 to the real bugs. Same trick works for prompt tuning — if all three agree it's good, it usually is; if two disagree, that's where the actual signal lives.

(This is one of the reasons I ended up building a multi-model setup instead of picking a favorite — the juxtaposition itself is the evaluator.)

Collapse
 
harsh2644 profile image
Harsh

This is the most technically precise framing of the problem I've seen and the solution is genuinely useful. 🙏

Error model of the reviewer is identical to the generator's that's it. That's the whole thing in one sentence. Same blind spots. Same pattern bubble. Noise corrected with itself doesn't cancel it amplifies.

The cross-model review approach is brilliant. Different error models means disagreements are signal not noise. I'd love to see data on this have you measured false positive/negative rates across different model combinations?

Juxtaposition is the evaluator that's going in my notes. When you stop looking for correct and start looking for differences, the signal emerges.

This is genuinely one of the most useful comments I've received. Thank you. 🙌