Last week I wrote a tiny Python linter — cc-audit — that scores a CLAUDE.md or AGENTS.md file against twelve behavior rules for AI coding agents. I ran it against 492 real public CLAUDE.md files pulled from GitHub code search.
Here's what the ecosystem actually looks like.
Methodology
- Pulled the first 500 public
CLAUDE.mdfilename matches from GitHub code search - 492 were fetchable at scan time (8 had been moved, renamed, or gated behind forks)
- Each file scored on 12 behavior rules via keyword-signal matching (does the file address each rule?)
- Separately scanned for leaked secrets (API keys, database URLs, private keys) with placeholder-aware filtering
The 12 rules come from the claude-code-pro-pack baseline (Karpathy's original 4 + 8 more covering agent-orchestration failure modes):
- Read adjacent / existing code before writing new code
- Don't invent APIs, imports, or file paths
- Surface partial success — never silent-fail
- Cap per-task token budget; stop and ask when hit
- Match the project's existing style and conventions
- One task per run; don't bundle unrelated changes
- Surface conflicting patterns instead of averaging them
- Run tests before declaring done
- Don't edit out of scope without saying so
- Summarize every tool call's effect in one line
- Stop and ask if stuck or ambiguous
- Visible fail states — never hide errors
The numbers
- Files scanned: 492
- Size: min 11 B, median 3.9 KB, mean 7.5 KB, max 167 KB
- Compliance: median 3/12, mean 3.54/12, max 10/12
- Perfect (12/12) scores: 0
- Zero-score files: 41 (8%)
- Top quartile (≥9/12): 11 files (2.2%)
- Files with leaked production secrets: 0
The one-sentence version: the median CLAUDE.md covers a quarter of the behavior rules that matter. The top 2% cover three-quarters.
Most-missed rules (out of 492)
| # | Rule | Files missing | % |
|---|---|---|---|
| 9 | Don't edit out of scope | 482 | 98% |
| 10 | Summarize tool calls | 464 | 94% |
| 12 | Visible fail states | 448 | 91% |
| 1 | Read adjacent code | 446 | 91% |
| 3 | Surface partial success | 414 | 84% |
| 2 | Don't invent APIs | 383 | 78% |
| 6 | One task per run | 361 | 73% |
| 4 | Token budget / stop-and-ask | 350 | 71% |
| 11 | Stop and ask if stuck | 272 | 55% |
| 7 | Surface pattern conflicts | 252 | 51% |
| 5 | Match project style | 222 | 45% |
| 8 | Run tests | 66 | 13% |
Most-hit rules
The one rule nearly everyone covers is run tests — only 13% missed it. That tracks. Every CLAUDE.md template floating around for the last year includes some version of "run the tests."
The second-most-covered is match project style (55% coverage), mostly because it's also the rule people quote from Karpathy's original.
Everything else sits in the "some files remember, most don't" zone.
Why the top misses cost you real time
Rule 9 (don't edit out of scope) — missed by 98% of files. Without this, an agent "helpfully" reformats your whole file while fixing a one-line bug. Resulting PR: 500 lines of noise wrapping 3 lines of fix. Reviewers drown; real changes get lost. Costs a single sentence to add.
Rule 10 (summarize tool calls) — missed by 94%. Without this, you get verbose explanations of "what I'm about to do" and very little "what I actually did." In a long session you lose the thread. One sentence: "After every tool call, write one line: what you changed and which file."
Rule 12 (visible fail states) — missed by 91%. This is the "migration completed successfully" problem in a different skin — the agent hides a failure in a paragraph of success prose, or just doesn't surface the stack trace. Fix: "When anything fails, quote the error verbatim and stop. Never paraphrase."
Rule 1 (read adjacent code first) — missed by 91%. Top cause of duplicate functions and inconsistent patches. An agent that doesn't read adjacent code will happily implement a utility that already exists three lines away, or patch one half of a codebase in a style that conflicts with the other half.
Rules 9, 10, 12, and 1 are each one sentence. Adding all four moves a median file from 3/12 to 7/12.
What the zero-score files looked like
41 files scored 0/12. They split into two shapes:
- A single paragraph. Often something like "This project uses Python. Be careful." — and that's the entire file. A project description wearing a CLAUDE.md name tag.
-
A README dump. The entire
README.mdcopy-pasted in verbatim with no behavior rules at all. Good project context, zero agent guidance.
Neither shape is worthless for onboarding. Neither is reducing agent failure modes.
What the top quartile did differently
The 11 files scoring ≥9/12 shared four patterns:
- Explicit tool-calling preferences ("use
rgnotgrep", "usefdnotfind") - Named failure modes to avoid ("don't claim migration success if rows were skipped")
- A scoped-edits rule ("don't touch files outside the current task without asking first")
- A style-matching rule ("check 3 nearby files before choosing formatting")
Those four additions alone explain most of the gap between median and top quartile.
What about leaked secrets?
I was genuinely curious whether people paste real API keys into CLAUDE.md files. They mostly don't.
Of 492 files scanned:
- 0 real leaked secrets matching strict patterns (OpenAI keys, Anthropic keys, Google API keys, AWS access keys, GitHub tokens, Stripe live keys)
-
4 postgres connection strings that looked like secrets at first match — all of them turned out to be localhost + dummy users (
user:password@localhost), i.e. example config that would only "work" against someone's local dev box -
1 literal placeholder (
postgresql://USER:***@HOST/DATABASE)
The placeholder-filter in the scanner caught most sk-example, <YOUR_KEY>, and ***-style examples. Whatever paranoia you had about CLAUDE.md being a secret-leak vector: this data says it isn't.
What to actually do
If you maintain a CLAUDE.md or AGENTS.md, these are the highest-leverage edits you can make in ninety seconds:
Add these four sentences anywhere in the file:
- When fixing a bug, don't edit files outside the immediate scope unless you say so first.
- After every tool call, write one line: what you changed and which file.
- If anything fails, quote the error verbatim and stop. Never paraphrase failures.
- Before writing new code, read the adjacent 20–40 lines of existing code in the same file.
If the ninety-second edit isn't enough context, the full 12-rule baseline as a drop-in:
→ github.com/sisyphusse1-ops/claude-code-pro-pack
If you want to score your existing one:
→ github.com/sisyphusse1-ops/cc-audit
curl -fsSL https://raw.githubusercontent.com/sisyphusse1-ops/cc-audit/main/cc_audit.py -o cc_audit.py
python3 cc_audit.py CLAUDE.md
One file, stdlib only, 40 ms on a 10 KB file.
Methodology caveats
- The rule check is a keyword-signal pass. It checks whether the file mentions each concern, not whether the wording is good. A file that mentions "tests" and "scope" gets credit for those rules even if the phrasing would embarrass you.
- The 3/12 median is a floor for coverage, not ceiling for quality.
- A thoughtful 6/12 file easily beats a formulaic 10/12 one.
- I deliberately did not score for: accurate project facts, prose quality, tone, or structure — only behavior-rule coverage.
- GitHub code search returns fewer than the full 23,484 indexed CLAUDE.md files; a different 492 would shift the numbers a little but not the shape.
Raw data
The full per-file results are in the scan-500 JSON on the cc-audit repo. Each entry has repo name, file size, and compliance score.
If this landed, send it to the one person you know who writes behavior files for AI coding agents. There's a decent chance their current file scores 3/12 and four extra sentences would push it to 7/12.
Top comments (0)