sisyphusse1-ops

Posted on May 10

I scored 492 public CLAUDE.md files against a 12-rule baseline. Median: 3/12.

#ai #claude #devops #webdev

Last week I wrote a tiny Python linter — cc-audit — that scores a CLAUDE.md or AGENTS.md file against twelve behavior rules for AI coding agents. I ran it against 492 real public CLAUDE.md files pulled from GitHub code search.

Here's what the ecosystem actually looks like.

Methodology

Pulled the first 500 public CLAUDE.md filename matches from GitHub code search
492 were fetchable at scan time (8 had been moved, renamed, or gated behind forks)
Each file scored on 12 behavior rules via keyword-signal matching (does the file address each rule?)
Separately scanned for leaked secrets (API keys, database URLs, private keys) with placeholder-aware filtering

The 12 rules come from the claude-code-pro-pack baseline (Karpathy's original 4 + 8 more covering agent-orchestration failure modes):

Read adjacent / existing code before writing new code
Don't invent APIs, imports, or file paths
Surface partial success — never silent-fail
Cap per-task token budget; stop and ask when hit
Match the project's existing style and conventions
One task per run; don't bundle unrelated changes
Surface conflicting patterns instead of averaging them
Run tests before declaring done
Don't edit out of scope without saying so
Summarize every tool call's effect in one line
Stop and ask if stuck or ambiguous
Visible fail states — never hide errors

The numbers

Files scanned: 492
Size: min 11 B, median 3.9 KB, mean 7.5 KB, max 167 KB
Compliance: median 3/12, mean 3.54/12, max 10/12
Perfect (12/12) scores: 0
Zero-score files: 41 (8%)
Top quartile (≥9/12): 11 files (2.2%)
Files with leaked production secrets: 0

The one-sentence version: the median CLAUDE.md covers a quarter of the behavior rules that matter. The top 2% cover three-quarters.

Most-missed rules (out of 492)

#	Rule	Files missing	%
9	Don't edit out of scope	482	98%
10	Summarize tool calls	464	94%
12	Visible fail states	448	91%
1	Read adjacent code	446	91%
3	Surface partial success	414	84%
2	Don't invent APIs	383	78%
6	One task per run	361	73%
4	Token budget / stop-and-ask	350	71%
11	Stop and ask if stuck	272	55%
7	Surface pattern conflicts	252	51%
5	Match project style	222	45%
8	Run tests	66	13%

Most-hit rules

The one rule nearly everyone covers is run tests — only 13% missed it. That tracks. Every CLAUDE.md template floating around for the last year includes some version of "run the tests."

The second-most-covered is match project style (55% coverage), mostly because it's also the rule people quote from Karpathy's original.

Everything else sits in the "some files remember, most don't" zone.

Why the top misses cost you real time

Rule 9 (don't edit out of scope) — missed by 98% of files. Without this, an agent "helpfully" reformats your whole file while fixing a one-line bug. Resulting PR: 500 lines of noise wrapping 3 lines of fix. Reviewers drown; real changes get lost. Costs a single sentence to add.

Rule 10 (summarize tool calls) — missed by 94%. Without this, you get verbose explanations of "what I'm about to do" and very little "what I actually did." In a long session you lose the thread. One sentence: "After every tool call, write one line: what you changed and which file."

Rule 12 (visible fail states) — missed by 91%. This is the "migration completed successfully" problem in a different skin — the agent hides a failure in a paragraph of success prose, or just doesn't surface the stack trace. Fix: "When anything fails, quote the error verbatim and stop. Never paraphrase."

Rule 1 (read adjacent code first) — missed by 91%. Top cause of duplicate functions and inconsistent patches. An agent that doesn't read adjacent code will happily implement a utility that already exists three lines away, or patch one half of a codebase in a style that conflicts with the other half.

Rules 9, 10, 12, and 1 are each one sentence. Adding all four moves a median file from 3/12 to 7/12.

What the zero-score files looked like

41 files scored 0/12. They split into two shapes:

A single paragraph. Often something like "This project uses Python. Be careful." — and that's the entire file. A project description wearing a CLAUDE.md name tag.
A README dump. The entire README.md copy-pasted in verbatim with no behavior rules at all. Good project context, zero agent guidance.

Neither shape is worthless for onboarding. Neither is reducing agent failure modes.

What the top quartile did differently

The 11 files scoring ≥9/12 shared four patterns:

Explicit tool-calling preferences ("use rg not grep", "use fd not find")
Named failure modes to avoid ("don't claim migration success if rows were skipped")
A scoped-edits rule ("don't touch files outside the current task without asking first")
A style-matching rule ("check 3 nearby files before choosing formatting")

Those four additions alone explain most of the gap between median and top quartile.

What about leaked secrets?

I was genuinely curious whether people paste real API keys into CLAUDE.md files. They mostly don't.

Of 492 files scanned:

0 real leaked secrets matching strict patterns (OpenAI keys, Anthropic keys, Google API keys, AWS access keys, GitHub tokens, Stripe live keys)
4 postgres connection strings that looked like secrets at first match — all of them turned out to be localhost + dummy users (user:password@localhost), i.e. example config that would only "work" against someone's local dev box
1 literal placeholder (postgresql://USER:***@HOST/DATABASE)

The placeholder-filter in the scanner caught most sk-example, <YOUR_KEY>, and ***-style examples. Whatever paranoia you had about CLAUDE.md being a secret-leak vector: this data says it isn't.

What to actually do

If you maintain a CLAUDE.md or AGENTS.md, these are the highest-leverage edits you can make in ninety seconds:

Add these four sentences anywhere in the file:

- When fixing a bug, don't edit files outside the immediate scope unless you say so first.
- After every tool call, write one line: what you changed and which file.
- If anything fails, quote the error verbatim and stop. Never paraphrase failures.
- Before writing new code, read the adjacent 20–40 lines of existing code in the same file.

If the ninety-second edit isn't enough context, the full 12-rule baseline as a drop-in:

→ github.com/sisyphusse1-ops/claude-code-pro-pack

If you want to score your existing one:

→ github.com/sisyphusse1-ops/cc-audit

curl -fsSL https://raw.githubusercontent.com/sisyphusse1-ops/cc-audit/main/cc_audit.py -o cc_audit.py
python3 cc_audit.py CLAUDE.md

One file, stdlib only, 40 ms on a 10 KB file.

Methodology caveats

The rule check is a keyword-signal pass. It checks whether the file mentions each concern, not whether the wording is good. A file that mentions "tests" and "scope" gets credit for those rules even if the phrasing would embarrass you.
The 3/12 median is a floor for coverage, not ceiling for quality.
A thoughtful 6/12 file easily beats a formulaic 10/12 one.
I deliberately did not score for: accurate project facts, prose quality, tone, or structure — only behavior-rule coverage.
GitHub code search returns fewer than the full 23,484 indexed CLAUDE.md files; a different 492 would shift the numbers a little but not the shape.

Raw data

The full per-file results are in the scan-500 JSON on the cc-audit repo. Each entry has repo name, file size, and compliance score.

If this landed, send it to the one person you know who writes behavior files for AI coding agents. There's a decent chance their current file scores 3/12 and four extra sentences would push it to 7/12.

DEV Community