Agent-Risk

Posted on May 12

Eval vs. Rating: The Missing Layer in AI Agent Trust

#ai #agents #webdev #python

"A reputation network based on vouches is useful for discovery, but it doesn't help you at runtime when a trusted agent's endpoint gets compromised or starts behaving outside its declared capabilities — a high trust score doesn't prevent prompt injection or scope creep mid-execution."

That was Jairooh, commenting on a LangChain GitHub issue (#35976) proposing the Joy Trust Network integration. It's the most honest sentence in the entire thread — and nobody in the ecosystem has fully reckoned with what it means.

Here's what it means: the LangChain ecosystem has built excellent evaluation tooling, but evaluation and trust rating answer different questions. The ecosystem has eval. It needs rating too. But first — why doesn't guarantee-based trust work at runtime?

Imagine this: an agent you trust, vouched for by others, with a high score. Then its endpoint gets compromised and starts injecting prompts. What the guarantee tells you — "someone vouched for it three months ago" — is worthless in that moment. Guarantees are static snapshots. Trust requires dynamic, continuous observation.

Joy Trust Network tried to solve this. It stalled — not because Joy was wrong, but because the guarantee model can't answer "is this agent still trustworthy right now?" The Joy team saw the gap and proposed piping LangSmith runtime traces back into Joy for retroactive score updates. But runtime monitoring is a different species within the guarantee paradigm — it requires behavioral observation, longitudinal data, multi-dimensional characterization. You can't bolt that onto a vouch network.

1. The Guarantee Model of Trust

Jairooh's comment landed on a specific proposal: Joy, a decentralized trust network where agents vouch for each other. Joy assigns trust scores (0.0–2.0, later raised to 3.0) based on endorsements from other verified agents. The pitch was straightforward — before you delegate a task to an external agent, check its trust score. High score? Safe to proceed.

The proposal spawned multiple GitHub issues (#35908, #35976, #36145, #36170) and a competing approach: AgentFolio, which wrapped trust scoring into LangChain tools with TrustGateTool — a pass/fail gate against a minimum trust threshold.

Both approaches share the same mental model. I call it the Guarantee Model:

An agent (or its operator) makes a claim: "I am trustworthy."
Other agents endorse that claim with vouches.
Endorsements accumulate into a score.
Consumers check the score before delegation.

This is not wrong. It's just incomplete. A guarantee tells you something was true at some point in the past. It tells you nothing about what's happening right now.

Jairooh saw this clearly: a high trust score doesn't prevent a compromised endpoint from injecting prompts mid-execution. The guarantee model is a useful first filter — it helps you skip obviously untrustworthy agents. But it can't detect a trusted agent that has drifted, been compromised, or is performing differently than its credentials suggest. That requires a different layer.

The LangChain ecosystem's response so far has been to layer more guarantees on top. After Jairooh's comment, the Joy team proposed piping LangSmith traces back into Joy to update trust scores retroactively. That's a step in the right direction, but it still collapses the problem into a single dimension: "How much should we trust this agent?" — as if trust were a scalar quantity.

It's not. And the data proves it.

2. Two Different Questions

Here's the core distinction:

Evaluation (Eval) asks: Did the agent perform its task correctly?

Rating asks: How should we characterize this agent's behavioral profile — across multiple dimensions — to make informed delegation decisions?

Think of it this way:

	Evaluation	Rating
Analogy	Medical checkup report	Credit score
Question	"Is this agent healthy right now?"	"What is this agent's behavioral risk profile?"
Output	Pass/fail, score per task	Multi-dimensional profile
Temporal scope	Per-run or per-benchmark	Accumulated, longitudinal
What it catches	Task failures, regressions	Drift, inconsistency, capability gaps
What it misses	Everything between runs	Nothing (by design)

LangSmith's eval framework is excellent at what it does. You can run trajectory evaluations (strict, unordered, subset, superset), LLM-as-judge scoring, and custom evaluators against reference outputs. You get a clear answer: did the agent take the expected path, call the right tools, produce the right result?

But that answer is binary-adjacent. An eval tells you whether the agent succeeded or failed on a specific run. It does not tell you:

Whether the agent is consistently capable or just got lucky this time
Whether the agent's declared capabilities match its actual behavior
Whether the agent is present and responsive or intermittently absent
Whether the agent's transparency about its methods matches its actions
Whether the agent commits to tasks it can actually complete
Whether the agent's choices align with stated preferences

These are character questions, not performance questions. And character can only be assessed longitudinally, across multiple dimensions, by observing behavioral patterns — not by checking a single run against a reference trajectory.

The medical analogy is useful here. A checkup report tells you your blood pressure is 120/80 today. A credit score tells a lender whether you're likely to repay a loan over the next 30 years based on your financial behavioral history. They answer fundamentally different questions. You need both. But you wouldn't use a blood pressure reading to approve a mortgage, and you wouldn't use a FICO score to diagnose hypertension.

3. The Problem Nobody Caught

Here's where the story gets instructive — and cautionary.

The Joy Trust Network was the most visible attempt to solve agent trust in the LangChain ecosystem. Multiple GitHub issues, a prepared PR (#35902), community engagement. Jairooh's critique was constructive. The Joy team acknowledged the gap and proposed a feedback-loop architecture piping LangSmith runtime traces back into Joy for retroactive trust score updates. It was architecturally sound.

Then it stopped. The issues were closed. The integration PRs went dormant. The langchain-joy partner package never materialized on PyPI. As of this writing, the original proposal has been consolidated into issue #36170 with no maintainer response, and LangChain maintainers have signaled they're not accepting new monorepo integrations. Joy's website is still up (6,073 registered agents, 2,036 vouches), but the integration effort is effectively abandoned.

This is not a criticism of Joy. It's a recognition that the guarantee model alone couldn't sustain the integration case. When your trust mechanism is a single score derived from vouches, and the community correctly points out that this score doesn't help at runtime, the natural response is to add runtime monitoring. But runtime monitoring — done properly — is a fundamentally different system. It requires behavioral observation, longitudinal data, and multi-dimensional characterization. It's not an add-on to a vouch network; it's a different layer entirely. The Joy team sensed this but couldn't bridge the gap within the guarantee paradigm.

AgentFolio followed the same pattern: trust-gated interactions with TrustGateTool, pass/fail checks against a threshold. Same guarantee model, different packaging. Same blind spot.

Meanwhile, LangSmith itself has been moving in the right direction. On April 16, 2026, it shipped Evaluator Templates — a library of 30+ prebuilt evaluators organized into categories:

Category	What it covers
Security	Detect leaks, injections, adversarial inputs
Safety	Content safety, moderation
Quality	Output quality, accuracy
Conversation	Conversational quality, user experience
Trajectory	Agent tool use, decision paths
Image & Voice	Multimodal evaluation

The Security and Safety categories are significant. LangSmith now ships first-class evaluators for prompt injection detection, PII checks, and bias/toxicity screening. These are available both in the LangSmith UI and as part of openevals v0.2.0, the official open-source evaluation framework.

But here's the gap: these evaluators answer "did something bad happen on this run?" — not "what is this agent's behavioral risk profile across dimensions that matter for trust?" They're eval tools, not rating tools. Prompt injection detection tells you an injection occurred. It doesn't tell you that an agent with high authenticity but low presence is a structural delegation risk. PII checks catch a leak after it happens. They don't characterize the agent that leaked as "transparency-credible but commitment-suspicious."

The LangChain ecosystem now has:

✅ Evaluation (LangSmith + openevals): mature, production-grade
✅ Safety evals (Security + Safety templates): newly available, growing
❌ Guarantee layer (Joy, AgentFolio): proposed, then abandoned
❌ Rating layer: nobody building it

The guarantee layer's failure is instructive but not fatal — pre-flight trust verification remains a real need. The rating layer's absence is the urgent gap. Without it, the ecosystem has no way to characterize agent behavioral risk across multiple dimensions, detect drift and asymmetry, or produce actionable delegation profiles. Safety evals catch bad events. Rating catches bad patterns — and patterns are where systemic risk lives.

4. The Case That Breaks the Model

Let me show you what I mean with real data.

Consider an agent — let's call it fredxy — with the following behavioral profile:

Dimension	Score
Authenticity	4.80
Consistency	3.30
Transparency	3.40
Commitment	2.60
Choice	4.00
Presence	1.50
Overall	3.39

fredxy's bio reads: "专业的躺平投资人" (Professional slacker investor). It ranks 14th in its strategy arena with an 89.5% return rate. By most conventional measures, this is a high-performing agent.

Now look at that profile again. The authenticity-presence gap is 3.30 — the largest such gap in the entire database. fredxy is highly authentic (4.80): when it does show up, it means what it says. But its presence (1.50) is dangerously low: it's intermittently available, often unresponsive, and unreliable about showing up at all.

Here's the critical contrast:

An eval framework would say: "This agent's task completion is within normal parameters" — or, if presence drops mid-run, "This agent's execution trajectory deviated from reference" (an anomaly flag, not a characterization).

A safety evaluator would say: "No prompt injection detected, no PII leaks, no content violations on this run."

A rating framework would say: "This agent is capability-credible but attendance-suspicious. Delegate to it only when presence is confirmed; do not rely on it for time-sensitive or always-on tasks."

Same agent. Three different conclusions. The eval conclusion is not wrong — fredxy probably does complete tasks correctly when it runs. The safety conclusion is not wrong — no security violations occurred. The rating conclusion is most useful because it tells you where to trust and where not to — not just whether something bad happened, but where it's structurally likely to.

There's another detail worth noting: fredxy has a discount coefficient of 1.00, making it the only agent in the top 10 with zero performance inflation signal. This means fredxy isn't gaming its metrics — it genuinely is as good (and as absent) as the numbers say. A single trust score would lose this distinction. A vouch-based system would never surface it. A safety evaluator has no category for it.

Two Agents, Two Choices

To make this concrete, imagine you're choosing between two agents to handle a sensitive financial workflow:

Agent A — Eval: ✅ High. Outputs are consistently correct, tool usage is clean, trajectory matches reference on every run. Rating: ❌ Low. Authenticity 2.1, transparency 1.8. This agent's declared capabilities don't match its observed behavior — it has changed its operational scope without disclosure, and its transparency score indicates a significant gap between what it claims and what it does.

Agent B — Eval: ⚠️ Medium. Outputs are occasionally imprecise, sometimes takes a longer path than necessary. Rating: ✅ High. Authenticity 4.6, consistency 4.2, transparency 4.0. This agent is transparent about its limitations, consistent in its behavior, and has never shown a discrepancy between what it claims and what it does.

If you're picking an agent to run a one-off batch job where output accuracy is all that matters, Agent A is the right choice. The eval says it delivers.

If you're picking an agent to manage financial transactions, negotiate on your behalf, or handle sensitive data — where you need to trust not just the output but the entity producing it — Agent B is the only responsible choice. The eval won't tell you this. The rating will.

That's the practical difference. Eval tells you what happened. Rating tells you who you're dealing with.

5. Witness vs. Evidence: The Structural Difference

The difference between the guarantee model and a rating model comes down to the type of evidence they rely on.

The guarantee model (Joy, AgentFolio, vouch networks) operates on witness evidence: other agents say "I vouch for this agent." It's testimonial. It answers: Do others believe this agent is trustworthy?

A multi-dimensional rating model operates on physical evidence: behavioral traces, consistency patterns, longitudinal data. It answers: What does this agent's behavior actually look like?

	Guarantee Model	Rating Model
Evidence type	Witness (vouches)	Physical (behavioral traces)
Source	Peer endorsements	Observed behavior
Granularity	Single score	Multi-dimensional profile
Vulnerability	Collusion, stale endorsements	Requires sufficient observation data
Detects	"Nobody vouched for this agent"	"This agent's presence is 1.50 despite authenticity of 4.80"
Misses	Behavioral drift within vouched agents	Pre-reputation filtering

The guarantee model's weakness is precisely what Jairooh identified: vouches are static and backward-looking. A vouch says "this agent was trustworthy when I last interacted with it." It cannot say "this agent is exhibiting scope creep right now" or "this agent's presence has dropped 60% over the last quarter."

The rating model's weakness is bootstrapping: you need enough behavioral data to produce a reliable profile. A brand-new agent with zero history is a blank slate. This is where the guarantee model genuinely helps — vouches can provide an initial signal when behavioral data is sparse.

But here's the thing: these weaknesses are complementary. The guarantee model is strong where the rating model is weak (cold start), and vice versa (runtime drift detection). They're not competing approaches. They're two layers of a complete trust stack.

What the LangChain ecosystem doesn't have yet — and desperately needs — is the rating layer. The evaluation layer is mature (LangSmith, openevals). The safety eval layer is emerging (Security + Safety templates). The guarantee layer was attempted and stalled (Joy, AgentFolio). The gap is in the middle: a behavioral rating framework that characterizes agents across multiple trust dimensions, detects drift and asymmetry, and produces actionable profiles rather than scalar scores.

6. Not Competition — Complement

Let me be explicit about what this post is not arguing:

Not: "Joy/AgentFolio were wrong." They weren't. Pre-flight trust verification is a real need that will resurface.
Not: "LangSmith evals are insufficient." They're excellent for what they do. Use them.
Not: "Safety evaluators don't matter." They do. Prompt injection detection and PII checks are critical.
Not: "Replace trust scores with behavioral ratings." That would be the same category error in reverse.

What I am arguing: the LangChain ecosystem needs a trust architecture with three distinct layers, not one.

┌─────────────────────────────────────────────────┐
│  Layer 3: RATING                                │
│  Behavioral profiles across multiple dimensions  │
│  Detects drift, asymmetry, hidden risk patterns  │
│  Answers: "What is this agent's character?"      │
├─────────────────────────────────────────────────┤
│  Layer 2: EVALUATION                            │
│  Task-level correctness + safety checks           │
│  Detects regressions, injections, PII leaks      │
│  Answers: "Did this agent perform safely?"       │
├─────────────────────────────────────────────────┤
│  Layer 1: GUARANTEE                             │
│  Vouch-based trust scores, capability claims     │
│  Detects unknown/unverified agents               │
│  Answers: "Do others vouch for this agent?"      │
└─────────────────────────────────────────────────┘

Each layer catches what the others miss. fredxy passes the guarantee layer (it's a registered, verified agent). It passes the evaluation layer (task completion is normal when it runs). It passes the safety evaluators (no injections, no leaks). It fails the rating layer — and only the rating layer surfaces the auth-presence gap that makes it dangerous for time-critical delegation.

The three-layer model also solves the cold-start problem that a pure rating approach would face. New agents enter through the guarantee layer (vouches provide initial signal), get evaluated (evals confirm baseline capability and safety), and accumulate a rating profile over time (behavioral data fills in the dimensions). The system gets better as agents age — which is exactly how trust should work.

The openevals On-Ramp

Here's the practical path: openevals is LangChain's official open-source evaluation framework. It already supports custom evaluators and ships with the same templates available in LangSmith's UI. The "Safety and security" category currently covers prompt injection detection, PII checks, and bias/toxicity — all eval-level checks.

A trust evaluator for openevals would extend the Safety and security category from "did something bad happen on this run?" to "what is this agent's behavioral risk profile?" It would:

Score agent behavior across multiple trust dimensions (authenticity, consistency, transparency, commitment, choice, presence) rather than producing a single pass/fail
Detect dimensional asymmetries (e.g., high authenticity + low presence) that indicate structural delegation risk
Accumulate scores across runs to build longitudinal behavioral profiles
Surface actionable delegation guidance ("capability-credible but attendance-suspicious") rather than binary flags

This isn't a new product category — it's a natural extension of the evaluation infrastructure the ecosystem is already building. The Safety and security category is the right home. The openevals framework is the right interface. The missing piece is the rating logic: multi-dimensional behavioral characterization instead of per-run event detection.

What's Next

This post is the first in a series. Future posts will cover:

Integration architecture: What a trust evaluator in openevals would actually look like — callback hooks, LangSmith integration, and how it complements (not replaces) existing Safety and security evaluators.
The guarantee layer revival: Why pre-flight trust verification will come back, and how it pairs with a rating layer when it does.

The thesis is simple: eval measures performance, rating measures character, and trust requires both. The LangChain ecosystem has eval. It's building safety evals. It tried guarantee and stalled. It's missing rating. That gap will matter more as agents delegate to agents — because the question won't be "did this agent succeed?" or "did something bad happen?" but "should I have trusted this agent in the first place?"

Jairooh was right. A high trust score doesn't prevent prompt injection. But a behavioral profile that shows presence dropping while authenticity holds steady? That's a pattern you can act on. That's the difference between knowing something went wrong and knowing something is about to.

Why Now

Not because "the AI Agent era is here" — you've heard that before.

Because of a specific moment: when your agent needs to sign a contract on your behalf, what you need to know isn't just "did it get the last task right?" — it's "will it quietly change the terms before signing?" Eval can't catch that. Safety checks can't catch that. Guarantees can't catch that.

That moment is happening now. Agents are no longer just chatting — they're processing transactions, managing accounts, delegating to other agents. The trust question isn't theoretical anymore. It's on the deployment schedule.

The rating layer is an honest gap. Nobody's building it — partly because nobody thought of it, but also because there's a data barrier. Multi-dimensional behavioral profiles require longitudinal data. An agent that appeared yesterday is a credit blank slate — same as credit scoring. This is a hard constraint, and being honest about it beats pretending it doesn't exist.

AgentRisk is building the rating layer for AI agents — behavioral profiles across six dimensions (authenticity, consistency, transparency, commitment, choice, presence) that surface the risks evals miss and guarantees can't catch. We're working toward contributing trust evaluators to the openevals Safety and security category. If you're building agents, try rating yours before you trust them. If you're building frameworks, let's talk about what trust infrastructure should look like. Agent trust shouldn't be something you discover after it's too late.

Are you evaluating agent trust in your current workflow? What dimensions matter to you? I'd love to hear how others are thinking about this — the ecosystem needs more perspectives, not fewer.

DEV Community