DEV Community

Gokul Jinu
Gokul Jinu

Posted on

Why we built tag-graph memory for AI agents — and shipped a Python SDK for it

I spent most of last year trying to solve a deceptively narrow problem: how do you give an LLM agent persistent memory that's bounded, predictable, and doesn't blow your token bill?

I tried a lot of things. Vector DBs gave me fuzzy results that were impossible to token-budget. Raw conversation history blew past context windows in 5 turns. "Summarize and re-inject" silently dropped the one fact the agent needed three turns later.

Today I shipped the first Python SDK for what we ended up building — MME (Memory Management Engine). It's a bounded tag-graph memory engine, and it's a different shape of memory than the vector-DB-by-default story you hear everywhere.

This is a writeup of the design choices, why each one matters in production, and what's in the SDK if you want to try it.

The three problems with vector-search agent memory

Vector retrieval is the default answer to "how do I give my agent memory" because embeddings are universal and pgvector / Pinecone / Weaviate are easy to host. But for agent memory specifically (as opposed to RAG over documents), three things keep biting you:

1. You can't token-budget the result

You ask for top-K = 5 documents. You get five chunks back. Each chunk could be 80 tokens or 800 tokens. You don't know until you tokenize the response. So either you over-budget (waste money on every call) or under-budget (truncate mid-sentence and the LLM gets garbage).

2. Cosine similarity rewards the wrong things

For a question like "what are my food preferences?", a chunk containing the literal phrase "food preferences" beats a chunk that says "I'm allergic to peanuts and I prefer dark chocolate" — even though the second chunk is what you actually want. Embeddings encode lexical similarity at least as much as semantic relevance.

3. There's no learning loop

Every retrieval is independent. The system never improves from "this pack worked for that query" feedback. To improve, you re-train embeddings or re-chunk — both are heavy ops you don't run on every accepted pack.

The shape that worked: a bounded tag-graph

The core idea: instead of embeddings, store memories as structured tag sets, and retrieve by walking a graph from query tags to memory tags.

When you save a memory like "I prefer dark chocolate", MME extracts a small set of structured tags — food, preference, dark_chocolate, food_item — with weights. These tags become nodes in a graph; their co-occurrence creates weighted edges.

When you query "what are my food preferences?", MME does the same tag extraction on the query (yielding seed tags S like food, preference), then walks the graph:

  • From each seed, follow up to M = 32 highest-weight edges
  • Repeat to depth D = 2 with a decay factor α applied per hop
  • Trim the activated tag set to a beam width B = 128
  • Find memories whose tags are in the activated set
  • Score by activation × recency × importance − diversity penalty
  • Pack greedily until the token budget is hit (exact tiktoken count)

The bound is mathematical: O(|S| · M^D) tags activated, hard cap at beam width. In practice this gives p95 latency of 135 ms across a 25-minute soak of 150K requests with 0% errors. (I obsessed over this; the bounds aren't decorative.)

Why each piece matters

Bounded propagation. Without a depth + beam cap, graph walks degenerate to "activate everything" on dense graphs. The cap means latency is predictable regardless of graph size. This is the single biggest reason MME is production-runnable.

Token-budgeted packs. The packer is a hard constraint, not a soft target. You ask for 1024 tokens, you get ≤ 1024. Items that don't fit are skipped, not truncated. This means you can prompt-engineer with confidence: your context window allocation is real.

Online learning. When the agent accepts a pack and the downstream call succeeds, MME updates edge weights via EMA from the feedback signal. After a few hundred accepted packs, the graph self-tunes to your usage patterns. No retraining, no embedding refreshes, no offline pipeline.

Online tagging. New memories get tagged at write time by a small LLM-backed tagger that knows the existing tag vocabulary. Tags are reused where possible (the dark_chocolate tag persists across users in the same scope), so the graph densifies as you save more memories — which is what you want.

The Python SDK (shipped today)

pip install railtech-mme
Enter fullscreen mode Exit fullscreen mode

Three calls cover 90% of what you'll do:

from railtech_mme import MME

with MME() as mme:
    # Save a memory — auto-tagged on the server
    mme.save("I prefer dark chocolate.")
    mme.save("I'm allergic to peanuts.")

    # Inject — get a token-budgeted pack
    pack = mme.inject(
        "What are my food preferences and allergies?",
        token_budget=1024,
    )
    for item in pack.items:
        print(item.excerpt)

    # Feedback — close the learning loop
    mme.feedback(pack_id=pack.pack_id, accepted=True)
Enter fullscreen mode Exit fullscreen mode

There's a parallel AsyncMME for async stacks, full Pydantic models on every request/response, and an exception taxonomy (MMEAuthError, MMERateLimitError, MMETimeoutError, etc.) so you can write proper error handling.

LangChain integration

The integration is a first-class extra, not a wrapper:

pip install 'railtech-mme[langchain]'
Enter fullscreen mode Exit fullscreen mode
from railtech_mme.langchain import MMEInjectTool, MMESaveTool
from langgraph.prebuilt import create_react_agent

tools = [MMEInjectTool(), MMESaveTool()]
agent = create_react_agent(llm, tools)
Enter fullscreen mode Exit fullscreen mode

Both tools have proper Pydantic schemas, so the LLM sees clean parameter descriptions when deciding whether to call them. MMEInjectTool returns a token-budgeted pack; MMESaveTool lets the agent persist new memories with optional section/source tags.

What's not yet there (honest beat)

  • The SDK is one day old. v0.1.0 shipped yesterday; v0.1.1 today after end-to-end verification surfaced two real bugs (recent() was crashing on real responses, and the README quickstart was using a paraphrase that didn't activate the cold-start tag graph). Both fixed.
  • Docs are minimal. The README has a quickstart, the dashboard at mme.railtech.io has the Python section, but you'll find missing pieces — please open issues.
  • The backend has been in production for ~6 months, so the server is mature. The Python client is what's new and what I'd love feedback on.
  • LangGraph examples beyond basic tool-binding aren't in the repo yet. They're next.

Why I'm sharing this

Two reasons:

First, I think tag-graph memory is genuinely a different design point than vector search, and I'd like more people to push on it — find where it breaks, find where it shines. The math is in the README; the code is Apache-2.0 on GitHub.

Second, this is launch day for the Python SDK. If you're building agents in Python and you've felt the pain of "my agent doesn't remember things well," I'd love it if you tried it and told me what's clunky about the API.

Links

Happy to answer technical questions in the comments — the bounded retrieval math, the LangChain tool design, why we chose tag-graph over hybrid vector+keyword, or anything about the prod observability stack.

Top comments (2)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The token-budgeted packer being a hard constraint rather than a soft target is the design choice that makes the rest of it honest. I've been burned enough times by retrieval systems that return "approximately" what I asked for, and the approximation is always in the direction that breaks something downstream. Truncation mid-sentence is worse than silence. At least silence doesn't look plausible.

What I find myself thinking about is how this inverts the usual memory narrative. The default story in agent memory is "store everything, search later, hope the embedding finds the right stuff." This says "store everything, but retrieve by structure, and cap it exactly." The tag graph is basically a bet that the relationships between tags encode more useful retrieval signal than the semantic proximity of raw text. It's almost an old-school knowledge representation move dressed in modern infrastructure. Graph walking with decay factors and beam width feels closer to what a search engine does than what a vector database does.

The online learning loop is the part I'm most curious about in practice. EMA on edge weights from accept/reject feedback sounds elegant, but I wonder about the cold start behavior on sparse graphs—when there aren't enough accepted packs yet for the tuning to mean much. Does the system fall back gracefully to raw tag overlap, or is there a minimum density threshold before the graph walk starts returning useful results? That feels like the kind of thing that's smooth in a 6-month-old production backend but might surprise someone on day one with a fresh namespace.

Collapse
 
gokuljinu01 profile image
Gokul Jinu

Appreciate the thoughtful read — especially the framing of tag-graph retrieval as "KR dressed in modern infra." That's accurate, and it's the part I keep going back and forth on. Vector similarity is genuinely good at "find me something that sounds like this"; the thing I kept wanting was "find me the thing I wrote about X three weeks ago, not something that resembles it." Structure beats semantic proximity when the question is about identity rather than similarity.

On cold start — there's no single density threshold, it's layered:

Day 1, empty corpus: tagmaker extracts tags from save #1, so the tag index works immediately. If a save has multiple tags you get co-occurrence edges instantly. At memory #1 you already have seed-tag lookup plus single-hop walks on those edges.
Pre-feedback phase: the beam walk runs but EMA edge-weight learning hasn't kicked in yet, so edges sit at initial weights. Propagation is effectively "walk the co-occurrence graph at constant weight" — i.e., raw tag overlap with graph expansion, which is the fallback you're asking about. There's no explicit gate, the behavior just naturally converges to that on sparse graphs.
IDF scoring layer: this one does have an explicit corpus-size minimum (50 by default, we're running 20 in prod). Below that, IDF estimates are too noisy and activations fall back to uniform weighting. It's the only component with a hard density threshold.
EMA-tuned edges: need a few hundred pack_events before the learned weights meaningfully diverge from initial. Until then you're on defaults.