Paperless-ngx: Self-Hosted Document Management for Developers Who Want the API

#webdev #devops #cloud #astro

If you've ever tried to grep a PDF you scanned six months ago, you already know why paperless-ngx exists. It's a Django + Postgres + Redis application that watches a folder, runs OCR on whatever you drop in, extracts metadata, applies tags, and serves the result through a searchable web UI and a REST API you can actually script against.

We ran a paperless-ngx instance against roughly 1,800 receipts, contracts, and PDFs over the past several weeks to see whether the "self-hosted alternative to Evernote/Dropbox" pitch holds up for developers who'd rather own their data and wire their own automations. Short version: it does, but the operational footprint and the gaps in classification accuracy are worth knowing before you commit a weekend to it.

The stack you're actually running

Paperless-ngx ships as a Docker Compose bundle. The reference deployment runs five containers: the Django webserver, a Redis broker, Postgres (or MariaDB, or SQLite for hobbyist installs), Gotenberg for office-document conversion, and Tika for content extraction. On a small VPS — 2 vCPU, 4 GB RAM — the whole thing idles around 600 MB of memory and spikes during OCR of large scans.

The ingestion pipeline is the part developers care about. You drop a file into the consume/ directory (mounted from the host) and a Celery worker picks it up. The worker detects the file type and routes office docs through Gotenberg/Tika, runs Tesseract OCR on image-only PDFs via the ocrmypdf wrapper, stores both the original and an OCR'd searchable PDF, applies matching rules to assign tags, correspondents, and document types, and indexes the full text in Whoosh for search.

The fact that the OCR'd output is a real searchable PDF — not a sidecar text file — matters because every downstream tool (preview, sharing, printing) gets text selection for free. That's ocrmypdf doing the heavy lifting underneath; paperless-ngx is the orchestration layer.

The classification system uses scikit-learn under the hood. It trains a Naive Bayes classifier on your tagged corpus, so accuracy starts poor and improves as you correct mistakes. Plan to tag a few hundred documents manually before the auto-tagging is worth trusting.

The REST API is documented and covers everything the UI does: uploading documents, querying by tag or correspondent, fetching the OCR text, attaching notes, even triggering reprocessing. There's no separate "admin API" — the same endpoints handle automation and human use. Auth is via token or session, and you can scope tokens to a user.

Where it slots into an AI workflow

The interesting question for 2026 isn't "can paperless-ngx replace your scanner software." It's "can it be the document substrate for the LLM tools you're already building." A few patterns we've seen work:

Embedding pipeline source. The REST API exposes /api/documents/{id}/?fields=content which returns the full OCR text. A small worker can poll for documents tagged needs-embedding, push the text into your vector store, then strip the tag. The Whoosh index isn't pretending to be a vector DB, so you keep paperless-ngx for storage and keyword search and use your own embeddings for semantic retrieval.

Custom auto-tagging. The built-in classifier is fine for structured stuff — every invoice from one vendor looks like every other one — but falls over on free-form documents. We replaced the classifier for one document type with a call to a small Claude Haiku prompt that returns a JSON list of tags, then PATCHes the document via the API. Cost worked out to roughly $0.0004 per document at current Haiku input pricing. Worth it for the documents where classification matters; overkill for receipts.

Webhook-driven workflows. Recent versions added a workflow engine with conditions and actions, including HTTP webhooks. You can fire a webhook when a document matching certain criteria is consumed, which is the cleanest hook point for downstream automation. Before workflows existed, people polled the API. The polling approach still works and is simpler if you only care about one document type.

Email ingestion. Paperless-ngx can poll an IMAP account, pull attachments, and consume them. We pointed a dedicated mailbox at it for receipts. Combined with a rule that auto-forwards anything matching common receipt patterns from your main inbox, you get a zero-touch capture pipeline.

The honest limitation: paperless-ngx is not an LLM tool and doesn't pretend to be. There's no built-in "ask your documents" UI. If you want chat-with-your-PDFs, you're building it yourself on top of the API. That's a feature if you care about which model touches your data — and a non-starter if you wanted that turnkey.

Self-hosting tradeoffs

The setup cost is real. Reading the install docs, generating secrets, configuring OCR languages, mounting volumes, getting the consume folder permissions right — call it half a day to a full day if you've used Docker Compose before. After that, ongoing maintenance is roughly quarterly: review the changelog, pull new images, run database migrations, check backups.

Backups are where most self-hosted document setups fail. Paperless-ngx ships a document_exporter management command that produces a portable manifest plus the original files. If you only back up the Postgres dump and the media folder, you'll be fine 95% of the time and miserable the 5% you needed to migrate to a new instance. Run the exporter on a cron.

The other tradeoff is search quality. Whoosh is a pure-Python search library, adequate for keyword search across a few thousand documents. It's not Elasticsearch. If you've got hundreds of thousands of pages and want fuzzy matching, ranking experiments, and faceted search out of the box, you'll outgrow it. For personal and small-team archives, it's fine.

Hardware: a Raspberry Pi 4 will run paperless-ngx, but OCR of large color scans will take minutes per document. A small x86 box (N100 mini PC, used SFF desktop, around $200) cuts OCR time to seconds and is what we'd recommend if you're scanning meaningfully.

After a few weeks of daily use, the friction points were predictable. Initial classifier accuracy is poor enough that you'll do a lot of manual tagging in the first month — there's no way around training data. The mobile web UI works but isn't the right capture surface; the Paperless Mobile community app (Android/iOS) is what you actually want for phone scans. Front-end indexing happens on document save, which means bulk imports of thousands of files briefly pin a CPU, so run large imports during off-hours.

None of these are blockers. They're the cost of running infrastructure you own instead of renting it.