Building a multi-tenant RAG pipeline with Postgres. Part 0: Overview

#ai #rag #postgres #langchain

Today I want to start with a series of articles describing my experience building a multi-tenant RAG system powered by Postgres that serves over
millions of documents while still delivering end-to-end responses in under 4 seconds (including the latency from AI providers). This article serves as the overview
before I will start diving deeper into the several topics in the upcoming weeks. I put a lot of research into most of the steps until I reached a somewhat
stable and fast system. I was heavily involved in building this at my company, but I wasn't the only one and many of the ideas came from working through problems together with the team.
In case you are thinking about building a RAG-based system this series could help you make the decisions regarding architecture or provider choice.

What makes a good RAG system?

In my opinion a good RAG system is mainly defined by recall and latency because these two things are directly impacting the end user experience.
One could argue that recall is more important than latency, since a fast answer is worth nothing if the system is giving users false answers.
But I think a good and well-crafted answer is also worth nothing when users need to wait ten to 20 seconds each time they ask something. The internet is a
fast-paced environment and people don’t like to wait.

Furthermore, a good RAG system should have guardrails against misuse. You are dealing with untrusted user input and therefore prompt injection is a real
threat for RAG systems. You need to find mechanisms to prevent misuse and still make the system respond in a friendly way.
Since LLMs tend to hallucinate (this is not exclusively LLM behavior, humans do that too), you also need a way to minimize the risk of providing false answers
to the user and in case your documents do not provide any useful information you also need to find a good solution for this.

As you can see, there are quite a few things you need to think of when building a RAG system that is somewhat publicly available. But before I jump into
how you can overcome the listed challenges, let’s take a look at what a RAG system is made of.

The Two Major Parts

A RAG system is generally split into two major parts: Ingestion and Retrieval.

Ingestion means storing documents in the vector database and retrieval is the process of getting these documents.
Both of these steps have some nuances that highly influence how good the documents that you feed the LLM are.
We will start with the ingestion since what you retrieve is only as good as what you put into the system in the first place.
If you put good stuff into the system, you have a fair chance to receive a proper answer from the LLM. If you put bad stuff into the system, chances are very
low you receive anything useful back.

Ingestion

When it comes to Ingestion, there are a couple of steps that need to be done before a piece of text can be stored in a vector database.

Content Transformation
Content Chunking
Embedding and Storage

I will go into detail on these three steps in the upcoming articles. For now I just want to say that it’s beneficial to think beforehand about

What kind of files you want to support, what text format you want to use for storage.
How big your chunks should be and how you want to split the documents.
What storage and embedding you want to use.

Especially for the third point there are many different providers that all have their own benefits. In the course of this series I will explain why the project team
I worked in decided to go with Postgres.

 Document --> Transform --> Chunk --> Embed --> pgvector

Retrieval

The Retrieval part of a RAG system includes significantly more steps than the ingestion part.

I personally split the retrieval part into six different steps:

Input Processing
Document Retrieval
Context Preparation
Reranking
Response Generation
Output Processing and Delivery

While most RAG tutorials focus on the core steps like document retrieval and response generation, I think a production RAG system needs way more than that.
Especially if you want to build it for multi-tenancy. It needs input and output guardrails, multiple retrievers and optimization regarding token cost and latency,
like the context preparation step that I included in my list. Simpler use cases may not need everything. In general having one RAG system for one specific use case only
will always lead to the best results possible. However, in the real world you cannot build and maintain a custom system for each and every customer due to time and cost restrictions.

Most of the steps mentioned above include several sub-steps. The input processing for example includes spam guards,
query rewriting and, depending on your use case, maybe even routing to or away from the document retrieval. I try to write one article for each of the steps listed above and
dive deep into the sub-steps so you have a proper understanding of what is needed for what.

Query --> Input Processing --> Document Retrieval --> Context Preparation ──┐
┌───────────────────────────────────────────────────────────────────────────┘
└──→ Reranking → Response Generation → Output & Delivery

What’s Next

In the next couple of articles I will go over the ingestion part of the RAG system. I will explain how to transform different file formats into a text format, how to split
documents into chunks that make sense for retrieval and how to store them in Postgres using pgvector. I will also provide some code examples.
We will be using Python and Langchain for this series.

Originally published on jasu.dev