Nilofer 🚀

Posted on May 12

RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production

#rag #llm #machinelearning #opensource

Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.

RAG Pipeline Stress Tester is a battle-testing toolkit that finds these issues before deployment.

What It Does

Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.
Tracks relevance, hallucination, refusal quality, and latency for every query sent.
Scores everything into a composite health score from 0 to 100.
Breaks results down by query category so you know exactly which failure modes are causing issues.
Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.
Produces an HTML report with interactive charts and a JSON report for CI/CD integration.

Why This Exists

Before deploying a RAG system to production, four questions need answers:

Does it hallucinate when asked about things not in the corpus?
Does it refuse appropriately on out-of-scope questions?
Does it stay consistent when the same question is asked multiple ways?
Does it hold up under load 10, 25, 50 concurrent users?

Manual testing cannot answer these questions at scale. This tool does it automatically.

Without stress testing - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.

With this tool - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.

The 7 Query Categories

The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:

out_of_scope - Questions with no answer in the corpus, tests hallucination resistance
adversarial - Prompt injection and jailbreak attempts, tests instruction-following safety
ambiguous - Queries with multiple valid interpretations, tests disambiguation
multilingual - Non-English queries, tests language handling
temporal - Time-sensitive questions that depend on stale data
negation - "What is NOT X" style questions, a common failure mode
compound - Multi-part questions requiring multiple retrievals

You can add your own queries by appending lines to any file in query_bank/.

Health Score

Every test run produces a composite Health Score from 0 to 100:

≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 < 40  POOR        Critical failures, do not deploy

Calculated from five weighted components:

Architecture

main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)

Install

pip install -r requirements.txt

The endpoint the tester sends requests to must accept POST with {"query": "..."} and return JSON containing either a response or answer field. Any HTTP status other than 200 is counted as an error.

Running a Stress Test

The core command runs a full stress test against your RAG endpoint:

# Basic — 10 concurrent users, 60-second run
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --concurrency 10 \
  --duration 60

# Test only specific query categories
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --query-types out_of_scope,adversarial,multilingual

# Custom output directory
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --output ./my-reports

Here is what a real terminal output looks like:

🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================

Quick Sanity Check

For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:

python3 main.py quick-test --endpoint http://localhost:8000/query

🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional

Generate Queries From Your Own Corpus

The analyze-corpus command analyzes your own .txt, .md, or .json files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into query_bank/:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 50

📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!

For very small corpora, lower the keyword frequency threshold:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 20 \
  --min-word-freq 1

Configuration

Edit config.yaml to customise load levels, thresholds, and reporting. The --endpoint CLI flag always takes precedence over config.yaml.

load.concurrency_levels - Concurrent user levels to test, for example [1, 5, 10, 25]
load.ramp_mode - If true, steps through each concurrency level; if false, runs at the first level for the full duration
load.duration_seconds - How long to run at each concurrency level
load.rate_limit_per_second - Maximum requests per second
evaluation.hallucination_threshold - Keyword-overlap score below which a response is flagged as a potential hallucination
evaluation.refusal_keywords - Phrases that indicate a refused answer
reporter.output_dir - Where to save HTML and JSON reports

Pass the config file with --config:

python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --config config.yaml

Output Reports

Each test run saves two files to ./reports/ or your --output path:

stress_test_results.json - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.

stress_test_report.html - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.

Endpoint Requirements

The tester sends:

POST /your-endpoint
{"query": "What is machine learning?"}

It expects a JSON response containing either a response or answer field:

{"response": "Machine learning is..."}

Any HTTP status other than 200 is counted as an error.

Running Tests

python3 -m pytest tests/ -v

58 tests covering all modules. Uses aioresponses to mock HTTP - no live RAG endpoint required.

Project Structure

rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.

xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.

How You Can Use and Extend This With NEO

Use it as a pre-deployment gate for every RAG system.
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.

Use it with your own domain queries.
The pre-built query banks are general purpose. For domain-specific testing, run analyze-corpus on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into query_bank/ and run the stress test.

Integrate the JSON report into CI/CD.
stress_test_results.json is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.

Extend it with additional query categories.
The 7 query banks are plain text files in query_bank/, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to query_bank/ and registering it in adversarial.py.

Final Notes

RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.

The code is at https://github.com/dakshjain-1616/RAG-pipeline-stress-tester
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Top comments (1)

Mamoor Ahmad • May 12

Nice tool! The 7 query categories are well thought out negation and temporal are the ones most people forget but they bite hardest in production.

The concurrent load testing with p95/p99 latency is the real differentiator here. Most RAG eval tools only measure single-request timing, which tells you nothing about how it holds up when 25 users hit it at once.

One thing I'd love to see added, faithfulness scoring.
Right now you're checking "did it hallucinate?
but there's a middle ground did it retrieve the right context and then extrapolate beyond it?
That's where a lot of RAG systems quietly fail.

Also any plans to support streaming endpoints?
A lot of production RAG setups use SSE, and latency characteristics under load are pretty different there.

Good stuff Nelofer 🙌 🙌👏