Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.
RAG Pipeline Stress Tester is a battle-testing toolkit that finds these issues before deployment.
What It Does
- Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.
- Tracks relevance, hallucination, refusal quality, and latency for every query sent.
- Scores everything into a composite health score from 0 to 100.
- Breaks results down by query category so you know exactly which failure modes are causing issues.
- Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.
- Produces an HTML report with interactive charts and a JSON report for CI/CD integration.
Why This Exists
Before deploying a RAG system to production, four questions need answers:
- Does it hallucinate when asked about things not in the corpus?
- Does it refuse appropriately on out-of-scope questions?
- Does it stay consistent when the same question is asked multiple ways?
- Does it hold up under load 10, 25, 50 concurrent users?
Manual testing cannot answer these questions at scale. This tool does it automatically.
Without stress testing - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.
With this tool - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.
The 7 Query Categories
The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:
out_of_scope - Questions with no answer in the corpus, tests hallucination resistance
adversarial - Prompt injection and jailbreak attempts, tests instruction-following safety
ambiguous - Queries with multiple valid interpretations, tests disambiguation
multilingual - Non-English queries, tests language handling
temporal - Time-sensitive questions that depend on stale data
negation - "What is NOT X" style questions, a common failure mode
compound - Multi-part questions requiring multiple retrievals
You can add your own queries by appending lines to any file in query_bank/.
Health Score
Every test run produces a composite Health Score from 0 to 100:
β₯ 80 EXCELLENT Production-ready
β₯ 60 GOOD Minor issues, review before deploying
β₯ 40 FAIR Significant issues, fix first
< 40 POOR Critical failures, do not deploy
Calculated from five weighted components:
Architecture
main.py Typer CLI β entry point and orchestration
adversarial.py Query generator β 7 categories, pre-built + corpus-generated
loader.py Async load driver β aiohttp, configurable concurrency
evaluator.py Scorer β hallucination, precision, refusal, consistency
reporter.py Report generator β HTML (Chart.js) + JSON output
corpus_analyzer.py Optional: generate targeted queries from your own documents
query_bank/ 7 pre-built adversarial query files (one per line)
tests/ 58 pytest tests (no live endpoint needed)
Install
pip install -r requirements.txt
The endpoint the tester sends requests to must accept POST with {"query": "..."} and return JSON containing either a response or answer field. Any HTTP status other than 200 is counted as an error.
Running a Stress Test
The core command runs a full stress test against your RAG endpoint:
# Basic β 10 concurrent users, 60-second run
python3 main.py stress-test \
--endpoint http://localhost:8000/query \
--concurrency 10 \
--duration 60
# Test only specific query categories
python3 main.py stress-test \
--endpoint http://localhost:8000/query \
--query-types out_of_scope,adversarial,multilingual
# Custom output directory
python3 main.py stress-test \
--endpoint http://localhost:8000/query \
--output ./my-reports
Here is what a real terminal output looks like:
π Starting RAG Stress Test
Endpoint: http://localhost:8000/query
Concurrency: 5
Duration: 20s
π Generating test queries...
Generated 350 test queries
β‘ Running load tests...
π Evaluating results...
π Generating reports...
β
Stress test complete!
JSON Report: reports/stress_test_results.json
HTML Report: reports/stress_test_report.html
=======================================================
Overall Health Score : 57.1/100
Status : FAIR - Significant issues detected
Total requests : 6355
Error rate : 0.0%
Precision score : 2.1%
Hallucination rate : 22.5%
Refusal rate : 77.5%
Consistency score : 72.1%
Latency p50/p95/p99 : 2.9 / 6.3 / 8.7 ms
Query Type Count Halluc% Refusal% AvgLat
------------------ ------ -------- --------- --------
adversarial 205 35.1% 64.9% 3.3ms
ambiguous 250 12.0% 88.0% 3.2ms
compound 200 22.0% 78.0% 4.0ms
multilingual 250 10.0% 90.0% 3.1ms
negation 200 20.0% 80.0% 5.3ms
out_of_scope 250 20.0% 80.0% 4.0ms
temporal 200 38.0% 62.0% 3.1ms
Recommendations:
- Low precision score. Enhance retrieval mechanism and relevance ranking.
- Moderate: Several areas need improvement for production readiness.
=======================================================
Quick Sanity Check
For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:
python3 main.py quick-test --endpoint http://localhost:8000/query
π Running quick sanity test...
Testing with 35 sample queries
π― Quick Test Health Score: 72.4/100
β
Endpoint appears functional
Generate Queries From Your Own Corpus
The analyze-corpus command analyzes your own .txt, .md, or .json files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into query_bank/:
python3 main.py analyze-corpus \
--corpus ./my-docs \
--output ./query_bank \
--num-queries 50
π Analyzing corpus: ./my-docs
Generated 50 in_scope queries β query_bank/in_scope_generated.txt
Generated 50 out_of_scope queries β query_bank/out_of_scope_generated.txt
Generated 50 adversarial queries β query_bank/adversarial_generated.txt
β
Corpus analysis complete!
For very small corpora, lower the keyword frequency threshold:
python3 main.py analyze-corpus \
--corpus ./my-docs \
--output ./query_bank \
--num-queries 20 \
--min-word-freq 1
Configuration
Edit config.yaml to customise load levels, thresholds, and reporting. The --endpoint CLI flag always takes precedence over config.yaml.
-
load.concurrency_levels- Concurrent user levels to test, for example[1, 5, 10, 25] -
load.ramp_mode- If true, steps through each concurrency level; if false, runs at the first level for the full duration -
load.duration_seconds- How long to run at each concurrency level -
load.rate_limit_per_second- Maximum requests per second -
evaluation.hallucination_threshold- Keyword-overlap score below which a response is flagged as a potential hallucination -
evaluation.refusal_keywords- Phrases that indicate a refused answer -
reporter.output_dir- Where to save HTML and JSON reports
Pass the config file with --config:
python3 main.py stress-test \
--endpoint http://localhost:8000/query \
--config config.yaml
Output Reports
Each test run saves two files to ./reports/ or your --output path:
stress_test_results.json - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.
stress_test_report.html - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.
Endpoint Requirements
The tester sends:
POST /your-endpoint
{"query": "What is machine learning?"}
It expects a JSON response containing either a response or answer field:
{"response": "Machine learning is..."}
Any HTTP status other than 200 is counted as an error.
Running Tests
python3 -m pytest tests/ -v
58 tests covering all modules. Uses aioresponses to mock HTTP - no live RAG endpoint required.
Project Structure
rag-pipeline-stress-tester/
βββ main.py # CLI entry point
βββ adversarial.py # Query generators (7 types)
βββ loader.py # Async load test driver
βββ evaluator.py # Scoring and metrics
βββ reporter.py # HTML + JSON report generator
βββ corpus_analyzer.py # Optional corpus-based query generation
βββ config.yaml # Test configuration
βββ requirements.txt
βββ query_bank/ # 7 pre-built adversarial query files
βββ tests/ # 58 pytest tests
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.
xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.
How You Can Use and Extend This With NEO
Use it as a pre-deployment gate for every RAG system.
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.
Use it with your own domain queries.
The pre-built query banks are general purpose. For domain-specific testing, run analyze-corpus on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into query_bank/ and run the stress test.
Integrate the JSON report into CI/CD.
stress_test_results.json is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.
Extend it with additional query categories.
The 7 query banks are plain text files in query_bank/, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to query_bank/ and registering it in adversarial.py.
Final Notes
RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.
The code is at https://github.com/dakshjain-1616/RAG-pipeline-stress-tester
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code


Top comments (1)
Nice tool! The 7 query categories are well thought out negation and temporal are the ones most people forget but they bite hardest in production.
The concurrent load testing with p95/p99 latency is the real differentiator here. Most RAG eval tools only measure single-request timing, which tells you nothing about how it holds up when 25 users hit it at once.
One thing I'd love to see added, faithfulness scoring.
Right now you're checking "did it hallucinate?
but there's a middle ground did it retrieve the right context and then extrapolate beyond it?
That's where a lot of RAG systems quietly fail.
Also any plans to support streaming endpoints?
A lot of production RAG setups use SSE, and latency characteristics under load are pretty different there.
Good stuff Nelofer π ππ