DEV Community

Mārtiņš Veiss
Mārtiņš Veiss

Posted on

Self-Hosting AutoBot: A DevOps Deep Dive into Docker Compose, Model Sizing, and Production Ops

You've seen the demos. You want to run AutoBot on your own hardware, your own data, under your own control. Good instinct. Here's the full operational picture — Docker Compose internals, how to match LLM models to your GPU or CPU, and the production habits that keep things stable long-term.

Why Self-Host?

AutoBot's tagline is "Your data. Your AI." That's not marketing copy — it's an architectural choice. When you self-host:

  • Conversations never leave your network
  • You choose which models run (open-weight, cloud API, or a mix)
  • Upgrade timing is yours to control
  • No per-seat pricing surprises

The trade-off is operational responsibility. This post is about making that trade-off comfortable.


Docker Compose Deep Dive

AutoBot ships with a docker-compose.yml that wires together several services. Let's walk through each layer.

Services Overview

services:
  backend:
    build: ./backend
    ports: ["8000:8000"]
    depends_on: [chromadb, redis]
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - CHROMA_HOST=chromadb
      - REDIS_URL=redis://redis:6379

  frontend:
    build: ./frontend
    ports: ["3000:3000"]
    depends_on: [backend]

  chromadb:
    image: chromadb/chroma:latest
    volumes:
      - chroma_data:/chroma/chroma

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  chroma_data:
  redis_data:
  ollama_models:
Enter fullscreen mode Exit fullscreen mode

What Each Service Does

backend — FastAPI application. Handles chat sessions, RAG retrieval, fleet management. The OLLAMA_HOST env var points it at your local model server; swap this for an OpenAI-compatible URL to use a cloud LLM instead.

frontend — Next.js UI. Talks only to the backend on port 8000. Stateless — you can restart it without losing anything.

chromadb — Vector database for knowledge bases. Your embedded documents live here. The chroma_data volume is critical — back it up.

redis — Session state and task queues. With --appendonly yes, Redis persists to disk. Losing this volume means losing active session context (but not your knowledge bases).

ollama — Local LLM inference server. Holds downloaded model weights in ollama_models. Models are large (4–70 GB each); this volume is expensive to rebuild.

Networking

All services communicate on a default Docker bridge network. The service names (chromadb, redis, ollama) resolve as hostnames inside the network — that's why the backend config uses http://ollama:11434 rather than localhost.

For a production deployment, consider an explicit network definition:

networks:
  autobot_net:
    driver: bridge

services:
  backend:
    networks: [autobot_net]
  # ... same for all services
Enter fullscreen mode Exit fullscreen mode

This lets you add an Nginx reverse proxy or Traefik on the same network without exposing internal ports.


Model Sizing to Hardware

This is where most self-hosting guides go wrong — they talk about VPS pricing instead of the actual constraint: inference throughput vs. memory bandwidth.

The Rule of Thumb

A model running entirely in VRAM is fast. A model that spills to RAM (or worse, disk) is slow. Plan your setup so your primary model fits in VRAM with room for the OS and other processes.

Hardware VRAM Practical Model Ceiling
RTX 3060 12 GB Llama 3 8B (Q4), Mistral 7B
RTX 3090 / 4090 24 GB Llama 3 70B (Q4 at the edge), Llama 3 8B (full precision)
2× A100 80 GB 160 GB Llama 3 70B (full), most open-weight frontier models
CPU only (32 GB RAM) Llama 3 8B (Q4, slow) — workable for low-traffic RAG

Local Ollama vs. Cloud LLM Trade-offs

AutoBot supports both. Here's how to think about the choice:

Local Ollama (default)

  • Zero per-token cost
  • Private by definition
  • Latency depends on your hardware
  • Best for: high-volume internal tools, sensitive data, experimentation

Cloud LLM (OpenAI, Anthropic, etc.)

  • Pay per token
  • Faster for large models you can't run locally
  • Data leaves your network (check your provider's retention policy)
  • Best for: production apps that need frontier model quality without buying GPUs

The OLLAMA_HOST env var makes switching simple. Point it at https://api.openai.com/v1 (with an OpenAI-compatible wrapper) to route through a cloud provider without touching application code.

Practical Model Recommendations

For a RAG-heavy knowledge base workload (most AutoBot deployments): a quantized 8B model (Llama 3.1 8B Q4_K_M) hits the sweet spot — fast enough for real-time chat, accurate enough for document retrieval, fits comfortably on a single consumer GPU.

For a multi-agent fleet workload: consider running a smaller model (3B–7B) per agent node and reserving a larger model for orchestration decisions. AutoBot's fleet manager is built to handle per-agent model config.


Production Tips

Backups

The three volumes that matter:

# ChromaDB — your knowledge bases
docker run --rm \
  -v autobot_chroma_data:/source \
  -v /backup:/backup \
  alpine tar czf /backup/chroma-$(date +%Y%m%d).tar.gz -C /source .

# Redis — session state
docker exec autobot-redis-1 redis-cli BGSAVE
docker cp autobot-redis-1:/data/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

# Ollama models — large, but painful to re-download
docker run --rm \
  -v autobot_ollama_models:/source \
  -v /backup:/backup \
  alpine tar czf /backup/ollama-$(date +%Y%m%d).tar.gz -C /source .
Enter fullscreen mode Exit fullscreen mode

Run chroma and redis backups daily. Ollama models only change when you pull new ones — back up on change, not on schedule.

Upgrades

# Pull latest images
docker compose pull

# Recreate containers (zero-downtime if you add a load balancer)
docker compose up -d --no-deps --build backend frontend

# Full restart (brief downtime)
docker compose down && docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Pin image tags in production (chromadb/chroma:0.5.3 not latest) so upgrades are deliberate, not automatic.

Monitoring

AutoBot's backend exposes a /health endpoint. Wire it into your monitoring stack:

# Simple cron healthcheck
*/5 * * * * curl -sf http://localhost:8000/health || notify-oncall
Enter fullscreen mode Exit fullscreen mode

For metrics, the backend emits structured logs to stdout. Forward them to Loki, Datadog, or whatever you already use:

  backend:
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"
Enter fullscreen mode Exit fullscreen mode

Watch for these signals:

  • ChromaDB query latency > 2s — index fragmentation or under-resourced container
  • Redis memory approaching limit — set maxmemory and a sensible eviction policy (allkeys-lru)
  • Ollama inference time spiking — model being swapped to RAM; consider reducing context length or switching to a smaller quantization

What's Next

Self-hosting is the start, not the finish. Once you're running in production, the interesting work is building knowledge bases, connecting data sources, and wiring up agents for your specific workflows.

If you want to help make AutoBot better at the infrastructure layer, there are open issues tagged for DevOps contributors:

Good first issues — DevOps label on AutoBot-AI

If AutoBot is saving you money or time on your infra, consider supporting development:

Ko-fi: ko-fi.com/mrveiss

Questions, corrections, or war stories from your own deployment — drop them in the comments.

Top comments (0)