DEV Community

hitansu parichha
hitansu parichha

Posted on

I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory

Gemma 4 Challenge: Build With Gemma 4 Submission

I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory

"Good evening, Sir. All systems are online."

Six months ago I had a simple idea: stop renting intelligence from the cloud and build something that lives entirely on my MacBook. Something that watches my screen, listens for my voice, manages my files, writes code, and remembers everything — without a single byte leaving the machine. A real J.A.R.V.I.S., not a chatbot wrapper.

Today I'm sharing the result: Project J.A.R.V.I.S. v5.0 — a fully local AI operating system built on Gemma 4, running on a MacBook Pro M4 Pro (48 GB unified memory). No OpenAI API keys. No subscriptions. No data leaving the machine (except when I explicitly flip it to online mode). Five completed phases, 13 specialist agents, a four-tier memory system, live screen vision, wake word detection, and an autonomous complexity router that picks the right model for every single request.

Let me show you exactly how it works — and more importantly, why Gemma 4 made this possible when nothing else could.


Why Gemma 4? The Honest Answer

Before I walk through the architecture, let me justify the model choice — because the challenge judging criteria specifically asks for intentional model selection.

I tried this project with other local models first. The problem was always the same: you either got a fast, small model that hallucinated too much on complex tasks, or you got a large model that worked well but took ~3 seconds to respond to "what time is it." Neither is acceptable for an always-on personal OS.

Gemma 4 solved this with its model family structure:

Model What it is RAM on M4 Pro Role in JARVIS
gemma4:e4b 4B effective params, MoE ~10 GB Always-on backbone (never unloads)
gemma4:26b 26B A4B MoE ~18 GB Code specialist + Deep screen vision
qwen3.5:27b-q4_K_M 27B dense (pairing model) ~16 GB Planner/Orchestrator/Researcher

Three critical things made Gemma 4 the only viable choice:

1. Native multimodal in the same model. gemma4:26b handles both text and images natively. This means the screen vision agent and code specialist use the exact same loaded model — zero extra RAM for vision capability.

2. The E4B is genuinely good. Most "small" models at 4B parameters are toys. Gemma 4 E4B (4 billion effective parameters via MoE routing) handles routing, auditing, voice triage, passive screen watching, and memory distillation — five separate roles — fast enough that the user never feels latency.

3. The 128K context window. My JARVIS_CORE.md persona file is ~4,000 tokens. It gets prepended to every single agent prompt. With a 128K context, this is trivial. With older models, this would eat 8–16% of the context budget on every call.


The Architecture: 10 Phases, 5 Complete

The full system is designed as 10 phases. Here's where we are:

Phase 1  ✅  Foundation: FastAPI gateway, complexity router, agent registry, dual-mode
Phase 2  ✅  Security: Sandbox executor, audit logs, path/network guards
Phase 3  ✅  Voice Engine: Whisper STT, Kokoro TTS, wake word, conversation loop
Phase 4  ✅  Memory: ChromaDB + Graphiti temporal graph + nightly distiller
Phase 5  ✅  Screen Vision: Passive watcher, deep analysis, proactive suggestions
Phase 6  🔨  Computer Control: PyAutoGUI, browser automation
Phase 7  🔨  Multi-Agent Teams: Parallel specialist delegation
Phase 8  🔨  MCP Skills Library: 8 MCP servers, 500+ tool integrations
Phase 9  🔨  Persona Engine: Emotional state, adaptive tone
Phase 10 🔨  Packaging: dmg installer, auto-update
Enter fullscreen mode Exit fullscreen mode

Let me walk through each completed phase in depth.


Phase 1: The Complexity Router — The Brain Behind Model Switching

This is the feature I'm most proud of. Every message that comes into JARVIS goes through the ComplexityRouter first. It assigns a score from 1 to 10 and routes to the appropriate model automatically.

Score 1-4  → gemma4:e4b       (receptionist — always loaded)
Score 5-7  → qwen3.5:27b      (orchestrator — loaded on demand)
Score 7-8  → gemma4:26b       (code specialist — loaded on demand)
Score 9-10 → orchestrator + specialist delegation
Enter fullscreen mode Exit fullscreen mode

Here's the actual scoring logic from core_engine/router.py:

def classify(self, message: str, context: str = "") -> dict:
    msg_lower = message.lower().strip()
    word_count = len(msg_lower.split())
    candidate_scores: list[int] = []

    # Rule 1: Very short / greeting → score 1-2
    if word_count < 5 or any(g in msg_lower for g in _LIGHT_GREETINGS):
        candidate_scores.append(2)

    # Rule 2: Light-medium factual patterns → score 3-4
    if any(re.search(p, msg_lower) for p in _LIGHT_MEDIUM_PATTERNS):
        candidate_scores.append(4)

    # Rule 3: Medium planning/research/comms → score 5-6
    if any(kw in msg_lower for kw in _MEDIUM_KEYWORDS):
        candidate_scores.append(5)

    # Rule 4: Code-related keywords → score 7-8
    if any(kw in msg_lower for kw in _CODE_KEYWORDS):
        candidate_scores.append(8)

    # Rule 5: Very long messages → score 9
    if word_count > 200:
        candidate_scores.append(9)

    # Rule 6: Multi-domain "research AND implement" → score 9-10
    if any(re.search(p, msg_lower) for p in _VERY_COMPLEX_MULTI):
        candidate_scores.append(10)

    score = max(candidate_scores)
    score = max(1, min(10, score))
    ...
Enter fullscreen mode Exit fullscreen mode

This is rule-based with LLM fallback planned for Phase 7. The key insight: rule-based routing is faster and more predictable than asking an LLM to route itself. For an always-on system, latency on the routing decision itself matters.

The RAM Guard

The single most important constraint in the system: gemma4:26b (~18 GB) and qwen3.5:27b-q4_K_M (~16 GB) cannot be loaded simultaneously — that's ~34 GB combined, leaving only ~14 GB for the OS on a 48 GB machine. The ModeManager enforces this as a hard rule:

# Large model RAM guard — these two must NEVER coexist
_LARGE_MODEL_A = "gemma4:26b"
_LARGE_MODEL_B = "qwen3.5:27b-q4_K_M"
Enter fullscreen mode Exit fullscreen mode

Before loading either model, the gateway checks which large model (if any) is currently resident and unloads it first. This makes model switching take ~2-3 seconds but prevents OOM crashes entirely.

Offline/Online Dual Mode

Every agent routes through ModeManager, which abstracts the backend:

  • OFFLINE mode: Ollama at localhost:11434
  • ONLINE mode: Vertex AI (Gemini 2.5 Pro/Flash/Flash-Lite) with automatic fallback to Ollama on any Vertex failure

The online model assignment mirrors the offline complexity tiers:

Complexity 8+  → gemini-2.5-pro    (matches gemma4:26b tier)
Complexity 5-7 → gemini-2.5-flash  (matches qwen3.5:27b tier)
Complexity 1-4 → gemini-2.5-flash-lite (matches gemma4:e4b tier)
Enter fullscreen mode Exit fullscreen mode

Privacy rule enforced in both modes: voice_triage always routes to local gemma4:e4b, never to Vertex AI — even in online mode. Voice commands are private.


Phase 2: The Security Sandbox

Every agent action passes through SecurityEnforcer before execution. This isn't optional middleware — it's enforced at the gateway level.

class SecurityEnforcer:
    """
    Central security orchestration layer.
    Coordinates PathGuard, NetworkGuard, and AuditManager.
    """
Enter fullscreen mode Exit fullscreen mode

The security stack:

  • PathGuard — blocks access outside allowed directories; ~/ and above requires explicit allowlisting
  • NetworkGuard — allowlist of permitted domains; blocks all others including internal network calls
  • AuditManager — SHA-256 hash-chained audit log; every action is cryptographically linked to the previous entry. API keys are automatically redacted via regex before logging.
  • PendingAction queue — file deletions require two separate confirmations, with a 5-minute expiry window. If the user doesn't confirm twice within 5 minutes, the action is cancelled.

The security policy lives in sandbox/jarvis_security.yaml — a human-readable YAML file where you can add rules without touching Python. sudo and admin commands are completely blocked at the policy level, not just the prompt level.


Phase 3: The Voice Engine — Wake Word to Spoken Response

The voice pipeline is a full conversation loop, not a single-shot transcription:

Idle state
    ↓ (wake word detected: "Hey Jarvis")
Recording (VAD auto-stops on silence)
    ↓
Transcribing (Whisper large-v3)
    ↓
Processing (ComplexityRouter → Agent → Response)
    ↓
Speaking (Kokoro-82M TTS, audio streamed)
    ↓
Conversation mode (60-second window, no wake word needed for follow-ups)
    ↓ (farewell word OR 60s idle)
Idle state
Enter fullscreen mode Exit fullscreen mode

The farewell word detection is multilingual — the system understands English, Hindi (Devanagari), and Hinglish out of the box:

FAREWELL_WORDS = {
    "goodbye", "bye", "sleep", "stand by", "dismissed",
    # Hindi
    "अलविदा", "सो जाओ", "शुभ रात्रि",
    # Hinglish
    "alvida", "so jao", "bas itna hi",
}
Enter fullscreen mode Exit fullscreen mode

TTS model selection: Kokoro-82M runs at ~15ms per sentence on the M4 Pro's MPS backend. Whisper large-v3 loads lazily on first voice command and stays resident — initial load ~3 seconds, subsequent calls ~200ms for a typical spoken sentence.

The voice session manager uses asyncio throughout. The wake word detector runs in a background thread, but hands off to asyncio.run_coroutine_threadsafe for everything downstream — so the voice pipeline and the FastAPI gateway share the same event loop cleanly.


Phase 4: The Four-Tier Memory System

This is what separates JARVIS from a standard chatbot. There are four memory tiers:

TIER 1 — PROCEDURAL  : JARVIS_CORE.md (persona, rules, user profile)
                       Injected first in every prompt. KV-cached by Ollama.
                       Cost after first request: ~0ms.

TIER 2 — EPISODIC    : memory_vault/logs/YYYY-MM-DD.log
                       Raw conversation log. Never injected directly.
                       Input for nightly distillation.

TIER 3 — SEMANTIC    : ChromaDB (vector similarity) + Graphiti (temporal graph)
                       Top 5 relevant facts injected silently into every prompt.

TIER 4 — COMPILED WIKI: memory_vault/wiki/
                       Synthesized Markdown knowledge base.
                       Built nightly from Tiers 2 and 3.
                       Human-readable and human-editable.
Enter fullscreen mode Exit fullscreen mode

The GraphitiStore component uses bi-temporal modeling — every fact has both a valid_from and valid_to timestamp:

"User prefers Redux"   → valid_from: Jan 1 | valid_to: Mar 15 (superseded)
"User prefers Zustand" → valid_from: Mar 15 | valid_to: None  (CURRENT)
Enter fullscreen mode Exit fullscreen mode

When JARVIS learns a new contradicting fact, it automatically closes the old one rather than stacking conflicting facts. This means memory gets smarter and more accurate over time — old facts don't poison new queries.

The nightly distillation job (runs at 2 AM on idle system via APScheduler) reads the day's episode log, extracts durable facts, and:

  1. Writes vector embeddings to ChromaDB
  2. Writes episodes to Graphiti with contradiction detection
  3. Updates wiki/user_profile.md with the compiled view

Memory correction commands JARVIS understands naturally:

  • "Jarvis, forget that I use Redux."
  • "Jarvis, what do you know about me?"
  • "Jarvis, do not learn from the next 10 minutes."
  • "Jarvis, show me my coding wiki."

Phase 5: Screen Vision — Watching Without Being Asked

The screen engine runs as a background thread, taking a screenshot every 2 seconds and running it through gemma4:e4b passive analysis. If a suggestion is generated AND the cooldown period has elapsed (default 120 seconds), JARVIS speaks up.

The passive watcher uses a two-tier vision model approach:

  • Passive (gemma4:e4b): Always-on. Fast. Shared model — no extra RAM cost. Detects what app is open, what file is being edited, current context.
  • Deep (gemma4:26b): On-demand. Full multimodal analysis with the same model used for code. Only loaded when the situation requires deeper understanding (complex UI, code review, error diagnosis).

The ScreenVision component returns structured output for every capture:

{
    "description": "TypeScript file auth.ts, async function handleLogin at line 26",
    "app_detected": "vscode",
    "context": "TypeScript file auth.ts line 26",
    "suggestions": ["The handleLogin function is not handling the rejected promise..."],
    "screenshot_b64": "",  # Only populated in deep mode
    "model_used": "gemma4:e4b"
}
Enter fullscreen mode Exit fullscreen mode

The SuggestionEngine ranks suggestions by relevance and enforces the Proactive Suggestion Protocol defined in JARVIS_CORE.md:

  • Maximum one suggestion every 3 minutes
  • Always starts with "Sorry to interrupt, Sir."
  • Always ends with "Shall I?" — never acts without confirmation

The JARVIS_CORE.md Persona File — The Secret Architecture Piece

One piece of the system that isn't obvious from the directory structure: JARVIS_CORE.md is not just a prompt file. It's the KV-cache anchor for the entire system.

When Ollama processes the first request with JARVIS_CORE.md prepended, it caches the key-value attention vectors for those ~4,000 tokens. Every subsequent request that starts with the same JARVIS_CORE.md prefix costs ~0ms for that portion — Ollama serves it from cache.

This is why the file contains the user profile, personality definition, memory architecture, anti-patterns, response format taxonomy (12 response types), wit calibration levels (0-4), and operating rules — all in one place, all cached, all cost-free after the first request.

The response taxonomy is worth highlighting. Every incoming message is classified into one of 12 types:

TYPE 1  — FACTUAL_SIMPLE     TYPE 7  — CODE_DEBUG
TYPE 2  — FACTUAL_LIST       TYPE 8  — TASK_CONFIRM
TYPE 3  — OPINION_ANALYSIS   TYPE 9  — RESEARCH_SUMMARY
TYPE 4  — COMPARISON         TYPE 10 — PLAN_STRATEGY
TYPE 5  — CODE_WRITE         TYPE 11 — CASUAL_CHAT
TYPE 6  — CODE_EXPLAIN       TYPE 12 — SYSTEM_STATUS
Enter fullscreen mode Exit fullscreen mode

Each type has both a text format and a voice format. In voice mode, markdown characters are forbidden — the model is instructed to produce natural spoken transitions ("First... Second... And finally...") instead of bullet points and headers.


The 13-Agent Registry

Agent                  Model (Offline)          Always-on?   Notes
─────────────────────────────────────────────────────────────────────────────
Receptionist           gemma4:e4b               Yes          Router + simple chat
Manager/Planner        qwen3.5:27b-q4_K_M       On-demand    Plans, coordinates
Code Specialist        gemma4:26b               On-demand    Write/debug/refactor
Screen Vision Passive  gemma4:e4b               Yes          2s scan, shared model
Screen Vision Deep     gemma4:26b               On-demand    Full analysis + control
Browser/Shopping       qwen3.5:27b-q4_K_M       On-demand    Puppeteer MCP
Research               qwen3.5:27b-q4_K_M       On-demand    Brave Search + Firecrawl
Auditor/QA             gemma4:e4b               Yes          Reviews outputs
Memory Distiller       gemma4:e4b               Yes          Nightly 2 AM job
File Manager           qwen3.5:27b-q4_K_M       On-demand    Filesystem MCP
Voice Triage           gemma4:e4b               Yes (ALWAYS LOCAL) Privacy-first
System Control         qwen3.5:27b-q4_K_M       On-demand    OS commands via sandbox
Communication          qwen3.5:27b-q4_K_M       On-demand    Gmail/Slack/Jira
Enter fullscreen mode Exit fullscreen mode

Five agents share gemma4:e4b and stay resident forever — that's the 10 GB base cost that never goes away. The other eight are on-demand, with the RAM guard preventing simultaneous loading of the two large models.


The RAM Budget in Practice

State                         RAM Used    Free (of 48 GB)
────────────────────────────────────────────────────────
macOS baseline                ~19.6 GB    ~28.4 GB
+ gemma4:e4b (always-on)      ~29.6 GB    ~18.4 GB
+ Code task → load 26b        ~37.6 GB    ~10.4 GB  ✅ Safe
+ Planning → unload 26b,
  load qwen3.5:27b             ~35.6 GB    ~12.4 GB  ✅ Safe
⚠️ BLOCKED: both large models ~53.6 GB       N/A    ❌ Hard block
Enter fullscreen mode Exit fullscreen mode

The hard block isn't a warning — the gateway refuses to route to a model if loading it would violate the RAM constraint. The user gets a graceful degradation message and JARVIS falls back to gemma4:e4b for the task.


Getting Started

Prerequisites: macOS with Apple Silicon (M1 or later), Ollama installed, Python 3.11+.

# Clone and enter the project
git clone https://github.com/Hitansu2004/Jarvis
cd jarvis

# Run setup (creates venv, installs deps, generates .env)
./setup.sh

# Pull the Gemma 4 models
ollama pull gemma4:e4b
ollama pull gemma4:26b

# Start the gateway
source .venv/bin/activate
uvicorn core_engine.gateway:app --reload

# Test it
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Good evening, Jarvis. What can you do?"}'
Enter fullscreen mode Exit fullscreen mode

To enable voice:

# Install audio deps (macOS)
brew install portaudio
pip install pyaudio --break-system-packages

# Start with voice
VOICE_ENABLED=true uvicorn core_engine.gateway:app --reload
Enter fullscreen mode Exit fullscreen mode

Check status and agent list:

curl http://localhost:8000/status
curl http://localhost:8000/agents
Enter fullscreen mode Exit fullscreen mode

The .env.example file documents every configurable parameter across all 10 phases. Start with the defaults — they're tuned for a 48 GB M4 Pro but the complexity thresholds and model assignments are all environment variables.


What Gemma 4 Unlocked That Nothing Else Could

Let me be specific about this, because it's the core of the build:

Gemma 4 E4B as a shared multi-role model. Five separate agents running on one always-loaded model is only possible because E4B is genuinely capable despite its size. Receptionist, Auditor, Voice Triage, Passive Screen Vision, and Memory Distiller all run on it. With any other 4B model I tried, the quality degraded below acceptable for at least two of those roles.

Gemma 4 26B's native vision without a separate model. Screen vision and code review on the same loaded model. This single fact saved ~10 GB of RAM (no separate vision model) and eliminated one entire model-switching operation from the pipeline.

The MoE efficiency. Both Gemma 4 models use Mixture-of-Experts. gemma4:26b has 26B parameters but only 4B are active on any given token. This is why the RAM footprint (~18 GB) is dramatically lower than a 26B dense model would be (~52+ GB). Without MoE, this entire architecture is impossible on consumer hardware.

The 128K context window. JARVIS_CORE.md + memory retrieval + the actual conversation can comfortably fit in context. With a 4K or 8K context window, the persona file alone would crowd out the memory system.


What's Next (Phases 6-10)

Phase 6 brings computer control — PyAutoGUI integration so JARVIS can click, type, and navigate on behalf of the user, with every action requiring confirmation through the security sandbox.

Phase 7 activates multi-agent teams — the orchestrator can spawn parallel specialist agents for complex tasks, with results synthesized back through an Auditor QA pass before delivery.

Phase 8 wires up the MCP Skills Library — 8 MCP servers are already registered in skills_mcp/mcp_registry.json, including Filesystem, GitHub, Puppeteer, Brave Search, Composio (500+ app integrations), and PyAutoGUI. The registry exists now; Phase 8 activates the connections.

Phase 10 is the goal: a .dmg installer that anyone with an Apple Silicon Mac can download and run. No cloud dependencies. No subscriptions. A personal AI OS that's yours.


Why This Matters

There's a philosophical point underneath all the engineering: your AI assistant should not require a corporate server to function.

The voice triage agent — the part of JARVIS that hears your voice commands and decides what to do with them — is hardcoded to gemma4:e4b regardless of the operation mode. Even if you've switched JARVIS to online mode for heavier tasks, voice never leaves the machine. This isn't a setting. It's enforced at the gateway level.

Every conversation logs to a local file. The nightly distillation runs on your own CPU. The temporal knowledge graph lives in memory_vault/kuzu_db/ on your filesystem. Your AI gets smarter over time, and none of that learning ever touches a cloud database.

Gemma 4 made this possible. A capable, efficient, multimodal open model family that runs well on consumer hardware is the technical prerequisite for this entire architecture. The E4B model being genuinely useful is what allows five always-on agents without breaking the RAM budget. The 26B model's native vision support is what makes screen understanding practical. The MoE efficiency is what makes the math work on 48 GB.

If you want to build something similar, the repo is linked below. The .env.example file is extensively documented. The test suite (50+ tests across all five phases) serves as the best architecture documentation.


Built with Python 3.11, FastAPI, Ollama, Gemma 4, PyTorch 2.6 MPS, ChromaDB, Graphiti, Kuzu, Whisper large-v3, Kokoro-82M, and a lot of late-night IST sessions.

Author: Hitansu Parichha | Software Engineer at Nisum Technologies

Top comments (0)