I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory
"Good evening, Sir. All systems are online."
Six months ago I had a simple idea: stop renting intelligence from the cloud and build something that lives entirely on my MacBook. Something that watches my screen, listens for my voice, manages my files, writes code, and remembers everything — without a single byte leaving the machine. A real J.A.R.V.I.S., not a chatbot wrapper.
Today I'm sharing the result: Project J.A.R.V.I.S. v5.0 — a fully local AI operating system built on Gemma 4, running on a MacBook Pro M4 Pro (48 GB unified memory). No OpenAI API keys. No subscriptions. No data leaving the machine (except when I explicitly flip it to online mode). Five completed phases, 13 specialist agents, a four-tier memory system, live screen vision, wake word detection, and an autonomous complexity router that picks the right model for every single request.
Let me show you exactly how it works — and more importantly, why Gemma 4 made this possible when nothing else could.
Why Gemma 4? The Honest Answer
Before I walk through the architecture, let me justify the model choice — because the challenge judging criteria specifically asks for intentional model selection.
I tried this project with other local models first. The problem was always the same: you either got a fast, small model that hallucinated too much on complex tasks, or you got a large model that worked well but took ~3 seconds to respond to "what time is it." Neither is acceptable for an always-on personal OS.
Gemma 4 solved this with its model family structure:
| Model | What it is | RAM on M4 Pro | Role in JARVIS |
|---|---|---|---|
gemma4:e4b |
4B effective params, MoE | ~10 GB | Always-on backbone (never unloads) |
gemma4:26b |
26B A4B MoE | ~18 GB | Code specialist + Deep screen vision |
qwen3.5:27b-q4_K_M |
27B dense (pairing model) | ~16 GB | Planner/Orchestrator/Researcher |
Three critical things made Gemma 4 the only viable choice:
1. Native multimodal in the same model. gemma4:26b handles both text and images natively. This means the screen vision agent and code specialist use the exact same loaded model — zero extra RAM for vision capability.
2. The E4B is genuinely good. Most "small" models at 4B parameters are toys. Gemma 4 E4B (4 billion effective parameters via MoE routing) handles routing, auditing, voice triage, passive screen watching, and memory distillation — five separate roles — fast enough that the user never feels latency.
3. The 128K context window. My JARVIS_CORE.md persona file is ~4,000 tokens. It gets prepended to every single agent prompt. With a 128K context, this is trivial. With older models, this would eat 8–16% of the context budget on every call.
The Architecture: 10 Phases, 5 Complete
The full system is designed as 10 phases. Here's where we are:
Phase 1 ✅ Foundation: FastAPI gateway, complexity router, agent registry, dual-mode
Phase 2 ✅ Security: Sandbox executor, audit logs, path/network guards
Phase 3 ✅ Voice Engine: Whisper STT, Kokoro TTS, wake word, conversation loop
Phase 4 ✅ Memory: ChromaDB + Graphiti temporal graph + nightly distiller
Phase 5 ✅ Screen Vision: Passive watcher, deep analysis, proactive suggestions
Phase 6 🔨 Computer Control: PyAutoGUI, browser automation
Phase 7 🔨 Multi-Agent Teams: Parallel specialist delegation
Phase 8 🔨 MCP Skills Library: 8 MCP servers, 500+ tool integrations
Phase 9 🔨 Persona Engine: Emotional state, adaptive tone
Phase 10 🔨 Packaging: dmg installer, auto-update
Let me walk through each completed phase in depth.
Phase 1: The Complexity Router — The Brain Behind Model Switching
This is the feature I'm most proud of. Every message that comes into JARVIS goes through the ComplexityRouter first. It assigns a score from 1 to 10 and routes to the appropriate model automatically.
Score 1-4 → gemma4:e4b (receptionist — always loaded)
Score 5-7 → qwen3.5:27b (orchestrator — loaded on demand)
Score 7-8 → gemma4:26b (code specialist — loaded on demand)
Score 9-10 → orchestrator + specialist delegation
Here's the actual scoring logic from core_engine/router.py:
def classify(self, message: str, context: str = "") -> dict:
msg_lower = message.lower().strip()
word_count = len(msg_lower.split())
candidate_scores: list[int] = []
# Rule 1: Very short / greeting → score 1-2
if word_count < 5 or any(g in msg_lower for g in _LIGHT_GREETINGS):
candidate_scores.append(2)
# Rule 2: Light-medium factual patterns → score 3-4
if any(re.search(p, msg_lower) for p in _LIGHT_MEDIUM_PATTERNS):
candidate_scores.append(4)
# Rule 3: Medium planning/research/comms → score 5-6
if any(kw in msg_lower for kw in _MEDIUM_KEYWORDS):
candidate_scores.append(5)
# Rule 4: Code-related keywords → score 7-8
if any(kw in msg_lower for kw in _CODE_KEYWORDS):
candidate_scores.append(8)
# Rule 5: Very long messages → score 9
if word_count > 200:
candidate_scores.append(9)
# Rule 6: Multi-domain "research AND implement" → score 9-10
if any(re.search(p, msg_lower) for p in _VERY_COMPLEX_MULTI):
candidate_scores.append(10)
score = max(candidate_scores)
score = max(1, min(10, score))
...
This is rule-based with LLM fallback planned for Phase 7. The key insight: rule-based routing is faster and more predictable than asking an LLM to route itself. For an always-on system, latency on the routing decision itself matters.
The RAM Guard
The single most important constraint in the system: gemma4:26b (~18 GB) and qwen3.5:27b-q4_K_M (~16 GB) cannot be loaded simultaneously — that's ~34 GB combined, leaving only ~14 GB for the OS on a 48 GB machine. The ModeManager enforces this as a hard rule:
# Large model RAM guard — these two must NEVER coexist
_LARGE_MODEL_A = "gemma4:26b"
_LARGE_MODEL_B = "qwen3.5:27b-q4_K_M"
Before loading either model, the gateway checks which large model (if any) is currently resident and unloads it first. This makes model switching take ~2-3 seconds but prevents OOM crashes entirely.
Offline/Online Dual Mode
Every agent routes through ModeManager, which abstracts the backend:
-
OFFLINE mode: Ollama at
localhost:11434 - ONLINE mode: Vertex AI (Gemini 2.5 Pro/Flash/Flash-Lite) with automatic fallback to Ollama on any Vertex failure
The online model assignment mirrors the offline complexity tiers:
Complexity 8+ → gemini-2.5-pro (matches gemma4:26b tier)
Complexity 5-7 → gemini-2.5-flash (matches qwen3.5:27b tier)
Complexity 1-4 → gemini-2.5-flash-lite (matches gemma4:e4b tier)
Privacy rule enforced in both modes: voice_triage always routes to local gemma4:e4b, never to Vertex AI — even in online mode. Voice commands are private.
Phase 2: The Security Sandbox
Every agent action passes through SecurityEnforcer before execution. This isn't optional middleware — it's enforced at the gateway level.
class SecurityEnforcer:
"""
Central security orchestration layer.
Coordinates PathGuard, NetworkGuard, and AuditManager.
"""
The security stack:
-
PathGuard — blocks access outside allowed directories;
~/and above requires explicit allowlisting - NetworkGuard — allowlist of permitted domains; blocks all others including internal network calls
- AuditManager — SHA-256 hash-chained audit log; every action is cryptographically linked to the previous entry. API keys are automatically redacted via regex before logging.
- PendingAction queue — file deletions require two separate confirmations, with a 5-minute expiry window. If the user doesn't confirm twice within 5 minutes, the action is cancelled.
The security policy lives in sandbox/jarvis_security.yaml — a human-readable YAML file where you can add rules without touching Python. sudo and admin commands are completely blocked at the policy level, not just the prompt level.
Phase 3: The Voice Engine — Wake Word to Spoken Response
The voice pipeline is a full conversation loop, not a single-shot transcription:
Idle state
↓ (wake word detected: "Hey Jarvis")
Recording (VAD auto-stops on silence)
↓
Transcribing (Whisper large-v3)
↓
Processing (ComplexityRouter → Agent → Response)
↓
Speaking (Kokoro-82M TTS, audio streamed)
↓
Conversation mode (60-second window, no wake word needed for follow-ups)
↓ (farewell word OR 60s idle)
Idle state
The farewell word detection is multilingual — the system understands English, Hindi (Devanagari), and Hinglish out of the box:
FAREWELL_WORDS = {
"goodbye", "bye", "sleep", "stand by", "dismissed",
# Hindi
"अलविदा", "सो जाओ", "शुभ रात्रि",
# Hinglish
"alvida", "so jao", "bas itna hi",
}
TTS model selection: Kokoro-82M runs at ~15ms per sentence on the M4 Pro's MPS backend. Whisper large-v3 loads lazily on first voice command and stays resident — initial load ~3 seconds, subsequent calls ~200ms for a typical spoken sentence.
The voice session manager uses asyncio throughout. The wake word detector runs in a background thread, but hands off to asyncio.run_coroutine_threadsafe for everything downstream — so the voice pipeline and the FastAPI gateway share the same event loop cleanly.
Phase 4: The Four-Tier Memory System
This is what separates JARVIS from a standard chatbot. There are four memory tiers:
TIER 1 — PROCEDURAL : JARVIS_CORE.md (persona, rules, user profile)
Injected first in every prompt. KV-cached by Ollama.
Cost after first request: ~0ms.
TIER 2 — EPISODIC : memory_vault/logs/YYYY-MM-DD.log
Raw conversation log. Never injected directly.
Input for nightly distillation.
TIER 3 — SEMANTIC : ChromaDB (vector similarity) + Graphiti (temporal graph)
Top 5 relevant facts injected silently into every prompt.
TIER 4 — COMPILED WIKI: memory_vault/wiki/
Synthesized Markdown knowledge base.
Built nightly from Tiers 2 and 3.
Human-readable and human-editable.
The GraphitiStore component uses bi-temporal modeling — every fact has both a valid_from and valid_to timestamp:
"User prefers Redux" → valid_from: Jan 1 | valid_to: Mar 15 (superseded)
"User prefers Zustand" → valid_from: Mar 15 | valid_to: None (CURRENT)
When JARVIS learns a new contradicting fact, it automatically closes the old one rather than stacking conflicting facts. This means memory gets smarter and more accurate over time — old facts don't poison new queries.
The nightly distillation job (runs at 2 AM on idle system via APScheduler) reads the day's episode log, extracts durable facts, and:
- Writes vector embeddings to ChromaDB
- Writes episodes to Graphiti with contradiction detection
- Updates
wiki/user_profile.mdwith the compiled view
Memory correction commands JARVIS understands naturally:
- "Jarvis, forget that I use Redux."
- "Jarvis, what do you know about me?"
- "Jarvis, do not learn from the next 10 minutes."
- "Jarvis, show me my coding wiki."
Phase 5: Screen Vision — Watching Without Being Asked
The screen engine runs as a background thread, taking a screenshot every 2 seconds and running it through gemma4:e4b passive analysis. If a suggestion is generated AND the cooldown period has elapsed (default 120 seconds), JARVIS speaks up.
The passive watcher uses a two-tier vision model approach:
- Passive (gemma4:e4b): Always-on. Fast. Shared model — no extra RAM cost. Detects what app is open, what file is being edited, current context.
- Deep (gemma4:26b): On-demand. Full multimodal analysis with the same model used for code. Only loaded when the situation requires deeper understanding (complex UI, code review, error diagnosis).
The ScreenVision component returns structured output for every capture:
{
"description": "TypeScript file auth.ts, async function handleLogin at line 26",
"app_detected": "vscode",
"context": "TypeScript file auth.ts line 26",
"suggestions": ["The handleLogin function is not handling the rejected promise..."],
"screenshot_b64": "", # Only populated in deep mode
"model_used": "gemma4:e4b"
}
The SuggestionEngine ranks suggestions by relevance and enforces the Proactive Suggestion Protocol defined in JARVIS_CORE.md:
- Maximum one suggestion every 3 minutes
- Always starts with "Sorry to interrupt, Sir."
- Always ends with "Shall I?" — never acts without confirmation
The JARVIS_CORE.md Persona File — The Secret Architecture Piece
One piece of the system that isn't obvious from the directory structure: JARVIS_CORE.md is not just a prompt file. It's the KV-cache anchor for the entire system.
When Ollama processes the first request with JARVIS_CORE.md prepended, it caches the key-value attention vectors for those ~4,000 tokens. Every subsequent request that starts with the same JARVIS_CORE.md prefix costs ~0ms for that portion — Ollama serves it from cache.
This is why the file contains the user profile, personality definition, memory architecture, anti-patterns, response format taxonomy (12 response types), wit calibration levels (0-4), and operating rules — all in one place, all cached, all cost-free after the first request.
The response taxonomy is worth highlighting. Every incoming message is classified into one of 12 types:
TYPE 1 — FACTUAL_SIMPLE TYPE 7 — CODE_DEBUG
TYPE 2 — FACTUAL_LIST TYPE 8 — TASK_CONFIRM
TYPE 3 — OPINION_ANALYSIS TYPE 9 — RESEARCH_SUMMARY
TYPE 4 — COMPARISON TYPE 10 — PLAN_STRATEGY
TYPE 5 — CODE_WRITE TYPE 11 — CASUAL_CHAT
TYPE 6 — CODE_EXPLAIN TYPE 12 — SYSTEM_STATUS
Each type has both a text format and a voice format. In voice mode, markdown characters are forbidden — the model is instructed to produce natural spoken transitions ("First... Second... And finally...") instead of bullet points and headers.
The 13-Agent Registry
Agent Model (Offline) Always-on? Notes
─────────────────────────────────────────────────────────────────────────────
Receptionist gemma4:e4b Yes Router + simple chat
Manager/Planner qwen3.5:27b-q4_K_M On-demand Plans, coordinates
Code Specialist gemma4:26b On-demand Write/debug/refactor
Screen Vision Passive gemma4:e4b Yes 2s scan, shared model
Screen Vision Deep gemma4:26b On-demand Full analysis + control
Browser/Shopping qwen3.5:27b-q4_K_M On-demand Puppeteer MCP
Research qwen3.5:27b-q4_K_M On-demand Brave Search + Firecrawl
Auditor/QA gemma4:e4b Yes Reviews outputs
Memory Distiller gemma4:e4b Yes Nightly 2 AM job
File Manager qwen3.5:27b-q4_K_M On-demand Filesystem MCP
Voice Triage gemma4:e4b Yes (ALWAYS LOCAL) Privacy-first
System Control qwen3.5:27b-q4_K_M On-demand OS commands via sandbox
Communication qwen3.5:27b-q4_K_M On-demand Gmail/Slack/Jira
Five agents share gemma4:e4b and stay resident forever — that's the 10 GB base cost that never goes away. The other eight are on-demand, with the RAM guard preventing simultaneous loading of the two large models.
The RAM Budget in Practice
State RAM Used Free (of 48 GB)
────────────────────────────────────────────────────────
macOS baseline ~19.6 GB ~28.4 GB
+ gemma4:e4b (always-on) ~29.6 GB ~18.4 GB
+ Code task → load 26b ~37.6 GB ~10.4 GB ✅ Safe
+ Planning → unload 26b,
load qwen3.5:27b ~35.6 GB ~12.4 GB ✅ Safe
⚠️ BLOCKED: both large models ~53.6 GB N/A ❌ Hard block
The hard block isn't a warning — the gateway refuses to route to a model if loading it would violate the RAM constraint. The user gets a graceful degradation message and JARVIS falls back to gemma4:e4b for the task.
Getting Started
Prerequisites: macOS with Apple Silicon (M1 or later), Ollama installed, Python 3.11+.
# Clone and enter the project
git clone https://github.com/Hitansu2004/Jarvis
cd jarvis
# Run setup (creates venv, installs deps, generates .env)
./setup.sh
# Pull the Gemma 4 models
ollama pull gemma4:e4b
ollama pull gemma4:26b
# Start the gateway
source .venv/bin/activate
uvicorn core_engine.gateway:app --reload
# Test it
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Good evening, Jarvis. What can you do?"}'
To enable voice:
# Install audio deps (macOS)
brew install portaudio
pip install pyaudio --break-system-packages
# Start with voice
VOICE_ENABLED=true uvicorn core_engine.gateway:app --reload
Check status and agent list:
curl http://localhost:8000/status
curl http://localhost:8000/agents
The .env.example file documents every configurable parameter across all 10 phases. Start with the defaults — they're tuned for a 48 GB M4 Pro but the complexity thresholds and model assignments are all environment variables.
What Gemma 4 Unlocked That Nothing Else Could
Let me be specific about this, because it's the core of the build:
Gemma 4 E4B as a shared multi-role model. Five separate agents running on one always-loaded model is only possible because E4B is genuinely capable despite its size. Receptionist, Auditor, Voice Triage, Passive Screen Vision, and Memory Distiller all run on it. With any other 4B model I tried, the quality degraded below acceptable for at least two of those roles.
Gemma 4 26B's native vision without a separate model. Screen vision and code review on the same loaded model. This single fact saved ~10 GB of RAM (no separate vision model) and eliminated one entire model-switching operation from the pipeline.
The MoE efficiency. Both Gemma 4 models use Mixture-of-Experts. gemma4:26b has 26B parameters but only 4B are active on any given token. This is why the RAM footprint (~18 GB) is dramatically lower than a 26B dense model would be (~52+ GB). Without MoE, this entire architecture is impossible on consumer hardware.
The 128K context window. JARVIS_CORE.md + memory retrieval + the actual conversation can comfortably fit in context. With a 4K or 8K context window, the persona file alone would crowd out the memory system.
What's Next (Phases 6-10)
Phase 6 brings computer control — PyAutoGUI integration so JARVIS can click, type, and navigate on behalf of the user, with every action requiring confirmation through the security sandbox.
Phase 7 activates multi-agent teams — the orchestrator can spawn parallel specialist agents for complex tasks, with results synthesized back through an Auditor QA pass before delivery.
Phase 8 wires up the MCP Skills Library — 8 MCP servers are already registered in skills_mcp/mcp_registry.json, including Filesystem, GitHub, Puppeteer, Brave Search, Composio (500+ app integrations), and PyAutoGUI. The registry exists now; Phase 8 activates the connections.
Phase 10 is the goal: a .dmg installer that anyone with an Apple Silicon Mac can download and run. No cloud dependencies. No subscriptions. A personal AI OS that's yours.
Why This Matters
There's a philosophical point underneath all the engineering: your AI assistant should not require a corporate server to function.
The voice triage agent — the part of JARVIS that hears your voice commands and decides what to do with them — is hardcoded to gemma4:e4b regardless of the operation mode. Even if you've switched JARVIS to online mode for heavier tasks, voice never leaves the machine. This isn't a setting. It's enforced at the gateway level.
Every conversation logs to a local file. The nightly distillation runs on your own CPU. The temporal knowledge graph lives in memory_vault/kuzu_db/ on your filesystem. Your AI gets smarter over time, and none of that learning ever touches a cloud database.
Gemma 4 made this possible. A capable, efficient, multimodal open model family that runs well on consumer hardware is the technical prerequisite for this entire architecture. The E4B model being genuinely useful is what allows five always-on agents without breaking the RAM budget. The 26B model's native vision support is what makes screen understanding practical. The MoE efficiency is what makes the math work on 48 GB.
If you want to build something similar, the repo is linked below. The .env.example file is extensively documented. The test suite (50+ tests across all five phases) serves as the best architecture documentation.
Built with Python 3.11, FastAPI, Ollama, Gemma 4, PyTorch 2.6 MPS, ChromaDB, Graphiti, Kuzu, Whisper large-v3, Kokoro-82M, and a lot of late-night IST sessions.
Author: Hitansu Parichha | Software Engineer at Nisum Technologies
Top comments (0)