DEV Community

Cover image for Building a Fully Offline AI Coding Assistant with Gemma 4, No Cloud Required ๐Ÿค–
Mamoor Ahmad
Mamoor Ahmad Subscriber

Posted on โ€ข Edited on

Building a Fully Offline AI Coding Assistant with Gemma 4, No Cloud Required ๐Ÿค–

Gemma 4 Challenge: Write about Gemma 4 Submission

Your code never leaves your machine. Your API bill is zero. Your assistant still works on a plane. โœˆ๏ธ

That's the pitch. Here's how to actually build it.

๐Ÿค” Why Go Offline in 2026?

Robot Coding

Three reasons pushed me (and a lot of other devs) toward local AI:

  1. ๐Ÿ’ฐ Cost. If you're running coding sessions multiple times a day, API bills add up fast. A one-time hardware investment pays for itself in months.

  2. ๐Ÿ”’ Privacy. Some codebases โ€” client work, proprietary algorithms, internal tools โ€” should never touch someone else's server.

  3. โšก Resilience. Cloud APIs throttle, go down, and change pricing. A local model just runs.

Gemma 4 finally makes this practical. Previous Gemma generations scored 6.6% on function-calling benchmarks โ€” basically useless for agentic coding. Gemma 4 31B scores 86.4% on the same benchmark. ๐Ÿคฏ

That's the jump that makes "local coding assistant" go from toy to tool.


๐Ÿงฐ What You'll Need

โš™๏ธ Hardware

Model Min RAM Recommended Best For
๐ŸŸข E4B (Edge) 4 GB 8 GB Raspberry Pi, Jetson Nano
๐Ÿ”ต 26B MoE โญ 16 GB (Q4) 24 GB M4 MacBook Pro, RTX 4070
๐ŸŸฃ 31B Dense 32 GB (Q4) 48 GB+ M4 Max, RTX 4090, GB10

โญ The sweet spot for most developers: 26B MoE on a 24 GB machine. It activates only 3.8B parameters per token (Mixture of Experts), so it's fast โ€” often faster than the bigger 31B despite being "smaller."

Hardware Comparison

๐Ÿ“ฆ Software


๐Ÿš€ Step 1: Get the Model

Option A: Ollama โ€” The Easy Path โ˜•

# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model โ€” this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b

# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b

# Verify it works ๐ŸŽ‰
ollama run gemma4:26b "Write a Python function to merge two sorted lists"
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local AI that can write code. Seriously.

Option B: llama.cpp โ€” For Power Users ๐Ÿ”ง

llama.cpp gives you more control over quantization, context length, and memory usage. This matters on constrained hardware.

# Install via Homebrew (macOS)
brew install llama.cpp

# Or build from source for GPU support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # NVIDIA
# or: cmake -B build -DGGML_METAL=ON  # Apple Silicon
cmake --build build --config Release -j
Enter fullscreen mode Exit fullscreen mode

Download the GGUF file from Hugging Face:

# 26B MoE Q4 โ€” best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ./models/
Enter fullscreen mode Exit fullscreen mode

Start the server with the right flags (every flag here matters โš ๏ธ):

llama-server \
  -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 \
  -ngl 99 \
  -c 32768 \
  -np 1 \
  --jinja \
  -ctk q8_0 \
  -ctv q8_0
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”‘ What each flag does:

Flag Purpose
-ngl 99 ๐Ÿš€ Offload all layers to GPU
-c 32768 ๐Ÿ“ 32K context window (increase if you have RAM)
-np 1 ๐ŸŽฏ Single slot โ€” multiple slots multiply KV cache memory
--jinja ๐Ÿ”Œ Required for Gemma 4's tool-calling template
-ctk q8_0 -ctv q8_0 ๐Ÿ’พ Quantize KV cache from ~940 MB to ~499 MB

โš ๏ธ Do NOT use the -hf flag to auto-download โ€” it silently pulls a 1.1 GB vision projector that will OOM on 24 GB machines. Learn from my pain. ๐Ÿ˜…


๐Ÿ”Œ Step 2: Connect It to Your Editor

Continue.dev (VS Code / JetBrains) ๐Ÿ’ป

Continue is an open-source AI code assistant that runs in your IDE. It supports Ollama and llama.cpp out of the box.

Install:

  1. Open VS Code โ†’ Extensions โ†’ Search "Continue" โ†’ Install
  2. Open ~/.continue/config.json (or use the Continue settings UI)

Config for Ollama:

{
  "models": [
    {
      "title": "Gemma 4 26B (Local)",
      "provider": "ollama",
      "model": "gemma4:26b",
      "contextLength": 32768
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 E4B (Autocomplete)",
    "provider": "ollama",
    "model": "gemma4:4b"
  }
}
Enter fullscreen mode Exit fullscreen mode

Config for llama.cpp:

{
  "models": [
    {
      "title": "Gemma 4 26B (llama.cpp)",
      "provider": "openai",
      "model": "gemma-4-26b",
      "apiBase": "http://localhost:1234/v1",
      "contextLength": 32768
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Pro tip: Use the 4B model for tab autocomplete (fast, low memory) and the 26B model for chat/explain/refactor (smarter, slower). This dual-model setup gives you the best of both worlds! ๐Ÿ†

Codex CLI โ€” Terminal Power Users โŒจ๏ธ

If you prefer agentic coding from the terminal:

# Install Codex CLI
npm install -g @openai/codex

# Run with local model
codex --oss -m gemma4:26b

# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1
Enter fullscreen mode Exit fullscreen mode

In Codex CLI's config.toml, set:

[model]
wire_api = "responses"
web_search = "disabled"  # llama.cpp rejects this tool type
Enter fullscreen mode Exit fullscreen mode

โš™๏ธ Step 3: Tune for Your Hardware

๐ŸŸก 16 GB Machine (MacBook Air M3/M4, Budget Builds)

# Use the E4B model โ€” still surprisingly capable
ollama pull gemma4:4b

# Or squeeze the 26B MoE with aggressive quantization
ollama pull gemma4:26b-q3_K_M
Enter fullscreen mode Exit fullscreen mode

In Continue, lower contextLength to 8192 to save memory.

๐Ÿ”ต 24 GB Machine (M4 Pro, RTX 4070/4080) โ€” โญ Sweet Spot

The 26B MoE at Q4_K_M fits comfortably:

# Ollama
ollama pull gemma4:26b

# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 32768 -np 1 --jinja \
  -ctk q8_0 -ctv q8_0
Enter fullscreen mode Exit fullscreen mode

๐ŸŸฃ 48 GB+ Machine (M4 Max, RTX 4090, Workstations)

Run the 31B Dense for maximum quality:

ollama pull gemma4:31b

# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 65536 -np 1 --jinja
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Š Step 4: Real-World Benchmark

I tested the same coding task across all configurations:

"Write a parse_csv_summary function with error handling, write tests, and run them."

Benchmark Results

Config Quality Time Tool Calls Verdict
โ˜๏ธ GPT-5.4 (Cloud) โ˜…โ˜…โ˜…โ˜…โ˜… 65s 3 Type hints, exception chaining, clean
๐Ÿ–ฅ๏ธ 31B Dense (48 GB) โ˜…โ˜…โ˜…โ˜…โ˜† 7 min 3 Functional, solid, no cleanup needed
โšก 26B MoE (24 GB) โ˜…โ˜…โ˜…โ˜†โ˜† 4 min 10 Functional but messy โ€” dead code, retries
๐Ÿ“ฑ E4B (8 GB) โ˜…โ˜…โ˜†โ˜†โ˜† 2 min 15+ Basic tasks only, struggles with multi-file

๐ŸŽฏ Key takeaway: The 31B Dense on capable hardware gets close to cloud quality. The 26B MoE is fast and functional but needs more human oversight. The E4B is great for autocomplete, not for agentic coding.

โšก Speed Comparison

The 26B MoE is deceptively fast. Despite being a "26B" model, it only activates 3.8B parameters per token:

Model Speed on M4 Pro Why
๐Ÿš€ 26B MoE ~52 tok/s Only reads 1.9 GB/token from memory
๐Ÿข 31B Dense ~10 tok/s Reads all 31.2B params per token

The MoE architecture means the model is reading less memory per token, so it flies on bandwidth-limited hardware. ๐ŸŽ๏ธ


๐ŸŽฏ Step 5: Prompt Engineering for Local Models

Local models need better prompting than cloud models. Here are patterns that actually work:

๐Ÿ“ System Prompt Template

You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command

Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal โ€” don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Tips That Actually Help

  • ๐ŸŽฏ Be specific about file paths. Local models hallucinate paths more than cloud models. Say src/utils/parser.ts, not "the parser file."
  • ๐Ÿ“‹ One task at a time. Don't ask for a full feature. Ask for "write the function," then "write the tests," then "run the tests."
  • ๐Ÿ“– Provide examples. Show the model what you want with a small example before asking it to generate.
  • ๐Ÿ”ง Use structured output. Gemma 4 supports native JSON output. Use it for tool calls and structured responses.

๐Ÿ› Common Pitfalls (Learn From My Pain)

๐Ÿ’ฅ "Ollama hangs on long prompts"

This is a known Flash Attention bug on Apple Silicon with Gemma 4.

Fix: Use llama.cpp instead, or wait for Ollama v0.20.6+.

๐Ÿ’ฅ "Tool calls land in the wrong field"

Ollama v0.20.3 has a streaming bug that routes Gemma 4 tool-call responses to the reasoning output instead of tool_calls.

Fix: Update to v0.20.5+ or use llama.cpp.

๐Ÿ’ฅ "Out of memory on startup"

If using llama.cpp with -hf flag, it downloads a 1.1 GB vision projector you don't need.

Fix: Use a direct -m path to the GGUF file instead.

๐Ÿ’ฅ "Codex CLI rejects my model"

Set web_search = "disabled" in config โ€” Codex CLI sends a web_search_preview tool type that llama.cpp doesn't recognize.


๐Ÿ—๏ธ Architecture: The Full Offline Stack

Here's what the complete setup looks like:

Architecture Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Your Editor (VS Code)           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚         Continue.dev Extension           โ”‚ โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  ๐Ÿ’ฌ Chat  โ”‚    โ”‚  โšก Autocomplete โ”‚   โ”‚ โ”‚
โ”‚  โ”‚  โ”‚  Refactor โ”‚    โ”‚  (E4B model)     โ”‚   โ”‚ โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                  โ”‚
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚  ๐Ÿ–ฅ๏ธ llama.cppโ”‚    โ”‚  ๐Ÿ“ฆ Ollama  โ”‚
     โ”‚  :1234       โ”‚    โ”‚   :11434   โ”‚
     โ”‚  (26B/31B)   โ”‚    โ”‚   (E4B)    โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                  โ”‚
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚       ๐Ÿ”’ Local GPU / CPU       โ”‚
     โ”‚    No data leaves this box     โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

๐Ÿคท When to Use Cloud Instead

Be honest about limitations:

โœ… Use Local For:

  • Day-to-day coding, refactoring, explaining code
  • Writing tests, documentation, boilerplate
  • Working with sensitive/proprietary codebases
  • Offline environments (โœˆ๏ธ flights, โ˜• cafes, ๐Ÿข secure facilities)

โŒ Use Cloud For:

  • Complex multi-file architectural changes
  • Tasks requiring reasoning across 10+ files
  • When you need the absolute highest code quality
  • Large-scale codebase migrations

๐Ÿ”ฎ What's Next

The local AI space is moving fast. Some things to watch:

  • ๐Ÿงฌ Gemma 4 fine-tuning โ€” Use Unsloth to fine-tune on your own codebase. A domain-specific adapter can dramatically improve quality.
  • ๐Ÿ”€ Multi-model pipelines โ€” Route simple tasks to E4B (fast), complex tasks to 26B/31B (smart). The AI router pattern is catching on.
  • ๐Ÿ‘๏ธ Vision + Code โ€” Gemma 4 processes images natively. Feed it a screenshot of a UI, get the code. This is massively underrated.

๐ŸŽฌ The Bottom Line

You don't need a $10K rig. A 24 GB laptop with Gemma 4 26B MoE gives you a coding assistant that:

  • โœ… Handles 80% of daily tasks
  • โœ… Costs nothing per query
  • โœ… Never phones home
  • โœ… Works offline
  • โœ… Keeps your code private

That's not a compromise โ€” that's a paradigm shift. ๐Ÿš€


All benchmarks were run locally on consumer hardware. No cloud APIs were harmed in the making of this post.


Found this useful? Drop a โค๏ธ and share it with a friend who's tired of API bills!

Questions? Hit me up in the comments โ€” I'll help you troubleshoot your setup. ๐Ÿ‘‡

Related Reading

Top comments (1)

Collapse
ย 
hemapriya_kanagala profile image
Hemapriya Kanagala โ€ข

Nice breakdown ๐Ÿ˜€