Nic Lydon

Posted on May 13

I Put Gemma 4 Behind My Homelab AI Gateway. This Is the Beginning.

#ai #gemmachallenge #gemma #opensource

Gemma 4 Challenge: Write about Gemma 4 Submission

Most model experiments start with a notebook, a benchmark script, or a quick API call.

This one started with a production-shaped question:

Can I swap out an entire model family that is currently serviing the default paths through my actual local AI gateway?

Not a side demo. Not a one-off curl. Not "look, it runs."

I mean the real route: the gateway that agents, background jobs, app surfaces, benchmark harnesses, and my own tools already call.

That is the experiment I started with Gemma 4.

This post is the beginning of that story, not the final verdict. I am writing it while the platform is still in the trial window. The follow-up will be more interesting: what stayed stable, what broke under real load, what got rolled back, and what I would keep after a week or two of actual use.

For now, this is the setup: what I changed, why I changed it, and what failed immediately.

The Platform Before The Swap

My local AI stack is built around a gateway I call Forge.

Forge gives callers one OpenAI-ish API surface and handles the messy parts behind it:

which model should answer this kind of request
which machine is hosting it
whether the model is hot, cold, deprecated, or on-demand
whether a request is chat, vision, embedding, transcription, code, extraction, or something else
whether a backend is available or should be skipped

The machines behind it are consumer hardware, not datacenter gear:

Host	Role
Furnace	Primary inference box, AMD Strix Halo, 96 GB unified VRAM allocated to the iGPU
Crucible	Secondary AMD box for creative workloads, permissive models, and burst/bulk work
Anvil	M4 Mac mini, useful for MLX/Metal paths and lightweight resident services

Before this experiment, the default local text path was mostly Qwen-family. That was not an accident. Qwen had become the operating baseline because it was predictable enough for a platform, not just impressive in isolation.

I had also tested other models. Devstral2, for example, was interesting enough to onboard and benchmark seriously. The smaller 24B variant was competitive in code scenarios, but it did not become the default path. The 123B model was too slow for the role I needed. That distinction matters:

A model can be good and still not be a good platform default.

That is the bar Gemma 4 had to clear.

Why I Did An In-Place Swap

I could have added Gemma 4 as another optional model and called it a day.

That would have been safer. It also would have taught me much less.

Instead, I treated it like a real migration. For the trial window, Gemma 4 took over the canonical roles that real callers already use.

Role	Previous route	Trial route
default chat	`qwen3.6-chat-35b-a3b`	`gemma-4-chat-31b`
priority chat	`qwen3-8b`	`gemma-4-chat-26b-a4b`
vision / multimodal	`qwen3-vl-30b-a3b`	`gemma-4-multimodal-8b-e4b`
prompt enhancement	`qwen3-4b`	`gemma-4-multimodal-2b-e2b`

The old Qwen routes were not deleted. They were marked deprecated with a planned rollback window. That gives me a clean flip-back path if the experiment does not earn its keep.

This is the part I think model posts often skip. A real model migration is not just "can I run it?" It is:

do I have the right weights on disk?
does my serving stack understand the architecture?
can I fit the hot set in memory?
do my existing aliases and callers still work?
can I roll back without spelunking through five repos?
do I have telemetry that will tell me the difference between model failure, gateway failure, and benchmark nonsense?

That last one matters more than I expected.

The First Failure Was Not The Model

The first deploy crashed.

Forge restarted cleanly. The model catalog showed the new Gemma 4 ids. The first smoke request hit the gateway, routed to llama-swap, and came back as a 502.

The useful error was one layer lower:

unknown model architecture: 'gemma4'

The problem was not Gemma 4 quality. The problem was my serving binary.

My llama.cpp build was from April 1. It was 466 commits behind the branch I needed. The GGUF files declared general.architecture: gemma4, and the old build simply did not know what that meant.

So the first chapter of the Gemma 4 experiment was not prompting. It was infrastructure:

back up the existing build tree
rebuild llama.cpp with ROCm/HIP for Strix Halo
verify the new binary recognizes the Gemma 4 architecture
regression-check the existing Qwen route
restart the serving layer
smoke test through the actual gateway, not a side process

Only after that did the model start answering.

That is a useful reminder: "model support" is not a binary property. A model can be downloadable, quantized, and present on disk, and still be unusable because the serving stack is one architecture handler behind.

The Second Failure Was More Interesting

Once Gemma 4 loaded, the first real chat benchmark looked bad.

Not "a little worse than Qwen" bad. Broken bad.

On the initial chat-bench run, gemma-4-chat-31b failed the structured extraction and format-compliance scenarios. It was also slow enough that something was clearly wrong. These were not hard prompts. These were the boring, throughput-oriented tasks that agents and background workers need to complete cleanly.

A direct request showed the issue immediately:

<think>
The user is asking a basic arithmetic question...
</think>

2 + 2 = 4

The model was spending the answer budget on a reasoning block.

For a human chat UI, visible thinking might be useful. For a benchmark expecting JSON, or an agent expecting a short answer, it is poison. The model can know the right answer and still fail the task because the caller never receives the shape it asked for.

This was familiar. Forge had already solved the same class of problem for Qwen3.

The fix was to make "thinking mode" a gateway policy, not a model identity.

Programmatic callers get:

{
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

injected by default when the model family needs it.

Chat UIs can opt back in explicitly:

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

That is the right abstraction for my platform. I do not want every agent, benchmark, worker, and internal tool to remember which local model family wraps output in thinking tokens this week. I want the gateway to know that once.

After that change, the chat benchmarks became meaningful. The three relevant routes - Gemma 4 31B, Gemma 4 26B-A4B, and the displaced Qwen3.6 35B-A3B baseline - reached the same pass-rate shape across the default chat scenarios.

The interesting result was latency. The 26B-A4B route was materially faster than both the dense 31B and the Qwen3.6 baseline on several workloads, while keeping the same pass rate in the corrected harness.

That is the kind of result I care about. Not "model X wins," but "model X belongs in this role."

Vision Exposed A Different Problem

The multimodal side taught a separate lesson.

I added a new VLM benchmark harness and ran the obvious first test. The initial scenario was too weak. It was good enough as a smoke test, but not good enough to tell me which model or host was better.

So I built a more discriminating scenario with three generated fixtures:

a bar chart that required OCR and chart reasoning
a code screenshot that required reading a function name and language
a homelab topology diagram that required identifying the hub and connected nodes

Then another problem appeared: concurrency.

Promptfoo's default concurrency sent multiple image-bearing requests at once during a backend startup window, which produced misleading 502s. Some errors appeared to implicate the wrong backend because parallel requests were failing around the same time.

Sequential runs told the truth.

With concurrency set to 1 and the output budget raised, the final VLM run passed cleanly across the tested local routes. The surprising part was not that Gemma 4 could read the images. The surprising part was that an M4 Mac mini running an MLX path was effectively tied with the AMD inference box on this small, practical vision benchmark.

That is not a leaderboard result. It is a routing result.

It tells me Anvil is not just a dev box. For some multimodal work, it is a useful inference target.

Audio Needed A Sidecar

Gemma 4's multimodal story is not just text and images. Audio is part of the interesting surface.

But my normal GGUF plus llama.cpp path did not support Gemma 4 audio input yet. The text and vision path worked through llama-swap. The audio conformer path did not.

So I built it as a sidecar:

a separate FastAPI worker
safetensors weights
HuggingFace Transformers
ROCm-specific PyTorch wheels
a Forge route at /v1/audio_qa

That route is intentionally not a replacement for Whisper.

Whisper remains the right tool for long-form transcription. Gemma 4 audio is more interesting for short audio understanding, audio Q&A, emotion or intent questions, and cross-modal prompts that combine audio and an image.

The first useful test was simple: the JFK sample clip. The route returned a good short transcription in under six seconds once warm. A 60 second clip correctly failed fast with a 413 because the audio path is capped at 30 seconds. An audio plus image prompt produced a coherent response grounded in both modalities.

That sidecar is not the end state. It is a bridge. When the standard serving path supports the audio input cleanly, the route can stay and the backend can change.

Again, that is the platform lesson: callers should not care which inference backend made the modality work.

What I Am Actually Testing

The easy version of this post would end with a benchmark table and a confident take.

I do not have that yet, and pretending otherwise would be silly.

What I have is the beginning of a platform trial:

Gemma 4 is now in the main chat, priority-chat, multimodal, and prompt-enhance roles.
Qwen is still available as a rollback path.
Devstral2 remains useful but not a default for this platform.
Forge now handles thinking-mode policy for both Qwen3 and Gemma 4.
The benchmark harness is better than it was before the experiment.
The audio path exists, but it is a sidecar until the normal serving stack catches up.
The real evaluation is now happening under actual workloads.

The question I care about over the next week is not:

Is Gemma 4 better than Qwen?

The question is:

Which parts of the platform are better with Gemma 4 in the route, and which parts should move back?

That means watching boring things:

error rates
429s and backend saturation
latency under background jobs
whether agent outputs stay clean
whether structured tasks remain reliable
whether multimodal routes are useful often enough to stay hot
whether the memory footprint is worth it
whether fallback behavior is predictable when the boxes are busy

This is less glamorous than a benchmark screenshot. It is also where the real answer lives.

The Takeaway So Far

The first day did not teach me that Gemma 4 is universally better. It taught me something more useful:

A model family becomes valuable when the platform can route it intentionally.

The gateway mattered more than any single model call.

Without Forge, I would have been debugging each app and agent separately. With Forge, the migration became a small number of role changes, a serving-stack rebuild, one generalized thinking-policy fix, and a better set of benchmarks.

That is the part I want to keep building toward: a local AI platform where model families can change without rewriting every caller, and where the system learns from real workloads instead of one-off demos.

This is the start of the Gemma 4 trial in my homelab.

In a week or two, I should have the more honest post: what survived real use, what got demoted, and what I would do differently if I were starting the swap again.