DEV Community

Cover image for Local AI Just Became the Default: Gemma 4 + omlx on M4
Max Quimby
Max Quimby

Posted on β€’ Originally published at computeleap.com

Local AI Just Became the Default: Gemma 4 + omlx on M4

On May 11, 2026, the top story on Hacker News was an essay titled "Local AI needs to be the norm". 1,646 points. 643 comments. The fifth-ranked story the same day was a practitioner walkthrough β€” "Running local models on an M4 with 24GB memory" β€” and its top-rated reply called Gemma 4 31B "the new baseline… less like a science experiment than any previous local model." At #11 on GitHub trending: jundot/omlx, a Mac inference server managed entirely from the menu bar. 13,600 stars. +455 in a day.

πŸ“– Read the full version with charts and embedded sources on ComputeLeap β†’

Three independent signals, same news cycle, same thesis. The frame around local AI has changed. The question used to be "can you run it locally?" β€” and the answer was a hobbyist's hedged yes. The question this week is "why isn't local the default?" β€” and the answer comes packaged as a polished menu-bar app running a 31-billion-parameter open model on a $1,599 laptop.

This piece pulls the three threads together: the model floor (Gemma 4 31B), the substrate (Apple Silicon via MLX), and the retail experience (omlx). And it explains why the structural counter-argument to the Anthropic-at-$1T thesis just shipped, quietly, in the same week.

The Frame Shift β€” From "Can You?" to "Why Isn't It Default?"

The HN #1 essay's argument isn't the obvious one. It's not "you can run LLMs on your old gaming rig now, look how cool." The top-ranked comment redirects the thread away from that hobbyist framing entirely:

πŸ’‘ "This isn't about the local models you're running on your old gaming rig β€” this is about code leveraging." β€” top comment on HN thread #48085821

The author is making a vendor argument: software companies β€” note-taking apps, IDEs, design tools, productivity SaaS β€” should be shipping local inference as the default. Cloud round-trips for free-text autocomplete, classification, summarization, and small structured tasks are absurd. They're absurd on latency. They're absurd on privacy. They're absurd on unit economics. And, as of Q2 2026, they're absurd on capability β€” because the local model can now actually do the job.

Hacker News thread screenshot: 'Local AI needs to be the norm' at 1,646 points and 643 comments β€” top of HN on 2026-05-11

View the original HN thread β†’

The cross-source convergence report for May 11 names this explicitly: "The frame has shifted from 'can you run it locally?' to 'why isn't local the default for X?'" This is the structural counter to the same week's other big AI story β€” Anthropic's $1–1.2T valuation, 80x annualized. If you believe the Anthropic thesis is in trouble in 2026, the load-bearing question is whether on-device inference is genuinely usable for the median enterprise task. The HN front page just made that argument out loud, with receipts.

The Model Floor β€” Gemma 4 31B on M4 24GB

The receipt the front page is responding to is HN #5, jola.dev's "Running local models on an M4 with 24GB". 488 points. 146 comments. A boring title and an unboring conclusion.

Hacker News thread screenshot: 'Running local models on an M4 with 24GB memory' at 488 points and 146 comments, with top comments calling Gemma 4 31B the new baseline

View the original HN thread β†’

Read the second-most-upvoted comment on that thread:

"Gemma 4 31B (dense / no MoE) is the new baseline for local models. It performs better than previous attempts like GPT OSS 120B and Nemotron Super 120B on my M5 Max with 128GB RAM. Less like a science experiment than any previous local model." β€” soganess, HN

And the practitioner receipt from thot_experiment in the same thread:

"Q6_K_XL at 128k context yields approximately 800 tokens/second read and 16 tokens/second write. With the proper harness, 31B is more than adequate for a very large portion of tasks. I had Gemma 4 31B independently reverse-engineer a Bluetooth thermometer protocol across multiple turns without human intervention."

That last sentence is the one to dwell on. A multi-turn agentic task β€” reverse-engineering a wire protocol β€” completed by a model running on consumer Apple hardware, no cloud round-trip, no API key. The same person elsewhere describes results comparable to Opus 4.7 on some creative tasks. The HN thread is full of these. The "less like a science experiment" line is the soundbite, but the substance is that practitioners are independently posting agentic-task receipts, not just throughput numbers.

Google released Gemma 4 under the marketing tagline "Byte for byte, the most capable open models." The dense 31B is the model that lands. It's the size where M-series Macs with 24–32 GB unified memory hit the sweet spot: large enough to be genuinely useful for agentic work, small enough to run at interactive speeds with room left for the OS, your editor, and a KV cache that actually fits.

πŸ› οΈ For the M4 24 GB envelope specifically: a Q4_K_M quantization of Gemma 4 31B occupies roughly 18–20 GB of unified memory, leaving 4–6 GB for the OS, IDE, browser, and the model's working KV cache. The 26B MoE variant β€” the cousin to the 31B dense flagship β€” runs at a steady ~18 tokens/second on the same hardware according to community benchmarks. The 31B dense is slower per-token but more capable per-token, and the trade lands in the right place for the use cases that matter on a laptop.

This is the first time the dense-31B size class has been credibly the baseline, not the ceiling. It pairs naturally with our 2026 local-AI hardware guide and the Qwen3.6-35B-on-Mac walkthrough. The pattern of the last twelve months has been clear: open-weights models are eating the "good enough for the median enterprise task" tier from below. Gemma 4 31B is just the cleanest example yet.

The Substrate β€” omlx Turns Apple Silicon Into a Real Inference Server

A capable model is necessary but not sufficient. The retail-experience step is what's been missing β€” and is what shipped this week.

jundot/omlx is an MLX-based LLM inference server with a native macOS menu-bar app β€” PyObjC, not Electron β€” that lets you start, stop, swap, and monitor a local inference server without ever opening a terminal. Apache 2.0. 13.6k stars. +455 in a day. Top-15 GitHub trending the week of release.

Mayank Vora tweet: 'Holy shit... Someone built a production-grade LLM inference server that runs entirely on your Mac, persists KV cache across RAM and SSD' β€” describing omlx

View the original post on X β†’

What makes omlx structurally interesting isn't the app β€” it's the cache. omlx ships a tiered KV cache: a hot tier in RAM, a cold tier on the SSD, block-based with copy-on-write semantics. When a previous prefix comes back β€” a system prompt, a code repository tree, a long document β€” it's restored from disk instead of recomputed. Users on X report time-to-first-token dropping from 30–90 seconds down to 1–3 seconds on long contexts after a warm-up. That isn't a marginal speedup. That's a usability regime change for coding agents that pass the same repo tree to the model every turn.

The architecture, from the README:

omlx architecture diagram: FastAPI server feeds an EnginePool with LRU eviction, which feeds a Scheduler (continuous batching via mlx-lm BatchGenerator), which feeds a Cache Stack with three tiers β€” GPU, Hot RAM, and Cold SSD

FastAPI Server
  β†’ EnginePool (multi-model, LRU eviction, TTL)
    β†’ Scheduler (FCFS + continuous batching via mlx-lm BatchGenerator)
      β†’ Cache Stack (GPU + Hot RAM + Cold SSD tiers)
Enter fullscreen mode Exit fullscreen mode

Continuous batching means concurrent requests don't serialize β€” a Claude Code session, a Cursor tab, and a Raycast script can all hit the same server and have their tokens interleaved. Multi-model serving means a single omlx process can hold an LLM, a vision-language model, an embedding model, and a reranker simultaneously, evicting the least-recently-used when memory pressure hits.

It is, in short, a production-shaped local inference server β€” drop-in compatible with both the OpenAI and Anthropic APIs β€” wrapped in a menu-bar app any non-engineer can run. That combination didn't exist eight weeks ago.

The menu-bar packaging is the retail tell. Local AI is no longer hobbyist. It is β€” at minimum β€” installable by someone who would also install Slack. That's a different distribution surface than llama.cpp's CLI.

The Runtime β€” MLX as "PyTorch for Mac"

Underneath omlx is MLX β€” Apple's open-source ML framework β€” and underneath that is the unified-memory architecture that has made Apple Silicon disproportionately good at running large models on consumer hardware. The pitch this week came from Prince Canuma (Arcee, MLX contributor) at AI Engineer, framing MLX as "PyTorch for Mac" β€” real-time vision, sub-100ms TTS, omni image+audio, video generation, all on Apple Silicon:

Watch on YouTube β€” Prince Canuma: MLX Genmedia at AI Engineer

Watch the full talk on YouTube β†’

This matters because the runtime story is the part that compounds. Two years ago, "ML on Apple Silicon" meant porting a PyTorch model via a CoreML conversion that lost fidelity at every step. Today it means a first-party Apple framework that the most-starred local-inference servers target natively. The HuggingFace Hub now filters models by GGUF/MLX as a first-class facet. MLX is no longer the alternative path β€” for the macOS developer surface, it is the path.

The Industry Tell β€” Ollama Officially Migrates to MLX

The signal that puts this beyond enthusiast territory came from Ollama's official account on X:

Ollama official tweet: 'Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework' β€” official MLX migration announcement

View the original post on X β†’

Ollama β€” the project that brought local LLMs to the "I just want to run it" crowd β€” publicly aligning with MLX is the bellwether move. Ollama doesn't ship a runtime change to chase a fashionable framework. They ship a runtime change because their users are spending real time on Apple Silicon and getting demonstrably better tokens-per-second on MLX paths. That decision is downstream of usage data, not aesthetics. When the default-installation experience for local LLMs migrates to MLX, the macOS developer surface is locked in.

Two days earlier, HuggingFace CEO Clement Delangue announced a local-first push β€” GGUF/MLX filtering on the Hub across 60,000+ compatible models, plus native trace visualization, plus a "Buckets" S3-like storage layer with Xet dedup explicitly framed as "Git was the wrong abstraction for ML data." Combined: the ecosystem rails are now optimized for local-first model distribution in a way they weren't a quarter ago.

What the Community Is Saying

The practitioner verdicts on omlx and Gemma 4 31B are unusually consistent.

Ivan Fioravanti tweet: 'oMLX is working really well as single machine inference engine for coding agents! Caching is managed perfectly... and oQ quantization delivers great results'

View the original post on X β†’

Ivan Fioravanti β€” one of the most rigorous MLX benchmarkers on X, and the person who routinely posts inference-server comparison tables β€” wrote:

"oMLX is working really well as single machine inference engine for coding agents! Caching is managed perfectly (it can use a ton of disk space, be aware!) and oQ quantization delivers great results."

His broader thread on MLX inference engines is candid about the state of the art ("benchmarking is a real mess at the moment… I'm finding many issues under heavy load, wrong perf stats, wrong management of cache mixing parts of prompts from other sessions, OOM, bugs"). omlx stands out in that environment for actually working under coding-agent load. That's a higher bar than "passes a synthetic benchmark." It's the bar a developer tool has to clear to be on every coworker's machine in six months.

Brian Roemmele posted the omlx install workflow as a productivity recommendation. The Chinese-language tech press flagged omlx specifically for its tiered KV cache. r/LocalLLaMA threads on Gemma 4 31B have been consistent: the model finally clears the "actually useful" bar on consumer Macs.

There's also a counter-voice worth flagging. The third comment on the HN #1 thread pushed back: frontier-model capability is still restricted, and previous tools already solved many of the small structured tasks the local-AI argument leans on. Fair. The pattern matters more than any single tool: the gap between "local + good enough" and "frontier API" is closing from below, and the distribution surface for local β€” menu-bar apps, official Ollama/MLX integration, HF filters β€” has improved more in 2026 than in the prior two years combined.

What This Means for the API Labs

The convergence report flags a direct disagreement between two clusters this week. Worth reading the two side-by-side.

⚠️ The Anthropic thesis: $1–1.2T valuation post-Q1, 80x annualized, Polymarket pricing Anthropic at 84% best-model-end-of-May and 95% best-coding-model. The API-margin story holds if cloud inference remains structurally superior for the median enterprise task. The local-AI thesis (this piece): if Gemma 4 31B on M4 is genuinely the new baseline β€” and if omlx-class substrates let any vendor ship local inference inside their product without their users noticing β€” then the median enterprise task may not require cloud inference at all.

Software vendors stop paying token prices for free-text autocomplete and structured classification. The cloud-API tier compresses to the work that genuinely needs it: long-horizon agents, multi-step reasoning, multimodal generation at the frontier. The cleanest read on which side is right will come from the next Anthropic or OpenAI pricing move. If they cut, they believe the local stack is real and they are defending share. If they hold, they believe the local stack tops out below the workload that matters. The pricing is the proxy for the bet.

Either way, the option value of building on a local-first substrate today has gone up. Twelve months ago that was a constraint. Today it's an architecture choice with material commercial upside. (Related: our deep-dive on the iPhone 17 Pro running a 400B LLM via SSD-to-GPU streaming β€” same substrate logic, different device class.)

How to Try It This Weekend (5 commands)

For an M-series Mac with 24 GB+ unified memory:

# 1. Install omlx (Homebrew tap or download .dmg from Releases)
brew install --cask omlx

# 2. Launch from menu bar (or `open -a omlx`). The icon lives in your status bar.

# 3. In the omlx admin dashboard (http://localhost:8000/admin),
#    search HuggingFace and one-click-download:
#      mlx-community/gemma-4-31b-it-4bit
#    Loads in ~30s; uses ~18-20 GB unified memory.

# 4. Point your tool at the local OpenAI-compatible endpoint:
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=sk-local-anything

# 5. Drive it from your existing coding agent (Claude Code, Cursor, Aider, etc.)
#    β€” omlx is drop-in compatible with both OpenAI and Anthropic API shapes.
Enter fullscreen mode Exit fullscreen mode

That's it. The first prompt is slow (model load + cold KV cache). The second is interactive. The third β€” if you're hitting the same repo tree β€” comes back near-instant from the cold-SSD KV cache restore. The retail experience is now as fast as the cloud one for the warm path, and cheaper-than-free for everything after the disk fills.

If you hit a wall, the omlx repo has a thorough README, an active discussion on the ml-explore/mlx repo, and a growing X community of practitioners.

The Bottom Line

Local AI didn't become the default this week. But the three things that have to be true for it to become the default β€” a credible model floor, a polished substrate, and an industry-level distribution signal β€” were all true in the same news cycle for the first time. Gemma 4 31B is the floor. omlx + MLX is the substrate. Ollama publicly migrating to MLX is the distribution signal.

The interesting question stopped being whether you can run a serious model on your laptop. It is now why your favorite software product is still paying API fees for tasks the laptop can handle just as well. That question is now loud enough to make the front page of Hacker News.

Watch what Anthropic and OpenAI price next. That's the tell.


Originally published at ComputeLeap

Top comments (0)