Apple Foundation Models: White Paper vs Real API — What Actually Matches Up
So I was thinking about doing a simple bubble classification experiment with Apple’s Foundation Models (FM), and the first thing that becomes obvious is this: the white paper and the actual API surface are describing the same system, but at very different layers of reality.
This technical report is very hopeful to read it describes the full model capability. But then, i exported FoundationModels top level swift doc with developer typings, and it looked sad as it exposes only a slice of advertised functionatlity.
1. What the white paper actually claims
The paper describes a fairly ambitious multimodal system:
- native support for text + image inputs
- vision encoders integrated into the model stack
- reasoning over images, multi-image inputs, and mixed modality prompts
- structured tool-use and JSON-style outputs
- on-device inference with optimized small models (~3B class scale)
- grounding tasks (OCR, region reasoning, visual understanding)
What else can a man want ? The 3B swissknife is compared head-on against Qwen 2.5 , that's impressive, as Qwen is the smartest model available on llama.cpp within 9B param formfactor. (P.S. probably the best alternative atm)
me: hopeful image classification is online as per paper's perspective:
- image → encoded tokens → reasoning → structured output
FoundationModels API exposes:
- strongly structured around prompt → response check
- optimized for deterministic, schema-driven responses check
- integrated with Apple system frameworks rather than raw model control check
But as soon as i go CMD+F to search bar for "Image" or anything close to pixel buffers, 0 results. It's just not there as of May 2026, macOS Tahoe 26.2
- no explicit image input type in the public API
- no direct “vision prompt” interface
- no raw multimodal session builder
Top comments (0)