White Paper FM v Public API

#ios #ai #swift #foundationmodels

Apple Foundation Models: White Paper vs Real API — What Actually Matches Up

So I was thinking about doing a simple bubble classification experiment with Apple’s Foundation Models (FM), and the first thing that becomes obvious is this: the white paper and the actual API surface are describing the same system, but at very different layers of reality.

arxiv.org

This technical report is very hopeful to read it describes the full model capability. But then, i exported FoundationModels top level swift doc with developer typings, and it looked sad as it exposes only a slice of advertised functionatlity.

https://pastebin.com/4N2PNgDc

1. What the white paper actually claims

The paper describes a fairly ambitious multimodal system:

native support for text + image inputs
vision encoders integrated into the model stack
reasoning over images, multi-image inputs, and mixed modality prompts
structured tool-use and JSON-style outputs
on-device inference with optimized small models (~3B class scale)
grounding tasks (OCR, region reasoning, visual understanding)

What else can a man want ? The 3B swissknife is compared head-on against Qwen 2.5 , that's impressive, as Qwen is the smartest model available on llama.cpp within 9B param formfactor. (P.S. probably the best alternative atm)

me: hopeful image classification is online as per paper's perspective:

image → encoded tokens → reasoning → structured output

FoundationModels API exposes:

strongly structured around prompt → response check
optimized for deterministic, schema-driven responses check
integrated with Apple system frameworks rather than raw model control check

But as soon as i go CMD+F to search bar for "Image" or anything close to pixel buffers, 0 results. It's just not there as of May 2026, macOS Tahoe 26.2