Jay

Posted on May 11 • Originally published at futureagi.com

I Benchmarked the Voice AI Stack in May 2026: What Actually Holds Up in Production

#voiceai #llm #ai #machinelearning

A practical May 2026 breakdown of the best STT, TTS, and voice agent platforms for production LLM voice systems, with latency, cost, and orchestration trade-offs.

Voice agents finally feel like an engineering problem, not a research demo.

The pieces are now fast enough to compose into something that feels natural in production. Streaming STT can sit under 300ms, first audio can show up under 100ms, and fast LLMs can stay in the same budget if you pick carefully.

What changed for me over the last few weeks was not any single model. It was seeing every layer mature at roughly the same time.

This post is my attempt to sort the stack by what actually matters in production, which starts with the shortest possible answer.

TL;DR

If I had to pick one practical stack right now, I would start with Deepgram Nova-3 plus Flux for STT, Cartesia Sonic Turbo for TTS, GPT-5 mini or Gemini 3.1 Flash for the LLM, and Retell AI for orchestration.

That gets you into sub-700ms territory today without forcing a custom build on day one.

Here is the short version by layer:

Streaming STT: Deepgram Nova-3
STT with built-in intelligence: AssemblyAI Universal-2
Highest batch accuracy: Google Cloud Chirp
Open-source STT: Whisper Large V3
Accent and education-focused STT: ElevenLabs Scribe
Turn-taking: Deepgram Flux
Lowest-latency TTS: Cartesia Sonic Turbo
Best voice quality and cloning: ElevenLabs v3 Multilingual
Best instructable voice: OpenAI gpt-4o-mini-tts and gpt-realtime
Best emotion control: Hume Octave
Best long-form TTS: PlayHT
Most practical default voice platform: Retell AI
Scale-first voice platform: Vapi
Voice-quality-first platform: ElevenLabs Conversational
Self-hosted bundle: Deepgram Voice Agent
Outbound phone volume: Bland AI

Those picks make more sense once you look at what changed across STT, TTS, and orchestration.

Why voice AI feels different now

For most of 2024 and 2025, voice AI felt uneven.

Whisper Large carried open-source STT, ElevenLabs carried high-end TTS, and everything between them still felt fragile in real deployments.

That changed in 2026 because the stack split into clear optimization axes instead of one fuzzy quality metric.

On the STT side, the trade-off is now mostly between streaming latency, transcript intelligence, and language coverage.

On the TTS side, the split is even cleaner. Latency, naturalness, and emotion control are not the same thing, and treating them as one number is how teams burn months.

On the platform side, orchestration is no longer the hidden tax people ignore until launch week. It is the thing that decides whether your great demo survives production traffic.

That is why I would evaluate the stack one layer at a time, starting with STT.

Best STT models in May 2026

Deepgram Nova-3

If the job is streaming transcription for a voice agent, this is the default I would reach for first.

The main reason is simple. Streaming latency is usually the binding constraint, and Nova-3 is tuned for that shape of workload.

Specs:

Streaming WER: 6.84% median across benchmarks
Latency: sub-300ms streaming
Languages: 30+ for streaming, 50+ for batch
Pricing: $0.0077 per minute streaming, $0.0043 per minute batch
Pairs naturally with Deepgram Flux for end-of-turn detection

Best for:

Production voice agents
Real-time captioning
Live conversational AI

Skip it if:

You care more about batch accuracy across broad language coverage
You need speech intelligence bundled directly on top of the transcript

That brings up the next case, which is when the transcript is not the real product.

AssemblyAI Universal-2

I think of Universal-2 as the pick for teams that need the transcript plus interpretation.

If your product needs summarization, entity extraction, or sentiment analysis as part of the pipeline, this is where it starts to make sense.

Specs:

Streaming WER: 14.5%
Bundled intelligence: summarization, entity detection, sentiment
Pricing: about $0.0025 per minute base, with additional feature cost
Languages: 99+

Best for:

Support call intelligence
Compliance review
Audit workflows
Post-call analytics

Skip it if pure transcript quality is the thing you care about most.

That trade-off matters even more when the workload is asynchronous.

Google Cloud Chirp

If I can wait seconds instead of milliseconds, I care less about streaming behavior and more about broad accuracy and language coverage.

That is the case where Google Cloud Chirp becomes interesting.

Specs:

Batch WER: 11.6%
Languages: 125+
Pricing: varies by tier and language

Best for:

Long-form transcription
Multi-language batch pipelines
Offline transcription jobs

Skip it for live voice agents.

Once you step out of managed APIs, the open-source baseline is still very relevant.

Whisper Large V3

Whisper Large V3 is still the open-source STT reference point.

I have not seen many teams fully replace it unless they specifically need better streaming behavior or managed platform economics.

Specs:

WER: 7.4% average across mixed benchmarks
Parameters: 1.55B
Languages: 99+
Self-hosted on commodity inference hardware
Usually stronger on European languages than on rarer African and Asian languages

Best for:

Self-hosted deployments
Privacy-sensitive workloads
Edge inference
High-volume usage where GPU economics beat API pricing

Skip it if you need sub-300ms streaming or built-in intelligence features.

That still leaves one specialist lane where general-purpose STT is not the best fit.

ElevenLabs Scribe

Scribe makes sense when accent normalization and pronunciation feedback are core to the product.

That is why it fits education, accessibility, and speech feedback workflows better than generic live-agent stacks.

Specs:

WER: up to 5% on excellent-accuracy languages with diarization
Strong accent normalization
Included through ElevenLabs subscription tiers

Best for:

Language learning
Accessibility
Speaker-labeled transcription
Accent-aware products

Skip it if you just need fast general-purpose streaming STT.

There is also one layer people still underestimate, and it matters more than a small WER delta in real calls.

Deepgram Flux

Flux is not just another STT model.

It solves the question of when the user is done speaking and when the system should answer.

In real voice agents, that matters more than many teams expect. I have seen cleaner turn-taking improve perceived quality more than squeezing out a point or two on WER.

Best for:

Any turn-based voice agent

Skip it for:

Dictation
Plain transcription
Non-conversational speech workflows

Once turn-taking is in place, the next hard decision is TTS, and that is where latency starts dominating the stack budget.

Best TTS models in May 2026

The TTS market is easier to reason about if you stop looking for one winner.

I would break it into three questions. Do I care most about latency, naturalness, or emotion control?

Cartesia Sonic Turbo

If the target is sub-500ms round-trip turn latency, Cartesia Sonic Turbo is the most important TTS option in the category.

The 40ms time-to-first-audio number changes what is possible for the rest of the system.

Specs:

TTFA: 40ms on Turbo, 90ms on standard
Languages: 15+
Voice cloning: supported
Streaming output

Best for:

Real-time interactive agents
Telephony
Any workflow where delay changes the user experience

Skip it if the product cares more about voice fidelity than raw speed.

This is the model that surprised me the most. The latency difference is not cosmetic. It changes whether the rest of your system even has room to breathe.

ElevenLabs v3 Multilingual

ElevenLabs still owns the voice-quality and cloning conversation.

If the voice itself is part of the product, this is still the strongest default.

Specs:

Languages: 32+
TTFA: sub-100ms
Voice cloning: best-in-class
Strong emotional depth across multiple languages

Best for:

Character voices
Creator content
Audiobooks
Branded voice products

Skip it if you need the lowest possible first-audio latency.

That leads into a different category entirely, which is not just quality but control.

OpenAI gpt-4o-mini-tts and gpt-realtime

This is the instructable voice pick.

If you want the LLM to control pacing, tone, and character directly through prompt instructions, this path is unusually flexible.

Specs:

TTFA: around 200ms
Languages: 50+
Natural-language voice control through GPT-4o audio features

Best for:

Context-aware voice agents
Dynamic character control
Teams already using the OpenAI audio stack

Skip it if your latency target is below 100ms.

I like this category because it makes voice behavior part of prompt design, but you still pay for that flexibility in latency.

Hume Octave

Hume Octave is the emotion-focused TTS option.

If prosody and emotional expression are the main job, this is where I would look before general-purpose models.

Specs:

TTFA: around 150ms
Languages: 20+
Tuned for emotion accuracy

Best for:

Mental health products
Character systems
Emotion-sensitive voice experiences

Skip it if cost or low latency is the top constraint.

There is also a separate lane for long-form speech, where turn speed matters less than sustained output quality.

PlayHT

PlayHT fits long-form conversational and narrated audio better than ultra-fast turn-by-turn systems.

That matters for podcasts, audiobook workflows, and any agent that has to speak for a while without falling apart.

Specs:

TTFA: around 250ms
Languages: 30+
Tuned for sustained quality over longer outputs

Best for:

Podcasts
Audiobooks
Long-running voice content

There is one more category worth keeping in mind if you need full control over the deployment surface.

Sesame Maya and Miles

Sesame is one of the more useful open-source TTS options right now.

I would mostly think about it for English-heavy self-hosted deployments where commercial TTS pricing does not make sense.

Best for:

Self-hosted English TTS
Privacy-sensitive systems
Cost-sensitive production workloads

Skip it if you need strong multilingual production quality.

Once STT and TTS are settled, the next choice is whether to wire the whole thing yourself or let a platform eat the orchestration pain.

Best voice agent platforms in May 2026

If I do not need a custom orchestration stack on day one, I would start with a platform.

You pay for that abstraction, but you also skip months of retry logic, barge-in handling, monitoring, and call quality debugging.

Retell AI

This is the most practical default for most teams.

The reason is not that it wins every metric. It is that the trade-off looks sane across latency, compliance, and engineering effort.

Specs:

End-to-end latency: about 600 to 780ms across third-party benchmarks
Pricing: $0.07 per minute
HIPAA: included
Compliance: HIPAA and SOC 2
Builder: no-code builder plus developer SDK

Best for:

Most production voice-agent use cases
Teams that need sub-700ms, not sub-300ms
Teams that want managed infrastructure without extra compliance friction

Skip it if:

You need sub-500ms end-to-end
You need self-hosting

This is where I would start unless scale or control clearly pushes me elsewhere.

Vapi

Vapi is the scale pick.

If the system is heading toward very high call volume or you need multi-channel behavior across voice, SMS, and chat, Vapi becomes much more compelling.

Specs:

Volume: 300M+ cumulative calls
Uptime: 99.99% SLA
Average latency: sub-500ms
Channels: voice, SMS, chat

Best for:

Large-scale production deployments
Multi-channel applications
Teams optimizing for infrastructure behavior at scale

Skip it if your traffic is still small enough that managed simplicity matters more than control.

That same split shows up again when voice quality, not orchestration breadth, is the center of the product.

ElevenLabs Conversational

This makes sense when the voice itself is the product.

If I were building a premium voice-first consumer experience, I would at least test this path seriously.

Specs:

Built on ElevenLabs v3 TTS
Sub-100ms TTS first-audio
End-to-end latency depends on STT and LLM choices

Best for:

Character voice agents
Premium consumer voice products
Products where voice quality is part of the brand

Skip it if you need the broadest enterprise feature set or the lowest end-to-end latency.

That leaves the control-heavy path, where self-hosting and bundled stack behavior matter more than polish.

Deepgram Voice Agent

This is the self-hosted or control-first option.

You get Nova-3, Flux, LLM routing, and TTS bundled together in a managed or self-hosted path.

Specs:

End-to-end latency: sub-400ms with Flux
Bundled components: STT, Flux, LLM routing, TTS
Self-hosted deployment: supported

Best for:

Teams that want stack control
Self-hosted compliance
Predictable bundled pricing

Skip it if you want the simplest managed platform experience.

There is still one platform category that stays fairly specialized.

Bland AI

Bland AI fits structured outbound phone workloads.

If the real job is outbound calling at scale, that specialization matters.

Best for:

Outbound phone agents
Sales operations
Structured campaign workflows

Skip it if you are focused on inbound conversational agents.

If you do not want a managed platform at all, open-source is still viable, but the trade-off is very real.

Open-source voice agent options

If you want to own the STT, LLM, and TTS loop yourself, four names come up often:

Pipecat
LiveKit Agents
Daily Bots
Cartesia Line

LiveKit Agents is especially relevant if you want first-party adapters across major STT and TTS providers.

The trade-off is simple. You skip platform fees, but you own orchestration, retries, barge-in handling, and the pager when production gets weird.

The moment you own the full loop, latency math becomes non-negotiable.

Latency budget, the part that decides everything

Natural conversation has a hard latency target.

Sub-300ms is the threshold people chase for truly natural turn-taking, and sub-700ms is the practical bar many production systems can still get away with.

The stack budget usually looks like this:

STT: 200 to 300ms
LLM inference: 100 to 300ms
TTS first audio: 40 to 200ms
Orchestration overhead: 50 to 100ms

If you want something close to sub-300ms, the TTS choice becomes structural.

That is why Cartesia Sonic Turbo matters so much. Without a 40ms-class TTS path, the rest of the budget gets tight very fast.

A practical sub-700ms stack usually looks more forgiving. Something like Deepgram Nova-3 plus a fast LLM plus ElevenLabs Flash-class TTS plus normal orchestration overhead is already workable.

That is also why picking by model quality screenshots alone leads to bad production decisions.

What 100K minutes per month really costs

Per-minute pricing hides a lot.

By the time a real production stack is running, the cost model depends on whether you chose managed flat pricing, bring-your-own providers, native audio APIs, self-hosting, or an enterprise bundle.

At around 100K minutes per month, the rough shapes look like this:

Retell managed: about $7,000
Vapi with BYO providers: about $11,500 to $14,000
OpenAI Realtime API alone: about $15,000 to $25,000
Bland AI tiers: about $6,000 to $9,000
Self-hosted mix of Whisper, Claude API, Cartesia, and Pipecat: about $6,000 to $7,500 plus on-call cost
Deepgram Voice Agent enterprise: about $7,500 plus enterprise contract terms

The honest part is that total cost of ownership does not line up cleanly with list price.

Under 100K minutes, a flat managed option often wins because engineering time, retry tuning, and compliance work are real costs. Above 1M minutes, bring-your-own or self-hosted economics can flip in your favor.

That cost picture only makes sense if you already know what kind of system you are trying to build.

How I would choose

Choose Retell AI if:

You want the most practical default
Sub-700ms is fine
HIPAA matters
You want managed infrastructure

Choose Vapi if:

You are operating at serious scale
You need multi-channel support
Infrastructure behavior matters more than simplicity

Choose Deepgram Voice Agent if:

You want self-hosting
You want tighter stack control
Turn-taking and latency are central

Choose ElevenLabs Conversational if:

Voice quality is part of the product itself
Cloning fidelity matters

Choose Bland AI if:

The job is outbound calling
The workflow is operational and structured

Roll your own if:

You need a very specific stack
You need tighter compliance or deployment control
You can afford the engineering overhead

That choice gets easier once you avoid the mistakes that keep showing up across voice teams.

Common mistakes I keep seeing

Picking TTS by sound quality and ignoring latency. A beautiful voice does not help if the turn feels slow.
Treating turn-taking as optional. Generic STT is not enough for a good conversational agent.
Ignoring LLM latency. A fast STT path plus a slow model still gives you a slow voice system.
Building from scratch too early. Custom stacks make sense, but only when the platform path is clearly blocking you.
Trusting vendor numbers without replaying your own traffic. Benchmarks are useful, but your accents, noise patterns, prompts, and call flows are what matter.

This is also where evaluation and observability become more important than most teams expect.

The part after launch

Voice agents fail for the same reasons text agents fail, plus a few extra ones.

You still get hallucinations, prompt contamination, bad tool calls, and retry loops. On top of that, you also get accent edge cases, interruptions, barge-in failures, and sentiment drift.

In my own work, I care less about the launch demo and more about whether the system survives real traffic.

That usually means four things:

Simulation before release, including accents, interruptions, background noise, and ambiguous prompts
Evals that score groundedness, hallucination risk, tool-call accuracy, and voice-specific behavior
Runtime guardrails that are fast enough not to blow the latency budget
Error clustering so repeated failures show up as patterns, not scattered tickets

For voice systems, accent handling, sentiment consistency, and tool-call accuracy are not edge metrics. They are production metrics.

That is why I would not pick a voice stack without thinking about failure analysis at the same time.

Sources I would check while validating a stack

For STT:

Deepgram Nova-3 announcement
Deepgram Flux docs
AssemblyAI Universal-2 research
Google Cloud Chirp docs
OpenAI Whisper Large V3 model card
ElevenLabs Scribe v2 Realtime

For TTS:

Cartesia Sonic Turbo docs
ElevenLabs Flash v2.5 latency docs
OpenAI Realtime API docs and pricing
Hume Octave docs
PlayHT docs

For platforms and benchmarks:

Vapi scale and uptime material
Retell latency material and telemetry API
Bland AI pricing
Deepgram Voice Agent pricing
ElevenLabs Conversational compliance requirements
Artificial Analysis AA-WER
ITU-T G.114

If you want the LLM side of the stack, read Best LLMs May 2026.

If you want the previous monthly snapshots, read Best Voice AI April 2026 and Best Voice AI March 2026.

The monthly changes are now happening fast enough that old voice stack assumptions do not stay true for long.

DEV Community

I Benchmarked the Voice AI Stack in May 2026: What Actually Holds Up in Production

TL;DR

Why voice AI feels different now

Best STT models in May 2026

Deepgram Nova-3

AssemblyAI Universal-2

Google Cloud Chirp

Whisper Large V3

ElevenLabs Scribe

Deepgram Flux

Best TTS models in May 2026

Cartesia Sonic Turbo

ElevenLabs v3 Multilingual

OpenAI gpt-4o-mini-tts and gpt-realtime

Hume Octave

PlayHT

Sesame Maya and Miles

Best voice agent platforms in May 2026

Retell AI

Vapi

ElevenLabs Conversational

Deepgram Voice Agent

Bland AI

Open-source voice agent options

Latency budget, the part that decides everything

What 100K minutes per month really costs

How I would choose

Common mistakes I keep seeing

The part after launch

Sources I would check while validating a stack

Top comments (0)