DEV Community

Cover image for I Benchmarked the Voice AI Stack in May 2026: What Actually Holds Up in Production
Jay
Jay

Posted on • Originally published at futureagi.com

I Benchmarked the Voice AI Stack in May 2026: What Actually Holds Up in Production

A practical May 2026 breakdown of the best STT, TTS, and voice agent platforms for production LLM voice systems, with latency, cost, and orchestration trade-offs.

Voice agents finally feel like an engineering problem, not a research demo.

The pieces are now fast enough to compose into something that feels natural in production. Streaming STT can sit under 300ms, first audio can show up under 100ms, and fast LLMs can stay in the same budget if you pick carefully.

What changed for me over the last few weeks was not any single model. It was seeing every layer mature at roughly the same time.

This post is my attempt to sort the stack by what actually matters in production, which starts with the shortest possible answer.

TL;DR

If I had to pick one practical stack right now, I would start with Deepgram Nova-3 plus Flux for STT, Cartesia Sonic Turbo for TTS, GPT-5 mini or Gemini 3.1 Flash for the LLM, and Retell AI for orchestration.

That gets you into sub-700ms territory today without forcing a custom build on day one.

Here is the short version by layer:

  • Streaming STT: Deepgram Nova-3
  • STT with built-in intelligence: AssemblyAI Universal-2
  • Highest batch accuracy: Google Cloud Chirp
  • Open-source STT: Whisper Large V3
  • Accent and education-focused STT: ElevenLabs Scribe
  • Turn-taking: Deepgram Flux
  • Lowest-latency TTS: Cartesia Sonic Turbo
  • Best voice quality and cloning: ElevenLabs v3 Multilingual
  • Best instructable voice: OpenAI gpt-4o-mini-tts and gpt-realtime
  • Best emotion control: Hume Octave
  • Best long-form TTS: PlayHT
  • Most practical default voice platform: Retell AI
  • Scale-first voice platform: Vapi
  • Voice-quality-first platform: ElevenLabs Conversational
  • Self-hosted bundle: Deepgram Voice Agent
  • Outbound phone volume: Bland AI

Those picks make more sense once you look at what changed across STT, TTS, and orchestration.

Why voice AI feels different now

For most of 2024 and 2025, voice AI felt uneven.

Whisper Large carried open-source STT, ElevenLabs carried high-end TTS, and everything between them still felt fragile in real deployments.

That changed in 2026 because the stack split into clear optimization axes instead of one fuzzy quality metric.

On the STT side, the trade-off is now mostly between streaming latency, transcript intelligence, and language coverage.

On the TTS side, the split is even cleaner. Latency, naturalness, and emotion control are not the same thing, and treating them as one number is how teams burn months.

On the platform side, orchestration is no longer the hidden tax people ignore until launch week. It is the thing that decides whether your great demo survives production traffic.

That is why I would evaluate the stack one layer at a time, starting with STT.

Best STT models in May 2026

Deepgram Nova-3

If the job is streaming transcription for a voice agent, this is the default I would reach for first.

The main reason is simple. Streaming latency is usually the binding constraint, and Nova-3 is tuned for that shape of workload.

Specs:

  • Streaming WER: 6.84% median across benchmarks
  • Latency: sub-300ms streaming
  • Languages: 30+ for streaming, 50+ for batch
  • Pricing: $0.0077 per minute streaming, $0.0043 per minute batch
  • Pairs naturally with Deepgram Flux for end-of-turn detection

Best for:

  • Production voice agents
  • Real-time captioning
  • Live conversational AI

Skip it if:

  • You care more about batch accuracy across broad language coverage
  • You need speech intelligence bundled directly on top of the transcript

That brings up the next case, which is when the transcript is not the real product.

AssemblyAI Universal-2

I think of Universal-2 as the pick for teams that need the transcript plus interpretation.

If your product needs summarization, entity extraction, or sentiment analysis as part of the pipeline, this is where it starts to make sense.

Specs:

  • Streaming WER: 14.5%
  • Bundled intelligence: summarization, entity detection, sentiment
  • Pricing: about $0.0025 per minute base, with additional feature cost
  • Languages: 99+

Best for:

  • Support call intelligence
  • Compliance review
  • Audit workflows
  • Post-call analytics

Skip it if pure transcript quality is the thing you care about most.

That trade-off matters even more when the workload is asynchronous.

Google Cloud Chirp

If I can wait seconds instead of milliseconds, I care less about streaming behavior and more about broad accuracy and language coverage.

That is the case where Google Cloud Chirp becomes interesting.

Specs:

  • Batch WER: 11.6%
  • Languages: 125+
  • Pricing: varies by tier and language

Best for:

  • Long-form transcription
  • Multi-language batch pipelines
  • Offline transcription jobs

Skip it for live voice agents.

Once you step out of managed APIs, the open-source baseline is still very relevant.

Whisper Large V3

Whisper Large V3 is still the open-source STT reference point.

I have not seen many teams fully replace it unless they specifically need better streaming behavior or managed platform economics.

Specs:

  • WER: 7.4% average across mixed benchmarks
  • Parameters: 1.55B
  • Languages: 99+
  • Self-hosted on commodity inference hardware
  • Usually stronger on European languages than on rarer African and Asian languages

Best for:

  • Self-hosted deployments
  • Privacy-sensitive workloads
  • Edge inference
  • High-volume usage where GPU economics beat API pricing

Skip it if you need sub-300ms streaming or built-in intelligence features.

That still leaves one specialist lane where general-purpose STT is not the best fit.

ElevenLabs Scribe

Scribe makes sense when accent normalization and pronunciation feedback are core to the product.

That is why it fits education, accessibility, and speech feedback workflows better than generic live-agent stacks.

Specs:

  • WER: up to 5% on excellent-accuracy languages with diarization
  • Strong accent normalization
  • Included through ElevenLabs subscription tiers

Best for:

  • Language learning
  • Accessibility
  • Speaker-labeled transcription
  • Accent-aware products

Skip it if you just need fast general-purpose streaming STT.

There is also one layer people still underestimate, and it matters more than a small WER delta in real calls.

Deepgram Flux

Flux is not just another STT model.

It solves the question of when the user is done speaking and when the system should answer.

In real voice agents, that matters more than many teams expect. I have seen cleaner turn-taking improve perceived quality more than squeezing out a point or two on WER.

Best for:

  • Any turn-based voice agent

Skip it for:

  • Dictation
  • Plain transcription
  • Non-conversational speech workflows

Once turn-taking is in place, the next hard decision is TTS, and that is where latency starts dominating the stack budget.

Best TTS models in May 2026

The TTS market is easier to reason about if you stop looking for one winner.

I would break it into three questions. Do I care most about latency, naturalness, or emotion control?

Cartesia Sonic Turbo

If the target is sub-500ms round-trip turn latency, Cartesia Sonic Turbo is the most important TTS option in the category.

The 40ms time-to-first-audio number changes what is possible for the rest of the system.

Specs:

  • TTFA: 40ms on Turbo, 90ms on standard
  • Languages: 15+
  • Voice cloning: supported
  • Streaming output

Best for:

  • Real-time interactive agents
  • Telephony
  • Any workflow where delay changes the user experience

Skip it if the product cares more about voice fidelity than raw speed.

This is the model that surprised me the most. The latency difference is not cosmetic. It changes whether the rest of your system even has room to breathe.

ElevenLabs v3 Multilingual

ElevenLabs still owns the voice-quality and cloning conversation.

If the voice itself is part of the product, this is still the strongest default.

Specs:

  • Languages: 32+
  • TTFA: sub-100ms
  • Voice cloning: best-in-class
  • Strong emotional depth across multiple languages

Best for:

  • Character voices
  • Creator content
  • Audiobooks
  • Branded voice products

Skip it if you need the lowest possible first-audio latency.

That leads into a different category entirely, which is not just quality but control.

OpenAI gpt-4o-mini-tts and gpt-realtime

This is the instructable voice pick.

If you want the LLM to control pacing, tone, and character directly through prompt instructions, this path is unusually flexible.

Specs:

  • TTFA: around 200ms
  • Languages: 50+
  • Natural-language voice control through GPT-4o audio features

Best for:

  • Context-aware voice agents
  • Dynamic character control
  • Teams already using the OpenAI audio stack

Skip it if your latency target is below 100ms.

I like this category because it makes voice behavior part of prompt design, but you still pay for that flexibility in latency.

Hume Octave

Hume Octave is the emotion-focused TTS option.

If prosody and emotional expression are the main job, this is where I would look before general-purpose models.

Specs:

  • TTFA: around 150ms
  • Languages: 20+
  • Tuned for emotion accuracy

Best for:

  • Mental health products
  • Character systems
  • Emotion-sensitive voice experiences

Skip it if cost or low latency is the top constraint.

There is also a separate lane for long-form speech, where turn speed matters less than sustained output quality.

PlayHT

PlayHT fits long-form conversational and narrated audio better than ultra-fast turn-by-turn systems.

That matters for podcasts, audiobook workflows, and any agent that has to speak for a while without falling apart.

Specs:

  • TTFA: around 250ms
  • Languages: 30+
  • Tuned for sustained quality over longer outputs

Best for:

  • Podcasts
  • Audiobooks
  • Long-running voice content

There is one more category worth keeping in mind if you need full control over the deployment surface.

Sesame Maya and Miles

Sesame is one of the more useful open-source TTS options right now.

I would mostly think about it for English-heavy self-hosted deployments where commercial TTS pricing does not make sense.

Best for:

  • Self-hosted English TTS
  • Privacy-sensitive systems
  • Cost-sensitive production workloads

Skip it if you need strong multilingual production quality.

Once STT and TTS are settled, the next choice is whether to wire the whole thing yourself or let a platform eat the orchestration pain.

Best voice agent platforms in May 2026

If I do not need a custom orchestration stack on day one, I would start with a platform.

You pay for that abstraction, but you also skip months of retry logic, barge-in handling, monitoring, and call quality debugging.

Retell AI

This is the most practical default for most teams.

The reason is not that it wins every metric. It is that the trade-off looks sane across latency, compliance, and engineering effort.

Specs:

  • End-to-end latency: about 600 to 780ms across third-party benchmarks
  • Pricing: $0.07 per minute
  • HIPAA: included
  • Compliance: HIPAA and SOC 2
  • Builder: no-code builder plus developer SDK

Best for:

  • Most production voice-agent use cases
  • Teams that need sub-700ms, not sub-300ms
  • Teams that want managed infrastructure without extra compliance friction

Skip it if:

  • You need sub-500ms end-to-end
  • You need self-hosting

This is where I would start unless scale or control clearly pushes me elsewhere.

Vapi

Vapi is the scale pick.

If the system is heading toward very high call volume or you need multi-channel behavior across voice, SMS, and chat, Vapi becomes much more compelling.

Specs:

  • Volume: 300M+ cumulative calls
  • Uptime: 99.99% SLA
  • Average latency: sub-500ms
  • Channels: voice, SMS, chat

Best for:

  • Large-scale production deployments
  • Multi-channel applications
  • Teams optimizing for infrastructure behavior at scale

Skip it if your traffic is still small enough that managed simplicity matters more than control.

That same split shows up again when voice quality, not orchestration breadth, is the center of the product.

ElevenLabs Conversational

This makes sense when the voice itself is the product.

If I were building a premium voice-first consumer experience, I would at least test this path seriously.

Specs:

  • Built on ElevenLabs v3 TTS
  • Sub-100ms TTS first-audio
  • End-to-end latency depends on STT and LLM choices

Best for:

  • Character voice agents
  • Premium consumer voice products
  • Products where voice quality is part of the brand

Skip it if you need the broadest enterprise feature set or the lowest end-to-end latency.

That leaves the control-heavy path, where self-hosting and bundled stack behavior matter more than polish.

Deepgram Voice Agent

This is the self-hosted or control-first option.

You get Nova-3, Flux, LLM routing, and TTS bundled together in a managed or self-hosted path.

Specs:

  • End-to-end latency: sub-400ms with Flux
  • Bundled components: STT, Flux, LLM routing, TTS
  • Self-hosted deployment: supported

Best for:

  • Teams that want stack control
  • Self-hosted compliance
  • Predictable bundled pricing

Skip it if you want the simplest managed platform experience.

There is still one platform category that stays fairly specialized.

Bland AI

Bland AI fits structured outbound phone workloads.

If the real job is outbound calling at scale, that specialization matters.

Best for:

  • Outbound phone agents
  • Sales operations
  • Structured campaign workflows

Skip it if you are focused on inbound conversational agents.

If you do not want a managed platform at all, open-source is still viable, but the trade-off is very real.

Open-source voice agent options

If you want to own the STT, LLM, and TTS loop yourself, four names come up often:

  • Pipecat
  • LiveKit Agents
  • Daily Bots
  • Cartesia Line

LiveKit Agents is especially relevant if you want first-party adapters across major STT and TTS providers.

The trade-off is simple. You skip platform fees, but you own orchestration, retries, barge-in handling, and the pager when production gets weird.

The moment you own the full loop, latency math becomes non-negotiable.

Latency budget, the part that decides everything

Natural conversation has a hard latency target.

Sub-300ms is the threshold people chase for truly natural turn-taking, and sub-700ms is the practical bar many production systems can still get away with.

The stack budget usually looks like this:

  • STT: 200 to 300ms
  • LLM inference: 100 to 300ms
  • TTS first audio: 40 to 200ms
  • Orchestration overhead: 50 to 100ms

If you want something close to sub-300ms, the TTS choice becomes structural.

That is why Cartesia Sonic Turbo matters so much. Without a 40ms-class TTS path, the rest of the budget gets tight very fast.

A practical sub-700ms stack usually looks more forgiving. Something like Deepgram Nova-3 plus a fast LLM plus ElevenLabs Flash-class TTS plus normal orchestration overhead is already workable.

That is also why picking by model quality screenshots alone leads to bad production decisions.

What 100K minutes per month really costs

Per-minute pricing hides a lot.

By the time a real production stack is running, the cost model depends on whether you chose managed flat pricing, bring-your-own providers, native audio APIs, self-hosting, or an enterprise bundle.

At around 100K minutes per month, the rough shapes look like this:

  • Retell managed: about $7,000
  • Vapi with BYO providers: about $11,500 to $14,000
  • OpenAI Realtime API alone: about $15,000 to $25,000
  • Bland AI tiers: about $6,000 to $9,000
  • Self-hosted mix of Whisper, Claude API, Cartesia, and Pipecat: about $6,000 to $7,500 plus on-call cost
  • Deepgram Voice Agent enterprise: about $7,500 plus enterprise contract terms

The honest part is that total cost of ownership does not line up cleanly with list price.

Under 100K minutes, a flat managed option often wins because engineering time, retry tuning, and compliance work are real costs. Above 1M minutes, bring-your-own or self-hosted economics can flip in your favor.

That cost picture only makes sense if you already know what kind of system you are trying to build.

How I would choose

Choose Retell AI if:

  • You want the most practical default
  • Sub-700ms is fine
  • HIPAA matters
  • You want managed infrastructure

Choose Vapi if:

  • You are operating at serious scale
  • You need multi-channel support
  • Infrastructure behavior matters more than simplicity

Choose Deepgram Voice Agent if:

  • You want self-hosting
  • You want tighter stack control
  • Turn-taking and latency are central

Choose ElevenLabs Conversational if:

  • Voice quality is part of the product itself
  • Cloning fidelity matters

Choose Bland AI if:

  • The job is outbound calling
  • The workflow is operational and structured

Roll your own if:

  • You need a very specific stack
  • You need tighter compliance or deployment control
  • You can afford the engineering overhead

That choice gets easier once you avoid the mistakes that keep showing up across voice teams.

Common mistakes I keep seeing

  1. Picking TTS by sound quality and ignoring latency. A beautiful voice does not help if the turn feels slow.

  2. Treating turn-taking as optional. Generic STT is not enough for a good conversational agent.

  3. Ignoring LLM latency. A fast STT path plus a slow model still gives you a slow voice system.

  4. Building from scratch too early. Custom stacks make sense, but only when the platform path is clearly blocking you.

  5. Trusting vendor numbers without replaying your own traffic. Benchmarks are useful, but your accents, noise patterns, prompts, and call flows are what matter.

This is also where evaluation and observability become more important than most teams expect.

The part after launch

Voice agents fail for the same reasons text agents fail, plus a few extra ones.

You still get hallucinations, prompt contamination, bad tool calls, and retry loops. On top of that, you also get accent edge cases, interruptions, barge-in failures, and sentiment drift.

In my own work, I care less about the launch demo and more about whether the system survives real traffic.

That usually means four things:

  • Simulation before release, including accents, interruptions, background noise, and ambiguous prompts
  • Evals that score groundedness, hallucination risk, tool-call accuracy, and voice-specific behavior
  • Runtime guardrails that are fast enough not to blow the latency budget
  • Error clustering so repeated failures show up as patterns, not scattered tickets

For voice systems, accent handling, sentiment consistency, and tool-call accuracy are not edge metrics. They are production metrics.

That is why I would not pick a voice stack without thinking about failure analysis at the same time.

Sources I would check while validating a stack

For STT:

  • Deepgram Nova-3 announcement
  • Deepgram Flux docs
  • AssemblyAI Universal-2 research
  • Google Cloud Chirp docs
  • OpenAI Whisper Large V3 model card
  • ElevenLabs Scribe v2 Realtime

For TTS:

  • Cartesia Sonic Turbo docs
  • ElevenLabs Flash v2.5 latency docs
  • OpenAI Realtime API docs and pricing
  • Hume Octave docs
  • PlayHT docs

For platforms and benchmarks:

  • Vapi scale and uptime material
  • Retell latency material and telemetry API
  • Bland AI pricing
  • Deepgram Voice Agent pricing
  • ElevenLabs Conversational compliance requirements
  • Artificial Analysis AA-WER
  • ITU-T G.114

If you want the LLM side of the stack, read Best LLMs May 2026.

If you want the previous monthly snapshots, read Best Voice AI April 2026 and Best Voice AI March 2026.

The monthly changes are now happening fast enough that old voice stack assumptions do not stay true for long.

Top comments (0)