Roadmap
2026 Roadmap
-
Frontier Speech AI models
Universal 3.1 and 3.2 Pro ship in Q2 and Q3 across async and realtime, followed by Universal 4 Pro Realtime in Q4 targeting the lowest end-to-end turn latency and best-in-class voice-agent audio handling. Universal TTS 1 ships in Q3, completing the in-house voice AI stack. Universal 1 Duplex (Preview), our native speech-to-speech model, follows in Q1 2027. Native-language coverage jumps from 6 to 30+, with noise cancellation, streaming PII redaction, and self-hosted realtime.
-
Voice AI infrastructure platform
One API for every voice AI model — ours and the best open and community ones. Voice Agent API ships in Q2 with Twilio integration, and the Edge Voice Agent Platform in Q3 adds agent management, session storage, and edge functions. Community Models span Async STT, Streaming STT, LLM Gateway, and TTS, so customers get the right model for every language, domain, and price point without ever leaving the platform.
Async #
Speech-to-text for pre-recorded audio.
Upcoming
-
The next Universal-3 Pro release. Native-language coverage jumps from 6 to 30+, adding Japanese, Korean, Hindi, Arabic, Turkish, and others. Accuracy improves on the six core languages.
New languages: Japanese, Vietnamese, Arabic, Dutch, Swedish, Hindi, Norwegian, Finnish, Danish, Urdu, Hebrew.
-
Universal-3 Pro transcription roughly twice as fast end-to-end. Phase one is in production and has already cut turnaround time by 30-80% across different audio durations since the model’s release.
-
Open-source speech-to-text models served directly through our API, giving immediate access to the best available models for the languages and domains they specialize in.
-
Significantly better speaker labeling in noisy and multi-speaker audio. Focuses on the two errors customers report most often: mislabeled short replies like “yeah” and “uh-huh”, and speaker turns that don’t line up with punctuation.
-
A follow-up to Universal-3.1 Pro. Focused on prompt-following, proper nouns and named entities, and broader language coverage.
-
Recognize the same speaker across different recordings, not just within a single file. Useful for meetings, call centers, and cross-session analytics workflows.
-
On-premise deployment of universal-3-pro for regulated environments with strict data-residency requirements.
-
The next major accuracy and capability release after Universal-3.2 Pro.
Recently shipped
-
Universal 3 Pro Async Timestamp Improvements # — Major improvement to Universal-3 Pro’s timestamp calculation, delivering median precision gains of 15.3% for English and 8.6% for non-English, with P99 improvements of 15.0% and 58.4% respectively.
-
Hebrew & Swedish # — Major accuracy gains in Hebrew and Swedish via community-model integrations. Word error rates dropped 37% and 47%.
-
Medical Mode # — An LLM-powered correction pass for medical terminology (drug names, procedures, clinical entities). On our medical benchmark, it achieves a 4.97% error rate versus 7.32% for the next-best vendor. Available as an add-on to Universal-3 Pro in English, Spanish, German, French, Portuguese, and Italian.
-
PII Audio Redaction using Silence # — Redact PII with silence instead of a beep. Reduces listener fatigue when redacted audio is replayed at scale in call-center and compliance workflows.
-
Universal 3 Pro Async # — Promptable speech-to-text with natural-language and custom-vocabulary prompts, mid-sentence language switching across six core languages, and audio tagging.
-
Improved Short-Audio Diarization # — 19% better speaker-count accuracy and 6% lower speaker-attributed word error rate on audio under two minutes.
-
Multichannel Diarization # — Per-channel speaker labels for multi-microphone recordings. Eliminates crosstalk ambiguity in call-center and meeting audio.
Realtime #
Low-latency streaming speech-to-text for live audio.
Upcoming
-
The next universal-realtime-3-pro release, with better noise handling, higher voice-agent accuracy, continuous speaker-labeling gains, and PII redaction. Early English results already beat universal-realtime-3-pro on numbers, medical terms, accented speech, and alphanumerics. Native-language coverage jumps from 6 to 30+, adding Japanese, Korean, Hindi, Arabic, Turkish, and others.
New languages: Japanese, Vietnamese, Arabic, Dutch, Swedish, Hindi, Norwegian, Finnish, Danish, Urdu, Hebrew.
-
Realtime noise suppression for voice agents and telephony. Delivers clean audio into transcription and downstream LLMs so accuracy holds up in real call-center conditions, with no separate preprocessor required.
-
PII detection and redaction in the realtime pipeline for HIPAA, PCI, and other compliance-sensitive workloads. Configurable entity types and substitution modes.
-
A follow-up to Universal-3.1 Pro Realtime. Focused on prompt-following, proper nouns and named entities, broader language coverage, and closing remaining accuracy gaps that enterprise customers care about.
-
A fast, cost-efficient realtime model for notetaking and meeting intelligence. Optimized for long-form audio where throughput, stable speaker labeling, and sustained accuracy over multi-hour sessions matter more than minimizing latency.
-
On-premise deployment of universal-realtime-3-pro for regulated environments with strict data-residency requirements.
-
The next-generation realtime model for voice agents. Targets the lowest end-to-end turn latency and the strongest handling of voice-agent audio (noise, interruptions, mid-turn hesitation, accented and non-native speech). Instruction-following is strong enough that a single model replaces today’s speech-to-text + LLM + TTS stacks. Multilingual across 15+ native languages and the foundation for our native speech-to-speech architecture.
-
A single native end-to-end model that replaces today’s Voice Agent pipeline (speech-to-text, LLM, text-to-speech) with a unified Realtime Speech LLM. Tighter latency, better prosody control, and more natural interruption handling than orchestrated stacks can deliver.
Recently shipped
-
Medical Mode # — An LLM-powered correction pass for medical terminology. 4.97% error rate versus 7.32% for the next-best vendor. Available in both async and streaming on universal-realtime-3-pro.
-
Streaming Diarization v1.5 # — Speaker-aware sentence splitting for cleaner segmentation. 4–5% lower word error rate, 56% fewer phantom speakers, and clear gains on the CallHome and AMI speaker-labeling benchmarks.
-
Universal 3 Pro Realtime # — Realtime speech-to-text with inline streaming speaker labeling, custom vocabulary prompts up to 1,000 words, audio tagging, filler-word control, mid-sentence language switching, and 99+ language support via Whisper routing for long-tail languages. EU region support.
-
Whisper Streaming # — The first community model in our streaming API, shipped alongside Universal 3 Pro Realtime.
-
Edge Routing and Data Zone Endpoints # — Global low-latency routing with US/EU data-residency endpoints. No additional charge.
Voice Agents #
End-to-end Voice Agent API.
Upcoming
-
Production release of the Voice Agent API (formerly Speech-to-Speech API). Built on universal-realtime-3-pro, LLM Gateway, and text-to-speech running on self-hosted LiveKit. PCI-certified.
-
Direct Twilio SIP and voice connectivity. Phone integration without customer-side LiveKit plumbing.
-
Full programmatic control of voice agents, with session data and tool execution running at the edge. Create, update, and version agents as code through a management API, persist and retrieve conversation sessions (events, transcripts, tool calls) through public endpoints, and run webhooks or tool calls at the edge instead of round-tripping to origin.
-
Official client libraries, starting with Python and TypeScript.
-
A custom turn detection model trained on Universal 3 Pro streaming outputs for market-leading turn detection performance. Reduces false endpointing and improves handling of pauses, hesitations, and overlapping speech in Voice Agent API calls.
Recently shipped
-
Voice Agent Preview # — First public release of end-to-end voice AI. Combines universal-realtime-3-pro, LLM Gateway, and text-to-speech on LiveKit.
TTS #
Text-to-speech built for voice agents.
Upcoming
-
A standalone text-to-speech model for production voice workloads. Low time-to-first-byte, voice prompting and customization, and accurate delivery of phone numbers, email addresses, named entities, and other content today’s TTS systems struggle with.
-
Open-source text-to-speech models served directly through our API alongside Universal TTS, giving immediate access to the best available voices for the languages, styles, and domains they specialize in.
Speech Understanding #
Extract meaning, sentiment, and events from audio.
Upcoming
-
Summarization via the LLM Gateway, replacing legacy LeMUR summaries. Quality gains come from routing through frontier models with automatic fallbacks.
-
Sharper chapter boundaries and titles via the LLM Gateway. Clearly better topic segmentation on long-form content.
-
Accuracy and quality improvements to Speaker ID, Translation, and Custom Formatting. Translation covers live streaming and pre-recorded audio. Enables multilingual workflows where the spoken language differs from the output language.
-
Automatic LLM-based transcript correction with no user prompts, generalizing the Medical Mode pattern to any domain.
-
Better custom-vocabulary (keyterm) prompting across every supported language, via LLM Gateway post-processing. Closes the quality gap with English for Spanish, German, French, Portuguese, and Italian.
-
Detect speaker emotions and emotional shifts in input audio. Distinct from Voice Agents Emotion and Style Tagging, which controls TTS output. Useful for therapy, CX scoring, and compliance monitoring.
LLM Gateway #
One API for every major LLM. Built-in fallbacks and audio-first integration.
Upcoming
-
Ongoing catalog expansion. The Gateway currently supports 24 models across Anthropic, OpenAI, Google, Qwen, and Kimi. DeepSeek, Mistral, Llama, and Cohere are next, with more open-source models to follow.
-
Prompt-cache pass-through for supported providers, so customers keep the cache discount while routing through the Gateway.
-
Reasoning (extended-thinking) controls exposed through the Gateway, on models that support them.
-
Priority, standard, and flex request tiers for per-request cost and latency control.
Recently shipped
Open Benchmarks #
Transparent, reproducible benchmarks across Universal and community models.
Upcoming
-
A public version of our internal evaluation dashboard, covering 30+ competitors and the community models we serve across dozens of metrics (word error rate, speaker-labeling accuracy, keyterm accuracy, timestamp precision, mid-sentence language switching). Customers can verify accuracy and pick the right model per language and domain.
-
Open-source release of the evaluation dashboard and supporting tooling, so customers and researchers can reproduce our benchmarks across all models on their own data.
-
A realistic Voice-AI evaluation set of telephony, meeting, and voice-agent audio with proper ground-truth transcripts. The same dataset is used to grade every model on the leaderboard, Universal and community alike. Not a thirty-second YouTube clip or LibriSpeech.
Developer Experience #
The dashboard, accounts, and tooling that make AssemblyAI easy to adopt.
Upcoming
-
Invite teammates with role-based access. Organization setup, member management, RBAC, full auth flows, MFA enforcement, account switching, and ownership transfer.
-
SAML and OIDC, as a fast-follow to multi-user accounts.
-
Guided setup for new accounts, including model selection, API key creation, first-request walkthrough, and best-practice defaults.
-
Hard spend caps and team budgets, beyond today’s soft alerts.
-
Deeper observability in the dashboard. P50 and P95 turnaround time, webhook delivery statistics, uptime, and latency histograms.
-
Programmatic access to billing and usage data.
-
Configurable thresholds and cadences, including daily alerts and custom trigger conditions.
-
Regulatory-compliant card support for EU PSD2 Strong Customer Authentication requirements.
Recently shipped
-
AssemblyAI Skill for AI Coding Agents # — Claude Code, Cursor, and Codex now ship with a native AssemblyAI skill. It gives them accurate knowledge of our API out of the box and cuts hallucinated API usage in agent-generated code.