MetaVoice Blog – Taking the bitter lesson to heart for speech-to-speech models.

MetaVoice is building Voice AI that feels indistinguishable from talking to a person, even after years of daily interactions.

Today's Voice AI can't handle a simple 'mm-hmm': a skill humans acquire in infancy. This limitation arises because current systems operate like a walkie-talkie: one party speaks at a time, never both simultaneously. Even more problematic, Voice AIs interrupt unexpectedly, causing conversational chaos. This leads to a frustrating game of conversational ping-pong – where the AI speaks, the user interrupts, and the AI, unable to yield quickly, creates awkward overlaps and repeated interruptions.

These limitation stem directly from core architectural decisions originally optimized for text-based Q&A chatbots. Today's voice AI systems typically rely on a cascaded pipeline (Automatic Speech Recognition / ASR → Text-based Large Language Models / LLMs → Text-to-Speech / TTS) or use multi-modal speech-text models, retrofitting speech capabilities onto frameworks initially developed for textual inputs. However, Q&A chatbot models are inherently turn-based, a constraint that doesn't naturally apply to human speech. Consequently, speech data to train these systems has to be filtered, extracted, or synthetically generated to conform to a turn-taking mold. This severely restricts scalability and limits the model's capability to learn nuances of human interaction.

Rich Sutton’s Bitter Lesson tells us that scalable, data-driven methods eventually outperform handcrafted ones. To take this seriously in speech-to-speech modelling, we need to fully embrace Duplex models – architectures that simultaneously listen and speak at every token. Such architectures not only allow for more natural human-like conversational behaviours (such as overlapping speech, backchannels, and polite interruptions) but also enable direct training on raw audio streams.

Yet duplex models face their own unique challenge – training them requires datasets with cleanly separated speaker tracks, something scarcely available publicly. Almost all conversational audio is mixed into a single waveform, making it unsuitable for duplex model training.

In this blog, we share our progress in overcoming this critical data bottleneck by effectively separating audio at scale. This advancement brings us significantly closer to Voice AI agents capable of sustaining multi-year interactions indistinguishable from human interactions.

Problem

Voiced Activity Detection (VAD)

Both chained (ASR→LLM→TTS) and multimodal speech-text models are turn-based. They need to be explicitly triggered to start generating a response. For this, they rely heavily on accurate VAD systems to determine when users start or stop speaking. The quality of VAD directly impacts conversational fluidity.

Traditional approaches include:

Silence-based VAD: Detects pauses of predetermined lengths. However, it frequently misinterprets naturally occurring speech pauses, such as brief hesitations during postcode recitation or subtle conversational acknowledgments like 'mm-hmm'.
Semantic VAD: Incorporates linguistic context to understand natural pauses. Yet it remains limited in handling conversational overlaps, simultaneous speech, or nuanced backchannels due to its reliance on linguistic signals.
‍

Training data

Existing voice models (TTS or speech-to-speech) typically require clean, single-speaker, turn-based datasets. However, this data lacks the nuances of real-world conversations like overlapping speech, backchannels, and interruptions. These constitute 50% of human-dialogue patterns.

Current data preparation methods for training TTS or speech-to-speech models hence involve extensive filtering or synthetic data. The common approaches are:

Training on audiobook-like data: Inherently single-speaker with no conversational turns, leading to unnatural, monotonous speech interactions.
Extracting turn-based data from real-world data: Involves diarisation (segmenting audio by speaker), filtering out segments containing speaker overlaps, and conducting extensive quality checks. This approach significantly reduces usable data and rarely provides clean, complete conversational turns.
Training on synthetic data: Synthetic TTS-generated dialogues offer controlled, clean, turn-based training examples. Whilst scalable, synthetic conversations often lack the nuanced realism and variability present in genuine human interactions. More recent work (like NotebookLM and ElevenLabs V3) explores “dialogue text-to-speech”, where models consider the entire interaction before generating speech. However, the outputs still remain inherently turn-based.

Similar challenges arise for other conversational components such as semantic VAD, or, in chained systems, components like ASR and text-based LLMs (for example to make their responses more “conversational”).

Collectively, these constraints significantly limit the scalability and realism achievable with existing data preparation methods.
‍

Duplex Models

Traditional chained or multimodal systems don’t respond until a user finishes an entire turn. Duplex models, by contrast, process input audio and generate output concurrently. They decide at every frame whether to speak or remain deliberately silent. Because these frame-level choices serve as implicit turn-taking, duplex models sidestep the need for explicit start/stop logic or VAD.

And because duplex models process audio continuously, they don't depend on turn-segmented datasets. Instead, they require two clean audio channels during training—one for incoming speech and the other their own output. Most publicly available conversational corpora, however, are mono-channel recordings where all speakers are mixed together. This makes them poorly suited for training duplex architectures.
‍

Solution

Our approach takes the bitter lesson to heart - extracting conversational data at scale for the general-purpose duplex approach. To do this, we’re building a state-of-the-art speech separation model for real-world conversational data to pre-train our duplex model. And there is much more in the works - see “Upcoming” section below.

Speech separation is the process of pulling apart each speaker’s voice from recordings where multiple people are talking—often at the same time. Human speech is rich and complex. It contains paralinguistic cues (like laughter) and interactional signals (like backchannels) which convey meaning beyond plain speech. See examples.

Original Audio	Speaker 1	Speaker 2

Currently, most open-source and (the very few) commercial speech separation models can handle plain speech, but struggle to separate the real-world conversational elements we care about. These include:

Conversational elements
- Paralinguistic sounds – like laughter, breaths, sighs‍
- Disfluencies – like "uh", "um"
- Backchannels – like "mm-hmm", "yeah", "right"
- Overlaps, interrupts and interjections
Long form conversations
High sampling rate – most solutions operate at 8kHz

To address these gaps, we've developed a speech separation model specifically tailored for natural conversations.

We significantly outperform existing open-source and commercial models in handling conversational elements on dialogues up to 30 seconds in length.

Human Evaluation Pass Rate
on Real-World Speech Separation

HIGHER IS BETTER

SepReformer

55%

Pyannote.ai

MetaVoice

92%

25%

50%

75%

100%

SepReformer

Original Audio	Speaker 1	Speaker 2

Pyannote.ai

Original Audio	Speaker 1	Speaker 2

MetaVoice

Original Audio	Speaker 1	Speaker 2

We’re now extending the model to handle arbitrary-length conversations with the same fidelity and separation quality. Here's an example of our early result. This is by no means perfect yet.

SepReformer

Original Audio	Speaker 1	Speaker 2

Pyannote.ai

Original Audio	Speaker 1	Speaker 2

MetaVoice

Original Audio	Speaker 1	Speaker 2

Upcoming

In future posts, we’ll share insights into how we expand on this pre-training work and address other challenges in building a product-grade duplex model:

Architecture - what is the “right” duplex architecture for stable long-form conversations?
Post-training - how do we post-train to build a reliable, engaging product with behaviours as programmed by our customers?
Evaluation - how do we evaluate these duplex models for natural conversations?
Control - what are the right controls for interactions that last hours/days?
‍

Join us

If you’re interested in any of the above challenges — we’re hiring. Check our open roles.
And, subscribe to @metavoiceio to get notified when we release more details.

‍