Will speech-to-speech replace the traditional STT-LLM-TTS pipeline?

Not entirely. S2S models (such as GPT-5.4 Voice or SeamlessM4T) offer unmatched latency (approx. 200 ms) and preserve emotional nuance, as no text-based intermediary step is required. However, the classic pipeline (cascade) remains superior for complex business logic, tool calling, and strict compliance, as the text serves as a controllable intermediate step.

What is the difference between speech-to-speech and text-to-speech / speech-to-text?

TTS and STT are tools for converting between text and audio. Speech-to-Speech (S2S), on the other hand, is a native audio-to-audio processing technology. S2S “understands” accents, emotions, and background noise directly and can reflect these in its response, whereas traditional systems lose this information when transcribing to text.

Speech-to-Speech

May 7, 2026

|By Julia Schönau

–-> Go to BOTwiki

Speech-to-Speech (S2S) refers to a technology that translates or processes spoken language directly into spoken language without the traditional detour through text. While conventional voice pipelines go through three stages (speech-to-text, then LLM, then text-to-speech), a speech-to-speech model processes audio end-to-end in a single neural network.

This way, even paralinguistic information—such as emotion, tone of voice, laughter, or hesitation—is preserved, details that are typically lost when transcribing speech into text.

Where speech-to-speech excels and where it has its limitations

S2S models excel at short, conversational interactions that require a high degree of naturalness, such as small talk, simple inquiries, or topics similar to those covered in FAQs. They currently perform less well in complex, business-critical processes involving multi-step tool calls, authentication, and backend write operations. In these scenarios, single-model architectures quickly fail due to tool-calling errors or a lack of adherence to rules.

Frequently Asked Questions (FAQ)

Not in general. Speech-to-speech is superior in terms of latency and naturalness, but currently has weaknesses when it comes to complex tool invocation, adherence to rules, and auditability.

While text-to-speech (TTS) and speech-to-text (STT) simply convert between written and spoken language, speech-to-speech (S2S) directly converts an audio input into a new audio output. In the process, characteristics such as the speaker’s voice, emotions, and intonation can be preserved or translated into another language without necessarily focusing on the intermediate step of visible text.

–> Back to the BOTwiki

Product

Features

Integrations

use cases

Industries

Resources

Documentation & Know-How

Recommendations

Speech-to-Speech

Where speech-to-speech excels and where it has its limitations

Frequently Asked Questions (FAQ)

Product

Features

Integrations

use cases

Industries

Resources

Documentation & Know-How

Recommendations

Speech-to-Speech

Where speech-to-speech excels and where it has its limitations

Frequently Asked Questions (FAQ)

Will speech-to-speech replace the traditional STT-LLM-TTS pipeline?+

What is the difference between speech-to-speech and text-to-speech / speech-to-text?+