Speech-to-Speech

--> to the BOTwiki - The Chatbot Wiki

Speech-to-Speech (S2S) refers to a technology that translates or processes spoken language directly into spoken language without the traditional detour through text. While conventional voice pipelines go through three stages (speech-to-text, then LLM, then text-to-speech), a speech-to-speech model processes audio end-to-end in a single neural network.

This way, even paralinguistic information—such as emotion, tone of voice, laughter, or hesitation—is preserved, details that are typically lost when transcribing speech into text.

 

Where speech-to-speech excels and where it has its limitations

S2S models excel at short, conversational interactions that require a high degree of naturalness, such as small talk, simple inquiries, or topics similar to those covered in FAQs. They currently perform less well in complex, business-critical processes involving multi-step tool calls, authentication, and backend write operations. In these scenarios, single-model architectures quickly fail due to tool-calling errors or a lack of adherence to rules.

 

Frequently Asked Questions (FAQ)

Not in general. Speech-to-speech is superior in terms of latency and naturalness, but currently has weaknesses when it comes to complex tool invocation, adherence to rules, and auditability.

While text-to-speech (TTS) and speech-to-text (STT) simply convert between written and spoken language, speech-to-speech (S2S) directly converts an audio input into a new audio output. In the process, characteristics such as the speaker’s voice, emotions, and intonation can be preserved or translated into another language without necessarily focusing on the intermediate step of visible text.

–> Back to BOTwiki - The Chatbot Wiki