TTS Latency Report 2025: Google vs. Microsoft voices

With the increasing use of voicebots in customer service and telephony, text-to-speech (TTS) latency is becoming an increasingly critical factor. Users expect a real-time response and every millisecond counts - especially in telephone conversations where silence feels like failure.

To ensure the optimal performance of our phonebots at BOTfriends, we conducted a comprehensive benchmark in which we compared the TTS voices of Google Cloud and Microsoft Azure for German (de-DE). Our goal: to identify the fastest and most reliable voices for different message types.

Test setup

Each voice was tested in three use cases:

Short message - 1 sentence
Long message - 3-5 sentences
Multiple messages - 3 consecutive short messages

Each test case was run three times per voice, and the average value (in milliseconds) was recorded to minimize anomalies. We analyzed both individual voices and voice types.

Summary of the test results

Voice provider	Fastest voice type	Slowest voice type
Google	Neural2	Chirp
Microsoft	Neural	DragonHD

Best overall performance

The voices from Google Neural2 and Microsoft Neural consistently delivered the lowest latency.
Google's Neural2-G and Standard-D voices performed exceptionally well in all scenarios.
Microsoft's KatjaNeural and KillianNeural were characterized by their responsiveness.

Most unsuitable for real-time use

The Google Chirp3-HD voices had the highest latency of up to 3.5 seconds for long messages.
Microsoft's DragonHDLatestNeural voices were similarly slow at 354 ms+ for short messages.

Detailed results

📊 Google TTS Voice Latency (ms)

Voice type	Short message	Long message	Multiple messages
Standard	159.96	468.83	153.60
Neural	🥇 101.17	🥇 133.50	🥇 82.67
Wavenet	324.04	951.12	210.37
Chirp	🚨 614.12	🚨 3436.52	🚨 525.82

Top performer:

de-DE-Standard-D - 71.00 ms (short), 103.00 ms (long)
de-DE-Neural2-H - 81.67 ms (short), 154.33 ms (long)
de-DE-Neural2-G - 81.89 ms (multiple messages)

📊 Microsoft TTS Voice Latency (ms)

Voice type	Short message	Long message	Multiple messages
Neural	🥇 104.71	135.52	🥇 113.13
MultilingualNeural	120.00	153.34	163.00
DragonHDLatestNeural	🚨 356.00	403.84	🚨 342.61

Top performer:

en-EN-GiselaNeural - 🥇 59.33 ms (short)
en-EN-KatjaNeural - 64.00 ms (short), 83.33 ms (multiple)
en-EN-KillianNeural - 80.00 ms (long)

Interpretation of the figures

Why latency is important:

Lower latency = faster response time during calls.
High TTS latency causes unpleasant pauses and impairs the user experience.
Several shorter messages mimic the real conversation rhythm, which makes this metric very relevant.

Neural models are the ultimate:

The neural voices of both providers outperform premium "HD" models such as Chirp and DragonHD in terms of speed.
With telephone-based systems, fast response times outweigh the need for very natural-sounding speech.

Recommendations for voice bot developers

If you are developing voice bots for real-time interactions (e.g. customer service hotlines or IVRs), we strongly recommend that you do so:

✅ Use these voices for speed:

Google en-EN-Neural2-G / H
Google Standard-D / F
Microsoft KatjaNeural / KillianNeural

❌ Avoid these voices for real-time use:

Google Chirp3-HD-* Voices
Microsoft DragonHDLatestNeural voices

These high latency voices can still be useful in non-interactive use cases or in cases where ultra-high quality is more important than speed.

Concluding thoughts

Our benchmarking clearly shows that not all TTS voices are created equal. With neural language models, both Google and Microsoft offer high-performance, low-latency options that are suitable for modern phone bots.
At BOTfriends, we strive to deliver fast, natural voice experiences - and tests like this ensure we're working with the best tools available.

Product

Features

Integrations

use cases

Industries

Resources

Documentation & Know-How

Recommendations

TTS Latency Benchmark 2025: Google vs. Microsoft Voices for Phonebots