With the increasing use of voicebots in customer service and telephony, text-to-speech (TTS) latency is becoming an increasingly critical factor. Users expect a real-time response and every millisecond counts - especially in telephone conversations where silence feels like failure.
To ensure the optimal performance of our phonebots at BOTfriends, we conducted a comprehensive benchmark in which we compared the TTS voices of Google Cloud and Microsoft Azure for German (de-DE). Our goal: to identify the fastest and most reliable voices for different message types.
Test setup
Each voice was tested in three use cases:
- Short message - 1 sentence
- Long message - 3-5 sentences
- Multiple messages - 3 consecutive short messages
Each test case was run three times per voice, and the average value (in milliseconds) was recorded to minimize anomalies. We analyzed both individual voices and voice types.
Summary of the test results
Voice provider | Fastest voice type | Slowest voice type |
---|---|---|
Neural2 | Chirp | |
Microsoft | Neural | DragonHD |
Best overall performance
- The voices from Google Neural2 and Microsoft Neural consistently delivered the lowest latency.
- Google's Neural2-G and Standard-D voices performed exceptionally well in all scenarios.
- Microsoft's KatjaNeural and KillianNeural were characterized by their responsiveness.
Most unsuitable for real-time use
- The Google Chirp3-HD voices had the highest latency of up to 3.5 seconds for long messages.
- Microsoft's DragonHDLatestNeural voices were similarly slow at 354 ms+ for short messages.
Detailed results
📊 Google TTS Voice Latency (ms)
Voice type | Short message | Long message | Multiple messages |
---|---|---|---|
Standard | 159.96 | 468.83 | 153.60 |
Neural | 🥇 101.17 | 🥇 133.50 | 🥇 82.67 |
Wavenet | 324.04 | 951.12 | 210.37 |
Chirp | 🚨 614.12 | 🚨 3436.52 | 🚨 525.82 |
Top performer:
- de-DE-Standard-D - 71.00 ms (short), 103.00 ms (long)
- de-DE-Neural2-H - 81.67 ms (short), 154.33 ms (long)
- de-DE-Neural2-G - 81.89 ms (multiple messages)
📊 Microsoft TTS Voice Latency (ms)
Voice type | Short message | Long message | Multiple messages |
---|---|---|---|
Neural | 🥇 104.71 | 135.52 | 🥇 113.13 |
MultilingualNeural | 120.00 | 153.34 | 163.00 |
DragonHDLatestNeural | 🚨 356.00 | 403.84 | 🚨 342.61 |
Top performer:
- en-EN-GiselaNeural - 🥇 59.33 ms (short)
- en-EN-KatjaNeural - 64.00 ms (short), 83.33 ms (multiple)
- en-EN-KillianNeural - 80.00 ms (long)
Interpretation of the figures
Why latency is important:
- Lower latency = faster response time during calls.
- High TTS latency causes unpleasant pauses and impairs the user experience.
- Several shorter messages mimic the real conversation rhythm, which makes this metric very relevant.
Neural models are the ultimate:
- The neural voices of both providers outperform premium "HD" models such as Chirp and DragonHD in terms of speed.
- With telephone-based systems, fast response times outweigh the need for very natural-sounding speech.
Recommendations for voice bot developers
If you are developing voice bots for real-time interactions (e.g. customer service hotlines or IVRs), we strongly recommend that you do so:
✅ Use these voices for speed:
- Google en-EN-Neural2-G / H
- Google Standard-D / F
- Microsoft KatjaNeural / KillianNeural
❌ Avoid these voices for real-time use:
- Google Chirp3-HD-* Voices
- Microsoft DragonHDLatestNeural voices
These high latency voices can still be useful in non-interactive use cases or in cases where ultra-high quality is more important than speed.
Concluding thoughts
Our benchmarking clearly shows that not all TTS voices are created equal. With neural language models, both Google and Microsoft offer high-performance, low-latency options that are suitable for modern phone bots.
At BOTfriends, we strive to deliver fast, natural voice experiences - and tests like this ensure we're working with the best tools available.