With the increasing use of voicebots in customer service and telephony, text-to-speech (TTS) latency is becoming an increasingly critical factor. Users expect a real-time response and every millisecond counts - especially in telephone conversations where silence feels like failure.

To ensure the optimal performance of our phonebots at BOTfriends, we conducted a comprehensive benchmark in which we compared the TTS voices of Google Cloud and Microsoft Azure for German (de-DE). Our goal: to identify the fastest and most reliable voices for different message types.

Test setup

Each voice was tested in three use cases:

  1. Short message - 1 sentence
  2. Long message - 3-5 sentences
  3. Multiple messages - 3 consecutive short messages

Each test case was run three times per voice, and the average value (in milliseconds) was recorded to minimize anomalies. We analyzed both individual voices and voice types.

Summary of the test results

Voice provider Fastest voice type Slowest voice type
Google Neural2 Chirp
Microsoft Neural DragonHD

Best overall performance

  • The voices from Google Neural2 and Microsoft Neural consistently delivered the lowest latency.
  • Google's Neural2-G and Standard-D voices performed exceptionally well in all scenarios.
  • Microsoft's KatjaNeural and KillianNeural were characterized by their responsiveness.

Most unsuitable for real-time use

  • The Google Chirp3-HD voices had the highest latency of up to 3.5 seconds for long messages.
  • Microsoft's DragonHDLatestNeural voices were similarly slow at 354 ms+ for short messages.

Detailed results

📊 Google TTS Voice Latency (ms)

Voice type Short message Long message Multiple messages
Standard 159.96 468.83 153.60
Neural 🥇 101.17 🥇 133.50 🥇 82.67
Wavenet 324.04 951.12 210.37
Chirp 🚨 614.12 🚨 3436.52 🚨 525.82

Top performer:

  • de-DE-Standard-D - 71.00 ms (short), 103.00 ms (long)
  • de-DE-Neural2-H - 81.67 ms (short), 154.33 ms (long)
  • de-DE-Neural2-G - 81.89 ms (multiple messages)

📊 Microsoft TTS Voice Latency (ms)

Voice type Short message Long message Multiple messages
Neural 🥇 104.71 135.52 🥇 113.13
MultilingualNeural 120.00 153.34 163.00
DragonHDLatestNeural 🚨 356.00 403.84 🚨 342.61

Top performer:

  • en-EN-GiselaNeural - 🥇 59.33 ms (short)
  • en-EN-KatjaNeural - 64.00 ms (short), 83.33 ms (multiple)
  • en-EN-KillianNeural - 80.00 ms (long)

Interpretation of the figures

Why latency is important:

  • Lower latency = faster response time during calls.
  • High TTS latency causes unpleasant pauses and impairs the user experience.
  • Several shorter messages mimic the real conversation rhythm, which makes this metric very relevant.

 

Neural models are the ultimate:

  • The neural voices of both providers outperform premium "HD" models such as Chirp and DragonHD in terms of speed.
  • With telephone-based systems, fast response times outweigh the need for very natural-sounding speech.

Recommendations for voice bot developers

If you are developing voice bots for real-time interactions (e.g. customer service hotlines or IVRs), we strongly recommend that you do so:

Use these voices for speed:

  • Google en-EN-Neural2-G / H
  • Google Standard-D / F
  • Microsoft KatjaNeural / KillianNeural

Avoid these voices for real-time use:

  • Google Chirp3-HD-* Voices
  • Microsoft DragonHDLatestNeural voices

These high latency voices can still be useful in non-interactive use cases or in cases where ultra-high quality is more important than speed.

Concluding thoughts

Our benchmarking clearly shows that not all TTS voices are created equal. With neural language models, both Google and Microsoft offer high-performance, low-latency options that are suitable for modern phone bots.
At BOTfriends, we strive to deliver fast, natural voice experiences - and tests like this ensure we're working with the best tools available.