The Future of Multilingual AI Voice Agents: Lessons From Our TTS Journey

Introduction

Imagine calling a customer service helpline for urgent assistance. A voice agent answers, but it sounds robotic, mispronounces your name, and struggles to understand when you switch from Hindi to English mid-sentence.

In customer-facing applications where clarity, empathy, and multilingual fluency are critical, AI-powered voice agents can’t afford to sound unnatural or disengaged.

That’s why we set out to find the best Text-to-Speech (TTS) model for our multilingual AI voice reception agent. We tested multiple TTS models and ranked them across several critical parameters to select the best fit for humanlike communication, low latency, and cost-efficiency.

Why Empathy Matters in AI Voice Assistants

Let’s be honest: nobody likes talking to a robot. In moments that are often defined by urgency or frustration – like troubleshooting an issue or rescheduling a service – an emotionless voice agent can make the experience even worse.

We realized that natural, expressive speech isn’t just nice to have; it’s a requirement. Our data shows that voice agents with natural-sounding speech have a 15-20% higher customer satisfaction rating compared to robotic-sounding alternatives.

The Multilingual Dilemma

India’s linguistic diversity is vast – customers may start speaking in Hindi but switch to English mid-sentence. A good multilingual AI voice agent needs to handle this effortlessly.

The TTS solution also needed to support future expansion for additional Indian languages, such as Telugu and Tamil, by accommodating language-specific nuances and dialects.

The problem? Many TTS models struggle with fluid language switching, especially when pronouncing technical terms, regional dialects, and names.

Latency Considerations

Businesses receive thousands of calls daily, and customers have neither the time nor the patience to deal with delayed responses. We needed a TTS model that could maintain low response times even under heavy load, especially during peak hours (typically 9:00 AM to 5:00 PM on weekdays).

The Cost Factor

Innovation is great, but let’s talk about the elephant in the room – cost. Businesses, especially in developing regions, need an AI-powered solution that balances performance with affordability.

It was important that the pricing structure (whether usage-based, per-minute, or subscription) suited our projected call volumes, ensuring economic viability as usage scaled.

Shortlisting Potential TTS Solutions

After surveying multiple TTS providers, we shortlisted four models based on:

- Language support (Hindi, English, Telugu, Tamil, etc.)
- Custom lexicon capabilities, i.e., the ease of adding local names and industry-specific vocabulary
- SSML compatibility, i.e., the ability to control intonation and emphasize key words effectively.
- Compliance with relevant data protection standards.
- Transparent cost structures suited for high call volumes.

The Final Contenders:

- Google Cloud Text-to-Speech
- ElevenLabs
- Waves by smallest.ai
- Sarvam Text-to-Speech

Evaluation Criteria and Rankings

Our evaluation was structured around several key criteria. For each category, we ranked the shortlisted models based on detailed tests using various domain-specific prompts in multiple languages.

We used a custom-built tool which allowed us to:

- Directly compare pronunciation, clarity, and response times.
- Assess language switching and support without external interference.
- Record consistent performance metrics for each model.

This method provided clear, objective data on which model excelled across our critical criteria.

1. Voice Quality and Naturalness (Score out of 10)

- Google Cloud (9/10): With the SSML customization, it can deliver very natural voices in various languages.
- ElevenLabs (9.5/10): Slightly edges out Google Cloud with exceptionally realistic voices for conversations.
- Waves (7.5/10): Offers solid voices but sometimes lacks emotion and appropriate pauses.
- Sarvam (8/10): Sounds realistic for Indian accents but can occasionally be overly robotic.

2. Pronunciation Precision & Custom Lexicons (Score out of 10)

- Google Cloud (9/10): Supports SSML and lexicons, which help it build emotion and accurately pronounce challenging Indian words.
- ElevenLabs (8/10): Offers fairly natural pronunciation, though it has limited SSML support.
- Sarvam (8.5/10): Lacks customizability, yet delivers natural pronunciation akin to Indian accents.
- Waves (7/10): Offers no customizability and tends to either rush or drag longer prompts.

3. Latency Under Load (Score out of 10)

- ElevenLabs (9.5/10): Fastest processing; its flash and turbo models achieved 100ms latency in our tests.
- Google Cloud (9/10): Also very fast; however, longer SSML texts cause slight delays compared to Elevenlabs, with an average latency of 123ms.
- Waves (8/10): Records a latency of around 150ms but is subject to hourly request limits.
- Sarvam (8.5/10): Considerably fast, with an average latency of 168ms.

4. Cost Comparison

Provider	TTS Cost	Full Solution Cost*	Cost at Scale**
Google Cloud	₹1.42/min	₹6.51/min	₹5.77/min
Sarvam	₹1.50/min	₹6.59/min	₹5.79/min
Smallest.ai	₹1.77/min	₹6.87/min	₹5.98/min
ElevenLabs	₹1.74/min	₹12.84/min	₹10.16/min

*Full solution includes TTS, STT (₹0.50/min), server hosting (₹0.89/min), telephony via Ozonetel (₹0.50/min), and LLM costs using Claude 3.5 Sonnet (₹3.13/min).

**Based on 10,000 minutes monthly with a 7:3 ratio of TTS:STT usage.

The substantial cost difference between ElevenLabs and other providers (nearly 100% premium) was a significant factor in our decision, despite its slight edge in voice quality. For a typical enterprise deployment handling 10,000 minutes monthly, this would represent an additional operational expense of approximately ₹756,000 annually.

It’s important to note that all cost projections are estimations based on current provider pricing and typical usage patterns. Actual costs may vary based on real-world factors such as:

- Changes in provider pricing structures
- Fluctuations in call durations and complexity
- Variations in the TTS:STT usage ratio
- Volume discounts as usage scales
- Seasonal variations in call volumes

5. Language Support (Score out of 10)

- Google Cloud (9.5/10): Supports English with an Indian accent and all Indian languages with various models and voices.
- Sarvam (9/10): Supports English with an Indian accent and all Indian languages, though only for select voices.
- ElevenLabs (7/10): Supports English, Hindi, and Tamil, with various languages and accents available.
- Waves (6.5/10): Supports only English and Hindi.

6. Scalability (Score out of 10)

- Google Cloud (9/10): Excellent support and ample customizability for improving the model for our specific use case.
- Sarvam (7.5/10): Supports all languages; however, it lacks the customizability to add emotional nuance and texture to conversations.
- ElevenLabs (7/10): Offers frequent updates with various voices available for multiple languages, though pricing and limited Indian language support may affect future expansion.
- Waves (6.5/10): Limited language support and lacks additional features to enhance the voice.

Final Verdict: Google Cloud TTS

After rigorous testing, Google Cloud Text-to-Speech emerged as the best solution due to:

Optimal Balance of Quality and Cost: While ElevenLabs marginally outperformed in voice quality, Google Cloud delivered excellent naturalism at a substantially lower cost (₹6.51 vs ₹12.84 per minute).
Superior Indian Language Support: With comprehensive coverage of 12+ Indian languages and dialects, Google Cloud significantly outperformed competitors, particularly for the code-switching between languages that’s common in Indian conversations.
Custom Lexicon Support: Google’s SSML and lexicon capabilities allow precise control over pronunciation, especially for industry-specific terms and regional names.
Low Latency Under Stress: Google’s robust infrastructure ensures consistent performance under load, with 123ms average latency proving sufficient for natural conversations.
Cost-Effectiveness at Scale: For our projected volume of 10,000 minutes monthly, Google Cloud represents potential savings of approximately ₹756,000 annually compared to ElevenLabs.

Next Steps and Future Enhancements

We selected a TTS engine that not only meets but also exceeds our expectations.

Here’s what we are planning next:

- Conduct more demonstrations and gather feedback from both domain experts and users to refine the system.
- Add native Telugu and Tamil voices as part of our ongoing efforts to provide fully localized experiences.
- Continue to update the custom dictionary with new terms and local nuances to maintain accuracy.
- Set up systems to track user satisfaction, call durations, and error rates to ensure continuous improvement.

Through iterative updates using SSML, we aim to further refine our voice agent’s tone, making it even more comforting and engaging over time.

Final Thoughts

Our work with voice agents is just the beginning. As voice AI evolves, it will play a crucial role in reducing administrative burdens and improving customer experiences.

With advancements in SSML, emotion AI, and multilingual NLP, we’re just scratching the surface of what AI-powered voice agents can do. The key isn’t just making AI talk – it’s making it communicate naturally and meaningfully.