Voice

Voice AI for regional languages.

Fine-tuned on real conversational data across Indian languages, with more languages to follow.

A research project by en2.ai - building ultra light, ultra fast voice AI models.

Powered byIntel

Training progression

More data, better pronunciation — each iteration trains on more hours of real Marathi speech.

92.994.396.697.495.597.998.1
98.7
Base 92.9v1 94.3v2 96.6v2.1 97.4v3 95.5v4 97.9v4.1 98.1
Sarvam target 98.7
Basev1v2v2.1v3v4v4.1
Sarvam
target

Pronunciation accuracy (%) · Sarvam Bulbul shown as commercial target

Listen to difference

Same 8 sentences across every model iteration. Work in progress — building our own dataset and improving voice quality with each run.

Short conversational

आज पुण्यात पाऊस पडतोय, छत्री घेऊन जा.

It's raining in Pune today, take an umbrella.

Stock model
v1
v2
v2.1
v3
v4
v4.1
Sarvam target

How It Works

  • We use datasets like Project Vaani to finetune speech to text models on Automated Speech Recognition.
  • We finetune text to speech models on real speech data collected across a language.
  • The models learns phonemes and prosody from custom recording styles and then generates speech preserving the voice characteristics while speaking the regional language fluently.

Architecture

The entire stack — model inference, database, and application — runs on a single Intel Granite Rapids server. No GPUs required.

Custom TTS Model

Fine-tuned on Marathi speech, quantized to INT8 with Intel OpenVINO

Intel Xeon 6 · Granite Rapids

AMX-accelerated matrix ops deliver fast INT8 inference — no GPU hardware needed

WhatsApp Delivery

User sends text via WhatsApp, receives a voice note back in seconds

Why CPU over GPU?

Traditional voice AI requires GPUs for real-time concurrency. But asynchronous use cases like WhatsApp TTS don't need sub-millisecond latency — they need reliable, cost-effective throughput.

Granite Rapids delivers this at a fraction of the cost. One server handles inference, storage, and the application layer — a deployable container optimized for Xeon, faster and cheaper than GPU alternatives. All voice data stays on-premise for data sovereignty, critical for government and enterprise deployments across India.