Voice
Voice AI for regional languages.
Fine-tuned on real conversational data across Indian languages, with more languages to follow.
A research project by en2.ai - building ultra light, ultra fast voice AI models.

Training progression
More data, better pronunciation — each iteration trains on more hours of real Marathi speech.








target
Pronunciation accuracy (%) · Sarvam Bulbul shown as commercial target
Listen to difference
Same 8 sentences across every model iteration. Work in progress — building our own dataset and improving voice quality with each run.
आज पुण्यात पाऊस पडतोय, छत्री घेऊन जा.
It's raining in Pune today, take an umbrella.
How It Works
- We use datasets like Project Vaani to finetune speech to text models on Automated Speech Recognition.
- We finetune text to speech models on real speech data collected across a language.
- The models learns phonemes and prosody from custom recording styles and then generates speech preserving the voice characteristics while speaking the regional language fluently.
Architecture
The entire stack — model inference, database, and application — runs on a single Intel Granite Rapids server. No GPUs required.
Custom TTS Model
Fine-tuned on Marathi speech, quantized to INT8 with Intel OpenVINO
Intel Xeon 6 · Granite Rapids
AMX-accelerated matrix ops deliver fast INT8 inference — no GPU hardware needed
WhatsApp Delivery
User sends text via WhatsApp, receives a voice note back in seconds
Why CPU over GPU?
Traditional voice AI requires GPUs for real-time concurrency. But asynchronous use cases like WhatsApp TTS don't need sub-millisecond latency — they need reliable, cost-effective throughput.
Granite Rapids delivers this at a fraction of the cost. One server handles inference, storage, and the application layer — a deployable container optimized for Xeon, faster and cheaper than GPU alternatives. All voice data stays on-premise for data sovereignty, critical for government and enterprise deployments across India.
