AI4BHARAT

Machine Transliteration

Indic languages are written in a variety of scripts (Brahmi family of abugida scripts, Arabic-derived abjad scripts, and even alphabetic Roman script). This diversity makes it challenging to support mechanisms which are convenient for typing or creating content in these diverse languages and scripts. Most Indian users are comfortable with the Roman keyboard and thus an optimal solution that users find beneficial is automatic transliteration of the romanized input into the native script. To enable this, at AI4Bharat, we have undertaken the task of creating large-scale transliteration corpora for Indic languages along with models for transliteration of romanized inputs into native scripts.

Our Contributions

Bhasha-Abhijnaanam Dataset

Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text which spans 22 Indic languages.

Know More →

📜Aksharantar dataset

The largest publicly available parallel transliteration corpora containing 26M word pairs spanning 21 languages mined from Wikidata, Samanantar and IndicCorp. It also contains a challenging and diverse benchmark for evaluating transliteration models.

Know More →

🚀IndicXlit model

A multilingual transformer based model for transliteration from romanized input to native language scripts supporting 21 languages. This model is trained using Aksharantar corpus and at the time of its release was the state of the art open source model as evaluated on Google's Dakshina benchmark and our Aksharantar benchmark.

Know More →

IndicLID

A model for identifying the language of romanized content on social media. This model will be trained using the large number of romanized Indian language words in Aksharantar.

Know More →

⌨️IndicXlit Keyboard

A transliteration interface (En-Indic online keyboard) that converts romanized text into Indic text as you type.

Try online →

💻IndicXlit Converter

A transliteration interface that converts Indic text into romanized text and vice versa.

Try online →

📱IndicSwipe

An on-going research for creating swipe-based keypad typing on Android devices for Indic languages

Know More →

Our Partners

Pratham Books

We have partnered with Pratham Books to enable romanized keyboards in their translation interface for low resource languages such as Bodo, Kashmiri, Konkani, Maithili, Nepali and Urdu.

IndicXlit Workshop

On 28th July, we are conducting a workshop to demonstrate our datasets, models, and applications.

Learn More