AI4BHARAT

IndicXlit

Github | Downloads | Paper | Demo | Python Library

IndicXlit is a transformer-based multilingual transliteration model (~11M) for Roman to native script conversion and vice-versa that supports 21 Indic languages. It is trained on the Aksharantar dataset which is the largest publicly available parallel corpus containing 26 million word pairs spanning 20 Indic languages at the time of writing (5 May 2022). It supports the following 21 Indic languages:


Assamese (asm) Bengali (ben) Bodo (brx) Gujarati (guj) Hindi (hin) Kannada (kan)
Kashmiri (kas) Konkani (gom) Maithili (mai) Malayalam (mal) Manipuri (mni) Marathi (mar)
Nepali (nep) Oriya (ori) Punjabi (pan) Sanskrit (san) Sindhi (snd) Sinhala (sin)
Tamil (tam) Telugu (tel) Urdu (urd)

Know more about IndicXlit

You can visit the IndicXlit page to know more about the models including:

  • Downloading IndicXlit
  • Using the publicly available models
  • IndicXlit accuracy
  • Training IndicXlit
  • Building the Aksharantar dataset

Citing

If you are using any of the resources, please cite the following article:

@article{Madhani2022AksharantarTB,
  title={Aksharantar: Towards building open transliteration tools for the next billion users},
  author={Yash Madhani and Sushane Parthan and Priyanka A. Bedekar and Ruchi Khapra and Vivek Seshadri and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.03018}
}

We would like to hear from you if:

  • You are using our resources. Please let us know how you are putting these resources to use.
  • You have any feedback on these resources.

License

The IndicXlit code (and models) are released under the MIT License.

Contributors

Contact

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitious Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.