Paper | Huggingface | Benchmarking

Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs. Benchmarking result on Aksharantar test set using IndicXlit model can be found here. More details regarding Aksharantar can be in the paper.

Downloads

  • The Aksharantar dataset can be downloaded from the Aksharantar Hugging Face repository.
  • Each language-pair corpus in the Aksharantar dataset is split into training, validation and test subsets. Each subset is a JSONL file consisting of individual data instances comprising a unique identifier, native word, English word, transliteration source and a score (if applicable).
  • Individual language-pair download links are provided in the data split below.

Data Split

The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.

Subsetas-en (4.72 MB)bn-en (31.5 MB)brx-en (0.933 MB)gu-en (29.5 MB)hi-en (31.4 MB)kn-en (83.7 MB)ks-en (1.1 MB)kok-en (16.6 MB)mai-en (6.74 MB)ml-en (125 MB)mni-en (0.313 MB)mr-en (39.9 MB)ne-en (67 MB)or-en (9.09 MB)pa-en (12.1 MB)sa-en (56 MB)sd-en (1.37 MB)ta-en (92.7 MB)te-en (69.1 MB)ur-en (17 MB)
Training179K1231K36K1143K1299K2907K47K613K283K4101K10K1453K2397K346K515K1813K60K3231K2430K699K
Validation4K11K3K12K6K7K4K4K4K8K3K8K3K3K9K3K8K9K8K12K
Test5531500941367768569363967707509355126911492565734133425643165334468245674463

Change Log

  • 07 May 2022 – The Aksharantar dataset is now available for download.

Contributors

Citing

If you are using any of the resources, please cite the following article:

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, 
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This data is released under the following licensing scheme:

  • Manually collected data: Released under CC-BY license.
  • Mined dataset (from Samanantar and IndicCorp): Released under CC0 license.
  • Existing sources: Released under CC0 license.

CC-BY License

CC0 License Statement

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources.
  • This work is published from: India.

Contact

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitious Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.