Aksharantar is the largest publicly available transliteration dataset for 21 Indic languages. The corpus has 26M Indic language-English transliteration pairs.
The language-wise splits for Aksharantar is shown in the table with total number of word pairs (in millions). Individual download links for each language-pair are as against the hyperlink.
Subset | as-en (4.72 MB) | bn-en (31.5 MB) | brx-en (0.933 MB) | gu-en (29.5 MB) | hi-en (31.4 MB) | kn-en (83.7 MB) | ks-en (1.1 MB) | kok-en (16.6 MB) | mai-en (6.74 MB) | ml-en (125 MB) | mni-en (0.313 MB) | mr-en (39.9 MB) | ne-en (67 MB) | or-en (9.09 MB) | pa-en (12.1 MB) | sa-en (56 MB) | sd-en (1.37 MB) | ta-en (92.7 MB) | te-en (69.1 MB) | ur-en (17 MB) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K |
Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K |
Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | - | 4682 | 4567 | 4463 |
If you are using any of the resources, please cite the following article:
@misc{madhani2022aksharantar,
title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users},
author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2022},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This data is released under the following licensing scheme:
CC-BY License
CC0 License Statement