Bhasha-Abhijnaanam Dataset

Paper | Huggingface | Benchmarking

Bhasha-Abhijnaanam is a language identification test set for native-script as well as romanized text which spans 22 Indic languages. Benchmarking result on Bhasha-Abhijnaanam test set using IndicLID model can be found here. More details regarding Bhasha-Abhijnaanam can be in the paper.

Downloads

The Bhasha-Abhijnaanam dataset can be downloaded from the Bhasha-Abhijnaanam Hugging Face repository
Bhasha-Abhijnaanam dataset is a JSONL file consisting of individual data instances comprising a unique identifier, native sentence, romanized sentence(if available), language, script and source.

Test Set

The language-wise statistics for Bhasha-Abhijnaanam is shown in the table with total number of sentences.

Subset	asm	ben	brx	guj	hin	kan	kas (Perso-Arabic)	kas (Devanagari)	kok	mai	mal	mni (Bengali)	mni (Meetei Mayek)	mar	nep	ori	pan	san	sid	tam	tel	urd
Native	1012	5606	1500	5797	5617	5859	2511	1012	1500	2512	5628	1012	1500	5611	2512	1012	5776	2510	2512	5893	5779	5751
Romanized	512	4595	433	4785	4606	4848	450	0	444	439	4617	0	442	4603	423	512	4765	448	0	4881	4767	4741

Contributors

Yash Madhani (AI4Bharat, IITM)
Mitesh M. Khapra (AI4Bharat, IITM)
Anoop Kunchukuttan (AI4Bharat, Microsoft)

Citing

If you are using any of the resources, please cite the following article:

@misc{madhani2023bhashaabhijnaanam,
      title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, 
      author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan},
      year={2023},
      eprint={2305.15814},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This data is released under the following licensing scheme:

Manually collected data: Released under CC0 license.

CC0 License Statement

We do not own any of the text from which this data has been extracted.
We license the actual packaging of the mined data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to Bhasha-Abhijnaanam manually collected data and existing sources.
This work is published from: India.

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)

Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology of the Government of India for their generous grant through the Digital India Bhashini project. We also thank the Centre for Development of Advanced Computing for providing compute time on the Param Siddhi Supercomputer. We also thank Nilekani Philanthropies for their generous grant towards building datasets, models, tools and resources for Indic languages. We also thank Microsoft for their grant to support research on Indic languages. We would like to thank Jay Gala and Ishvinder Sethi for their help in coordinating the annotation work. Most importantly we would like to thank all the annotators who helped create the Bhasha-Abhijnaanam benchmark.