To evaluate language models on Indic languages, we need a robust human-annotated NLU benchmark consisting of 9 tasks across 18 Indic languages.
IndicXTREME benchmark includes 9 tasks that can be broadly grouped into sentence classification (5), structure prediction (2), question answering (1), and sentence retrieval (1).
The list of tasks are as follows:
- IndicCOPA - Dataset - We manually translate the COPA test set into 18 Indic languages to create IndicCOPA
- IndicQA - Dataset - A manually curated cloze-style reading comprehension dataset that can be used for evaluating question-answering models in 11 Indic languages
- IndicXParaphrase - Dataset - A new, multilingual, and n-way parallel dataset for paraphrase detection in 10 Indic languages
- IndicSentiment - Dataset - A new, multilingual, and n-way parallel dataset for sentiment analysis in 13 Indic languages
- IndicXNLI - Dataset - Automatically translated version of XNLI in 11 Indic languages. Created by Divyanshu et. al. in this paper
- Naamapadam - Dataset - NER dataset with manually curated testsets for 9 Indic languages. Created by Arnav et. al in this paper
- MASSIVE - Dataset - This in an intent classification and slot-filling dataset created using user queries collected by Amazon Alexa for 7 Indic languages. Created by FitzGerald et. al. in this paper
- FLORES - Dataset - To evaluate the retrieval capabilities of models, we include the Indic parts of the FLORES-101 dataset. Available in 18 Indic languages. Created by NLLB Team et. al. in this paper
For more information about the datasets and the models, please refer to the GitHub repository
Corresponding authors: Sumanth Doddapaneni
If you are using any of the resources, please cite the following article:
@article{Doddapaneni2022towards,
title={Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
author={Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
journal={ArXiv},
year={2022},
volume={abs/2212.05409}
}
IndicXTREME is released under this licensing scheme: