IndicNLG suite is a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages spanning five diverse NLG tasks. The datasets were created using a combination of crawling websites, machine translation, n-gram count and regular expression based cleaning . Overall, the suite contains about 8.5M examples across all languages and tasks and is the largest multilingual NLG dataset to date as well as the first of its kind for Indic languages. You can use these datasets to benchmark your own NLG systems.
You can read more about IndicNLGSuite in this paper. We have benchmarked our own monolingual and multilingual models based on IndicBART and found that our models perform at par with or are better than baseline models such as mT5.
The datasets and models are available on HuggingFace
TaskDatasetModelBiography GenerationIndicWikiBioComing SoonHeadline GenerationIndicHeadlineGenerationComing SoonSentence SummarizationIndicSentenceSummarizationComing SoonParaphrase GenerationIndicParaphraseComing SoonQuestion GenerationIndicQuestionGenerationComing Soon
If you use IndicNLG Suite, please cite the following paper:
@misc{kumar2022indicnlg,
title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
year={2022},
eprint={2203.05437},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Datasets
Different datasets are released under different licenses
IndicHeadlineGeneration, IndicSentenceSummarization and IndicParaphrase are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IndicWikiBio and IndicQuestionGeneration are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Models
All models are released under the MIT license.