Datasets
Large sentence-level monolingual corpora for 11 languages from two language families (Indo-Aryan branch and Dravidian) and Indian English with an average 9-fold increase in size over OSCAR. This corpora was created by crawling content from news articles, magazines and blogposts.
Know More →
The largest publicly available parallel corpora collection for Indic languages containing ∼46.9M parallel sentences between English and 11 Indic languages, ranging from 142K pairs between English-Assamese to 8.6M pairs between English-Hindi. Of these 34.6M pairs are newly mined as a part of this work.
Know More →
The largest publicly available parallel transliteration corpora containing 26M word pairs spanning 21 languages mined from Wikidata, Samanantar and IndicCorp. It also contains a challenging and diverse benchmark for evaluating transliteration models.
Know More →
This is a benchmark for 6 NLU tasks spanning 11 Indian languages containing standard training and evaluation sets to evaluate the natural language understanding capabilities of language models for Indian languages.
Know More →
This is a benchmark containing various tasks to evaluate the natural language generation capabilities of language models for Indian languages.
Know More →
Over 6,400 hours of labelled audio across 12 Indian languages mined and aligned from audio broadcasts and PDF transcripts from All India Radio.
Know More →
A benchmark of speech recognition tasks including ASR, speaker verification, speaker identification, language identification, query by example, and keyword detection for 12 Indian languages.
Know More →
17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance.
Know More →
Benchmark for zero-shot and cross-lingual evaluation of various NLU tasks in multiple Indian languages.
Know More →
Training and evaluation datasets for named entity recognition in multiple Indian language.
Know More →