Large sentence-level monolingual corpora for 11 languages from two language families (Indo-Aryan branch and Dravidian) and Indian English with an average 9-fold increase in size over OSCAR. This corpora was created by crawling content from news articles, magazines and blogposts.
The largest publicly available parallel corpora collection for Indic languages containing ∼46.9M parallel sentences between English and 11 Indic languages, ranging from 142K pairs between English-Assamese to 8.6M pairs between English-Hindi. Of these 34.6M pairs are newly mined as a part of this work.
The largest publicly available parallel transliteration corpora containing 26M word pairs spanning 21 languages mined from Wikidata, Samanantar and IndicCorp. It also contains a challenging and diverse benchmark for evaluating transliteration models.
This is a benchmark for 6 NLU tasks spanning 11 Indian languages containing standard training and evaluation sets to evaluate the natural language understanding capabilities of language models for Indian languages.
This is a benchmark containing various tasks to evaluate the natural language generation capabilities of language models for Indian languages.
Over 6,400 hours of labelled audio across 12 Indian languages mined and aligned from audio broadcasts and PDF transcripts from All India Radio.
A benchmark of speech recognition tasks including ASR, speaker verification, speaker identification, language identification, query by example, and keyword detection for 12 Indian languages.
17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance.
Benchmark for zero-shot and cross-lingual evaluation of various NLU tasks in multiple Indian languages.