AI4BHARAT

Datasets

IndicCorp

Large sentence-level monolingual corpora for 11 languages from two language families (Indo-Aryan branch and Dravidian) and Indian English with an average 9-fold increase in size over OSCAR. This corpora was created by crawling content from news articles, magazines and blogposts.

Know More →

Samanantar

The largest publicly available parallel corpora collection for Indic languages containing ∼46.9M parallel sentences between English and 11 Indic languages, ranging from 142K pairs between English-Assamese to 8.6M pairs between English-Hindi. Of these 34.6M pairs are newly mined as a part of this work.

Know More →

Aksharantar

The largest publicly available parallel transliteration corpora containing 26M word pairs spanning 21 languages mined from Wikidata, Samanantar and IndicCorp. It also contains a challenging and diverse benchmark for evaluating transliteration models.

Know More →

IndicGLUE

This is a benchmark for 6 NLU tasks spanning 11 Indian languages containing standard training and evaluation sets to evaluate the natural language understanding capabilities of language models for Indian languages.

Know More →

IndicNLG Suite

This is a benchmark containing various tasks to evaluate the natural language generation capabilities of language models for Indian languages.

Know More →

Shrutilipi

Over 6,400 hours of labelled audio across 12 Indian languages mined and aligned from audio broadcasts and PDF transcripts from All India Radio.

Know More →

IndicSUPERB

A benchmark of speech recognition tasks including ASR, speaker verification, speaker identification, language identification, query by example, and keyword detection for 12 Indian languages.

Know More →

Dhwani

17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance.

Know More →

Coming Soon
IndicXTREME

Benchmark for zero-shot and cross-lingual evaluation of various NLU tasks in multiple Indian languages.

Know More →

Coming Soon
Naamapadam

Training and evaluation datasets for named entity recognition in multiple Indian language.

Know More →