Shrutilipi – AI4BHĀRAT

Shrutilipi is a labelled ASR corpus obtained by mining parallel audio and text pairs at the document scale from All India Radio news bulletins for 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu. The corpus has over 6400 hours of data across all languages.

Dataset Details

Language	Size (in Hours)
bengali	443
gujarati	460
hindi	1620
kannada	459
malayalam	359
marathi	1015
odia	601
punjabi	94
sanskrit	27
tamil	794
telugu	390
urdu	193
Total	6457

Downloads

The dataset can be downloaded from the links given below –

Download transcripts – Link

The transcripts and audio paths are provided in fairseq format, which can be directly used for training models using the fairseq library. It consists of 3 files –

train.tsv file – Each line in the file contains the relative path to an audio file and the number of frames in the audio separated by tabs. The file also contains a header which has the absolute path to the dataset.

train.wrd (word) file – each line contains the transcription for the audio file in the ‘.tsv’ file which is corresponding to the same line number (ignoring the header in the ‘.tsv’ file).

train.ltr (letter) file – Tokenized transcriptions for the corresponding sentences in ‘wrd’ file. (tokenized to characters)

Audio Dataset Format

The audio files for each news bulletin are present in separate folders.
The audio files are stored in wav format sampled at 16KHz.
The audio filenames are numbered by sentence ids in the bulletin, eg. sent_1.wav

Folder Structure

data
├── bengali
│   ├── <bulletin-1>
│   │   ├── sent_1.wav
│   │   ├── sent_2.wav
│   │   ├── ...
│   │   └── sent_n.txt
│   ├── <bulletin-2>
│   └── ...
├── gujarati
├── ...

Audio Download Links

Language	Download Link
bengali	Link (65 GB)
gujarati	Link (68 GB)
hindi	Link (229 GB)
kannada	Link (63 GB)
malayalam	Link (84 GB)
marathi	Link (123 GB)
odia	Link (74 GB)
punjabi	Link (12 GB)
sanskrit	Link (6 GB)
tamil	Link (107 GB)
telugu	Link (80 GB)
urdu	Link (30 GB)

Model Download links

Language	Download Link (3.6GB)
bengali	Link
gujarati	Link
hindi	Link
marathi	Link
odia	Link
tamil	Link
telugu	Link

Shrutilipi – Mining Process

We summarize the key procedure we used for mining audio-text pairs from documents from the AIR dataset in the figure below. For a detailed description of the data mining process, please refer to our paper.

Results on Hindi Benchmarks

Benchmarks	Kathbath-Known	Kathbath-UnKnown	Tarini	CommonVoice 6	CommonVoice 7	CommonVoice 8	CommonVoice 9	Avg.
W2V (MUCS)	14.1	14.7	22.7	19.4	19.5	20.7	20.5	18.8
W2V (MUCS + Shrutilipi)	9.4	9.6	19.7	15	13.4	13.9	13.7	13.5
Conf. (MUCS)	17.2	17.7	25.4	20.9	21.4	22.9	22.8	21.2
Conf. (MUCS + Shrutilipi)	15.2	14.9	23.9	19.3	19.1	20	19.9	18.9

Results on Kathbath Unknown Test Set

	bn	gu	hi	mr	or	ta	te	Avg.
Existing	14.4	15	14.7	25.6	31.5	24.1	22.3	21.1
Existing + Shrutilipi	13.4	9.5	9.6	15.7	21.5	19.7	17.7	15.3

Results on MUCS Benchmark

	gu	hi	mr	or	ta	te	Avg.
Existing	17.9	12	13.6	23.3	20.5	16.4	17.3
Existing + Shrutilipi	12.8	11.1	11.4	23	20.7	13.8	15.5

Citing our work

If you are using any of the resources, please cite the following article:

@misc{https://doi.org/10.48550/arxiv.2208.12666,
  doi = {10.48550/ARXIV.2208.12666},
  url = {https://arxiv.org/abs/2208.12666},
  author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
  title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.

License

Dataset

The Shrutilipi dataset is released under this licensing scheme:

We do not own any of the raw text and audio from which this dataset has been extracted.
The raw dataset and audio have been crawled from the publicly available website: https://newsonair.gov.in
We license the actual packaging of this data under the Creative Commons CC0 license (“no rights reserved”) license.
To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to the Shrutilipi dataset.
This work is published from: India.

Code and Models

The code and models are released under the MIT License.

Contributors

Kaushal Bhogale
Abhigyan Raman
Tahir Javed
Sumanth Doddapaneni
Anoop Kunchukuttan
Mitesh Khapra
Pratush Kumar

Contact

Anoop Kunchukuttan (anoop.kunchukuttan@gmail.com)
Mitesh Khapra (miteshk@cse.iitm.ac.in)
Pratyush Kumar (pratyush@cse.iitm.ac.in)

Acknowledgements

We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank Megh Makhwana from Nvidia for helping in training Conformer-based ASR models. We would like to thank the EkStep Foundation for providing the Tarini dataset. We would like to thank Janki Nawale and Anupama Sujatha from AI4Bharat for helping in coordinating the annotation task, and extend thanks to all the annotators of AI4Bharat team.