Dhwani is a Unlabelled ASR corpus obtained from YouTube and News On AIR news bulletins. The dataset contains raw audios across 40 Indian Languages.
Numbers represent hours
For YouTube
YT
├── bengali
│ ├── XXXXXXXXXXX.wav
│ ├── XXXXXXXXXXX.wav
│ ├── XXXXXXXXXXX.wav
│ └── ...
├── gujarati
├── ...
For NOA
NOA
├── Audio
│ ├── assamese
│ ├── audio
│ ├── newsonair.nic.in
│ ├── NSD-Assamese-Assamese-0705-0715-201810107486.mp3
│ ├── NSD-Assamese-Assamese-0705-0715-20181011161537.mp3
├── gujarati
├── ...
YouTube urls
NewsOnAir - Please crawl the data from the following website - https://newsonair.gov.in/
If you are using any of the resources, please cite the following article:
@dataset{
}
We would like to hear from you if:
The Dhwani dataset, models and code are released under the MIT License.
We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.