Naamapadam
Naamapadam is the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages.
The dataset contains train, test and dev splits. We have manually annotated gold standard testsets for 8 languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Telugu.
We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set.
Naamapadam Dataset: Available on our Hugginface repository
IndicNER model: Available on our Huggingface repository
Comparison of Indian language Named Entity training dataset statistics (total number of named entities), For all datasets, the statistics include only LOC, PER and ORG named entities.
as | bn | gu | hi | kn | ml | mr | or | pa | ta | te | |
---|---|---|---|---|---|---|---|---|---|---|---|
Naamapadam | 5.0K | 1.6M | 765.5K | 2.1M | 655K | 1.0M | 731.2K | 189.6K | 875.9K | 741.1K | 747.8K |
WikiANN | 443 | 61K | 1.6K | 4.7K | 1.9K | 9.4K | 7.2K | 658 | 1K | 12.5K | 3.4K |
FIRE-2014 | - | 6.1K | - | 3.5K | - | 4.2K | - | - | - | 3.2K | - |
CFILT | - | - | - | 262.1K | - | - | 4.8K | - | - | - | - |
MultiCoNER | - | 9.9K | - | 10.5K | - | - | - | - | - | - | - |
MahaNER | - | - | - | - | - | - | 16K | - | - | - | - |
AsNER | ~6K | - | - | - | - | - | - | - | - | - | - |
Statistics for the Naamapadam dataset. The testsets for as, or and ta are either silver standard or small. Work on creation of larger, manually annotated testsets is in progress for these languages.
Lang. | Sentence Count | Train | Dev | Test | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Train | Dev | Test | Org | Loc | Per | Org | Loc | Per | Org | Loc | Per | |
bn | 961.7K | 4.9K | 607 | 340.7K | 560.9K | 725.2K | 1.7K | 2.8K | 3.7K | 207 | 331 | 457 |
gu | 472.8K | 2.4K | 1.1K | 205.7K | 238.1K | 321.7K | 1.1K | 1.2K | 1.6K | 419 | 645 | 673 |
hi | 985.8K | 13.5K | 437 | 686.4K | 731.2K | 767.0K | 9.7K | 10.2K | 10.5K | 257 | 302 | 263 |
kn | 471.8K | 2.4K | 1.0K | 167.5K | 177.0K | 310.5K | 882 | 919 | 1.6K | 291 | 397 | 614 |
ml | 716.7K | 3.6K | 974 | 234.5K | 308.2K | 501.2K | 1.2K | 1.6K | 2.6K | 309 | 482 | 714 |
mr | 455.2K | 2.3K | 1.1K | 164.9K | 224.0K | 342.3K | 868 | 1.2K | 1.8K | 391 | 569 | 696 |
pa | 463.5K | 2.3K | 993 | 235.0K | 289.8K | 351.1K | 1.1K | 1.5K | 1.7K | 408 | 496 | 553 |
te | 507.7K | 2.7K | 861 | 194.1K | 205.9K | 347.8K | 1.0K | 1.0K | 2.0K | 263 | 482 | 607 |
ta | 497.9K | 2.8K | 49 | 177.7K | 281.2K | 282.2K | 1.0K | 1.5K | 1.6K | 26 | 34 | 22 |
as | 10.3K | 52 | 51 | 2.0K | 1.8K | 1.2K | 18 | 5 | 3 | 11 | 7 | 6 |
or | 196.8K | 993 | 994 | 45.6K | 59.4K | 84.6K | 225 | 268 | 386 | 229 | 266 | 431 |
Corresponding authors: Rudra Murthy V, Anoop Kunchukuttan
If you are using any of the resources, please cite the following article:
Language
@misc{mhaske2022naamapadam, doi = {10.48550/ARXIV.2212.10168}, url = {https://arxiv.org/abs/2212.10168}, author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop}, title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages} publisher = {arXiv}, year = {2022},}
Naamapadam is released under this licensing scheme: