IndicCorp has been developed by discovering and scraping thousands of web sources - primarily news, magazines and books, over a duration of several months.
IndicCorp is one of the largest publicly-available corpora for Indian languages. It has also been used to train our released models which have obtained state-of-the-art performance on many tasks.
The corpus is a single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated.
Downloads
Language | # News Articles* | Sentences | Tokens | Link |
---|---|---|---|---|
as | 0.60M | 1.39M | 32.6M | link |
bn | 3.83M | 39.9M | 836M | link |
en | 3.49M | 54.3M | 1.22B | link |
gu | 2.63M | 41.1M | 719M | link |
hi | 4.95M | 63.1M | 1.86B | link |
kn | 3.76M | 53.3M | 713M | link |
ml | 4.75M | 50.2M | 721M | link |
mr | 2.31M | 34.0M | 551M | link |
or | 0.69M | 6.94M | 107M | link |
pa | 2.64M | 29.2M | 773M | link |
ta | 4.41M | 31.5M | 582M | link |
te | 3.98M | 47.9M | 674M | link |
For processing the corpus into other forms (tokenized, transliterated etc.), you can use the indicnlp library. As an example, the following code snippet can be used to tokenize the corpus:
Language
from indicnlp.tokenize.indic_tokenize import trivial_tokenizefrom indicnlp.normalize.indic_normalize import IndicNormalizerFactory
lang = 'kn'input_path = 'kn'output_path = 'kn.tok.txt'
normalizer_factory = IndicNormalizerFactory()normalizer = normalizer_factory.get_normalizer(lang)
def process_sent(sent): normalized = normalizer.normalize(sent) processed = ' '.join(trivial_tokenize(normalized, lang)) return processed
with open(input_path, 'r', encoding='utf-8') as in_fp,\ open(output_path, 'w', encoding='utf-8') as out_fp: for line in in_fp.readlines(): sent = line.rstrip('\n') toksent = process_sent(sent) out_fp.write(toksent) out_fp.write('\n')
If you are using IndicCorp, please cite the following article:
Language
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
IndicCorp is released under this licensing scheme: