Indian Languages Corpora Initiative (ILCI) - phase2

ILCI Languages, Consortium members and Principal Investigators

The ILCI - Indian Languages Corpora Initiative has been a welcome move by TDIL to develop national corpora based on national standard. The phase-1 saw the development of parallel annotated corpora in 12 major Indian Languages including English in India's national standard in POS annotation. The size of this corpora project finishing in Feb 2012 is 600,000 annotated sentences with each sentence having an average of 16 words (9600,000 annotated words). The phase-2 purposes to include 4 north eastern languages (Nepali, Bodo, Assamese, Manipuri) and Kannada and add 11,00,000 new sentences (500,000 sentences in 5 new languages including the 4 languages of NE and 600,000 in 12 existing languages). The total size of the corpora after phase 2 is estimated to be approximately 27 million parallel annotated and chunked words in the domain of
1. Health and Tourism (HT)

2. AGriculture and ENTertainment (AGENT)

Languages under ILCI Project
Assamese                    Bangla                    Bodo

Gujarati                    Hindi/English                    Kannada

Konkani                   Malayalam                    Manipuri

Marathi                    Nepali                    Odia

Punjabi                    Tamil                    Telugu


