Computational Linguistics R & D at J.N.U. New Delhi

Tagsets and tagged corpora

The Computational Linguistics R&D at Special Centre for Sanskrit Studies J.N.U., has also focussed on develping tagsets for Indian languages. We developed the first Sanskrit tagset as part of a Ph.D. by Dr. R.Chandrashekar (Ph.D. 2002-2007) under the superviosin of Dr. Girish Nath Jha. The tagger developed can be tested by clicking here.

Recently, Microsoft Research India Lab created a generic hierarchical tasget called "IL-POST" for Indian langauges. Dr. Girish Nath Jha (alongwith many eminent scholars) is an author of this tagset. Subsequently, under Dr. Jha's supervision, IL-POST tagset was adapted for Sanskrit by incorporating many features from the JNU tagset (R.Chandrashekar 2007) by Madhav Gopal as part of Computational Linguistics coursework with Dr. Jha. The statistical tagger at MSR India Labs was trained on approximately 6 K tagged words and the following results were obtained.

Training data: 200 sentences ( ~5.8K words)
Test Data: 50 Sentences (~1.2K words)
Word Level Accuracy: 75.35%
Sentence Level Accuracy: 29.3%

JNU Sanskrit tagset
tagset
annotation guidelines
tagged corpora

IL-POST Sanskrit tagset
tagset
annotation guidelines
tagged corpora