Annotation Guidelines for tagging Sanskrit using MSRI-JNU Sanskrit tagset




This is a guideline for annotating Sanskrit text with Parts-of-Speech (POS) tags according to the hierarchical POS tagset framework designed at Special Centre for Sanskrit Studies, JNU, New Delhi following the pattern of Microsoft Research India tagset for Indic languages. The first ever attempt for developing a Sanskrit tagset was undertaken by Dr. R. Chandrashekar, in his doctoral thesis (JNU, 2002-2007) under the supervision of Dr. Girish Nath Jha. This work has given us an insight for finalising our tagset. Thus this Sanskrit tagset has been obtained from MSRI Indic languages tagset and the Sanskrit tagset by Dr. R. Chandrashekar (2007).


The objective of the guidelines is to provide clear instructions for tagging Sanskrit text. This tagset consists of categories, types, and their attributes which are the three different levels of the hierarchy.  Categories are the top level part-of-speech classes like noun, adjective, particle etc. Categories are obligatory and types are the main sub-classes of categories. Attributes are morpho-syntactic features of types and all of them are optional.


1.         Objective

2.         Structure of the tagset

A.        Categories

B.        Types

C.        Attributes

3.         Description of tags     

            A. Categories and Types:

i. Noun (N)

            1. Common (NC)

            2. Proper (NP)


ii Verb (V)


iii. Pronoun (P)

            1. Pronominal (PPR)

            2. Reflexive (PRF)

            3. Reciprocal (PRC)

            4. Relative (PRL)

            5. Wh- (PWH)



iv. Nominal modifier (J)

            1. Adjective (JJ)

            2. Quantifier (JQ)


v. Demonstrative (D)

            1. Absolute (DAB)

            2. Relative (DRL)

            3. Wh- (DWH)


vi. Adverb (A)

            1. Manner (AMN)      

            2. Location (ALC)                 


vii. Participle (VB)

        1. Participle Proper (VBP)

        2. Participle Gerundive (VBG)


viii. Particle (C)

            1. Coordinating (CCD)          

            2. Subordinating (CSB)         

            3. Classifier (CCL)                 

            4. Interjection (CIN)

            5. Negative (CNG)

            6. Emphatic (CEM)                            

            7. Other (CX)


ix.Punctuation (PU)


x.Residual (RD)

            1.Foreign word (RDF)

            2.Symbol (RDS)

            3.Others (RDX)


            B. Attributes and their values:

1.Gender (Gen)

            a.Masculine (mas)

            b.Feminine (fem)

            c.Neuter (neu)


2. Number (Num)

            a. Singular (sg)

            b. Dual (dl)

            c. Plural (pl)


3. Person (Per)

            a. First    (1)

            b. Second (2)

            c. Third   (3)


4. Case (Cs)

            a. Nominative (nom)

            b. Accusative (acc)

            c. Instrumental (ins)

            d. Dative (dat)

            e. Ablative (abl)

            f. Genetive (gen)

            g. Locative (loc)

            h. Vocative (voc)


5. Nominal declension {vibhakti} (Vbh)          (case marker)

            a. Prathama (i)

            b. Dwitiya (ii)

            c. Tritiya (iii)

            d. Chaturthi (iv)

            e. Panchami (v)

            f. Shashthi (vi)

            g. Saptami (vii)

            h. Vocative (viii)


6. Tense/Mood (Tns/Mood)

            a. Present (prs)

            b. Aorist (aor)

            c. Imperfect (imprf)

            d. Perfect (prf)

            e. Periphrastic Future (phf)

            f. General Future (gft)

            g. Imperative (imp)

            h. Potential (pot)

            i. Benedictive (ben)

            j. Conditional (cnd)



7. Numeral (Nml)

            a. Ordinal (ord)

            b. Cardinal (crd)

            c. Non-numeral (nnm)


8. Distance (Dist)

            a. Proximal (prx)

            b. Distal (dst)



9. Emphatic (Emph)

            a. Yes   y

            b. No   n



10. Negative (Neg)

            a. Yes   y

            b. No    n



11. Honorificity (Hon)

            a. Yes   y

            b. No    n


            C. Common value for all the attributes :

a. Not-applicable (0); when any other value is not applicable to the category or the relevant morpho- syntactic feature is not available.

b. Undecided or doubtful  (x); when the annotator is unsure  about a possible tag.


4. Special cases

5. Conclusion


1. Objective

The goal of this tagset framework is to annotate (tag) Sanskrit text i.e. to assign to each word the correct tag (Parts of Speech) in the context of the sentence. In this framework we would be following the hierarchical and decomposable tagset schema of IL-POSTS. Despite significant works on Computational Sanskrit, POS Tagging of Sanskrit is still in its infancy. Part of Speech Annotation serves as fundamental building blocks for NLP research. This new framework addresses Sanskrit data exclusively.


2. Structure of the tagset

            a. Categories: categories are the primary grammatical classes to which the words belong. ‘Grammatical’ means grossly the parts of speech through which each individual word is recognized. E.g., noun, verb, adjective etc.

            b. Types: types are the subclasses or finer specification of the categories, which are determined on the basis of either form or function. E.g., common, proper etc. as the subcategory of the category ‘noun’.

            c. Attributes: attributes are the set of basic morpho-syntactic features of a type, like, person, number, gender etc.


3. Description of the tags

Always tag the attributes if they are morphologically present. The description of the tags is elaborated below:

3.1. Categories and their Types and 3.2. their Attributes

While marking the tags, we have to concentrate on forms for the attributes, i.e., if the attributes are present in the word morpho-syntactically, we mark the attributes accordingly. But, we have to concentrate on function also while marking the types. However, this is a guiding principle only and may vary depending on the context.

3.1.1 NOUNS (N)

The types and attributes of a NOUN are –

TYPE                 ATTRIBUTES

Common Noun (NC)  gender, number, case, nominal declension

Proper noun (NP)        gender, number, case, nominal declension

A noun is generally inflected for gender, number, case and obviously for nominal declension.



Common nouns in this tagset are the words that belong to the types of common noun (person, place or a thing), abstract noun (emotions, ideas etc), collective noun (group of things, animals, or persons), countable and non-countable nouns, and nouns in a complex verb etc.


As Sanskrit has grammatical gender only, so in this framework we propose to tag words with their grammatical gender and we don't consider their semantic (natural) gender. There are no definite rules that can be laid down for the determination of the gender of words in Sanskrit. It can best be studied from the dictionary or from usage. There are certain words that are found in more than one gender and we annotate them according to their meaning as meaning is determined by gender in such cases. For example the word मित्र 'mitra' is found in masculine (मित्रः) 'mitrah' meaning sun and in neuter (मित्रम्) 'mitram' meaning friend. There are also some words whose natural gender and the grammatical gender coincide.



In Sanskrit selecting the value of number attribute is not a tricky task. Unlike most of Indian languages Sanskrit has three numbers: singular, dual, and plural. And for each number generally Sanskrit words have different inflections. There are certain words whose number is fixed for all usages, for example अप् 'ap' (water) is always used in plural in all its declensions. One needs to be well acquainted with Sanskrit grammar to do this job.



In Paninian grammatical framework six cases have been acknowledged but in this framework we have assumed eight cases for the sake of uniformity in Indian languages and linguistic description, as they do exist in practical. Thus we have incorporated genitive and vocative cases and given them full status of a case.

            Case                Most frequent declension

            a.Nominative (nom)                first

            b.Accusative (acc)                   second

            c.Instrumental (ins)                 third

            d.Dative (dat)                         fourth

            e.Ablative (abl)                       fifth

            f.Genetive (gen)                      sixth

            g.Locative (loc)                       seventh

            h.Vocative (voc)                     eighth


For the case recognition in Sanskrit one doesn't have any problem provided he/she has a good practical knowledge of "Siddhant Kaumudi" by Bhattoji Dixit. Due to a sound tradition of learning Sanskrit grammar, we believe, there is no need to say anything special here except that in this framework we have proposed to mark instrumental case in the constructions like रामेण, रमया etc. that are in their third declension and subjects of their verbs by the Paninian rule "कर्तृकरणयोस्तृतीया".


Nominal Declension:

To find out nominal declension value in Sanskrit might be easy if one has a good stake of Sanskrit grammar. Generally, nominal declension is determined by the case of its host.  In cases where same word form is found in many declensions it becomes more difficult to assign any value. However, in the context we see the possible meaning of the word and try to find out its case, then we can decide its value. One need to be well acquainted with Sanskrit grammar to do this job.




When the word denotes a specific name of a person, place, shop, institution, date, day, month, species, etc., or whatever is considered to be a name would be marked as proper noun. If the word is of some other category, but is used as a proper noun in a context; should be marked as proper noun.

The attributes and their assignment in Proper nouns are the same as those for Common nouns.


3.1.2 VERB (V)

In Sanskrit, only a finite verb is found in a sentence. For the moment we don't make any difference between parasmaipada and atmanepada. The attributes of verb are number, person, tense/mood, and honorificity. Honorificity have to be marked where a singular entity has been treated as a plural carrying plural form of the verb.



3.1.3 PRONOUN (P)

The types and attributes of pronouns are –

TYPES                        ATTRIBUTES

1. Pronominal (PPR)   gender, number, person, case, nominal declension, emphatic,


2. Reflexive (PRF)                  gender, number, case, nominal declension

3. Reciprocal (PRC)                gender, number, case, nominal declension

4. Relative (PRL)                    gender, number, person, case, nominal declension

5. Wh (PWH)                          gender, number, person, case, nominal declension



Pronominal include all personal pronouns, inclusive pronouns and indefinite pronouns. In the case of indefinite pronouns, person attribute should be annotated as [3] i.e., third person as they take verb form of third person.

e.g., personal pronouns: अहम्, त्वम्, भवान्, सः etc.

Inclusive Pronouns: सर्वम्, उभयम् etc.

Indefinite pronouns: कश्चित्, किंस्वित्, etc.


Gender: gender information is morphologically encoded in the pronouns. However, the first person and second person in Sanskrit have no gender information on their own. They may be annotated as not applicable (0). 


Number: number is morphologically marked in almost all of the cases in pronouns. The default value is singular ‘sg’. Annotate it as plural ‘pl’ or ‘du’ if number is morphologically present in the word. In case of inclusive pronouns, the number attribute should be annotated according to their morphological number in which they have been used in the sentence.


Person: Person attribute would be annotated only in personal pronouns. While assigning any person value we have to see of what person verb they are carrying with them.


Case and Nominal declension marking for pronominals is the same as for nouns.


Emphatic: this attribute is assigned when the pronominals are combined with other pronouns or nouns .For example, सोहं-रघूणामन्वयं वक्ष्ये "that I will describe the race of the Raghus"; ते वयं दमयन्त्यर्थं चरामः पृथिवीमिमाम् "we, of this description, roam over the earth for (in search of) Damayanti. Here सः and ते are for emphatic information.


Honorificity: honorificity is informed by the use of plural form for singular referent. It can be found in any of the pronominals.

Emphatic and honorificity are tagged as 'y' (present) and 'n' (absent).


REFLEXIVE:  A reflexive pronoun is a pronoun that is preceded by the noun or pronoun to which it refers (its antecedent). In Sanskrit, the sense of the reflexive pronoun is expresed by the words like आत्मन्, स्वयम् etc; e.g. राजा स्वयं समरभूमिं जगाम "the king himself went to the battlefield.  Their gender and number is determined by their antecedent. For case and declension we have to see the context.


RECIPROCAL: reciprocity is expressed by the repetition of the pronominal adjectives; e.g. अन्योन्य, इतरेतर, एकैक, and परस्पर. These are generally used in the singular. Their gender is determined by their referents. Their morphological number and nominal declension should be marked and also the case if it is relevant in the context, otherwise mark as not applicable.


RELATIVE (PRL):  A relative pronoun is a pronoun that links two clauses into a single complex clause, e.g. यत् (which), यम् (whom), and या (who-fem).  

Relative pronouns bear the same attributes as those for the pronominals barring emphatic and honorificity.


WH- PRONOUNS (PWH): Wh- Pronouns like कः, किम्, etc. hold the attributes gender, number, person, case, and nominal declension.



The types and attributes of nominal modifiers are –


Adjectives (JJ)                        gender, number, case, nominal declension, emphatic, negative, honorificity

Quantifier (JQ)            gender, number, case, nominal declension, numeral, emphatic



An adjective modifies a noun; hence it is kept as a type of nominal modifier in the framework. Though adjectives are not always followed by nouns, it can be used as a predicate too. The first kind is called an attributive adjective and the second type is called a Predicative adjective. An adjective can function as a noun if not followed by a modified noun; in that case it is called an absolute adjective. However, these do not make any difference in the attribute set of the tagset. Nor do the comparative and superlative adjectives. An adjective in Sanskrit is inflected for gender, number, case, and nominal declension. In other words it agrees to which it qualifies.



A quantifier is a word which quantifies the noun, i.e., it expresses the noun’s definite or indefinite number or amount e.g., 

दशम्, तृतीयः, कतिपय, सर्वे

We mark gender and number in these words as it is in Sanskrit grammar tradition. Their case and nominal declension are like those in adjectives.

Numerals: the values for numerals are: cardinal, ordinal and non-numeral. Any word other than cardinal and ordinals are annotated as non-numerals.

Ordinal: quantifiers those denote the orders

Cardinal: number words

            Non-numerals: quantifiers other than numbers and ordinals, which includes existential, universal quantifiers, modifiers, etc.

Modifiers can modify a noun, as well as a verb. In that case, we presume that there is an ellipsis of a noun and hence the construction looks like a verbal modifier.

When a quantifier is not followed by a noun annotate it as noun.



Types and attributes of demonstratives are-

TYPES                                    ATTRIBUTES

1. Absolute Demonstrative (DAB)                 gender, number, person, case, nominal declension, distance, honorificity

2. Relative Demonstrative (DRL)                   gender, number, person, case, nominal declension, honorificity

3.Wh- Demonstrative (DWH)                        gender, number, person, case, nominal declension, honorificity


Demonstratives have the same form of the pronouns, but distributionally they are different than the pronouns as they are always followed by a noun, adjective or another pronoun.

e.g., इदम् पुस्तकम्, तद्बालकः, सः महाभागः,

For all the attributes in demonstrative, go by the rules stated in the pronouns.


Relative demonstratives are non-distinguishable from relative pronouns, except for that a demonstrative is followed by a noun, pronoun or adjective. In DRL distance attribute is absent.

यः पुरुषः, यत्कार्यम्


Wh demonstratives are non-distinguishable from whpronouns, except for that a demonstrative is followed by a noun, pronoun or adjective. The change in the morphological form is not found. E.g.,

केन बालकेन, कः महापुरुषः


3.1.6. ADVERB (A)

An adverb belongs to a group of words that modifies the verb, adjective or the sentence.

Types and attributes of adverbs are -

TYPES                                    ATTRIBUTES

Adverbs of Manner (AMN)    [0]

Adverbs of Location (ALC)   [0]

Almost all adverbs in Sanskrit are indeclinables so, for the moment we are not assigning them any attribute.



In this tagset we mean by participles the kridantas which most often function like a verb and rarely like an adjective in a sentence. They inflect for gender, number, case and nominal declension. They may be followed by a finite verb. (The kridantas which behave like a noun or an indeclinable 'avyaya' should not be treated as participles). The kridantas (primary derivatives) which do not inflect for anything we treat them as particles. 

Types and attributes of participles are -

TYPES                                    ATTRIBUTES

1. Participle Proper (VBP)                  gender, number, case, nominal declension

2. Participle Gerundive (VBG)           gender, number, case, nominal declension

Dr. R. Chandrashekhar has termed the krityapratyayaantas as gerundives and the rest as participles. For the time being we are following him. The examples of gerundives are कार्यम्, कर्तव्यम्, करणीयम् etc. and the participles proper are लभमानम्, गच्छत्, उक्तवत्, दृष्टवान्, करिष्यमाण etc.


3.1.8. PARTICLE (C)

A particle is a word that does not belong to one of the main parts of speech, is invariable in form, and typically has grammatical or pragmatic meaning. Most of them are Indeclinables 'avyayas’.

TYPES                                    ATTRIBUTES

1. Coordinating (CCD)          

2. Subordinating (CSB)         

3. Classifier (CCL)                 

4. Interjection (CIN)              

5. Negative (CNG)

6. Emphatic (CEM)

7. Others (CX)                                   

We don’t assign any attribute for particles. e.g.

ननु\CX, \CCD                     


Coordinating particles are those particles which act as conjunctions that link constituents without syntactically subordinating one to the other. These are similar to English- and, or and but; e.g.,

\CCD, अपिच\CCD etc.


A subordinating particle is a particle that acts as conjunction that links constructions by making one of them a complement of another. E.g., परम्\CSB, परन्तु\CSB, यत्\CSB,  तदपि\CSB etc.


A classifier particle acts as unit nouns, e.g., 5০০ कोटिः\CCL, इत्यादि\CCL etc.


Words that express emotion are interjections, e.g., बत\CIN, अहो\CIN, हा\CIN, धिक्\CIN, स्वधा\CIN etc.


The indeclinables which are used for negative meaning are treated under this category. For example, \CNG, मा\CNG etc.


The indeclinables which are used for emphasis should be tagged as CEM. As,

एव\CEM etc.


This tag is used for all other particles which cannot be grouped under the above mentioned types. E.g.,

किल\CX, खलु\CX, तु\CX, किंचित्\CX. etc.



The punctuation marks are ‘’, ‘,’, “, ‘;’, ‘?’, ‘!’ . They are tagged as \PU. They do not have any attribute


3.1.10. RESIDUAL (RD)

Residuals are the words those cannot be categorized under any category-type described so far.

The types of Residual are-

TYPES                        ATTRIBUTES

Foreign word (RDF)  

Symbol (RDS)           

Others (RDX)


Residuals do not have any attributes.


Foreign words are those words which are written in any foreign script other than देवनागरी. And also the borrowed words which are not Sanskritized or foreign names which are not Sanskritized  are dealt as foreign words. E.g.,

buildings\RDF, Alexander\RDF the\RDF great\RDF, ˈaɪzək\RDF ˈćzɪˌmɑv\RDF, Исаак\RDF Озимов\RDF etc.


Symbols are characters which are not used as punctuation marks. The Devnagari and Roman abbreviations are treated as Symbols, e.g., $\RDS, &\RDS, +\RDS, %\RDS, @\RDS, यू.पी.\RDS etc.


This tag is given to words that are written in English numerals and in Sanskrit numerals (for the time being we don’t have any proper tag for them). E.g., 1352\RDX, 907\RDX, ६४७\RDS etc.




This is an important and problematic aspect of annotation as well as the guidelines. We consider the nouns as PROPER NOUN if it denotes some name. The words which precede the name of a person, consider श्री, आचार्यः, महोदयः etc. are part of proper nouns and would be tagged as proper nouns.


5. Conclusion

This is a guideline for Sanskrit parts of Speech tagging which tries to accommodate small nuances of annotation in natural languages. The broader scope of the framework is to accommodate all the natural languages of India within a single framework. However, it is almost an impossible job to capture all the subtleties of a natural language. This guideline is specific to Sanskrit and aims to give clues in annotation helping in disambiguation in tagging. This is a first version of the guideline. Possibly it would not be able to capture many subtleties found in hand. Point out all the exceptions or examples you find in the corpora, which will be useful in making the guideline more perfect.





                                            महर्षिः दयानन्दः

काठियावाड़प्रान्तस्य\ टंकारानामके\ ग्रामे\ मूलशंकरस्य\ जन्म\ अभवत्\ \PU अस्य\ जनकस्य\ नाम\ कर्षणलालः\ आसीत्\ \PU अष्टमे\ वर्षे\ मूलशंकरस्य\ यज्ञोपवीत-संस्कारः\ अभवत्\ \PU अस्य\ स्मरणशक्तिः\ अद्भुता\ आसीत्\ \PU चतुर्दशे\ वर्षे\ सः\ यजुर्वेदसंहिताम्\ कण्ठस्थाम्\ अकरोत्\ \PU शिवस्य\ महत्तां\ श्रुत्वा\VBP सः\ शिवरात्रौ\ व्रतम्\ अकरोत्\ \PU सः\ तस्यां\ रात्रौ\ \CX सुप्तः\VBP \PU तदा\ALC.0 देवालये\ सः\ अपश्यत्\ यत्\CSB.n शिवमूर्तिं\ परितः\CX मूषकाः\ अकूर्दन्\VBP तत्र\ALC.0 \CCD.n पतितान्\ तण्डुलान्\ अखादन्\ \PU एतत्\ दृष्ट्वा\VBP मूलशंकरस्य\ श्रद्धा\ अनश्यत्\ \PU अयम्\ वृत्तान्तः\ तस्य\ जीवने\ महत्त्वपूर्णः\ आसीत्\ \PU यदा\ALC मूलशंकरः\ षोडशवर्शीयः\ आसीत्\ तदा\ALC तस्य\ चतुर्दशवर्षीया\ अनुजा\ रुग्णा\ अभवत्\ \PU शीघ्रम्\ALC एव\CX सा\ प्राणान्\ अत्यजत्\ \PU यदा\ALC सः\ नवदशवर्शीयः\ अभवत्\ तदा\ALC तस्य\ पितृव्यः\ विषूचिकानामकेन\ रोगेण\ आक्रान्तः\ अभवत्\ मृतः\ \CCD.n \PU शिवलिङ्गस्य\ उपरि\CX मूषकाणाम्\ कूर्दनेन\ जीवनस्य\ \CCD.n अस्थिरतया\ मूलशंकरस्य\ हृदये\ वैराग्यस्य\ भावना\ उत्पन्ना\ अभवत्\ \PU अस्य\ वैराग्यस्य\ भावनां\ ज्ञात्वा\CX परिवारस्य\ सदस्याः\ मूलशंकरस्य\ विवाहस्य\ चिन्तां\ अकुर्वन्\ \PU एतत्\ ज्ञात्वा\CX एकविंशतिवर्षीयः\ मूलशंकरः\ गृहम्\ अत्यजत्\ \PU १८६०तमे\ वर्षे\ नवम्बरमासस्य\ चतुर्दशतारिकायाम्\ सः\ मथुरानगरे\ गुरोः\ विरजानन्दस्य\ समीपम्\CX अगच्छत्\ विद्याध्ययनं\ \CCD.n अकरोत्\ \PU महर्षिः\ दयानन्दः\ सरस्वती\ मुम्बईनगरे\ १८५७तमे\ वर्षे\ विधिपूर्वकम्\AMN आर्यसमाजस्य\ स्थापनाम्\ अकरोत्\ \PU शीघ्रम्\AMN एव\CEM महर्षिदयानन्दस्य\ प्रयत्नैः\ उत्तरभारते\ आर्यसमाजस्य\ प्रसारः\ अभवत्\ \PU आर्यसमाजः\ समाजे\ प्रचलितान्\ अन्धविश्वासान्\ अस्पृश्यताम्\ अविद्यां\ \CCD भारतवर्षाद्\ बहिष्कर्तुं\CX प्रयत्नम्\ अकरोत्\ \PU महर्षिः\ दयानन्दः\ येषाम्\ सिद्धान्तानाम्\ प्रचारम्\ अकरोत्\ ते\ सर्वे\ सत्यार्थप्रकाशनामके\ ग्रन्थे\ उल्लिखिताः\ सन्ति\ \PU सः\ संस्कारविधिम्\ अपि\CEM अलिखत्\ \PU एतावत्\ महत्\ कार्यं\ कृत्वा\CX १८८३तमे\ वर्षे\ एषः\ महर्षिः\ ईश्वरं\ स्मरन्\ स्वदेहम्\ अत्यजत्\ \PU