The purpose of the TEI Special Interest Group “Indic Texts” is to allow scholars engaged in the study of Indic texts to develop and document best practices in applying the TEI’s Guidelines to these kinds of texts. To participate, please join the mailing list at

There are several respects in which the applicability of the TEI guidelines to these texts is less than obvious. These relate to distinctive features of Indic textuality, including:

  • the use of syllabic scripts, and the non-coincidence of grapheme- (syllable-) and word-boundaries;
  • the application of phonotactic rules (sandhi) that further obscures the boundaries between words; extensive compounding;
  • the use of distinctive media and writing supports (such as birch bark, palm leaves, and copper plates);
  • distinctive metrical patterns with different types of caesuras;
  • the prominence of the commentary as a genre, and the depth of intertextual relations this implies;
  • the frequent reuse of texts in other texts, which requires careful and deliberate application of the "quoteLike" module.

The expected outcome of the SIG’s work is a practical guide that analyzes common cases in the markup of Indic texts and proposes ways in which the analytical tools provided by the TEI Guidelines might best be used in these cases, discussing benefits and drawbacks of the solutions possible. Ideally, this guide will become a part of the TEI Guidelines.


Manuscript Transcription

Canonically, an akṣara makes up a single "grapheme," and this is reflected in Unicode representations of Indic scripts, where consonants and independent vowels are encoded first, and then vowel-markers (and dependent consonants like anusvāraḥ and visargaḥ) are encoded subsequently as combining characters. Unless marked with a combining vowel character, or a cancellation character, consonants are understood to have an inherent vowel a. The sequence of consonants within conjuncts is also canonically the same as their phonological sequence. Thus in the conjunct "rg", the "r" is represented before the "g" in transliteration, in Devanagari र्ग (0930 + 094D + 0917) and in Kannada ರ್ಗ (0CB0 + 0CCD + 0C97), although it is rendered on top of the "g" in Devanagari and to the right of the "g" in Kannada.

Cancelling dependent vowels

In manuscripts, dependent vowel markers can be cancelled, and the consonant is then read with the inherent vowel a.


If you want to encode this kind of change, there are technical problems, whether one is using an Indic script or an alphabetic transliteration system (like IAST or ISO-15919):

  • In Indic scripts, rendering problems are likely if the cancelled vowel marker is enclosed within the <del></del> tags, since it is a combining character;
  • In transliteration, the deletion of one vowel must be accompanied by the addition of the inherent vowel, although there is no addition marked as such in the manuscript.

There are two possible solutions that came up on the list. Both involve the use of the <subst> element, which contains the akṣara that is subject to scribal modification, and within it, the <add> and <del> elements.

The first, and most straightforward, solution is to treat the akṣara, and not the "akṣara part," as the smallest unit of variation in the manuscript, and thus to include the consonant in the <add> and <del> elements. The correction can thus be read as changing "ḷo" into "ḷa".

<subst><del type="cancelled">ḷo</del><add>ḷa</add></subst>

The other option involves putting only vowel markers in the <add> and <del>. This is more precise, but since a dependent "a" cannot actually be represented in Indic scripts, it requires that the transcription be displayed in Roman transliteration (or otherwise some ad-hoc processing will be necessary).

<subst>ḷ<del type="cancelled">o</del><add place="implicit">a</add></subst>

Of course not all projects will require markup of this granularity.

Floating consonants

When an orthographically dependent consonant is separated from another consonant, for instance by a binding hole, it can generally be transcribed without any special markup:

śa<space type="binding-hole"/>ḥ

But when the orthographic sequence of consonants differs from the canonical sequence of consonants, this is not possible.


If necessary the out-of-sequence consonant could be specifically marked as such:

māgg<space type="binding-hole"/><g ref="#floating-r">r</g>

In this case a processor should be able to "fix" the sequence of characters given this markup. But the group advised that this level of markup was not generally required.

Floating vowel markers

Floating vowel markers present an analagous case to floating consonants and should probably be encoded similarly if required.

