Computational Linguistics for Sanskrit and Modern Indian Languages

Leaving aside the socio-political arguments for the importance of Sanskrit, the question is, what else do we stand to gain from the study of Sanskrit language and literature, particularly in relation to the field of linguistics and CL? This article explores answers to that question.

Sanskrit holds a unique position in the field of CL. Almost all other languages are studied only to produce computational tools and applications for the said languages, but Sanskrit is different. An understanding of the language not only allows us to make better tools but it also opens up the literature in this language, which has much to add to debates on the nature of language. As we venture to make tools and models of language in the hopes of one day building machines that have the capacity to generate and understand language to the same extent as humans, we are often faced with difficult questions concerning how language functions and what its nature is.

Sanskrit Language and Linguistics

The study of language in the Indian tradition began with the desire to maintain the purity of the dominant language—in this case Sanskrit—and also to maintain the authenticity of scripture and ascertain the authenticity of its transmission. The Vedas were the primary scriptures and they were originally transferred orally. Since the Vedas also contained hymns that were part of rituals, methods to maintain accuracy of speech were necessary, and a sophisticated system of phonetics was employed to codify the language of ritual to keep it free from change. The earliest attempts of this are seen in ‘Padapatha’, in around 1000–700 BCE, which extracted individual words from the continuous recitation of the hymns—‘Samhitapatha’—and also morphologically analyzed the words into roots and affixes (Orara 1967). Pratisakhya was a treatise developed around 600 BCE that further developed these aspects and proposed a sufficiently developed grammar, including word types, derivation rules and rules for combining words to form sentences. These techniques have proven their effectiveness by preserving the language of the rituals even today (Orara 1967). Treatises like these were a part of the Vedangas or the 'Limbs of the Vedas'. Out of the six 'limbs', four deal with the science of language—Shiksha, which is phonetics and phonology; Chhandas, which is meter and rhyme; Vyakarana, which is grammar; and Virukta, which is etymology and lexicography. The 'vedangas' were developed to help the student to correctly interpret the scriptures (Deshpande 2016, Orara 1967).

While the study of phonetics and metrics helped in the development of language science, the major impetus came from the study of vyakrana and nirukta (Deshpande 2016). Two treatises, one from each of these domains, have come down to us. Nirukta is a treatise by Yasaka on the etymology of rare and ambiguous Vedic words, and the morphological analysis of words in terms of roots and suffixes, and grammatical description of a set of words in terms of parts of speech was already available in the Padapathas (Orara 1967).

Panini's 'Ashtadhyayi' is a treatise in the vyakarana domain that contains 4000 sutras that provide an intricate system of rules that interpret a confounding array of linguistic matters, like the composition of nouns and case relations, the transformation of roots and nouns using suffixes, accent changes in word-formation and sentence construction (Deshpande 2016). The work of Panini is the most prominent work in linguistics and has sparked an entire tradition that continues till date. Panini is considered to belong to the mid-fourth century BCE or earlier (Cardona 2007:268). He draws on scholarship on language before him and names many authors but most of that literature is now lost to us. It is with Panini that 'linguistics' as in the study of language as a separate phenomenon begins (Deshpande 2016). He moves away from the older tradition by choosing as his material, not the text of the Vedas, with its artificial language used to describe rituals but the spoken language of the priests, i.e., bhasa. He also reduced the identifiable parts of speech to two—verb and non-verb. Panini probably composed his treatise in writing, as the use of symbols is evidence for this, and he refers to the visible—therefore, written—aspect of language.

Many commentators further developed his thesis, because despite the sutras being both concise and precise, the interpretations can be conflicting. For example, Patanjali's 'Mahabhashya' discusses Panini's grammar and other commentaries on it like Katyayana's Vrittikas, and deals with technical as well as philosophical issues of Panini's grammar. 'Siddhanta-Kaumudi' is a commentary by Bhattoji Dikshit, a 17th-century grammarian from Maharashtra who reordered the sutras in a way that makes them easy to follow for a student but distorts the architecture of the system designed by Panini (Orara 1967, Hamilton 2001, Craggs 2011)

The object of analysis for the Paninian tradition was Sanskrit and though the analysis had shifted from the language of the Vedas to the spoken language of the priests and the upper class, Sanskrit was still considered to be a 'higher' language. According to Katyayana and Patanjali, other languages did perform the task of communicating but only Sanskrit had religious merit. Other schools of philosophy borrowed from the Paninian tradition in their effort to defend the Vedas.

No model of language can be complete without a thorough theory of meaning. The Indian philosophers too asked the question, 'How does language produce meaning?'. The area often referred to as 'shabda-bodha', or verbal cognition, is particularly influenced by the Sanskrit grammarians. Most traditions claimed that at every level smaller units of 'meaning' combined to form larger units, e.g., stems combined to form words and words combined to form sentences. For some the whole was equal to the sum of its parts, and for others, the whole was bigger, i.e., the meaning of a sentence was either seen to derive from a simple addition of the smaller units that made up the sentence, or it was posited that with every step, every layer, the meaning of a sentence would be further developed.

Let us consider some of the major schools of thought. The Nyaya-Vaisheshikas were interested in logic, epistemology and ontology, and thus for them a sentence painted a picture of reality. The primary task of the Mimamsakas was interpreting the Vedas and thus for them meaning was eternal and had no relation to the intention of the speaker, as the Vedas were claimed to be eternal and to have no author. The Buddhists were interested in explaining how language cannot depict reality and thus takes us away from true realization. The differences in the views of each school thus stem from the differences in the position that each school of thought took regarding language. The grammarians, on the other hand, were interested in understanding language in terms of cognition. Meaning, according to the grammarians, is a projection of a persons' intellect. Bharthrihari, who comes in the Paninian tradition after Patanjali, develops a theory of meaning that deals with cognition and thus without invoking metaphysical issues like the nature of reality deals with the concepts we extract from language. For Bhartrihari, the components or units of language described by grammarians have to bear on cognition or communication, and the meaning of a sentence becomes an object in a flash of cognition (Orara 1967, Deshpande 2016).

Contact with Western Thought

It was under the colonial enterprise that the west discovered the Indian tradition of language science. The Orientalist scholars studied the Sanskrit language and the ancient texts like the Vedas and European scholarship in this domain grew as they found much to gain from the research. The Indologists compared the ancient languages of the west, like Greek and Latin to Sanskrit and scholars like William Jones declared that the similarities could not be accounted as mere coincidences. A common ancestor to these languages was postulated by Jones and thus was born an entire domain of research called historical linguistics (or, comparative linguistics) that analyzed and compared languages in terms of phonology, morphology, syntax, etc and organized them into families (Craggs 2011). Sanskrit belongs on the node level of the Indo-Aryanlanguage family. While Jones is not influenced by the Ashtadhyayi, the influence of Sanskrit grammarians made him aware of the inadaquacy if the Roman alphabet to capture all the sounds of language and with this was born the idea of the International Phonetic Aphabet. Panini was introduced to the west by Thomas Colebrook, ten years after Jones’ death.

Jones was not the only scholar to study Sanskrit; other European scholars like Max Müller, Friedrich Schlegel, Franz Bopp, Jacob Grimm, etc also studied the language and its literature. It was through the Sanskrit grammarians that the importance of the spoken language as an object of study was recognized by the western scholarship which until then focused only on languages with a written corpus. All this scholarship had little impact on scholarship India in the 19th century; the first noteworthy accomplishment of an Indian in this field caame from Ramkrishna Gopal Bhandarkar in 1877 who studied the development of Indo-Aryan vernaculars (Craggs 2011).

The two great figures of modern linguistics, Saussure and Chomsky, also owe some debt to Panini and to the Sanskrit grammarians. The ideas like signifier and signified or the prominence of spoken language which form the foundation of Saussure's work can be found in slightly different terms in the Indian tradition. The Paninian structure shows similarity to not only the dependency parsing mechanism proposed by Lucien Tesniere but the generative mechanism of Chomsky also has echoes of Paninian formalism.

Today, in India, scholarship in linguistics and Sanskrit has taken a different turn. Study of Sanskrit language and its tradition of language study is important not only to the primary disciplines to which this research belongs but also to the newer areas like natural language processing and artificial intelligence.

Panini's Ashtadhyayi

For Computational Linguistics, the Paninian tradition has insights to offer in many areas of research. The Ashtadhyayi and the tradition of commentaries that has developed over the ages provides us with perhaps the most comprehensive grammar ever developed for any language. So, though knowledge of Sanskrit is necessary to understand and apply this 'grammar', the framework developed is general enough to be applicable to many modern Indian languages as well as languages of other families. The other area of research is the format in which the Ashtadhyayi is presented and not just the use of coded language in the concise 'sutra' format but also the arrangement of rules by Panini has much to add to our knowledge of how algorithms, particularly for language processing and generation, are written.

The 4000 or so sutras in the Ashtadhyayi give a model of the Sanskrit language in every aspect—phonology, morphology, syntax and semantics—though it might be problematic to divide the model in these specific domains. The Paninian model is concerned with extracting information out of language and deals with language as a self-contained phenomenon, as much as possible, and therefore makes little appeal to knowledge of the world. This aspect brings it closer to the goals of natural language processing because it is difficult to encode 'common sense' in a machine.

Concepts like 'the verbal root' and 'kaaraka relations' form the foundations of the Paninian model of language. The action represented by a verbal root, or 'dhatu', is claimed to be composed of 'vyapara' (an activity) and 'phala' (a result) thus providing a simple system to analyze the information contained in 'verbal roots'. For example, the action of 'opening a lock' will entail a list of sub-actions like inserting the key, twisting the key and, the moving of the levers, which lead the final state, the 'phala'. Each action and sub-action is carried out by a different participant, 'kaaraka', which is related to the objects in specific ways and the most independent of these participants is called a 'swatantrakarta'. In the model there are about six kaarakarlations that attempt to define the almost infinite relations possible between states and objects (verbs and nouns). The concept of 'vivaksha' in the Paninian model deals with the point-of-view or the attitude of the speaker and it defines the choice of action and how it is related to the participants because it is understood that no linguistic utterance states only facts but always includes the attitude and intention of the speaker.

Kaaraka relations express semantic as well as syntactic information encoded in a sentence as they map nominal and verbal elements in a sentence but this is not the level that describes how a sentence is uttered. Before the rules of phonetics are applied, the model creates 'vhibhakti'. The 'vhibhakti' for a verb is composed of the markers of tense, aspect and modality and that of a noun is composed of information like person, gender and, number. While the concept of vibhakti is applicable to languages like English and will also contain information about the position of the word in the sentence, it is particularly helpful in processing languages with a relatively free word order, like Sanskrit and most modern Indian languages.

Apart from the concepts of language and grammar in the Ashtadhyayi, its formal structure is of great value to computer science research in general and CL in particular. Panini's grammar, like every other CL grammar devised, is a system of finite rules with the potential of generating an infinite number of words and sentences. Panini writes the Ashtadhyayi like an algorithm with three types of statements—definitions, rules, and meta-rules. Definitions are statements that define the terms used to perform operations that are stated in the rules and the meta-rules describe how the given set of rules interact. In the Ashtadhyayi Sanskrit is described using a form of Sanskrit that is heavily codified as Panini often uses abbreviations and other devices to make the 'sutras' concise and precise and therefore the langauge in which Ashtadhyayi is written has its own vocabulary and syntax.

The general format of a sutra in Panini is, 'X becomes Y in the environment Z', where X, Y, and Z describe linguistic elements like sounds or words or group of words. X, Y, or Z are not always explicit in the Sutra but sometimes have to be inferred from context or from the preceding Sutra. This is a feature of natural language called ellipsis but in Panini it is used as a device and it is called anuvritti and a borrowed item is used as long as it remains compatible. Definition of compatibility or other constraints and exceptions are expressed as rules in the Ashtadhyayi. Thus, Panini's grammar contains a list of rules and exceptions to those rules but in a format that gives highest precedence to economy.

Study of Panini's Ashtadhyayi and the commentaries that follow continues to provide insight into the internal workings of language in a way that is strikingly modern in perspective. As seen before, Panini's grammar views language as a system of symbols, a perspective that Saussure and Chomsky brought to linguistics. Comparison of the generative paradigm and Ashtadhyayi is a vein of research that will be rich with insights.

The framework generally applied to process sentences involves dividing a given string of symbols into its constituents—phonetic, morphological, syntactic or semantic—and this process is called parsing. Identifying the constituents in a sentence computationally, particularly in natural language, requires an extremely robust framework that draws from knowledge of language and Paninian grammar is apt for providing such a framework. As is noted before the Paninian framework is primarily interested in extracting information from language and this goal coincides well with the goal of building a parser. The first layer of a parser is morphological analysis which is provided by vibhakti relations which is comparable to the layer generally called 'local word grouping'. This layer extracts information like the type of the constituent, like nouns, verbs or adjectives and the other markers related to the type like tense, modality, person, gender, etc. The next step in processing is the core parser which provides kaaraka relations and word-sense disambiguation both of which tasks are handled, to some extent, in Ashtadhyayi in a way that has perfect analogues to computation. The task of building parsers based on Panini is an ongoing project and multiple alternative frameworks are employed to achieve this task. The frameworks based on the Ashtadhyayi are applied not only to build parsers for not only Sanskrit but also for modern Indian languages, which belong to the same family as Sanskrit and also to English, which quite different from Sanskrit.

Despite centuries of commentaries and modern applications based on Panini's grammar, our understanding of the grammar is neither complete nor thorough. Even today research on Ashtadhyayi continues with an effort to understand how the rules interact or what the rules say about the working of language. This research goes beyond the study of Sanskrit and affects both domains, linguistics and computer science. The simple fact that this text written in somewhere around 500 BC continues to provide us with answers as well as questions is in itself humbling for all those who assume that the traditional schools of thought have nothing to give to the modern world. Apart from the research in Paninian tradition or the study of the text of Ashtadhyayi for application in natural language processing and machine translation, another research goal is building a simulation of the Ashtadhyayi.

Computational Linguistics for Sanskrit and Modern Indian Languages
Research in computational linguistics in India has many goals, relating to Sanskrit as well as Indian languages. The first major area is of course Machine Translation, for Sanskrit and for Indian Languages. Other areas of research in computational linguistics include designing of natural language processing tools and building lexical resources for Sanskrit and Indian languages. Let us look at these goals in some detail.

The goal of creating a fully automated machine translation system is far from realization. The process of translating text in one language to another is complex and is mired with problems of all sorts. The obvious problems of processing natural language, like resolving ambiguity is intensified as the system has to grapple with two languages - the input language and the target language. Apart from that, the a major problem occurs in the area of building a system that has access to world knowledge without which the task of translation becomes almost impossible. These problems are dealt with by employing statistical processes or machine learning algorithms or most recently applying neural networks. Even with these, machine translation without human intervention is a goal that will require a lot more research in areas ranging from linguistics and language philosophy to AI and machine learning. Current systems are capable of translating basic sentences from some languages but as the complexity increases, accuracy decreases. Also, the possible language pairs in India is a huge set and therefore is a field with rich potential for further research.

Meanwhile, a possible solution to the the problem of machine translation is provided in a system called Anusaraka. This is not a machine translation system but what is termed as a 'language accessor'. This system uses a parser based on Paninian grammar. The output of the system is not grammatical but is comprehensible ans thus it can be used by people who have basic understanding of the target language but need to translate text from one language to another. Consider for example a scholar who needs to access a certain text in Sanskrit but has only basic knowledge of Sanskrit. In this case Anusaraka is a tool that this scholar can use to help him/her in the process of translation.

The goal of a pure MT system is neither lofty nor impossible but until such a system is designed, creative solutions to the problems will open new areas of research.

Building tools that perform basic tasks like morphological analysis or part-of-speech tagging for Sanskrit and the large number of Indian languages forms a large area of research in computational linguistics. These tools are based on linguistic research in multiple Indian languages and provide resources to scholars in different domains. Along with parsing tools, creation of corpora for Sanskrit and Indian languages is a project taken up by many institutes. Building an annotated corpus, a set of texts that are selected and structured, for Indian languages is a gigantic task. This corpus is a great resource for scholars to test new tools and for linguistic analysis.

An important task of computational linguistics research is building text to speech engines, which are based on research in phonetics and the tools created as a result are useful not only as stand alone projects but also as input for larger systems. Another research task is optical character recognition, which is vital in building our database of digital texts. Handwriting recognition also becomes important in this task. Along with digital libraries, building online databases of texts in Sanskrit and in Indian languages which can be searched is also that universities take up. These databases become resources for other scholars and students.

Apart from these research areas, an important task for researchers is making sure that the next generation is prepared to continue the research and for that building of educational tools is of utmost importance. This includes creating teaching aids, learning tools, online resources, along with lectures and textbooks. These tools include language teaching for Indian languages and Sanskrit. The educational tools make use of the vast amount of research done by scholars in various fields of linguistics and computational linguistics, so that information and knowledge is not only accessible but also engaging.

References

Cardona, George. 2007. Panini: A Survey of Research. Delhi: Motilal Banarsidass Publishers.

Craggs, Laura. 2011. 'Indian Language Traditions and their Influence on Modern Linguistics'.Vernaculum.

Deshpande, Madhav. 2016. Language and Testimony in Classical Indian Philosophy, Stanford Encyclopedia of Philosophy. Online at https://plato.stanford.edu/archives/fall2016/entries/language-india (viewed on January 10, 2018).

Orara, E. de Guzman. 1967. 'An Account of Ancienct Indian Grammatical Studies Down to Patanjali's Mahabhashya: Two Traditions'. Asian Studies: 369–76.

Computational Linguistics for Sanskrit and Modern Indian Languages

Pranjal Koranne

More from Sahapedia