How can you make a computer understand, or process, or generate, human language? Natural Language Processing or Computational Linguistics is a field of research which deals with that question and other issues associated with it. The aim here is to understand the fundamentals of NLP, and, in an allied article, investigate how NLP has been applied to Sanskrit.
In his seminal thesis, Alan Turing proposed a definition of artificial intelligence based on the ability of the machine to effectively process natural language, i.e., language used by humans. For a machine to pass the Turing Test it needs to hold a conversation with a human being and fool the human into thinking that the machine is human. The idea of developing for computers the ability to process and generate natural language took roots in the first half of the 20th century.
NLP or Computational Linguistics has two basic goals. The first is use of natural language for Human Computer Interaction, i.e., using everyday spoken language while using a machine. When we use Google to search for cute cat pictures or that particular article about climate change, we are using natural language to interact with a computer. Another aspect of HCI is what is called a dialogue system, where a computational system is designed to hold conversations with humans. Wizenbaums Eliza is an example from the mid-1900s. It was a system designed to imitate a therapist and people almost believed it to be a real therapist, much to the exasperation of its creator. Today a lot of websites use dialogue systems to interact with customers. An analogous system is a question-answering system.
The other important goal of NLP is Machine Translation, which uses computation as a tool for translating speech or text from one language to another. The early systems were based on word-to-word translation and often gave unexpected results, like an MT system that translated the sentence 'The spirit is willing, but the flesh is weak' into Russian, and when translated back into English using the Russian version, the sentence became, 'The vodka is strong, but the meat is rotten'.
This field of study has always been interdisciplinary, utilizing research from fields like linguistics, mathematics, logic, philosophy, electronics, psychology, and now from neuroscience.
Knowledge of Language
Any robust NLP system cannot function without 'knowledge of language', i.e., an understanding of how natural language functions and what it is made of. Knowledge of language comes from the scientific study of language—linguistics.
In the 19th century, language had been studied primarily in terms of its change through time, which is called a diachronic approach to studying language. For the discipline to come to a point where theories of language itself were developed, a shift in perspective was necessary. Ferdinand de Saussure was the pioneer who shifted the focus of study to language as it exists in the present, i.e., the synchronic approach. Saussure also shifted the focus from written language to spoken language and gave the latter precedence. Saussure's theories form the basis of our understanding of language. His ideas were presented in the posthumously published work based on notes from his lectures, titled (in its English translation), A Course in General Linguistics, first published in French in 1916).
Saussure studied language as a system and the basic component of this system is called a sign. A sign is any form of physical marker that carries information or meaning. The physical form is called a signifier and the meaning it relates to is called the signified. A vital contribution of Saussure is the insight that the relationship between signifier and signified is arbitrary. There is nothing in the signifier that ties it to the signified and that relationship is established only through unconscious social agreement and not reason. Language is thus studied as a system of signs and the relationships between signs. Saussure also makes a distinction between language as a social and cultural phenomenon, for which the term he coined is 'parole', and language as a system, called 'langue'. So, study of language behaviour or use is the study of 'parole', and the study of language as a system is the study of 'langue'. Saussure's contributions gave birth to linguistics as a distinct discipline, not simply part of philosophy or anthropology.
There are now many branches to linguistics. Psycholinguistics is the study of language in the mind, neurolinguistics is the study of language in the mind and sociolinguistics is the study of language in use in society. Computational linguists use knowledge of language for computer applications or use knowledge of computation to study language itself. At the basis of all this is an understanding of what language is. In the 1960's Charles A Hockett gave a comprehensive account of characteristics of language which he termed as 'design features of language'. As pointed out by Saussure, arbitrariness is a design feature of language. The fact that language can be broken down into discrete units of analysis is also a design feature, as is 'displacement' which allows humans to talk about things that are not physically present during the act of speaking. Human language is capable of producing novel utterances (productivity) but it is also learned and taught (learnability) and it is transmitted culturally (cultural transmission). Another feature of language is reflexiveness, i.e., we can use language to talk about language, as is 'prevarication' which is the ability to lie or deceive.
After Saussure the most important contribution in linguistics came from Noam Chomsky whose theories form the backbone of a lot of work in NLP. Chomsky provides a detailed account of language as a system. The central idea of the Chomskian paradigm is 'universal grammar' (UG). The hypothesis is that all language users are born with the apparatus to produce language and therefore all languages work using the same underlying processes. In one of the incarnations of generative grammar, the processes that are common to all languages are called principles and those which vary from language to language are called parameters.
Like Saussure, Chomsky also a makes a distinction between language that is actually used—language performance—and language as a system, i.e., language competence. What linguists in the Chomskian model study is mostly language competence. This language competence is the result of UG and also why children acquire language so fast. In the older model, language was a 'learned', in the terms of behaviorist school. The thesis, as propogated for language for B.F. Skinner boils down to the notion than children learn language through imitation. Chomsky's first claim to fame came from refuting Skinner and showing how children learn language in a specific series of steps: as if one by one, aspects of UG are turned on in their brains.
The theoretical claim of UG is that every linguistic utterance is generated by certain rules, the work done by linguists using this paradigm is called generative linguistics. Generative linguistics works towards uncovering the underlying rules of language and the idea that computers can mimic these 'algorithms' to generate or process language is vital for NLP.
Components of Language
In the simplest form of analysis, a natural language has the following levels: sounds, words, sentences and meaning. While this may seem basic, there are competing theories for analyzing and understanding each layer.
The primary aspect of a language is sound. The study of how these sounds are produced and received and other physical properties of sounds in language is called phonetics. Phonology, though dependent of phonetics, is the study of organization of sounds in languages in terms of their grammatical properties or their properties as units of meaning.
Accurate transcription of sounds used in spoken language is an important aspect in linguistics and the International Phonetic Alphabet (IPA), designed by the International Phonetic Association, is used by linguists to transcribe the distinctive sounds in spoken language. The smallest distinctive unit of sound is called a phoneme, a concept that was introduced by a Polish linguist named Jan Baudouin de Courtnay in 1876. The classification of the IPA is based on where in the articulatory system the sound is produced and the manner in which it is produced.
Linguists study the patterns in which phonemes combine to form linguistic utterances and these theories form the foundation for NLP applications like speech recognition, speech synthesis, or text-to-speech conversion, etc.
Morphology is a branch of linguistics that studies the patterns of word formation in a language and across languages. The smallest distinctive unit in a word that contributes either semantic content or grammatical function is called a morpheme. Some words are made out of a single morpheme and some words are formed through combination of morphemes. Again, what is important is the rules or pattens of combining morphemes.
Morphological analysis for an NLP system is done by parsing. The parsing algorithm describes how an utterance should be processed. In simple terms, morphology provides us the rules to obtain the spoken form or the surface form of a word from the root form based on rues which take into accounts elements like number, gender, tense, etc. A parsing algorithm handles conversion of root word to the correct surface form or converts surface form to root words. A morphological analysis also provides part of speech tagging, i.e., it determines if a given word is a noun, a verb, an adjective, etc.
Recognizing phonemes and morphemes and implementing the rules of combinations for these is just the beginning of either constructing or analyzing a coherent linguistic utterance. Perhaps the most important element of analysis is the sentence. The branch of linguistics that studies the structure of a sentence is called syntax.
A parsing algorithm is implemented for analyzing the structure and relationship of words in a sentence. The algorithm describes rules for correct methods of combinations based on a multitude of factors like word order, agreement marking, part of speech, etc. A major chuck of efforts in the Chomskian paradigm is focused on producing a grammar that is applicable to all or most languages. The set of rules that describe how constituents are related to each other without relating it to knowledge of the outside world is called context-free grammar. Here is an example of an extremely simplified CFG.
S --> NP VP
VP --> V NP/PP
P --> P NP
NP --> D N
This states that a sentence is made of verb phrase and a noun phrase or a preposition phrase; a preposition phrase is made of a noun phrase and a preposition; and a noun phrase is made of a determiner and a noun. So, for example a CFG analysis of a sentence like 'I sleep on the floor' would be:
S --> NP VP
NP --> N
VP --> V PP
P --> P NP
NP --> D N
N --> I; floor
P --> on
D --> the
V --> sleep
Even the most sophisticated CFGs cannot analyze sentences from even the most prominent varities of languages. Newer theories have tackled different issues like active-passive transformations, interrogative sentences, etc.
The question 'What is meaning?' is philosophically troubling and seems almost impossible to answer. Linguistics theories attempt to produce a scientific account of the nature of meaning. Within the Sausserian model arbitrariness of language is established but there are still patterns and rules for constructing meaning through relationships between components of language and through relationships of components of language to the outside world. The branch of linguistics which deals with this aspect of language is semantics. Semantics has many branches. Lexical semantics deals with the meaning of words, pragmatics is the study of meaning in context and when meaning of multiple connected sentences is studied, it is called discourse analysis.
Any NLP system needs to formalize meaning in its system. For example, if we are constructing a dialogue system then answering the question 'Which restaurant near me has good Chinese?' requires a knowledge of meaning of each word and what it refers to in the world. The system also needs to understand the context in which Chinese is used and process it in terms of food. For understanding the use of pronouns which refer to each other across multiple sentences, a knowledge of discourse is necessary.
A vital part of any NLP system is the way in which it deals with ambiguity. Natural language is always ambiguous—every aspect can mean more than one thing. This is difficult to deal with in machine for which everything is either true or false, right or wrong, one or zero.
Consider this simple sentence, 'I made her duck'. What does it mean? Does it mean that I made some woman lower the upper half of her body to save her from an oncoming projectile? Does it mean that I cooked a bird called duck that belonged to her, or that I cooked a a duck for her? There different types of ambiguities in this sentence. First is the meaning of the word 'made' and the word 'duck', which can be considered as a morphological or lexical ambiguity. Second is the pronoun 'her' which can either be used to point to a person or show possession, in this sentence, this is pragmatic or discursive or referential ambiguity. Consider another sentence, 'I saw a boy on the hill with a telescope'. Now, did I see a boy on the hill who stood with a telescope near him or did I use a telescope to see a boy on the hill or was I on the hill with a telescope when I saw a boy?
The process of solving these problems of natural language in terms of a machine is called 'disambiguation' and it needs to be carried out on every level of linguistic analysis for any NLP system to function.
Algorithms and Models
An algorithm is basically a series of steps to solve a problem. It is comparable to a cooking recipe that a mindless machine can follow and get expected results. Problems of language processing are complex, ambiguous and often not really problems. These problems are addressed by models of language based on linguistics, and algorithms derived from research in mathematics, computer science, etc. The most common models and algorithms used in NLP are state machines, rule systems, logic, probabilistic models, etc.
If one considers a list of words as one ‘state’ and a syntactically correct sentence as another ‘state’ then a state machine is model that describes how one can go from the first state to the last through a series of intermediate states. Related to this is the model of rule systems, or formal grammars devised to transform a list of phonemes and morphemes into words or list of words into sentences. First-order logic, borrowing from philosophy, mathematics, linguistics and computer science, is used to create formal models of knowledge and meaning and applied to problems in semantics, pragmatics, and discourse analysis in NLP systems. Probabilistic models are most handy for solving ambiguity problems in NLP, but state machines and grammars can both be augmented with probabilistic models and used in almost every NLP application.
Latest developments in computer science use machine-learning tools to solve NLP problems. Instead of relying only on models specifying rules and systems based on knowledge of language and logic, ML algorithms sift through a large amount of data to generate a solution or complete a task.
Some Early NLP Systems
The earliest incarnations of NLP systems were based on very basic philosophy of language and were deployed primarily for Human-Computer-Interaction (HCI). These systems did not perform any real parsing, i.e., analysis of constituents of all levels of linguistic utterance but were focused on solving problems or answering questions.
In 1966-67 Joseph Wizenbaum at MIT developed a computer model of a psychologist that fooled people into believing that they were indeed talking to a human being. The system had no comprehension of what it ‘said’ but was only designed to parrot out certain strings based on keywords present in the input. It did tricks like turning the input into a sentence or giving generic output like ‘What do you feel about it?’. This is perhaps the earliest developed dialogue system.
This was a successful question-answering system designed in the 1970’s based on the then latest techniques in artificial intelligence. It was designed to answer questions about moon-rock samples. It used to Chomskian framework to perform syntactic analysis and also performed semantic analysis for information retrieval. It answered simple queries like ‘What is the average concentration of iron in lemenite?’. Answering these simple questions required a robust system to handle database queries and language parsing, which is why it was an important milestone in the development of NLP.
In 1970 a dialogue system was built to converse with a human being to perform simple tasks in a tabletop world made of blocks. While the system was crude and simplistic in its application, this simulated world of blocks managed by a simulated eye and hand was influential for the field of NLP. It integrated semantic parsing, knowledge representation, information retrieval, dialogue system and a question-answering system in a very effective way.
State of the Art
Today the world of language and speech processing has come a long way. Some of the examples of cutting edge developments in NLP are:
- Conversational agents on travel websites
- Voice activation systems used by astronauts as well car drivers
- Use of speech recognition by video companies for effective search
- Google’s cross-language information retrieval and translation systems
- Educational publishers using NLP to process and grade assignments
- Interactive virtual agents used as teaching aids
- Text analysis tools for marketers
- Systems that can generate news reports, poetry, and novels
Bharati, Akshar, Vineet Chaitanya, and Rajeev Sangal. Natural Language Processing: A Paninian Perspective. New Delhi: Prentice Hall, 1999.
Jurafsky, Daniel, and James Martin. 2017. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Online at https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf (viewed on March 1, 2018).
Malmkjare, Kristen. 2002. Routledge Encyclopedia of Linguistics. New York: Routledge.
Matthews, P.H. 2003. Linguistics: A Very Short Introduction. New York: Oxford University Press.
Trask, R.L., and Peter Stockwell. 2007. Key Concepts in Language and Linguistics. New York: Routledge.