Understanding textual language: an overview

Billions, of pages of textual language exist today (2008) on the Internet.  Surely, virtually all text ever written in man’s history will someday be accessible in this way.  This is remarkable and unprecedented.  The question that this project tries to answer is, To what extent can all this now available, and to-be-available  textual language be understood by computer software?  This of course simply asks the question, What does it mean for a machine to understand textual language? 

A child begins to understand natural language at a very early age, and a year or two after birth this understanding is ordinarily so complete, and the child sufficiently developed physically, that spoken language can be created spontaneously, instinctively.  The child begins to speak. 

All of this learning is intimately associated with the child’s burgeoning understanding of the world itself, and in particular with senses and desires: a wish to move, a wish to do things, an instinct to mimic, a wish to discover and understand, a wish to have one’s desires fulfilled.  The words themselves and the mapping of the child’s world in the brain somehow become tightly linked, in fact almost inseparable.  So, in what sense could a machine be said to understand these things, sensations and feelings for which it has no sensory inputs, no sense of self or desire, no human emotions whatsoever?

This project then does not concern the question of teaching a machine to understand in the way that a person understands, but rather is one of constructing a useful intermediary form of communication among humans who, through text, share interests and the desire to interact.  It then resembles the words of a book.  And certainly the book itself does not in any sense understand the human sensations themselves.  It is simply a way of understanding what other people have written, many of whom—and perhaps those most important—are no longer alive. 

Were it possible to develop such a system the question that naturally arises is, What could be done with it?  While specific applications were considered in the initial thinking about this project: intelligent, conversational search queries, automatic and adaptive dictionary creation, and sophisticated, idiomatic language translation, to mention only those most obvious; the more fruitful avenue in which to proceed is probably to put thoughts of applications aside and concentrate on more fundamental research questions until it can be more clearly determined just what it is that might be done, the mode of doing it, and whether there could be sufficient success to even warrant such speculations.

Meanwhile, the challenge of simply better understanding natural language, on which so much as been written and so much is still undetermined, seems quite an interesting and challenging project on its own, even with no specific goal in mind.  So that is the state of the project as it stands.

This project, Understanding Textual Language, has two close precedents: it is very close to what has sometimes been termed artificial intelligence, a by-now somewhat hackneyed term, and it has strong bonds of course to the study of linguistics.  Insofar as artificial intelligence is concerned it would, if successful, pass the Turing test, succinctly phrased by Alan Turing:

a human judge engages in a natural language conversation with one human and one machine, each of them try to appear human; if the judge cannot reliably tell which is which, then the machine is said to pass the test… [the three actors are assumed to be only in textual communication]

This quotation should give us pause: is this project is achievable?  Hours, years, decades have been spent by very smart people working on this effort, one way or another.  Nevertheless, timing is everything, and problems once insurmountable have been surmounted.  As to linguistics, veritable mountains of books have been written on this subject and thousands upon thousands of relevant experiments carried out.  So one question that needs to be answered is this, To what extent should these precedents guide the direction of this research?

There is a certain tension between what has gone before and what we can learn by starting quite fresh.  It is important that we find a proper balance: to ignore everything that has gone before would seem to be, in its way, silly, like once again reinventing the wheel.  Yet sometimes what has gone before can slyly guide one into blind alleys.  To make new breakthroughs in our knowledge of language, and to use it to advantage in this project, let us not take past research canonically but instead try to gain new insights.

It would seem desirable for now to downplay as much as possible the computer aspects of the system, in particular any specific implementation, database construction, computer language and the like.  While the design of a “web crawler” might be very interesting on its own, it is certainly not something that ought to concern us now; the project is a long way from thinking about detailed implementations.

Nevertheless, it is certainly not practical to ignore implementation considerations entirely since eventually proof of theory and, should we be lucky, applications themselves, ought eventually to flow from this research.  Certainly, at some point, certain hypotheses of the system will need to be implemented in code in order to test them.  But first we need to have viable hypotheses to test and it seems judicious at this moment that the less there is now, in the beginning, to slow us down in the primary endeavor of finding them, the better.

This rough outline of the project leaves many of the basic notions of the project unsatisfactorily explained:  Precisely what do we mean by the term understanding?  What sorts of objects acquire understanding in nature?  What is language and how is it acquired by humans?  How are language and understanding related?  How might a software system be structured to actually understand text, whatever that means?

Considerable effort has been expended on just these questions, in the hope that by unraveling these entangled issues we can begin to understand more clearly not only what it is that we wish to do, but just how we can go about doing it: What are the pieces?  How do they relate to one another, work together?  The outcome of this effort, the best answers to these questions that we have now, have been outlined in three successive essays which delve in some detail into these mysteries.  They include numerous diagrams, links to thoughts by earlier researchers, and provide references to common lexical terminology:

1.  Understanding

What does understanding—the word itself—mean?  There are two conflicting answers to this question.  What are these schools of thought?  Past thinking about each of them is explored here, some new thinking is added, and we adopt one of them here and explore it, discarding the other.

2. Language

What is the difference between communication and language, and what is the extent of sub-human language?  What is the relation­ship between under­standing and language and, importantly, between sensory memory and words themselves, the means with which we express our thoughts?

3. Elements of the system

We describe various options in just how the elements of the system can be packaged and then we discuss the issues that present them­selves in the design of these most signifi­cant pieces of the system.  Other elements are outlined roughly, unfortunat­ely, some very roughly indeed.