Pronoun and Proper Noun Identification and Analysis:

Some simple heuristic approaches.

For CPSC 485 under Pete Peterson
James Dean Palmer
450-35-5462
August, 1998

Parsing natural language is an incredibly difficult task in the computer science field. Great head way has been made in this field by a number of researchers each exploring different hueristics and solutions to a problem that does not have a perfect solution.

One such project is the RELATUS Natural Language System (http://www.ai.mit.edu/people/jcma/jcma.html). The RELATUS system provides statistics and representations of text up to 600 pages in length. The RELATUS emphasis is on a interpretive approach to perception and meaning of content. This emphasis on an interpretive and not formulative approach to AI processes has come from the work in a field called hermeneutics. Hermeneutics is the branch of philosophy which explores the understanding and interpretation of texts.

In doing my preliminary research for this project I found a number of exciting projects in natural language research, and some excellent literature on the field. In particular, Hermeneutics: From Textual Explication to Computer Understanding? explores the origins and direction of the field. The particular natural language problem I am tackling in this project is a considerably simpler task than most of the natural language parsers at the center of natural language research, but that doesn't mean that it is an easy problem, or even a solvable problem.

The particular approach I took is the more classical AI approach to natural language parsing. This involves taking a number of metrics and then using a fusion model to weight the metrics heuristically. A more modern approach to solving this problem would be to write a program that could model the text and relationships itself and then create its own heuristic. This would have been a much more complicated and time consuming piece of software!

Implementation Details

As was required for this assignment, the entire application is written in ANSI C. I did take the liberty of using c++ style comments but the code is not c++. One of the purposes of this project was to demonstrate c proficiency. Thus all the algorithms I wrote were written without referencing any existing code.

Since I am a bit of a c++ fan, the Tlist structure and functions (I want to call it a class) use very similar naming conventions to the list and deque STL templates. In fact, converting this code from the current c style to c++ style would be very simple.

The Tlist structure and it's data are abstracted by a series of functions that perform operations on that data (like private variables in c++). Functions for iterating through the list in an extremely efficient manner are also provided (Tlist_next_until_tail and Tlist_previous_until_head). Some of the functions in my Tlist are never used but are provided for completeness.

The Tword_list functions then extend Tlist with special functions to manipulate the Tword data structure. The Tword_list is the list of words that are passed to the program for parsing. The Tword data structure contains several lists which are populated with metrics concening word relationships, but only one function uses these lists. I had planned to implement more metrics that would require these lists, but these features had to be cut out in order to turn the project in before the end of the semester.

The Tlookup_list functions also extend Tlist with special functions to manipulate the Tlookup data structure. This list contains all the words provided in the lookup table. The lookup table data file follows the following format convention:

word-type gender number word-name

Where word-type can be "pronoun", "possessive", "article", "verb", and "join". Gender can be "male", "female", "neutral", "thing" or "none"". Number can be "singular", "plural", or "none". The word-name is the actual word you are defining. This lookup table provides a lot of flexibility in defining the grammar you are trying to decipher.

And finally the main.c file provides the actual hueristics that are used to determine the word ranking of candidate words.

Metrics

The metrics that this program performs in order to have values for a weighted heuristic are very straight forward and exact measurements of word relationships. These measurements are performed on all candidate proper nouns and pronouns.

This first metric I defined was word proximity, which is defined as the pronoun position minus the candidate word position. The absolute value of this measurement is the distance of the pronoun from the candidate word.

The second metric I defined was sentence proximity, which is defined as the pronoun sentence position (given as sentence position, not word position) minus the candidate word sentence position. The absolute value of this measurement is the distance in sentences of the pronoun from the candidate word.

The third metric I defined was gender agreement. This is true if the two words are of the same gender and is false if the two words are not of the same gender or if this information in not known.

The fourth metric I defined was number agreement. This is true if the two words are of the same number, and is false if they are not.

In addition to these metrics, each word also has a word position, a sentence position, and a type (if available from the lookup table). These variables as well as the metrics I have defined can make up the elements of the heuristic this program employs.

Heuristics

The initial approach in formulating a good heuristic is to emulate the cognitive process that humans use to understand pronoun proper noun relationships. Then based on the performance of this heuristic we can modify the heuristic to get better real world values.

What I found in my own investigations was that different heuristics worked better for different text samples. The writing style itself often dictates how we look for word relationships. Some of the more advanced natural language analysis systems I mentioned at the beginning of this paper take this particular approach. They analyze the document and then based on the model they formulate heuristics to understand the model.

My own program is not nearly so advanced, and so hand tuning the heuristic that is used for a particular peice of text is imparitive to get really good performance. The actual metrics for a given sample are often very similar within the same document, this makes the non-hard coded heuristic much more valuable.

Conclusions

I found this project was an excellent mental exercise. I only wish I had more time to devote to it to explore even more heuristic and word relationship techniques. A heuristic developing engine definitely seems like the best approach to obtain more reliable results across a number of non-homogeneous documents. There are also a number of other metrics I could have implemented, but a project like this can never really be "done". We can simply get as close as is necessary to meet our design goals, since perfection is impossible.

I am on the whole happy with my progress on this subject, but improvement is always possible. Since artificial intelligence is an area that I am extremely interested in, I am sure the experience and knowledge I have gotten from this project will extend to future endeavors.

References

1. Hearst. "TileBars: Visualization of Term Distribution Information in Full Text

Information Access" In proceedings of CHI 95, May 1995.

2. Mallery, Hurwitz and G. Duffy. "Hermeneutics", The Encyclopedia of Artificial

Intelligence, New York: John Wiley & Sons, 1987.

3. Mallery. "Semantic Content Analysis: A New Methodology for The RELATUS Natural

Language Environment,'' Artificial Intelligence and International Politics, V.

Hudson, ed., Boulder: Westview Press, 1991.

4. T. Winograd, "What Does It Mean to Understand Natural Language,'' Cognitive Science

4, 209-241 (1980).