Natural Language Pronoun Identification

Objective

The project deliverable will be an application written in C that takes a text file of natural text as input and produces a report as output. The report will be a human readable analysis of each pronoun used in the text file, and will attempt to identify the proper noun to which it refers. The goal of this project is not to create an application that is 100% correct or even 60% correct, but to write an application that makes reasonable choices with the information that it has.

Motivation

The first motivation in this project is to demonstrate a proficiency in C programming to satisfy my remaining requirements for credit in CPSC 120. The second motivation in this project is to explore problems related to natural language parsing.

Design

The task of parsing natural language is non-trivial and is inherently ambiguous. Even humans with a life time of experience in speaking and understanding natural language are often making intelligent "guesses" as to what the language they communicate with means. A natural language parsing computer program must then emulate the same systems or heuristics that humans use to understand speech.

This project limits the scope of this problem to the identification of pronouns to the proper nouns to which they refer. For clarity, I should note that the project is also limited to the American English language.

The approach that I will be taking is to use several heuristics to score the relationship of each pronoun. The resulting analysis of the text input will include what the computer program finds to be the most likely relationships based on the heuristics I provide. The scoring will either being a statistical percentage estimating the likelihood of the relationship, or more likely an arbitrary rating system based on the mechanics of the heuristics used. The application will also do a separate scoring for each proper noun in an attempt to determine if it is a person, place, event or thing.

I plan to do some preliminary research on other natural language parsing projects, and based on this information I will choose the heuristics that will be implemented. Heuristics I plan to explore include context (using prepositional and sentence structure context to establish relationship), spatial proximity (using word order to establish relationships), and lookup tables (to identify proper nouns based on a limited amount of world information). These different heuristics will then be synthesized into a single system that will generate the human readable report.