Automatic Strategies for Sense Disambiguation
A broad category of disambiguation schemes is to start with a similarity metric that gives a score between every two synsets and seeks to maximize that score across all of the possible synsets.
When inquisitive readers encounter a word they don't know or a word that is used in a way that is unfamiliar to them, they look up the term in a dictionary. Confronted with multiple options for a term, they find that definition (or gloss) that best fits the context the word appeared in. Using this intuition, Mike Lesk developed an elegant scheme whereby the sense of a word that has the largest number of overlaps the other senses is chosen.
This work has been extended by Siddharth Patwardhan and Ted Pedersen to use context vectors for each of the words in a WordNet gloss and connected synsets (via a number of other WordNet linkages) to form a basis for a new similarity measure.
Philip Resnik introduced the idea of adding the idea of information content to synsets in order to add a quantitative measure of distance between synsets separate from just the topological separation within WordNet. Resnik took a corpus, and for every word appearing in the corpus, adds one over the number of senses for every sense of that word and all of its hypernym parents.
As the number of words increases and the number of possible senses increases, there is additional computational overhead in trying the ensemble of senses that maximizes the similarity measure under investigation. Cowie, Guthrie, and Guthrie created a disambiguation strategy that sampled from the possible ensembles using simulated annealing.
Abe and Li
Abe and Li ignore the semantic heirarchy for the purposes of calculating the actual probability of individual synsets (and use a simplification that each synset is equally likely, which is reminiscent of Resnik's information content calculation). The probabilities are calculated directly from the data, which means the synsets are simply a large multinomial. Instead, the hierarchy comes into play in selecting a cut of the tree that satisfies an MDL. Any synsets that are not included in the cut are equally likely (based on the probability of the last included parent).
Hiding a Semantic Hierarchy in a Markov Model
Although Abney and Light focused on the problem of disambiguation in the context of selectional restrictions, their approach is very similar to our approach. They train, for every part of speech relationship that links to a noun, to create a hidden Markov model that would generate the word from a sense.
They were unable to improve upon Resnik's approach of using class-based measures, and discovered the problems of smoothing and weighting by length.
Rada Mihalcea creates a very large graphical model by creating nodes for each synset that could represent a word. Then, edges connect related synsets from one word to the next, and then the PageRank algorithm is run to determine a ranking for each word based on the induced stationary distribution.
The question of which links are drawn is interesting. At first, Mihalcea used existing links in WordNet, taking care not to connect nodes corresponding to the same word the weights were initially seeded by the Lesk similarity. A later improvement was to use all possible connections, but throwing out those with no Lesk similarity.