Wordnet plus documentmatch

From CSWiki
Revision as of 20:25, 5 March 2006 by Ezubaric (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

1. Vary "internal" parameters, including:

  • which semantic relations to use in computing extended synsets (including possibly composite relations, and relations "sliced" by part of speech).
  • what attenuation factor to use for each relation.
  • how much weight to assign "honorary" synsets (i.e., those corresponding to words in the document but not in WordNet)
Dave: For (b) and (c), we should fit these numbers from the data (unless there are too many parameters).
Christiane: I think the idea is to explore semantic relations one at a time, then see whether there are differences among the relations, both within and across POS (i.e., hyponymy among nouns is not likely to produce the same effect as hyponymy among verbs). Then try composite relations...
Christiane: About (c): these are likely to be named entities. They are related to common nouns like "president" and "Asian country" which tend to serve as anaphors to named entities in documents. Not clear what weight to assign to them.

2. Normalization issues for creating the vector associated with a given document. Notably, we can insist (versus not) that every word in a document contribute "activations" to the vector that sum to one.

3. Alternative measures of proximity between vectors, e.g., cosine, dot-product, correlation, Minkowski distance (including L2)

Dave: We should think about how the different normalization choices affect the different distance measures which we plan to try. Our choices about the vector construction and the distance measure should be tied. Along the same lines, I'd like to propose that we try symmetrized KL divergence. this is used in the modern-day IR community in the "language modeling approach" to retrieval.

4. Matching documents on the basis of the matrix-of-proximity that emerges from the analysis of vectors. How do we exploit the univocal match between document pieces?

5. Use of evocation (the big enchilada). Should evocation be rendered binary then treated like one more semantic relation, or used quantitatively from the start? How much should it weigh compared to other semantic relations?

Jordan: I think that it's going to be particularly important for bridging the rift between POS networks and for creating a network for the really sparse POS networks (i.e. ADJ and to a lesser extent verbs)

6. Use of part-of-speech-tagging. Is preprocessing for POS sufficiently accurate to improve assignment of words to synsets, or do we perform better by ignoring POS?

Research directions:

From Dave: I wouldn't want rob's comment about us defining a kernel to slip by. This is an important connection. and, I think that a number of the above issues are decided for us under a probabilistic framework. However, that is not necessarily a good thing!
From Dan: Is there a plausible mapping from Chinese characters to synsets? (I recognize from Jordan's remark that the mapping won't be one-one.) Given the language competence in our group, it might be feasible to try our document-matching algorithm in Chinese. Of course, first we should get it to work in English!
From Jordan: Polysemous characters are less of a problem than polysemous English words (I think ... Xiaojuan should correct me otherwise), but there are already mappings from Chinese concepts to English WordNet (http://rocling.iis.sinica.edu.tw/CLCLP/Vol8-2/paper%203.pdf), which we would want to use instead of any ad hoc system we would throw together.
From Dan: Because Chinese is written in ideograms, it occurred to me as likely to be less polysemous than English. That's why it might be nice to try our method there (after English ...). I wasn't thinking of creating our own mapping from characters to WordNet but rather hoping that one already exists. I'll have a look at the link Jordan provided.
From Christiane: We can also start less exotic. There are well-developed wordnets in several European languages.
From Dan: To the extent that our algorithm relies on POS-tagging and "glomming" (or was it "globbing"?), we are exploiting word-order information in the document. But there is so much order-information yet to be exploited! We are still close to treating documents as a "bag of words." (This is the big burrito.)
    A part of Wordnet_plus