WNImage:Jonathan's Lab Notes

From CSWiki
Revision as of 18:56, 8 May 2006 by Jcone (talk | contribs)

Jump to: navigation, search


  • Take another look over the results. They look slightly better but are still crummy. One reason for this is that the cache only has those words that appear in the captions. A good match might not be in any of the captions. However, any possible match has to be within 3 steps of one of the caption words. Only examine those Xsynsets and that speeds things up immensely.
  • Still not much better. Many results are multiples of 17.0. The reason for this is that each word is reachable via any of the 17 link types. Therefore, we are putting an inordinately high weight on words that appear in the captions (since they automatically get a score of 17!).


  • Looked at the initial results. Not so good. There are a couple of problems:
    • There are many unrelated synsets. When this occurs, a the top synsets are simply equal to one of the caption keywords (and have a score of 1.0, meaning they match only that keyword). Hopefully this will be mitigated with more relations.
    • The other problem is disambiguation of polysemous words. For example, consider a caption with the words [white,red]. While you might these to naturally fire the {color} synset, it turns out that {person} gets a much higher score. For example, this could be a picture of E.B. White and Lenin. Because words like white which can also be proper names belong to many synsets, you get an explosion of weight associated to that word. In other words, because our scheme simply adds together the score of each sense of a word, this is essentially equivalent to expanding our bag of words to include each possible sense. So for [white], you end up with a crowd of people named White in the bag of words. Normalization might be one solution, but it greatly dilutes the value of [white], that is, normalization doesn't realize that white here refers to the color so much as it attenuates the magnitude of each sense of white out of existence.
    • A better solution for the aforementioned problem might be to select the sense of white which gives the highest score within the Xsynset we are testing. Implement this. Looks a bit better.
    • Start implementing the traversal of links other that hypernym. Regenerate results.


  • Updated the code to generate a weighted Xsynset.
  • Pickled everything to make it more compact and robust.
  • Wrote a brute force synset ranker.
  • It might be extremely slow. Think about using something like a HMM to get approximate rankings.
  • Read up on HMM. Also consider a simple gradient descent algorithm. Since the optimal synset must connected to one of the keywords, we just start walking from each of the keywords.


  • Updated the database to use the new structures.
  • Regenerated the database.
  • Updated the web documentation.