WNImage:Jonathan's Lab Notes

From CSWiki
Revision as of 02:02, 13 May 2006 by Jcone (talk | contribs)

Jump to: navigation, search


  • Fix the results webpages to incorporate different k-means parameters.
  • Implement a chi^2 scorer for each of the illustrated synsets. The results look fairly promising, with many erroneously assigned images getting low scores.
  • Sort a number of things to ensure that the results can be more easily compared.
  • Fix a bug in the k-means sorter.


  • Finish the k-means sorter. It now only assigns an image to a synset if it is the only member of its cluster. Make a webpage with the results.
  • Meeting


  • Made webpages/documented the latest results. They're starting to shape up.
  • Wrote the script to parse the results and based on those results, assign images to synsets. For now, just pick the top synset.
  • Make webpages for this set of results.
  • Start work on a k-means method for grouping candidate synsets. This should make it easier (in principle) to determine which synsets are in the "top tier" for a given image.


  • Remove duplicates from morphy-ized captions, otherwise words which appear in both singular and plural forms have excessive weight.
  • Update the webpage with the meeting notes.
  • Do not follow ANTONYM links. Also don't give scores to ATTRIBUTE links since they point to adjectives!
  • Generate the webpage with randomized (using seed 8675309) images.
  • Implement weighting colors less and weighting based on depth in the hierarchy.


  • Take another look over the results. They look slightly better but are still crummy. One reason for this is that the cache only has those words that appear in the captions. A good match might not be in any of the captions. However, any possible match has to be within 3 steps of one of the caption words. Only examine those Xsynsets and that speeds things up immensely.
  • Still not much better. Many results are multiples of 17.0. The reason for this is that each word is reachable via any of the 17 link types. Therefore, we are putting an inordinately high weight on words that appear in the captions (since they automatically get a score of 17!).
  • pywordnet also seems to be giving errors when trying to access certain synsets by offset. I will need to debug this later....
  • The results are better now, but still not great. Take a break and write a script to create webpages with the results.


  • Looked at the initial results. Not so good. There are a couple of problems:
    • There are many unrelated synsets. When this occurs, a the top synsets are simply equal to one of the caption keywords (and have a score of 1.0, meaning they match only that keyword). Hopefully this will be mitigated with more relations.
    • The other problem is disambiguation of polysemous words. For example, consider a caption with the words [white,red]. While you might these to naturally fire the {color} synset, it turns out that {person} gets a much higher score. For example, this could be a picture of E.B. White and Lenin. Because words like white which can also be proper names belong to many synsets, you get an explosion of weight associated to that word. In other words, because our scheme simply adds together the score of each sense of a word, this is essentially equivalent to expanding our bag of words to include each possible sense. So for [white], you end up with a crowd of people named White in the bag of words. Normalization might be one solution, but it greatly dilutes the value of [white], that is, normalization doesn't realize that white here refers to the color so much as it attenuates the magnitude of each sense of white out of existence.
    • A better solution for the aforementioned problem might be to select the sense of white which gives the highest score within the Xsynset we are testing. Implement this. Looks a bit better.
    • Start implementing the traversal of links other that hypernym. Regenerate results.


  • Updated the code to generate a weighted Xsynset.
  • Pickled everything to make it more compact and robust.
  • Wrote a brute force synset ranker.
  • It might be extremely slow. Think about using something like a HMM to get approximate rankings.
  • Read up on HMM. Also consider a simple gradient descent algorithm. Since the optimal synset must connected to one of the keywords, we just start walking from each of the keywords.


  • Updated the database to use the new structures.
  • Regenerated the database.
  • Updated the web documentation.