From CSWiki
Revision as of 17:31, 27 June 2006 by Ezubaric (talk | contribs)

Jump to: navigation, search

CVS Access

The files are in the repository under ldawn. The repository is named wnp. To access it, follow the instructions here. You'll need to be approved by the repository owner, who is JBG. You'll need gsl installed. If you don't have root on a machine and can't add to the normal include directory, look at the "make jbg" entry in the Make file to see how to point to a different directory. In MSVC, you'll need to look at [this http://www.sourceware.org/ml/gsl-discuss/2004-q2/msg00000.html] to get GSL linked up.


You'll also need some data files, which can be found [here http://www.cs.princeton.edu/~jbg/wn/ldawn/]. They also require some libraries from the py-evo-feat directory in the wnp archive, which can be accessed by adding it to the python path.

Program Files

  • mixture.cpp
    Creates a mixture model of topic walks; still not working completely
  • generateReport.py
    Given the stem of inference synsets (e.g. "inf-synset."), creates a report on the accuracy, report.out.
  • LDAWN.cpp
    The main file, from which all other functions are called
  • WN.cpp
    Reads in the WordNet information and serves as the basis for the topic walks
  • TopicWalk.cpp
    The topic walk parameters that exist on top of the WN class
  • Path.cpp
    An individual path through WN that ends in a synset

Data Files

  • bnc-par.dat
    The BNC corpus split into paragraphs. Words occurring fewer than 10 times were excluded, as were paragraphs with fewer than five terms (although those terms were counted toward the frequency ... this was done because some headers were counted as paragraphs). Uses bnc-par
  • semcor-par.dat
    The SemCor corpus split into paragraphs. Uses the same vocab and word files as bnc-par.dat.

Output Files

  • name.entropyN
    The entropy after the N th round
  • name.alpha
    The alpha parameter of the model
  • name.beta
    The beta parameter of the model
  • name.walkN
    The Nth topic parameters of the TopicWalk

How to Run

I'll add more soon. Until then, after compiling with make, run ./ldawn -help to see all the options.

It's easier just to show examples.

  • ./ldawn -modelName five -numTopics 5

Run the LDA topic walk with five topics and write the output to "five"


Conditional Probability on SemCor paragraphs

This took longer than expected to get working. Apart from the usual bugs/missteps, there was one particular problem that took me forever to root out. There apparently is some inconsistency with how the STL handles queries to empty vectors. I was using MSVC for debugging, and everything worked fine. But when I used gcc, I was getting some odd assertion breaks.

Apparently, the following conditions were causing a problem:

  • If a word appeared in a synset that was the parent of another synset that also appears in that word
  • If the parent duplicate synset's first link goes in the direction of the child duplicate synset

And of course, this only occurred in a word with tons of possible paths, so it too forever to figure out what was happening with parallel debugging on two platforms. Such a situation happened in this document:


    A part of Wordnet_plus