The files are in the repository under ldawn. The repository is named wnp. To access it, follow the instructions here. You'll need to be approved by the repository owner, who is JBG. You'll need gsl installed. If you don't have root on a machine and can't add to the normal include directory, look at the "make jbg" entry in the Make file to see how to point to a different directory. In MSVC, you'll need to look at [this http://www.sourceware.org/ml/gsl-discuss/2004-q2/msg00000.html] to get GSL linked up.
You'll also need some data files, which can be found [here http://www.cs.princeton.edu/~jbg/wn/ldawn/]. They also require some libraries from the py-evo-feat directory in the wnp archive, which can be accessed by adding it to the python path.
- Creates a mixture model of topic walks; still not working completely
- Given the stem of inference synsets (e.g. "inf-synset."), creates a report on the accuracy, report.out.
- The main file, from which all other functions are called
- Reads in the WordNet information and serves as the basis for the topic walks
- The topic walk parameters that exist on top of the WN class
- An individual path through WN that ends in a synset
- The BNC corpus split into paragraphs. Words occurring fewer than 10 times were excluded, as were paragraphs with fewer than five terms (although those terms were counted toward the frequency ... this was done because some headers were counted as paragraphs). Uses bnc-par
- The SemCor corpus split into paragraphs. Uses the same vocab and word files as bnc-par.dat.
- The entropy after the N th round
- The alpha parameter of the model
- The beta parameter of the model
- The Nth topic parameters of the TopicWalk
How to Run
I'll add more soon. Until then, after compiling with make, run ./ldawn -help to see all the options.
It's easier just to show examples.
- ./ldawn -modelName five -numTopics 5
Run the LDA topic walk with five topics and write the output to "five"
Conditional Probability on SemCor paragraphs
This took longer than expected to get working. Apart from the usual bugs/missteps, there was one particular problem that took me forever to root out. There apparently is some inconsistency with how the STL handles queries to empty vectors. I was using MSVC for debugging, and everything worked fine. But when I used gcc, I was getting some odd assertion breaks.
Apparently, the following conditions were causing a problem:
- If a word appeared in a synset that was the parent of another synset that also appears in that word
- If the parent duplicate synset's first link goes in the direction of the child duplicate synset
And of course, this only occurred in a word with tons of possible paths, so it too forever to figure out what was happening with parallel debugging on two platforms. Such a situation happened in this document:
A part of Wordnet_plus