Difference between revisions of "LDAWN"

From CSWiki
Jump to: navigation, search
(Conditional Probability on SemCor paragraphs)
(Next Steps)
 
(20 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== CVS Access ==
 
The files are in the repository under ldawn.  The repository is named wnp.  To access it, follow the instructions [http://cvs.cs.princeton.edu here].  You'll need to be approved by the repository owner, who is JBG.  You'll need [http://www.gnu.org/software/gsl/ gsl] installed.  If you don't have root on a machine and can't add to the normal include directory, look at the "make jbg" entry in the Make file to see how to point to a different directory.  In MSVC, you'll need to look at [this http://www.sourceware.org/ml/gsl-discuss/2004-q2/msg00000.html] to get GSL linked up.
 
  
== Files ==
 
  
You'll also need some data files, which can be found [here http://www.cs.princeton.edu/~jbg/wn/ldawn/].  They also require some libraries from the py-evo-feat directory in the wnp archive, which can be accessed by adding it to the python path.
+
== Next Steps ==
  
=== Program Files ===
+
* Using all of BNC - faster?
 +
* Error analysis
 +
* Write
  
*;mixture.cpp: Creates a mixture model of topic walks; still not working completely
+
== Error Analysis ==
  
*;generateReport.py: Given the stem of inference synsets (e.g. "inf-synset."), creates a report on the accuracy, report.out. 
+
* This method's error on mistakes <= 12.5, McCarthy ~14 (need to check to make sure path is being computed the same way)
  
*;LDAWN.cpp: The main file, from which all other functions are called
+
== CVS Access ==
 
+
The files are in the repository under ldawn.  The repository is named wnpTo access it, follow the instructions [http://cvs.cs.princeton.edu here]. You'll need to be approved by the repository owner, who is JBG. You'll need [http://www.gnu.org/software/gsl/ gsl] installed. If you don't have root on a machine and can't add to the normal include directory, look at the "make jbg" entry in the Make file to see how to point to a different directoryIn MSVC, you'll need to look at [this http://www.sourceware.org/ml/gsl-discuss/2004-q2/msg00000.html] to get GSL linked up.
*;WN.cpp: Reads in the WordNet information and serves as the basis for the topic walks
 
 
 
*;TopicWalk.cpp: The topic walk parameters that exist on top of the WN class
 
 
 
*;Path.cpp: An individual path through WN that ends in a synset
 
 
 
=== Data Files ===
 
 
 
*;bnc-par.dat: The BNC corpus split into paragraphs.  Words occurring fewer than 10 times were excluded, as were paragraphs with fewer than five terms (although those terms were counted toward the frequency ... this was done because some headers were counted as paragraphs)Uses bnc-par
 
 
 
*;semcor-par.dat: The SemCor corpus split into paragraphsUses the same vocab and word files as bnc-par.dat.
 
 
 
=== Output Files ===
 
 
 
*;name.entropyN: The entropy after the N th round
 
 
 
*;name.alpha: The alpha parameter of the model
 
 
 
*;name.beta: The beta parameter of the model
 
 
 
*;name.walkN: The Nth topic parameters of the TopicWalk
 
 
 
== How to Run ==
 
 
 
I'll add more soon. Until then, after compiling with make, run ./ldawn -help to see all the options.
 
 
 
It's easier just to show examples.
 
 
 
* ./ldawn -modelName five -numTopics 5
 
 
 
Run the LDA topic walk with five topics and write the output to "five"
 
 
 
== Experiments ==
 
 
 
=== Conditional Probability on SemCor paragraphs ===
 
 
 
This took longer than expected to get working.  Apart from the usual bugs/missteps, there was one particular problem that took me forever to root out.  There apparently is some inconsistency with how the STL handles queries to empty vectorsI was using MSVC for debugging, and everything worked fine.  But when I used gcc, I was getting some odd assertion breaks. 
 
 
 
Apparently, the following conditions were causing a problem:
 
* If a word appeared in a synset that was the parent of another synset that also appears in that word
 
* If the parent duplicate synset's first link goes in the direction of the child duplicate synset
 
 
 
And of course, this only occurred in a word with tons of possible paths, so it too forever to figure out what was happening with parallel debugging on two platforms.  Such a situation happened in this document:
 
 
 
[[Image:LDAWN-par-1519.jpg|thumb|An SemCor paragraph that caused problems in the testing phase.]]
 
  
  
 
     A part of [[Wordnet_plus]]
 
     A part of [[Wordnet_plus]]

Latest revision as of 23:30, 12 December 2006


Next Steps

  • Using all of BNC - faster?
  • Error analysis
  • Write

Error Analysis

  • This method's error on mistakes <= 12.5, McCarthy ~14 (need to check to make sure path is being computed the same way)

CVS Access

The files are in the repository under ldawn. The repository is named wnp. To access it, follow the instructions here. You'll need to be approved by the repository owner, who is JBG. You'll need gsl installed. If you don't have root on a machine and can't add to the normal include directory, look at the "make jbg" entry in the Make file to see how to point to a different directory. In MSVC, you'll need to look at [this http://www.sourceware.org/ml/gsl-discuss/2004-q2/msg00000.html] to get GSL linked up.


    A part of Wordnet_plus