Difference between revisions of "Topic wsd"

From CSWiki
Jump to: navigation, search
(Feb 14)
Line 1: Line 1:
 
 
== Feb 13 ==
 
== Feb 13 ==
  
Line 15: Line 14:
 
Power went out, so parser died.  THought that I should also be doing stemming and writing out LDA counts.  Stemming will be:
 
Power went out, so parser died.  THought that I should also be doing stemming and writing out LDA counts.  Stemming will be:
 
* See if morphy or the Porter stemmer have suggestions
 
* See if morphy or the Porter stemmer have suggestions
* If it's in WordNet, keep it and count it
+
* If it's in WordNet and the original string isn't, keep it and count it
 +
 
 +
== Feb 15 ==
 +
 
 +
Goals:
 +
* Get BNC parsing started (office)
 +
 
 +
Added a stoplist and excluding words of less than length 3.  Problem with program using isWord taking string arguments rather than lists, added an assert.
 +
 
 +
* Go back to using Pedersen (laptop)
 +
* Cluster WordNet (cluster)

Revision as of 13:23, 15 February 2007

Feb 13

Created class to strip documents from BNC using one of three tags:

  • s - too small
  • p - right size, excludes speech
  • bncdoc - too big, variable

It then parses the paragraph with minipar.

Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.

Feb 14

Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:

  • See if morphy or the Porter stemmer have suggestions
  • If it's in WordNet and the original string isn't, keep it and count it

Feb 15

Goals:

  • Get BNC parsing started (office)

Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert.

  • Go back to using Pedersen (laptop)
  • Cluster WordNet (cluster)