Topic wsd

From CSWiki
Revision as of 23:59, 27 February 2007 by Ezubaric (talk | contribs)

Jump to: navigation, search

Feb 13

Created class to strip documents from BNC using one of three tags:

  • s - too small
  • p - right size, excludes speech
  • bncdoc - too big, variable

It then parses the paragraph with minipar.

Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.

Feb 14

Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:

  • See if morphy or the Porter stemmer have suggestions
  • If it's in WordNet and the original string isn't, keep it and count it

Feb 15


  • Get BNC parsing started (office)

Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert. Started running all the jobs on office computer.

  • Go back to using Pedersen (laptop)

The interpreter is working, but the offsets don't match up to the answer file. Will debug tomorrow.

  • Cluster WordNet (cluster)

Nothing done on this. Had an idea of using Huffman codes, but that doesn't make sense without sense frequency. Need to change clustering method to attach less frequent cluster to more frequent one (at the root? or at the most linked to synset?)

Feb 16

Need to regenerate answer file from Semcor after discovering that 2.1 and 2.0 aren't getting along together. There's a new Semcor from Rada, so that should help. Also concerned that a bad mapping might have hurt LDAWN.

Mapping was rather difficult (well, more annoying than difficult); took all day.

Feb 19

Something messed up my office computer. Need to rebuild the filesystem. Hopefully parsing was not lost. Discovered that using new semcor file not as easy as I thought, will need to create new vocab, clean up the code, etc.

Restarted the parsing (after power outage).

Feb 20

Running 1 / 10 on cycles. Need to use disk space more efficiently, so using the LDA assignments and the original dat file to do the topic frequencies.

Discovered that new semcor splitting program isn't quite working:

6006:1 9656:1 41913:3 17850:1 46014:1 12223:1 39328:1 53189:1 46022:1 7115:1 35277:1 39888:1 9720:2 59347:1 980:1 20952:1 40921:1 19930:1 46046:1 49633:1 55272:1 17387:1 59884:1 14589:2 17909:2 53238:1 1016:1 11258:2 55551:2 26110:1 14847:100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Means that something is messed up.

Wrote script to automatically create topic files trained on bnc and then applied to semcor.

Feb 24

The parser finished; created raw count files.

Feb 25

Fixed paser output, redid it. Created settings file as a central place for setting the location of various files.

Feb 26

Fixed semcor output. Still need to try it in LDAWN. After getting parallelize to work and start it running, need to write a program to merge the results. The cached file needs to contain:

  • setting names
  • num topics

Feb 27

Had an extra initial line at the start of Semcor file, which prevented LDA from running and messed up doc numbers. Is now fixed and rerunning LDA. Found a problem in the convert BNC file that caused the voc numbers to be larger than they should be (perhaps because words appear across parts of speech?).

Argh! Might have deleted some stuff from wnp accidentally on topic directory. Really need to clean that up and finish the ahzs runs.

Right now, measures that give zero similarity muck things up because it discards all other information, and we're basically guessing at random.

Made cluster scripts, running them for one topic.