Topic wsd

From CSWiki
Jump to: navigation, search

Feb 13

Created class to strip documents from BNC using one of three tags:

  • s - too small
  • p - right size, excludes speech
  • bncdoc - too big, variable

It then parses the paragraph with minipar.

Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.

Feb 14

Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:

  • See if morphy or the Porter stemmer have suggestions
  • If it's in WordNet and the original string isn't, keep it and count it

Feb 15

Goals:

  • Get BNC parsing started (office)

Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert. Started running all the jobs on office computer.

  • Go back to using Pedersen (laptop)

The interpreter is working, but the offsets don't match up to the answer file. Will debug tomorrow.

  • Cluster WordNet (cluster)

Nothing done on this. Had an idea of using Huffman codes, but that doesn't make sense without sense frequency. Need to change clustering method to attach less frequent cluster to more frequent one (at the root? or at the most linked to synset?)

Feb 16

Need to regenerate answer file from Semcor after discovering that 2.1 and 2.0 aren't getting along together. There's a new Semcor from Rada, so that should help. Also concerned that a bad mapping might have hurt LDAWN.

Mapping was rather difficult (well, more annoying than difficult); took all day.

Feb 19

Something messed up my office computer. Need to rebuild the filesystem. Hopefully parsing was not lost. Discovered that using new semcor file not as easy as I thought, will need to create new vocab, clean up the code, etc.

Restarted the parsing (after power outage).

Feb 20

Running 1 / 10 on cycles. Need to use disk space more efficiently, so using the LDA assignments and the original dat file to do the topic frequencies.

Discovered that new semcor splitting program isn't quite working:

6006:1 9656:1 41913:3 17850:1 46014:1 12223:1 39328:1 53189:1 46022:1 7115:1 35277:1 39888:1 9720:2 59347:1 980:1 20952:1 40921:1 19930:1 46046:1 49633:1 55272:1 17387:1 59884:1 14589:2 17909:2 53238:1 1016:1 11258:2 55551:2 26110:1 14847:100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Means that something is messed up.

Wrote script to automatically create topic files trained on bnc and then applied to semcor.

Feb 24

The parser finished; created raw count files.

Feb 25

Fixed paser output, redid it. Created settings file as a central place for setting the location of various files.

Feb 26

Fixed semcor output. Still need to try it in LDAWN. After getting parallelize to work and start it running, need to write a program to merge the results. The cached file needs to contain:

  • setting names
  • num topics

Feb 27

Had an extra initial line at the start of Semcor file, which prevented LDA from running and messed up doc numbers. Is now fixed and rerunning LDA. Found a problem in the convert BNC file that caused the voc numbers to be larger than they should be (perhaps because words appear across parts of speech?).

Argh! Might have deleted some stuff from wnp accidentally on topic directory. Really need to clean that up and finish the ahzs runs.

Right now, measures that give zero similarity muck things up because it discards all other information, and we're basically guessing at random.

Made cluster scripts, running them for one topic.

Feb 28

Memory for the disambiguation scripts was causing a problem, so rewrote them to only read a bit of the file at a time (this works because the access is linear). Screws up the last word of each run, but it's every 10,000 docs, so it's not a big deal (but should be fixed).

Also need to make sure NNP like "location" and "person" are excluded.

Worried about race condition on cluster scripts when writing the pickle file (won't be a problem on subsequent runs).

March 1

Okay, fuckups came to light galore. Nevertheless, a quasi-random sample of words were actually disambiguated using the system:

Accuracy vs. Topics
Topic Accuracy
1 44.6
5 44.6
10 44.2
25 43.5

Problems remain:

  • Make sure the bad vocab didn't infect any of the file's I'm currently using (i.e. delete all pickle files and start from scratch)
  • Parallelize seems to be missing more words now that I tried to fix it so that the single topic version would work
  • Merge looks for files that aren't there
  • merge.py
  • Need to remove "location," "person," etc.

March 2

Upon finding that the existing JCN implementation didn't quite work (differences between 2.0 and 2.1 couldn't be fixed by mapping), I redid JCN so it can use arbitrary counts. This might be useful when we use topic-specific similarity.

March 3

Long loading times for parallized version are silly; they don't need to load in the vocab distributions for all topics. Paralellize is missing fewer words, but not perfect. Messy implementation and needs to be rewritten. Started JCN runs, but Curtis just started a huge number of jobs.

March 4

Google interview (and worrying about it) took up most of the morning.

Checked finishing rates of cluster scripts; mostly seem to work, but default's files have been changed. Rerunning those. Created a script to finish off stragglers. Merge seems to mostly find the files, but some words are still not found.

Wrote code to have better output files for error analysis. Not debugged, but written.

March 5

Debugged merge file and created script so that it throws out group/location/person. Compiled a bunch of statistics on accuracy, but it doesn't improve as the number of topics increases. Possible reasons:

  • Not enough differentiation across senses
  • Assignment problem; should merge take a weighted sum across topics (doing this would require reworking parallelize for all assignments with prob > x)
  • Something that Miro pointed out that I don't quite understand. Dave proposed exponentiating syntactic and semantic terms and weighting by topic enrichment.

Tried removing syntactic component; it does have an impact (but not a good one).

March 6

Made JCN topic specific; had debugging problems when the IC counts didn't change across topics. Need to debug tonight. After running topic specific algorithm on 1 and 10 topics, it doesn't seem to have an impact.

March 7

Discovered some problems:

  • BNC dat file wasn't lemmatized; could lead to problems with topics and similarity
  • Lin file was only for nouns; should be fixed now

Running the exponential idea now.

March 8

Realized I'm a dumbass and wasn't normalizing. Miro was right, and I'm a dipshit. So, yeah, rerunning. Maybe we'll get somewhere now.

March 9

Sick. Talked, however, with Edo about some evocation ideas. Should try that out soon.

March 10

Still sick, but found math error in normalization.

March 11

Fixed problem with loading pickles. Reading (uneffectively) for reading group presentation.

March 12

Count files have problem that makes it difficult to read. Creating convertToDat.py on office computer in wnp/BNC directory to deal with it; if stripBNC is run again, should be checked to see if it still does this. This will require rerunning LDA, which might take some time.

Talked with Jonathan about data files. Looks to be sorta messy for CLIR (as they aren't lemmatized).

To do tonight:

  • Get LDA rerunning started
  • Get beta-gamma linked system running (perhaps adding a new model flag)
  • Get a start on parsing data files

First two started, nothing on third. Discovered I wasn't using the broad Lin after all, and that the gamma was being used incorrectly. Maybe we're finally getting somewhere. Thus, the numbers below are pretty pointless. Rerunning again with the right syntactic sim

Accuracy vs. Topics
Topic Accuracy
1 49.2
5 50.4
10 48.7
25 48.4
50 48.7
100 48.5

March 13

Okay, created a tagger and lemmatizer for the CLIR data. It's running now. Makes the same output as what I was using for semcor.

March 14

Okay, choosing Beta first gives us:

Accuracy vs. Topics
Topic Accuracy
1 45.11
5 45.14
10 45.0
25 43.0*
50 44.1
100 44.0

Not the greatest. So now I'm redoing the exponential thing again, which might improve now that the the syntactic similarity is going in the right order.

March 15

Prepared for machine learning group / clebrated Irene's birthday. Deleted find stragglers ... need to make sure everything in CVS.

March 16

Created parameters for the various models (exp. gamma multiplier, etc.) will run them all and compare. Need to get back to hacky version.

March 17

Recreated the find stragglers file; used to diagnose problem of scripts ending prematurely due to pickle problems (a value exception caused by disk problems).

March 18

Got the final results; didn't look too good. Created script to dump output to dot files. It's looking like normalization not quite kosher. Will have to redo the 1/5 runs.

March 19

Fixed a logic error that had screwed up normalization. Also allowed the ability to get accuracy based on POS. Rerunning the experiments.

March 21

Moved normalization out of inner loop, should be faster now.

March 22

Tried applying softmax to betas, which seemed too uniform. Didn't work out. Seemed to make things more uniform.

March 23

Found out I was using wrong Lin similarity. Now fixed and rerunning experiments. Should make it POS specific rather than taking max across parts of speech for those which are both. Tested out WN 1.6; has to be changed via symbolic link, but shouldn't be a problem when running on CLIR.

March 25

Tried a variety of runs on jcn, didn't get much beyond 48%, so using jcn-eb5 for the CLIR submission.

March 26

Rewrote paper for EMNLP and got CLIR submission in.

March 27

Thought the backward usage of gamma might have caused the higher accuracies, I'm now trying that. Also trying no gamma at all (the original clunky idea). There are some words that aren't getting synset assignments; I'll need to debug that. I have hopes for the symmetric idea. If that doesn't pan out directly, I need to try reversing the order of the similarity (or taking the average).

March 28

Will try the backward gamma in the morning, running a bunch of experiments. Am also shuffling the synsets to prevent any order effect. Bad parallelization led to the missing synsets. Made graph colored.

April 1

Determined that the bad gamma file coupled with the first sense heuristic was causing the problem. Without the mess, jcn-eb seems to work the best. Not sure if it will still be worth doing. Verbs and adjective similarities don't do much, so perhaps it would be good to get something like Lesk working here.

In creating the files for all words, I noticed the following:

  • multiword expressions (esp.) aren't in the vocab, which seems to say the lemmatization isn't working as expected
  • some of the all words files are missing from the vocab. Shouldn't be too bad as long as they're in the gamma file (will mess up my alpha, though, which implies maybe going back to perl is a good idea).

April 4

Submitted all words. Not sure how well it went. Topic assignment didn't seem to be working at the merge stage, so I just went with what was there at parallelize, and hoped it was right.  :(

Gotten lax about updating the wiki. The following has been done:

  • Added Perl support (tried JCN and Vector, didn't help ... should try Lesk)
  • Made ad hoc support for all words, formatted files correctly (I hope)
  • Tried the non-topic and topicalized versions of algorithms

Needs to be done:

  • Use beta rather than assignments for topic probability
  • Marginalize over topics (thus requiring me to load theta)?
  • Better similarity, esp. for verbs
  • Test multiple runs of LDA

April 8

Wrote new LDA class to read in beta and gamma.

April 9

Started debugging LDA class, but needs new LDA files, so need to change the setup of what's going on.

April 10

Worked on ImageNet in the morning; realized that parallelize and merge need to be reworked.

April 11

Rewrote parallelize and merge; parallelize is debugged (I think). Also found bug in the LDA module; fixed.

UNRESOLVED:

  • WNsim assumes the old topic structure, but this isn't causing any problems because it can simply look at the old pickles.
  • Background is broken, so going to raw betas again
  • Probabilities might be messed up ... don't know if it's the result of the big alphas; anyhow, running jcn-rb5 and seeing what happens

Got results back from the contest; they suck, but I don't know why.

April 12

Started to try to run experiments, but Tim had a big deadline, so I backed off.

April 13

WOrking on writing.

April 14

Since the pickle files for larger numbers of topics are gone, I redid the information content.

April 15

Fixed the overflow problem. Background probabilities weren't smoothed at all, and the Dirichlet prior on the vocab multinomials resulted in overflow when I was (in essence) dividing by zero. There was also a problem in background and frequency using different representations for words. Should be fixed. One remaining problem is the problem of zero-frequency words in the vocab that don't get assigned (even a smoothed) frequency in LDA.