Difference between revisions of "Topic wsd"
|Line 311:||Line 311:|
* Better similarity, esp. for verbs
* Better similarity, esp. for verbs
* Test multiple runs of LDA
* Test multiple runs of LDA
Revision as of 17:56, 11 April 2007
- 1 Feb 13
- 2 Feb 14
- 3 Feb 15
- 4 Feb 16
- 5 Feb 19
- 6 Feb 20
- 7 Feb 24
- 8 Feb 25
- 9 Feb 26
- 10 Feb 27
- 11 Feb 28
- 12 March 1
- 13 March 2
- 14 March 3
- 15 March 4
- 16 March 5
- 17 March 6
- 18 March 7
- 19 March 8
- 20 March 9
- 21 March 10
- 22 March 11
- 23 March 12
- 24 March 13
- 25 March 14
- 26 March 15
- 27 March 16
- 28 March 17
- 29 March 18
- 30 March 19
- 31 March 21
- 32 March 22
- 33 March 23
- 34 March 25
- 35 March 26
- 36 March 27
- 37 March 28
- 38 April 1
- 39 April 4
- 40 April 8
- 41 April 9
Created class to strip documents from BNC using one of three tags:
- s - too small
- p - right size, excludes speech
- bncdoc - too big, variable
It then parses the paragraph with minipar.
Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.
Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:
- See if morphy or the Porter stemmer have suggestions
- If it's in WordNet and the original string isn't, keep it and count it
- Get BNC parsing started (office)
Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert. Started running all the jobs on office computer.
- Go back to using Pedersen (laptop)
The interpreter is working, but the offsets don't match up to the answer file. Will debug tomorrow.
- Cluster WordNet (cluster)
Nothing done on this. Had an idea of using Huffman codes, but that doesn't make sense without sense frequency. Need to change clustering method to attach less frequent cluster to more frequent one (at the root? or at the most linked to synset?)
Need to regenerate answer file from Semcor after discovering that 2.1 and 2.0 aren't getting along together. There's a new Semcor from Rada, so that should help. Also concerned that a bad mapping might have hurt LDAWN.
Mapping was rather difficult (well, more annoying than difficult); took all day.
Something messed up my office computer. Need to rebuild the filesystem. Hopefully parsing was not lost. Discovered that using new semcor file not as easy as I thought, will need to create new vocab, clean up the code, etc.
Restarted the parsing (after power outage).
Running 1 / 10 on cycles. Need to use disk space more efficiently, so using the LDA assignments and the original dat file to do the topic frequencies.
Discovered that new semcor splitting program isn't quite working:
6006:1 9656:1 41913:3 17850:1 46014:1 12223:1 39328:1 53189:1 46022:1 7115:1 35277:1 39888:1 9720:2 59347:1 980:1 20952:1 40921:1 19930:1 46046:1 49633:1 55272:1 17387:1 59884:1 14589:2 17909:2 53238:1 1016:1 11258:2 55551:2 26110:1 14847:100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Means that something is messed up.
Wrote script to automatically create topic files trained on bnc and then applied to semcor.
The parser finished; created raw count files.
Fixed paser output, redid it. Created settings file as a central place for setting the location of various files.
Fixed semcor output. Still need to try it in LDAWN. After getting parallelize to work and start it running, need to write a program to merge the results. The cached file needs to contain:
- setting names
- num topics
Had an extra initial line at the start of Semcor file, which prevented LDA from running and messed up doc numbers. Is now fixed and rerunning LDA. Found a problem in the convert BNC file that caused the voc numbers to be larger than they should be (perhaps because words appear across parts of speech?).
Argh! Might have deleted some stuff from wnp accidentally on topic directory. Really need to clean that up and finish the ahzs runs.
Right now, measures that give zero similarity muck things up because it discards all other information, and we're basically guessing at random.
Made cluster scripts, running them for one topic.
Memory for the disambiguation scripts was causing a problem, so rewrote them to only read a bit of the file at a time (this works because the access is linear). Screws up the last word of each run, but it's every 10,000 docs, so it's not a big deal (but should be fixed).
Also need to make sure NNP like "location" and "person" are excluded.
Worried about race condition on cluster scripts when writing the pickle file (won't be a problem on subsequent runs).
Okay, fuckups came to light galore. Nevertheless, a quasi-random sample of words were actually disambiguated using the system:
- Make sure the bad vocab didn't infect any of the file's I'm currently using (i.e. delete all pickle files and start from scratch)
- Parallelize seems to be missing more words now that I tried to fix it so that the single topic version would work
- Merge looks for files that aren't there
- Need to remove "location," "person," etc.
Upon finding that the existing JCN implementation didn't quite work (differences between 2.0 and 2.1 couldn't be fixed by mapping), I redid JCN so it can use arbitrary counts. This might be useful when we use topic-specific similarity.
Long loading times for parallized version are silly; they don't need to load in the vocab distributions for all topics. Paralellize is missing fewer words, but not perfect. Messy implementation and needs to be rewritten. Started JCN runs, but Curtis just started a huge number of jobs.
Google interview (and worrying about it) took up most of the morning.
Checked finishing rates of cluster scripts; mostly seem to work, but default's files have been changed. Rerunning those. Created a script to finish off stragglers. Merge seems to mostly find the files, but some words are still not found.
Wrote code to have better output files for error analysis. Not debugged, but written.
Debugged merge file and created script so that it throws out group/location/person. Compiled a bunch of statistics on accuracy, but it doesn't improve as the number of topics increases. Possible reasons:
- Not enough differentiation across senses
- Assignment problem; should merge take a weighted sum across topics (doing this would require reworking parallelize for all assignments with prob > x)
- Something that Miro pointed out that I don't quite understand. Dave proposed exponentiating syntactic and semantic terms and weighting by topic enrichment.
Tried removing syntactic component; it does have an impact (but not a good one).
Made JCN topic specific; had debugging problems when the IC counts didn't change across topics. Need to debug tonight. After running topic specific algorithm on 1 and 10 topics, it doesn't seem to have an impact.
Discovered some problems:
- BNC dat file wasn't lemmatized; could lead to problems with topics and similarity
- Lin file was only for nouns; should be fixed now
Running the exponential idea now.
Realized I'm a dumbass and wasn't normalizing. Miro was right, and I'm a dipshit. So, yeah, rerunning. Maybe we'll get somewhere now.
Sick. Talked, however, with Edo about some evocation ideas. Should try that out soon.
Still sick, but found math error in normalization.
Fixed problem with loading pickles. Reading (uneffectively) for reading group presentation.
Count files have problem that makes it difficult to read. Creating convertToDat.py on office computer in wnp/BNC directory to deal with it; if stripBNC is run again, should be checked to see if it still does this. This will require rerunning LDA, which might take some time.
Talked with Jonathan about data files. Looks to be sorta messy for CLIR (as they aren't lemmatized).
To do tonight:
- Get LDA rerunning started
- Get beta-gamma linked system running (perhaps adding a new model flag)
- Get a start on parsing data files
First two started, nothing on third. Discovered I wasn't using the broad Lin after all, and that the gamma was being used incorrectly. Maybe we're finally getting somewhere. Thus, the numbers below are pretty pointless. Rerunning again with the right syntactic sim
Okay, created a tagger and lemmatizer for the CLIR data. It's running now. Makes the same output as what I was using for semcor.
Okay, choosing Beta first gives us:
Not the greatest. So now I'm redoing the exponential thing again, which might improve now that the the syntactic similarity is going in the right order.
Prepared for machine learning group / clebrated Irene's birthday. Deleted find stragglers ... need to make sure everything in CVS.
Created parameters for the various models (exp. gamma multiplier, etc.) will run them all and compare. Need to get back to hacky version.
Recreated the find stragglers file; used to diagnose problem of scripts ending prematurely due to pickle problems (a value exception caused by disk problems).
Got the final results; didn't look too good. Created script to dump output to dot files. It's looking like normalization not quite kosher. Will have to redo the 1/5 runs.
Fixed a logic error that had screwed up normalization. Also allowed the ability to get accuracy based on POS. Rerunning the experiments.
Moved normalization out of inner loop, should be faster now.
Tried applying softmax to betas, which seemed too uniform. Didn't work out. Seemed to make things more uniform.
Found out I was using wrong Lin similarity. Now fixed and rerunning experiments. Should make it POS specific rather than taking max across parts of speech for those which are both. Tested out WN 1.6; has to be changed via symbolic link, but shouldn't be a problem when running on CLIR.
Tried a variety of runs on jcn, didn't get much beyond 48%, so using jcn-eb5 for the CLIR submission.
Rewrote paper for EMNLP and got CLIR submission in.
Thought the backward usage of gamma might have caused the higher accuracies, I'm now trying that. Also trying no gamma at all (the original clunky idea). There are some words that aren't getting synset assignments; I'll need to debug that. I have hopes for the symmetric idea. If that doesn't pan out directly, I need to try reversing the order of the similarity (or taking the average).
Will try the backward gamma in the morning, running a bunch of experiments. Am also shuffling the synsets to prevent any order effect. Bad parallelization led to the missing synsets. Made graph colored.
Determined that the bad gamma file coupled with the first sense heuristic was causing the problem. Without the mess, jcn-eb seems to work the best. Not sure if it will still be worth doing. Verbs and adjective similarities don't do much, so perhaps it would be good to get something like Lesk working here.
In creating the files for all words, I noticed the following:
- multiword expressions (esp.) aren't in the vocab, which seems to say the lemmatization isn't working as expected
- some of the all words files are missing from the vocab. Shouldn't be too bad as long as they're in the gamma file (will mess up my alpha, though, which implies maybe going back to perl is a good idea).
Submitted all words. Not sure how well it went. Topic assignment didn't seem to be working at the merge stage, so I just went with what was there at parallelize, and hoped it was right. :(
Gotten lax about updating the wiki. The following has been done:
- Added Perl support (tried JCN and Vector, didn't help ... should try Lesk)
- Made ad hoc support for all words, formatted files correctly (I hope)
- Tried the non-topic and topicalized versions of algorithms
Needs to be done:
- Use beta rather than assignments for topic probability
- Marginalize over topics (thus requiring me to load theta)?
- Better similarity, esp. for verbs
- Test multiple runs of LDA
Wrote new LDA class to read in beta and gamma.
Started debugging LDA class, but needs new LDA files, so need to change the setup of what's going on.