Created class to strip documents from BNC using one of three tags:
- s - too small
- p - right size, excludes speech
- bncdoc - too big, variable
It then parses the paragraph with minipar.
Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.
Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:
- See if morphy or the Porter stemmer have suggestions
- If it's in WordNet and the original string isn't, keep it and count it
- Get BNC parsing started (office)
Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert. Started running all the jobs on office computer.
- Go back to using Pedersen (laptop)
The interpreter is working, but the offsets don't match up to the answer file. Will debug tomorrow.
- Cluster WordNet (cluster)
Nothing done on this. Had an idea of using Huffman codes, but that doesn't make sense without sense frequency. Need to change clustering method to attach less frequent cluster to more frequent one (at the root? or at the most linked to synset?)
Need to regenerate answer file from Semcor after discovering that 2.1 and 2.0 aren't getting along together. There's a new Semcor from Rada, so that should help. Also concerned that a bad mapping might have hurt LDAWN.
Mapping was rather difficult (well, more annoying than difficult); took all day.
Something messed up my office computer. Need to rebuild the filesystem. Hopefully parsing was not lost. Discovered that using new semcor file not as easy as I thought, will need to create new vocab, clean up the code, etc.
Restarted the parsing (after power outage).
Running 1 / 10 on cycles. Need to use disk space more efficiently, so using the LDA assignments and the original dat file to do the topic frequencies.
Discovered that new semcor splitting program isn't quite working:
6006:1 9656:1 41913:3 17850:1 46014:1 12223:1 39328:1 53189:1 46022:1 7115:1 35277:1 39888:1 9720:2 59347:1 980:1 20952:1 40921:1 19930:1 46046:1 49633:1 55272:1 17387:1 59884:1 14589:2 17909:2 53238:1 1016:1 11258:2 55551:2 26110:1 14847:100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Means that something is messed up.
Wrote script to automatically create topic files trained on bnc and then applied to semcor.
The parser finished; created raw count files.
Fixed paser output, redid it. Created settings file as a central place for setting the location of various files.
Fixed semcor output. Still need to try it in LDAWN. After getting parallelize to work and start it running, need to write a program to merge the results. The cached file needs to contain:
- setting names
- num topics
Had an extra initial line at the start of Semcor file, which prevented LDA from running and messed up doc numbers. Is now fixed and rerunning LDA. Found a problem in the convert BNC file that caused the voc numbers to be larger than they should be (perhaps because words appear across parts of speech?).
Argh! Might have deleted some stuff from wnp accidentally on topic directory. Really need to clean that up and finish the ahzs runs.
Right now, measures that give zero similarity muck things up because it discards all other information, and we're basically guessing at random.
Made cluster scripts, running them for one topic.
Memory for the disambiguation scripts was causing a problem, so rewrote them to only read a bit of the file at a time (this works because the access is linear). Screws up the last word of each run, but it's every 10,000 docs, so it's not a big deal (but should be fixed).
Also need to make sure NNP like "location" and "person" are excluded.
Worried about race condition on cluster scripts when writing the pickle file (won't be a problem on subsequent runs).
Okay, fuckups came to light galore. Nevertheless, a quasi-random sample of words were actually disambiguated using the system:
- Make sure the bad vocab didn't infect any of the file's I'm currently using (i.e. delete all pickle files and start from scratch)
- Parallelize seems to be missing more words now that I tried to fix it so that the single topic version would work
- Merge looks for files that aren't there
- Need to remove "location," "person," etc.
Upon finding that the existing JCN implementation didn't quite work (differences between 2.0 and 2.1 couldn't be fixed by mapping), I redid JCN so it can use arbitrary counts. This might be useful when we use topic-specific similarity.
Long loading times for parallized version are silly; they don't need to load in the vocab distributions for all topics. Paralellize is missing fewer words, but not perfect. Messy implementation and needs to be rewritten. Started JCN runs, but Curtis just started a huge number of jobs.
Google interview (and worrying about it) took up most of the morning.
Checked finishing rates of cluster scripts; mostly seem to work, but default's files have been changed. Rerunning those. Created a script to finish off stragglers. Merge seems to mostly find the files, but some words are still not found.
Wrote code to have better output files for error analysis. Not debugged, but written.
Debugged merge file and created script so that it throws out group/location/person. Compiled a bunch of statistics on accuracy, but it doesn't improve as the number of topics increases. Possible reasons:
- Not enough differentiation across senses
- Assignment problem; should merge take a weighted sum across topics (doing this would require reworking parallelize for all assignments with prob > x)
- Something that Miro pointed out that I don't quite understand. Dave proposed exponentiating syntactic and semantic terms and weighting by topic enrichment.
Tried removing syntactic component; it does have an impact (but not a good one).
Made JCN topic specific; had debugging problems when the IC counts didn't change across topics. Need to debug tonight. After running topic specific algorithm on 1 and 10 topics, it doesn't seem to have an impact.
Discovered some problems:
- BNC dat file wasn't lemmatized; could lead to problems with topics and similarity
- Lin file was only for nouns; should be fixed now
Running the exponential idea now.
Realized I'm a dumbass and wasn't normalizing. Miro was right, and I'm a dipshit. So, yeah, rerunning. Maybe we'll get somewhere now.
Sick. Talked, however, with Edo about some evocation ideas. Should try that out soon.
Still sick, but found math error in normalization.
Fixed problem with loading pickles. Reading (uneffectively) for reading group presentation.
Count files have problem that makes it difficult to read. Creating convertToDat.py on office computer in wnp/BNC directory to deal with it; if stripBNC is run again, should be checked to see if it still does this. This will require rerunning LDA, which might take some time.
Talked with Jonathan about data files. Looks to be sorta messy for CLIR (as they aren't lemmatized).
To do tonight:
- Get LDA rerunning started
- Get beta-gamma linked system running (perhaps adding a new model flag)
- Get a start on parsing data files
First two started, nothing on third. Discovered I wasn't using the broad Lin after all, and that the gamma was being used incorrectly. Maybe we're finally getting somewhere. Thus, the numbers below are pretty pointless. Rerunning again with the right syntactic sim
Okay, created a tagger and lemmatizer for the CLIR data. It's running now. Makes the same output as what I was using for semcor.
Okay, choosing Beta first gives us:
Not the greatest. So now I'm redoing the exponential thing again, which might improve now that the the syntactic similarity is going in the right order.