Created class to strip documents from BNC using one of three tags:
- s - too small
- p - right size, excludes speech
- bncdoc - too big, variable
It then parses the paragraph with minipar.
Tried getting topic WSD to work with JCN from Pedersen's IC file and 3.0 structure, but it gave horrible results.
Power went out, so parser died. THought that I should also be doing stemming and writing out LDA counts. Stemming will be:
- See if morphy or the Porter stemmer have suggestions
- If it's in WordNet and the original string isn't, keep it and count it
- Get BNC parsing started (office)
Added a stoplist and excluding words of less than length 3. Problem with program using isWord taking string arguments rather than lists, added an assert. Started running all the jobs on office computer.
- Go back to using Pedersen (laptop)
The interpreter is working, but the offsets don't match up to the answer file. Will debug tomorrow.
- Cluster WordNet (cluster)
Nothing done on this. Had an idea of using Huffman codes, but that doesn't make sense without sense frequency. Need to change clustering method to attach less frequent cluster to more frequent one (at the root? or at the most linked to synset?)
Spent all day (literally) converting semcor to the right format. This should also work for LDA/LDAWN, which might be hurt by missing synsets. So, I'll try both that and the original McCarthy thing. Takes any combination of POS, so it should be fairly flexible.