TOPICSROOT/corpora/bnc

From CSWiki
Revision as of 14:54, 4 October 2007 by Jcone (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This directory contains files pertaining to the BNC (British National Corpus). Please refer to libs/corpora/bnc for libraries useful for dealing with this data. The files in this directory are:

  • bnc-indices.dat. This contains the byte offset of each sentence bnc-roles.dat. This is particularly useful when one only wants to real with a subset of the corpus without having to parse the entire thing.
  • bnc-roles.dat. This contains the bnc data in the following format. The first line is an integer indicating the number of sentences within the corpus. Then each sentence appears in turn. The first line of each sentence indicates the number of words in that sentence. This in turn is followed by the words in the sentence. Each word is a line consisting of an integer representing the synset (-1 if unknown), then the word id which refers to the lemmatized form of the word. This is followed by a number indicating the number of roles for the word and then the roles.