TOPICSROOT/corpora/bnc

From CSWiki
Jump to: navigation, search

This directory contains files pertaining to the BNC (British National Corpus). Please refer to libs/corpora/bnc for libraries useful for dealing with this data. The files in this directory are:

  • bnc-indices.dat. This contains the byte offset of each sentence bnc-roles.dat. This is particularly useful when one only wants to real with a subset of the corpus without having to parse the entire thing.
  • bnc-roles.dat. This contains the bnc data in the following format. The first line is an integer indicating the number of sentences within the corpus. Then each sentence appears in turn. The first line of each sentence indicates the number of words in that sentence. This in turn is followed by the words in the sentence. Each word is a line consisting of an integer representing the synset (-1 if unknown), then the word id which refers to the lemmatized form of the word. This is followed by a number indicating the number of roles for the word and then the roles.
  • sorted-word.dat.dict. This file contains a list of lemmas which appear in the corpus sorted according to their frequency. The word id in bnc-roles.dat indexes a line in this file.
  • word-counts.dat. The word count for each of the lemmas in sorted-word.dat.dict.
  • word-index.dat. Each line consists of a list of numbers representing sentences containing the word indexed on that same lin in sorted-word.dat.dict. In order to quickly fetch all the sentences containing a word, one need fetch the appropriate line from this file, then convert each of the sentence indices into byte offset using bnc-indices.dat.

Stuff Jordan needs to document

  • resnik-counts.dat
  • sorted-role.dat.dict
  • synset-counts.dat