Difference between revisions of "TOPICSROOT/corpora/wikipedia"
Latest revision as of 15:29, 4 October 2007
The directory contains the following files:
- enwiki-latest-pages-articles.xml. This (LARGE) xml file is the un-bz2'ed downloaded data. The markup is fairly self-explanatory.
- index. A list of byte offsets for each article in the corpus. This file is generated via generate-offsets.sh. You should never use this file except as an input to process-offsets.sh.
- index.processed. A list of byte offsets for each article in the corpus. This file is cleaned as compared to the aforementioned file. It can be generated by process-offsets.sh. Each line consists of a byte offset and article name separated by a colon.