TOPICSROOT/corpora/wikipedia

From CSWiki
Jump to: navigation, search

This directory contains the data files for Wikipedia[1]. The files are a grab of Wikipedia from Sept 10, 2007. WNPCVSROOT/libs/corpora/wikipedia contains useful tools for manipulating the archive.

The directory contains the following files:

  • enwiki-latest-pages-articles.xml. This (LARGE) xml file is the un-bz2'ed downloaded data. The markup is fairly self-explanatory.
  • index. A list of byte offsets for each article in the corpus. This file is generated via generate-offsets.sh. You should never use this file except as an input to process-offsets.sh.
  • index.processed. A list of byte offsets for each article in the corpus. This file is cleaned as compared to the aforementioned file. It can be generated by process-offsets.sh. Each line consists of a byte offset and article name separated by a colon.