This file contains tools to manipulate the TOPICSROOT/corpora/wikipedia wikipedia corpus.


  • Page - Represents a single article. It defines the following methods:
    • init - Takes a file descriptor and reads in the lines for the current article. The file descriptor should be positioned to the start of the article. After the object is created, the descriptor will point to the end of the article.
    • getLinks - Parses the article for links. By default it ignores links in different namespaces. It also ignores links within templates.
    • str - Returns an XML representation of the article that can be reparsed to yield the same structure.
  • Index - Represents the master index for a corpus. It defines the following methods:
    • init - Takes a file descriptor of the index file and one of the articles XML file. The index file should be generated via the scripts described below.
    • fetchPage - Fetches the given article. Will adjust the case if necessary. Raises KeyError if the article does not exist.

__main__ functionality

When run from the command line, the library will extract a subgraph of the corpus using a BFS. The commandline options are:

  • --index - The index file. This should be set to /n/fs/topics/corpora/wikipedia/index.processed.
  • --pages - The pages file. This should be set to /n/fs/topics/corpora/wikipedia/en-wiki-latest-pages-articles.xml.
  • --start_page - The root node of the BFS walk.
  • --max_depth - The maximum depth to traverse from the starting node.

Additional Tools

There are additional tools in the wikipedia subdirectory.

  • generate-offsets.sh - Generates an index of byte position offsets for each article in the corpus.
  • process-offsets.sh - Cleans the index generated by generate-offsets.sh.