This file contains tools to manipulate the TOPICSROOT/corpora/wikipedia wikipedia corpus.
- Page - Represents a single article. It defines the following methods:
- init - Takes a file descriptor and reads in the lines for the current article. The file descriptor should be positioned to the start of the article. After the object is created, the descriptor will point to the end of the article.
- getLinks - Parses the article for links. By default it ignores links in different namespaces. It also ignores links within templates.
- str - Returns an XML representation of the article that can be reparsed to yield the same structure.
- Index - Represents the master index for a corpus. It defines the following methods:
- init - Takes a file descriptor of the index file and one of the articles XML file. The index file should be generated via the scripts described below.
- fetchPage - Fetches the given article. Will adjust the case if necessary. Raises KeyError if the article does not exist.