Computer-supported collation with CollateX


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 4.0 International License] Last modified:


The terms below are listed roughly in the order in which they are introduced in the workshop.

Witness
A witness is a version of a work that is being compared to other versions. Examples of witnesses used in this workshop are manuscript copies of the Partonopeus de Blois text or different editions of Charles Darwin’s Origin of species. We discuss witesses in Unit 3 of the workshop.
Segment, segmentation
CollateX performance is sensitive to the volume of data, and the quality of the output and the speed with which it is generated are better with shorter texts. For that reason, users typically subdivide large texts into smaller units, which we call segments. Each segment can then be processed as a separate collation task. Examples of segments are chapters in modern prose (such as the Darwin files), acts or scenes in plays (as in the Julien play), and line groups or lines of verse (as in the Partonopeus de Blois files). We discuss segmentation in Unit 3 of the workshop.
Token, tokenization
Collation compares the individual words of the witnesses and tries to align them. Tokenization is the process of dividing a witness text into words (also called word tokens) so that they can be compared. A word token is not necessarily what a human would consider a word. For example, by default CollateX treats punctuation marks as word tokens, so that, for example, the abbreviation A’dam (for Amsterdam) is separated into three tokens (A, , and dam). The user can override the default behavior by specifying an alternative tokenization strategy (such as splitting only on white space, treating puncutation as part of the adjacent word, etc.). We discuss tokenization in Unit 6 of the workshop.
JSON
JSON (JavaScript object notation) is a data interchange format commonly used in web-services applications. Some features of CollateX accept only JSON input or produce only JSON output.
Normalization
By default CollateX compares literal word tokens, which may include variation that the developer does not consider meaningful. For example, by default upper- and lower-case letters are treated as different. If the user creates a shadow n property for the JSON word token object, CollateX will compare the n values instead of the literal string values of the word tokens. We discuss normalization in Unit 6 of the workshop.
Matching
The process of collation involves identifying which word tokens in the witnesses should be considered matches for word tokens in other witnesses. This process involves comparing the tokens across witnesses and determining which ones should be considered matches. Real texts of any appreciable length typically involve the repentation of word tokens, so matching involves not only identifying which token are equivalent, but also determining which of the equivalent token should be considered matches (correspondences) on the textual level. Matching is discussed in Unit 7 of the workshop.
Variant graph
The internal collation object that CollateX constructs is called a variant graph. It can be visualized in SVG (scalable vector graphics); see the example at The data model: variant graphs in the main CollateX documentation. Variant graphs of any appreciable size or complexity can be difficult for a human to read, but they may be useful for debugging a collation.
Alignment table, ranks
The variant graph can be serialized as a sequence of sets of collated tokens, called ranks. The table can be oriented vertically or horizontally. In a horizontal alignment table, which is the default, the rows represent the witnesses and the columns represent the ranks. In a vertical table, the rows represent ranks and the columns represent witnesses. Alignment tables are a common user-friendly visualization of patterns of agreement across witnesses. In traditional paper publication of critical editions, an alignment table is sometimes called an interlinear collation.
TEI parallel segmentation
The TEI (Text Encoding Initiative) recommends three different strategies for encoding variation across witnesses. The most popular of these is parallel segmentation, described in the Parallel segmentation method portion of the TEI guidelines.
Critical apparatus
A critical apparatus is set of typographic conventions for publishing a compact report of variation across witnesses. Typically one version of the text (the copy text), which may be a real witness or a reconstruction by the editors, is printed in full, and variants from other witnesses (control texts) are listed in footnotes. Because digital editions may lack pagination, adapting traditional critical-edition layout to digital publication requires deciding where the footnotes should go.