<oo>→<col> Computer-supported
collation with CollateX
Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
The terms below are listed roughly in the order in which they are introduced
in the workshop.
- Witness
- A witness is a version of a work that is being compared to other
versions. Examples of witnesses used in this workshop are manuscript copies of the
Partonopeus de Blois text or
different editions of Charles
Darwin’s Origin of species. We discuss witesses in Unit 3 of
the workshop.
- Segment, segmentation
- CollateX performance is sensitive to the volume of data, and the quality of the
output and the speed with which it is generated are better with shorter texts. For
that reason, users typically subdivide large texts into smaller units, which we call
segments. Each segment can then be processed as a separate collation
task. Examples of segments are chapters in modern prose (such as the Darwin files),
acts or scenes in plays (as in the Julien play), and line groups or
lines of verse (as in the Partonopeus de Blois files). We discuss segmentation in
Unit 3 of the workshop.
- Token, tokenization
- Collation compares the individual words of the witnesses and tries to align them.
Tokenization is the process of dividing a witness text into words (also called
word tokens) so that they can be compared. A word token is not
necessarily what a human would consider a word. For example, by default CollateX
treats punctuation marks as word tokens, so that, for example, the abbreviation
A’dam
(for Amsterdam) is separated into three tokens (A
, ’
,
and dam
). The user can override the default behavior by specifying an
alternative tokenization strategy (such as splitting only on white space, treating
puncutation as part of the adjacent word, etc.). We discuss tokenization in Unit 6
of the workshop.
- JSON
- JSON (JavaScript object notation) is a data interchange format commonly used in
web-services applications. Some features of CollateX accept only JSON input or
produce only JSON output.
- Normalization
- By default CollateX compares literal word tokens, which may include variation that
the developer does not consider meaningful. For example, by default upper- and
lower-case letters are treated as different. If the user creates a shadow
n
property for the JSON word token object, CollateX will compare the
n
values instead of the literal string values of the word tokens.
We discuss normalization in Unit 6 of the workshop.
- Matching
- The process of collation involves identifying which word tokens in the witnesses
should be considered matches for word tokens in other witnesses. This process
involves comparing the tokens across witnesses and determining which ones should be
considered matches. Real texts of any appreciable length typically
involve the repentation of word tokens, so matching involves not only identifying
which token are equivalent, but also determining which of the equivalent token
should be considered matches (correspondences) on the textual level. Matching is
discussed in Unit 7 of the workshop.
- Variant graph
- The internal collation object that CollateX constructs is called a variant
graph. It can be visualized in SVG (scalable vector graphics); see the
example at The data model: variant
graphs in the main CollateX documentation. Variant graphs of any appreciable
size or complexity can be difficult for a human to read, but they may be useful for
debugging a collation.
- Alignment table, ranks
- The variant graph can be serialized as a sequence of sets of collated tokens, called
ranks. The table can be oriented vertically or horizontally. In a
horizontal alignment table, which is the default, the rows represent the witnesses
and the columns represent the ranks. In a vertical table, the rows represent ranks
and the columns represent witnesses. Alignment tables are a common user-friendly
visualization of patterns of agreement across witnesses. In traditional paper
publication of critical editions, an alignment table is sometimes called an
interlinear collation.
- TEI parallel segmentation
- The TEI (Text Encoding Initiative) recommends three different strategies for
encoding variation across witnesses. The most popular of these is parallel
segmentation, described in the Parallel segmentation method portion of the TEI
guidelines.
- Critical apparatus
- A critical apparatus is set of typographic conventions for publishing a compact
report of variation across witnesses. Typically one version of the text (the
copy text), which may be a real witness or a reconstruction by the
editors, is printed in full, and variants from other witnesses (control
texts) are listed in footnotes. Because digital editions may lack
pagination, adapting traditional critical-edition layout to digital publication
requires deciding where the footnotes should go.