Using CollateX with XML: recognizing and tracking markup information during collation

XML to JSON

This section describes how to convert XML input (of any type) to JSON for use in CollateX.

What sort of JSON does CollateX expect

CollateX can process input in pretokenized JSON with:

collate_pretokenized_json(inputFile.json,output='json')

The input JSON file should have the following shape (example from http://collatex.net/doc/#json-input):

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "white" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

JSON is a data structure that consists of a hierarchical (nested) arrangement of JSON objects, each of which has a property name and a value. This is comparable to an XML hierarchy, where elements have a name (comparable to a JSON property name) and contents (comparable to the value of the JSON property). JSON does not have anything analogous to the XML distinction between child elements and attributes; both are expressed as embedded JSON objects with property names and values.

JSON syntax works as follows:

A JSON object is surrounded by curly braces.
Each JSON object has one or more name : value pairs. The name is separated from the value by a colon, and if the JSON object has multiple name : value pairs, those are separated from one another by commas.
The name in a name : value pair must be a string (enclosed in matching single or double quotation marks).
The value in a name : value pair may be any of several data types, the most important of which, for our purposes, are strings and lists.
String values are enclosed in matching single or double quotation marks.
For our purposes, lists are lists of JSON objects. The entire list is enclosed in square brackets and the individual list items (JSON objects surrounded by curly braces) are separated from one another by commas.

Whitespace new-line characters and indentation are for human legibility and are not informational from a JSON perspective. Although JSON supports other data types, such as integers and booleans, for the purpose of converting XML to JSON, these can be treated as if they were strings.

In the example above, the root of the JSON input document (comparable to the XML document node) has a single witnesses property (comparable to an XML root element), the value of which is a list of witnesses, each of which is a JSON object (comparable to the children of an XML root element). In this example there are two witnesses. Each witnesses, in turn, has two properties: the value of the id property is a string that identifies the witnesses (e.g., a manuscript siglum) and the value of the tokens property is a list of word tokens. Each word token is a JSON object that has an obligatory t property, which contains the string value of the token. All other properties of tokens are optional. The only property that CollateX uses other than t is the n (normalized) property. As described in our separate normalization tutorial, if an n property is specified on the JSON object, CollateX will identify correspondences during collation by comparing n property values. If there is no n property, CollateX compares t properties to identify correspondences. Other properties are for the convenience of the user; CollateX passes them through to the output, but does not use them directly.

The preceding means that to collate JSON input, which is necessary if you want to perform your own tokenization or if you want to use normalization, you must create the JSON structure illustrated above, providing at least a t property for every word token.

JSON and Python dictionaries

As described above, JSON is a hierarchical format, where curly braces delimit JSON objects that many contain other JSON objects. What the human user sees, though, is a linear sequence of individual characters, which we can understand as a serialization of the inherent hierarchical JSON structure as a string. Python can represent either the actual hierarchy (using a Python dictionary) or the serialization (using a Python string). These will look the same to the human user when they are printed to the screen because the screen displays one character at a time, and therefore serializes the hierarchy as part of the rendering process, but internally the dictionary and the string representations are different. At present CollateX can accept JSON input only as a Python dictionary, and not as a string. Support for stringified JSON may be added at a later date, but this tutorial assumes that when you create JSON input for CollateX, you must create it as a Python dictionary.

The structure of a Python dictionary is isomorphic with the hierarchical JSON structure, which means that a Python dictionary is capable of representing hierarchical JSON objects directly. As described above, JSON objects are name : value pairs, where the name is a string and the value may be any of several datatypes, including JSON objects, lists, strings, integers, etc. Meanwhile, Python dictionaries consist of key : value pairs that have properties analogous to the name : value properties of JSON objects. This means that converting your XML to JSON for input into CollateX can be thought of as converting your XML to a Python dictionary, which can then be used by CollateX as a JSON representation without further transformation.

The XML input

If your input is in XML and you want to pass it to CollateX as JSON (which is necessary if you want to do your own tokenization or normalization), you have to convert your XML to the specific JSON format described above. The complexity of your XML will vary from project to project (this is true even of projects that use TEI, since TEI XML may take different forms), and it is not possible to write a generic Python script that will convert any arbitrary XML (or even any arbitrary TEI) to the JSON format needed for CollateX. This means that you will have to write your own Python code, which knows about your specific XML elements, to perform that transformation. There are, however, three basic XML patterns you are likely to encounter when processing this type of transformation, and we discuss each of them here. If you know how to process these three types of XML patterns, you can write Python code to deal with your specific XML.

This tutorial assumes that all processing (transformation of the input XML to JSON, collation of the JSON with CollateX, transformation of the output JSON to XML) will take place in a single Python program. Alternatively, it is also possible to perform the XML to JSON transformation as a separate preprocessing step, using an XSLT transformation tool such as Saxon. Processing the XML with Saxon instead of in Python has the advantage of making XPath 2.0 and XSLT 2.0 resources available (Python supports only XPath 1.0 and XSLT 1.0), along with the possible disadvantage of separating your collation into separate processes. This tutorial performs all transformations within Python, using XPath 1.0 an XSLT 1.0 strategies.

Simple XML might make no use of internal markup. For example, we might have two witnesses that look like:
```
<l id="13" n="13">Li salaus se torne al serain</l>
```
and
```
<l id="13" n="13">Li solaus se torne al serain</l>
```
These and the following examples are taken from witnesses A and B, respectively, of Partonopeus de Blois, available at the Oxford Text Archive. The second word tokens differ (salaus vs solaus), but otherwise the readings are the same, and there is no markup inside the <l> elements.
More complex XML might include markup within an individual word token. For example:
```
<l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l>
```
and
```
<l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l>
```
In this example, there there are both character-level differences and markup differences, and the markup differences are confined to a single word token. Sometimes the token-level markup affects an entire token (e.g., et vs <abbrev>et</abbrev>) and sometimes it affects only part of a token (e.g., maint vs mai<abbrev>n</abbrev>t). Sometimes the variation affects both the character-level data content and the markup (e.g., Ki vs Q<abbrev>ui</abbrev>).
Still more complex XML might have markup that spans more than one token, e.g.:
```
<l id="1116" n="1098">Nus cler<crease>s ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l>
```
and
```
<l id="1116" n="1024">Nus clers ne v<abbrev>os</abbrev> poroit desc<abbrev>ri</abbrev>re</l>
```
Here the <crease> element in Witness A begins in the middle of one word token, spans the following two, and ends in the middle of the next. If we are ultimately to treat each word token as a separate XML element (for example, as a separate table cell in an HTML rendering of an alignment table), we will need to translate the spanning element to an alternative representation in order to avoid an overlapping hierarchy conflict.

Each of the preceding types of XML can be represented in JSON in the ways described below. The procedure for incorporating normalized shadow tokens is discussed in a separate unit.

Producing JSON for CollateX input

Leveraging existing markup

In some cases when you are collating witnesses that have already been tagged in TEI or other XML, the source documents may already contains markup that can be used for alignment purposes. For example, the division of the text in the examples above into <l> elements makes it possible to treat each <l> as a separate collation task, which reduces the likelihood of misaligning tokens that repeat in different lines. In the samples above, the @id attributes match across witnesses, which means that the texts have already been aligned coarsely during initial markup, and the task for CollateX is to refine the alignment so that it operates at the level of the individual word token, instead of the entire <l> element. This type of prealignment is not required, but if it is present in your documents, exploiting it will improve the speed and accuracy of the collation process.

Tokenizing on whitespace

We assume in this tutorial that we want to tokenize the contents of the witnesses (in the preceding examples, the contents of the <l> elements) on whitespace. One complication that arises in tokenizing XML on whitespace is that not all whitespace characters in XML represent word boundaries. For example, in:

<l id="13" n="13">Li salaus se torne al serain</l>

there is whitespace inside the start tag that delimits the attributes; the only whitespace we care about for tokenization purposes is whitespace inside text() nodes. A second complication is that if we do the tokenization within Python, we are limited to XPath 1.0 and XSLT 1.0, which means that we cannot use several powerful features of XPath 2.0 and XSLT 2.0 that would be helpful in this context (e.g., tokenize(), matches(), <xsl:analyze-string>, <xsl:for-each-group>, except, and <<).

We cope with these challenges by performing the tokenization with XSLT, rather than with Python regular expression matching against a string, because it is easier to distinguish whitespace in text() nodes from whitespace inside tags using XML-aware processing. But we do the processing in two passes: first we replace all whitespace sequences inside text() nodes with empty milestone <w/> tags and we then, on a second pass, convert those to wrapper <w> elements around each word token.

XML without internal markup

As a simplified example, let’s assume that we have two two-line witnesses to the Partonopeus de Blois text, as follows:

Witness A

<l id="21" n="21">Laloete cante damor</l>
<l id="22" n="22">Si estrine laube del jor</l>

Witness B

<l id="21" n="19">Laloete chante damor</l>
<l id="22" n="20">Sin estrine laube dou jor</l>

This is a simplified example that focuses on creating JSON that can serve as input into CollateX. Real-world examples will be more complicated in the following ways:

If your input documents include other content that you do not want to process with CollateX (e.g, a TEI header or tags like TEI <text>, <body>, <div>, <lg>, etc.), your Python script can read entire document and distinguish what to ignore, what to pass through unchanged, and what to collate with CollateX.
If your input documents have <l> tags but lack the @id attributes that provide coarse prealignment here, you have to align the entire texts at once with CollateX, since without the @id attributes you cannot determine which lines in one witness is supposed to be collated against which line in aother. In that case the <l> tags represent a case of internal markup that crosses word boundaries (type 3 above), which is discussed below.
To keep everything in one place, we include the XML within the Python code. In Real Life you would normally read the XML into the script from the file system.
We have separated the three patterns here for teaching purposes, and you can combine all three in your Python code.

Here is a Python script that can collate this input using CollateX, along with the JSON output of CollateX. We discuss converting the JSON output to XML separately below.

#!/usr/bin/env python3
"""
example1.py

Author: David J. Birnbaum (djbpitt@gmail.com; http://www.obdurodon.org)
Acknowledgements: Revised with the help of Ronald Dekker
First version: 2015-06-25
Collates minimal TEI input <l> elements without internal markup
"""

from collatex import *
from lxml import etree
import re, json


class Witness:
    """An instance of Witness is the etree representation of a witness"""
    def __init__(self,xml):
        self.xml = etree.XML(xml)

    def ids(self):
        return [int(i) for i in self.xml.xpath('//@id')]

    def minId(self):
        return min(self.ids())

    def maxId(self):
        return max(self.ids())

    def lById(self,id):
        """word-tokenized <l> of a witness by @id

        not called directly; used by words(), below
        """
        self.id = str(id)
        line = Line(self.xml,self.id)
        return line.tokenized()

    def words(self,id):
        """word tokens with <w> wrappers removed"""
        self.id = str(id)
        wrappedWords = self.lById(self.id).xpath('//w')
        for w in wrappedWords:
            yield Word(w).unwrap()

    def siglum(self):
        # xpath() returns a list, even if there's just one object
        return str(self.xml.xpath('//lg/@wit')[0])

    def generate_tokens(self, lineId):
        words = self.words(lineId)
        currentTokens = []
        for word in words:
            wordToken = {}
            wordToken['t'] = word
            currentTokens.append(wordToken)
        return currentTokens


class Line:
    """An instance of Line is an <l> in a witness with a specified @id"""
    # The two XSLT transformations are class properties
    # The first replaces whitespace in the input with <w/> milestone tags
    # The second transforms the milestones to wrappers
    xsltAddW = etree.XML('''
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
        <xsl:template match="*|@*">
            <xsl:copy>
                <xsl:apply-templates select="node() | @*"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="@*"/>
                <!-- insert a <w/> milestone before the first word -->
                <w/>
                <xsl:apply-templates/>
            </xsl:copy>
        </xsl:template>
        <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
             CUSTOMIZE HERE: add other elements that may span multiple word tokens
        -->
        <xsl:template match="add | sic | crease ">
            <xsl:element name="{name()}">
                <xsl:attribute name="n">start</xsl:attribute>
            </xsl:element>
            <xsl:apply-templates/>
            <xsl:element name="{name()}">
                <xsl:attribute name="n">end</xsl:attribute>
            </xsl:element>
        </xsl:template>
        <xsl:template match="text()">
            <xsl:call-template name="whiteSpace">
                <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
            </xsl:call-template>
        </xsl:template>
        <xsl:template name="whiteSpace">
            <xsl:param name="input"/>
            <xsl:choose>
                <xsl:when test="not(contains($input, ' '))">
                    <xsl:value-of select="$input"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($input, ' ')"/>
                    <w/>
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring-after($input,' ')"/>
                    </xsl:call-template>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:template>
    </xsl:stylesheet>
    ''')
    transformAddW = etree.XSLT(xsltAddW)

    xsltWrapW = etree.XML('''
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="w"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="w">
            <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
            <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
            <w>
                <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
            </w>
        </xsl:template>
    </xsl:stylesheet>
    ''')
    transformWrapW = etree.XSLT(xsltWrapW)

    def __init__(self,witness,id):
        self.witness = witness
        self.id = id

    def fullLine(self):
        # xpath() returns a list, even if there's just one object
        return self.witness.xpath('//l[@id =' + self.id + ']')[0]

    def lineString(self):
        # lineString() is for diagnosis, and is not used in production
        return etree.tostring(self.fullLine(), encoding='unicode')

    def tokenized(self):
        self.withMilestones = Line.transformAddW(self.fullLine())
        self.withWrappers = Line.transformWrapW(self.withMilestones)
        return self.withWrappers


class Word:
    """An instance of Word is a word token in a line"""
    unwrapRegex = re.compile('<.*?>\s*(.*)\s*</.*?>')
    def __init__(self,w):
        self.w = w
        self.stringified = etree.tostring(self.w,encoding='unicode')

    def unwrap(self):
        """Remove <w> tags from around word token"""
        return Word.unwrapRegex.match(self.stringified).group(1)


class WitnessSet:
    """An instance of WitnessSet is the set of witnesses being collated

    In these examples, the witnesses are individual <l> elements, not entire mss
    """

    def __init__(self, witnesses):
        self.witnesses = witnesses

    def get_line_ids(self):
        witnessMin = min([witness.minId() for witness in self.witnesses])
        witnessMax = max([witness.maxId() for witness in self.witnesses])
        return range(witnessMin,witnessMax + 1)

    def generate_block(self, lineId):
        block = {}
        witnesses = []
        for witness in self.witnesses:
            currentWitness = {}
            currentWitness['id'] = witness.siglum()
            currentWitness['tokens'] = witness.generate_tokens(lineId)
            witnesses.append(currentWitness)
        block['witnesses'] = witnesses
        return block

    def generate_blocks_by_line(self):
        for lineId in self.get_line_ids():
            block = self.generate_block(lineId)
            yield block

def main():
    witnessA = """
    <lg wit="A">
        <l id="21" n="21">Laloete cante damor</l>
        <l id="22" n="22">Si estrine laube del jor</l>
    </lg>

    """
    witnessB = """
    <lg wit="B">
        <l id="21" n="19">Laloete chante damor</l>
        <l id="22" n="20">Sin estrine laube dou jor</l>
    </lg>
    """
    witnessC = """
    <lg wit="A">
        <l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l>
        <l id="9" n="9">Ki ma doune soie merci</l>
    </lg>
    """
    witnessD = """
    <lg wit="B">
        <l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l>
        <l id="9" n="9">Q<abbrev>ui</abbrev> ma done soie merci</l>
    </lg>
    """
    # get the numeric range of @id values as witnessMin and witnessMax
    witnessATree = Witness(witnessA)
    witnessBTree = Witness(witnessB)
    witnessSet = WitnessSet([witnessATree, witnessBTree])

    # treat each <l> (by @id) as a separate collation block
    for block in witnessSet.generate_blocks_by_line():
        # Uncomment the following line to see the JSON input to CollateX
        # print(json.dumps(block,indent=2))
        collation = collate_pretokenized_json(block)
        print(collation)

if __name__ == "__main__": main()

If you run this, the output is:

+---+---------+--------+-------+
| A | Laloete | cante  | damor |
| B | Laloete | chante | damor |
+---+---------+--------+-------+
+---+-----+---------+-------+-----+-----+
| A | Si  | estrine | laube | del | jor |
| B | Sin | estrine | laube | dou | jor |
+---+-----+---------+-------+-----+-----+

XML with internal markup confined to a single word token

Word tokens that contain no markup can be collated on their string value, but if we want to recognize that, for example, maint and mai<abbrev>n</abbrev>t are a perfect string match if we ignore the markup, we can create an n property for the JSON word token object that strips the internal markup. The changes to the preceding Python script that perform that normalization are the following:

Inside the definition of the Word class (lines 156–71), we convert everything to lower case, strip punctuation, and strip tags. We define puncRegex (159) and tagRegex (160) to match punctuation and tags, respectively, and we use those plus the str.lower() method to create the normalized shadow token (171). As noted above, if we specify an n property in our JSON word token (57, 169–71), CollateX will use that property to perform matching and alignment, but it will return the original non-normalized token in the alignment table.

#!/usr/bin/env python3
"""
example2.py

Author: David J. Birnbaum (djbpitt@gmail.com; http://www.obdurodon.org)
Acknowledgements: Revised with the help of Ronald Dekker
First version: 2015-06-25
Collates TEI input <l> elements with token-internal markup
Based on example1.py
"""

from collatex import *
from lxml import etree
import re, json, string


class Witness:
    """An instance of Witness is the etree representation of a witness"""
    def __init__(self,xml):
        self.xml = etree.XML(xml)

    def ids(self):
        return [int(i) for i in self.xml.xpath('//@id')]

    def minId(self):
        return min(self.ids())

    def maxId(self):
        return max(self.ids())

    def lById(self,id):
        """word-tokenized <l> of a witness by @id

        not called directly; used by words(), below
        """
        self.id = str(id)
        line = Line(self.xml,self.id)
        return line.tokenized()

    def words(self,id):
        """word tokens with <w> wrappers removed"""
        self.id = str(id)
        wrappedWords = self.lById(self.id).xpath('//w')
        for w in wrappedWords:
            yield Word(w)

    def siglum(self):
        # xpath() returns a list, even if there's just one object
        return str(self.xml.xpath('//lg/@wit')[0])

    def generate_tokens(self, lineId):
        words = self.words(lineId)
        currentTokens = []
        for word in words:
            wordToken = {}
            wordToken['t'] = word.unwrap()
            wordToken['n'] = word.normalizeToken()
            currentTokens.append(wordToken)
        return currentTokens


class Line:
    """An instance of Line is an <l> in a witness with a specified @id"""
    # The two XSLT transformations are class properties
    # The first replaces whitespace in the input with <w/> milestone tags
    # The second transforms the milestones to wrappers
    xsltAddW = etree.XML('''
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
        <xsl:template match="*|@*">
            <xsl:copy>
                <xsl:apply-templates select="node() | @*"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="@*"/>
                <!-- insert a <w/> milestone before the first word -->
                <w/>
                <xsl:apply-templates/>
            </xsl:copy>
        </xsl:template>
        <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
             CUSTOMIZE HERE: add other elements that may span multiple word tokens
        -->
        <xsl:template match="add | sic | crease ">
            <xsl:element name="{name()}">
                <xsl:attribute name="n">start</xsl:attribute>
            </xsl:element>
            <xsl:apply-templates/>
            <xsl:element name="{name()}">
                <xsl:attribute name="n">end</xsl:attribute>
            </xsl:element>
        </xsl:template>
        <xsl:template match="text()">
            <xsl:call-template name="whiteSpace">
                <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
            </xsl:call-template>
        </xsl:template>
        <xsl:template name="whiteSpace">
            <xsl:param name="input"/>
            <xsl:choose>
                <xsl:when test="not(contains($input, ' '))">
                    <xsl:value-of select="$input"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($input, ' ')"/>
                    <w/>
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring-after($input,' ')"/>
                    </xsl:call-template>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:template>
    </xsl:stylesheet>
    ''')
    transformAddW = etree.XSLT(xsltAddW)

    xsltWrapW = etree.XML('''
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
        <xsl:template match="/*">
            <xsl:copy>
                <xsl:apply-templates select="w"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="w">
            <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
            <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
            <w>
                <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
            </w>
        </xsl:template>
    </xsl:stylesheet>
    ''')
    transformWrapW = etree.XSLT(xsltWrapW)

    def __init__(self,witness,id):
        self.witness = witness
        self.id = id

    def fullLine(self):
        # xpath() returns a list, even if there's just one object
        return self.witness.xpath('//l[@id =' + self.id + ']')[0]

    def lineString(self):
        # lineString() is for diagnosis, and is not used in production
        return etree.tostring(self.fullLine(), encoding='unicode')

    def tokenized(self):
        self.withMilestones = Line.transformAddW(self.fullLine())
        self.withWrappers = Line.transformWrapW(self.withMilestones)
        return self.withWrappers


class Word:
    """An instance of Word is a word token in a line"""
    unwrapRegex = re.compile('<.*?>\s*(.*)\s*</.*?>')
    puncRegex = re.compile('[' + string.punctuation + ']+')
    tagRegex = re.compile('<.*?>')
    def __init__(self,w):
        self.w = w
        self.stringified = etree.tostring(self.w,encoding='unicode')

    def unwrap(self):
        """Remove <w> tags from around word token"""
        return Word.unwrapRegex.match(self.stringified).group(1)

    def normalizeToken(self):
        """Create shadow 'n' property"""
        return Word.tagRegex.sub('',Word.tagRegex.sub('',self.stringified.lower()))


class WitnessSet:
    """An instance of WitnessSet is the set of witnesses being collated

    In these examples, the witnesses are individual <l> elements, not entire mss
    """

    def __init__(self, witnesses):
        self.witnesses = witnesses

    def get_line_ids(self):
        witnessMin = min([witness.minId() for witness in self.witnesses])
        witnessMax = max([witness.maxId() for witness in self.witnesses])
        return range(witnessMin,witnessMax + 1)

    def generate_block(self, lineId):
        block = {}
        witnesses = []
        for witness in self.witnesses:
            currentWitness = {}
            currentWitness['id'] = witness.siglum()
            currentWitness['tokens'] = witness.generate_tokens(lineId)
            witnesses.append(currentWitness)
        block['witnesses'] = witnesses
        return block

    def generate_blocks_by_line(self):
        for lineId in self.get_line_ids():
            block = self.generate_block(lineId)
            yield block

def main():
    witnessA = """
    <lg wit="A">
        <l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l>
        <l id="9" n="9">Ki ma doune soie merci</l>
    </lg>
    """
    witnessB = """
    <lg wit="B">
        <l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l>
        <l id="9" n="9">Q<abbrev>ui</abbrev> ma done soie merci</l>
    </lg>
    """
    # get the numeric range of @id values as witnessMin and witnessMax
    witnessATree = Witness(witnessA)
    witnessBTree = Witness(witnessB)
    witnessSet = WitnessSet([witnessATree, witnessBTree])

    # treat each <l> (by @id) as a separate collation block
    for block in witnessSet.generate_blocks_by_line():
        # Uncomment the following line to see the JSON input to CollateX
        # print(json.dumps(block,indent=2))
        collation = collate_pretokenized_json(block)
        print(collation)

if __name__ == "__main__": main()

If you run this, the output is:

+---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+
| A | Ki                   | maint                  | <abbrev>et</abbrev> | el | pere | et                  | el | fis |
| B | Q<abbrev>ui</abbrev> | mai<abbrev>n</abbrev>t | <abbrev>et</abbrev> | el | pere | <abbrev>et</abbrev> | el | fis |
+---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+
+---+----------------------+----+-------+------+-------+
| A | Ki                   | ma | doune | soie | merci |
| B | Q<abbrev>ui</abbrev> | ma | done  | soie | merci |
+---+----------------------+----+-------+------+-------+

XML with internal markup that may span tokens

If we want to visualize the alignment table in HTML or other XML, we have to avoid overlapping elements. In the preceding example there is no risk of overlap because the markup that we retain in the word tokens is wholly confined to a single token. But in an example like:

<l id="1116" n="1098">Nus cler<crease>s ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l>

we cannot safely create output with the <crease> start tag inside one word token and the end tag inside another, since if we were to do that and then wrap each word token in tags, we would create an overlap situation that violates XML well-formedness constraints. To deal with this issue we identify the tags that may span word tokens and convert those to empty milestone tags. In line 84 of the Python script above, we’ve flagged <add>, <sic>, and <crease> to be treated in this way. If we change the input XML in the Python script (line 193–205) to include an example that of <crease> that spans word tokens:

witnessA = """
    <lg wit="A">
        <l id="1115" n="1097">La cambre est de marbre porfire</l>
        <l id="1116" n="1098">Nus cler<crease> ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l>
    </lg>
    """
    witnessB = """
    <lg wit="B">
        <l id="1115" n="1023">La cambre est d<abbrev>un</abbrev> marbre porfire</l>
        <l id="1116" n="1024">Nus clers ne v<abbrev>os</abbrev> poroit desc<abbrev>ri</abbrev>re</l>
    </lg>
    """

the output is:

+---+----+--------+-----+----------------------+--------+---------+
| A | La | cambre | est | de                   | marbre | porfire |
| B | La | cambre | est | d<abbrev>un</abbrev> | marbre | porfire |
+---+----+--------+-----+----------------------+--------+---------+
+---+-----+--------------------------+----+----------------------+-------------------------+---------------------------+
| A | Nus | cler<crease n="start"/>s | ne | vos                  | poroit<crease n="end"/> | desc<abbrev>ri</abbrev>re |
| B | Nus | clers                    | ne | v<abbrev>os</abbrev> | poroit                  | desc<abbrev>ri</abbrev>re |
+---+-----+--------------------------+----+----------------------+-------------------------+---------------------------+

<oo>→<col> Computer-supported collation with CollateX

Introduction

Overview

Why JSON input

Why JSON output

Using XSLT within Python