CollateX and XML, Part 2

David J. Birnbaum (,, 2015-06-29

This example collates a single line of XML from four witnesses. In Part 1 we spelled out the details step by step in a way that would not be used in a real project, but that made it easy to see how each step moves toward the final result. In Part 2 we employ three classes (WitnessSet, Line, Word) to make the code more extensible and adaptable.

The sample input is still a single line for four witnesses, given as strings within the Python script. This time, though, the witness identifier (siglum) is given as an attribute on the XML input line.

Load libraries. Unchanged from Part 1.

In [1]:
from collatex import *
from lxml import etree
import json,re

The WitnessSet class represents all of the witnesses being collated. The generate_json_input() method returns a JSON object that is suitable for input into CollateX.

At the moment each witness contains just one line (<l> element), so the entire witness is treated as a line. In future parts of this tutorial, the lines will be processed individually, segmenting the collation task into subtasks that collate just one line at a time.

In [2]:
class WitnessSet:
    def __init__(self,witnessList):
        self.witnessList = witnessList
    def generate_json_input(self):
        json_input = {}
        witnesses = []
        json_input['witnesses'] = witnesses
        for witness in self.witnessList:
            line = Line(witness)
            witnessData = {}
            witnessData['id'] = line.siglum()
            witnessTokens = {}
            witnessData['tokens'] = line.tokens()
        return json_input

The Line class contains methods applied to individual lines (note that each witness in this part of the tutorial consists of only a single line). The XSLT stylesheets and the functions to use them have been moved into the Line class, since they apply to individual lines. The siglum() method returns the manuscript identifier and the tokens() method returns a list of JSON objects, one for each word token.

With a witness that contained more than one line, the siglum would be a property of the witness and the tokens would be a property of each line of the witness. In this part of the tutorial, since each witness has only one line, the siglum is recorded as an attribute of the line, rather than of an XML ancestor that contains all of the lines of the witness.

In [3]:
class Line:
    addWMilestones = etree.XML("""
    <xsl:stylesheet version="1.0" xmlns:xsl="">
        <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
        <xsl:template match="*|@*">
                <xsl:apply-templates select="node() | @*"/>
        <xsl:template match="/*">
                <xsl:apply-templates select="@*"/>
                <!-- insert a <w/> milestone before the first word -->
        <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
             CUSTOMIZE HERE: add other elements that may span multiple word tokens
        <xsl:template match="add | sic | crease ">
            <xsl:element name="{name()}">
                <xsl:attribute name="n">start</xsl:attribute>
            <xsl:element name="{name()}">
                <xsl:attribute name="n">end</xsl:attribute>
        <xsl:template match="note"/>
        <xsl:template match="text()">
            <xsl:call-template name="whiteSpace">
                <xsl:with-param name="input" select="translate(.,'&#x0a;',' ')"/>
        <xsl:template name="whiteSpace">
            <xsl:param name="input"/>
                <xsl:when test="not(contains($input, ' '))">
                    <xsl:value-of select="$input"/>
                <xsl:when test="starts-with($input,' ')">
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring($input,2)"/>
                    <xsl:value-of select="substring-before($input, ' ')"/>
                    <xsl:call-template name="whiteSpace">
                        <xsl:with-param name="input" select="substring-after($input,' ')"/>
    transformAddW = etree.XSLT(addWMilestones)
    xsltWrapW = etree.XML('''
    <xsl:stylesheet xmlns:xsl="" version="1.0">
        <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
        <xsl:template match="/*">
                <xsl:apply-templates select="w"/>
        <xsl:template match="w">
            <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
            <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
                <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
    transformWrapW = etree.XSLT(xsltWrapW)
    def __init__(self,line):
        self.line = line
    def siglum(self):
        return str(etree.XML(self.line).xpath('/l/@wit')[0])
    def tokens(self):
        return [Word(token).createToken() for token in Line.transformWrapW(Line.transformAddW(etree.XML(self.line))).xpath('//w')]

The Word class contains methods that apply to individual words. unwrap() and normalize() are private; they are used by createToken() to return a JSON object with the "t" and "n" properties for a word token.

In [4]:
class Word:
    unwrapRegex = re.compile('<w>(.*)</w>')
    stripTagsRegex = re.compile('<.*?>')
    def __init__(self,word):
        self.word = word
    def unwrap(self):
        return Word.unwrapRegex.match(etree.tostring(self.word,encoding='unicode')).group(1)
    def normalize(self):
        return Word.stripTagsRegex.sub('',self.unwrap().lower())
    def createToken(self):
        token = {}
        token['t'] = self.unwrap()
        token['n'] = self.normalize()
        return token

Create XML data and assign to a witnessSet variable

In [5]:
A = """<l wit='A'><abbrev>Et</abbrev>cil i partent seulement</l>"""
B = """<l wit='B'><abbrev>Et</abbrev>cil i p<abbrev>er</abbrev>dent ausem<abbrev>en</abbrev>t</l>"""
C = """<l wit='C'><abbrev>Et</abbrev>cil i p<abbrev>ar</abbrev>tent seulema<abbrev>n</abbrev>t</l>"""
D = """<l wit='D'>E cil i partent sulement</l>"""

witnessSet = WitnessSet([A,B,C,D])

Generate JSON from the data and examine it

In [6]:
json_input = witnessSet.generate_json_input()
{'witnesses': [{'tokens': [{'t': '<abbrev>Et</abbrev>cil', 'n': 'etcil'}, {'t': 'i', 'n': 'i'}, {'t': 'partent', 'n': 'partent'}, {'t': 'seulement', 'n': 'seulement'}], 'id': 'A'}, {'tokens': [{'t': '<abbrev>Et</abbrev>cil', 'n': 'etcil'}, {'t': 'i', 'n': 'i'}, {'t': 'p<abbrev>er</abbrev>dent', 'n': 'perdent'}, {'t': 'ausem<abbrev>en</abbrev>t', 'n': 'ausement'}], 'id': 'B'}, {'tokens': [{'t': '<abbrev>Et</abbrev>cil', 'n': 'etcil'}, {'t': 'i', 'n': 'i'}, {'t': 'p<abbrev>ar</abbrev>tent', 'n': 'partent'}, {'t': 'seulema<abbrev>n</abbrev>t', 'n': 'seulemant'}], 'id': 'C'}, {'tokens': [{'t': 'E', 'n': 'e'}, {'t': 'cil', 'n': 'cil'}, {'t': 'i', 'n': 'i'}, {'t': 'partent', 'n': 'partent'}, {'t': 'sulement', 'n': 'sulement'}], 'id': 'D'}]}

Collate and output the results as a plain-text alignment table, as JSON, and as colored HTML

In [7]:
collationText = collate_pretokenized_json(json_input,output='table',layout='vertical')
collationJSON = collate_pretokenized_json(json_input,output='json')
collationHTML2 = collate_pretokenized_json(json_input,output='html2')
|          A           |          B           |          C           |    D     |
| <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c | <abbrev>Et</abbrev>c |    E     |
|          il          |          il          |          il          |          |
|          -           |          -           |          -           |   cil    |
|          i           |          i           |          i           |    i     |
|       partent        | p<abbrev>er</abbrev> | p<abbrev>ar</abbrev> | partent  |
|                      |         dent         |         tent         |          |
|      seulement       | ausem<abbrev>en</abb | seulema<abbrev>n</ab | sulement |
|                      |        rev>t         |        brev>t        |          |
{"table": [[[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "seulement", "t": "seulement"}]], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "perdent", "t": "p<abbrev>er</abbrev>dent"}], [{"n": "ausement", "t": "ausem<abbrev>en</abbrev>t"}]], [[{"n": "etcil", "t": "<abbrev>Et</abbrev>cil"}], [null], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "p<abbrev>ar</abbrev>tent"}], [{"n": "seulemant", "t": "seulema<abbrev>n</abbrev>t"}]], [[{"n": "e", "t": "E"}], [{"n": "cil", "t": "cil"}], [{"n": "i", "t": "i"}], [{"n": "partent", "t": "partent"}], [{"n": "sulement", "t": "sulement"}]]], "witnesses": ["A", "B", "C", "D"]}
etcil etcil etcil e
- - - cil
i i i i
partent perdent partent partent
seulement ausement seulemant sulement