Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified:
This tutorial explains how to convert XML to JSON using Python 3. JSON (JavaScript object notation) is a data interchange format commonly used in web-services applications. Some features of CollateX accept only JSON input or produce only JSON output, which means that the user may have to convert input data in other formats to JSON, and may also have to convert CollateX output in JSON to other formats. This tutorial assumes that you are familiar with XML, but not necessarily with JSON. It also assumes that you are familiar with using CollateX from within Python, at least at the level covered in our CollateX-Python tutorial.
CollateX can read input in several formats, including plain text and XML, but if you need to perform your own tokenization (divide the text into words differently from the default) or normalize your word tokens (see the separate discussion in the Normalization unit), the input into CollateX has to be in a CollateX-specific JSON format. The first topic discussed in the present unit, then, addresses how to transform your XML or plain-text input into JSON so that you can control the tokenization and incorporate normalized shadows of your word tokens into the CollateX input. CollateX will then use the normalized shadows to determine when tokens should be considered the same, but it will output the original text from the witness.
CollateX can produce output in several formats, but the native output format is a particular JSON configuration, and certain CollateX operations (including the collation of input with user-specified normalization shadows) can produce only JSON output. This means that the user must then convert the JSON output of CollateX into a more useful form (e.g., TEI, other XML, HTML). The second topic discussed in this unit, then, addresses how to convert CollateX JSON output into XML, which can then be converted to other XML (such as TEI) or to HTML with XSLT.
The XML-to-JSON and JSON-to-XML conversions require an understanding of regular expressions, XSLT, and Python. In this tutorial the XSLT is executed from within Python, which means that we are constrained to use only XSLT 1.0 and XPath 1.0 features, because there is no robust Python support for the current versions of the XSLT and XPath standards. (It is, alternatively, possible to perform the XSLT transformations separately and connect the XSLT and Python processes with I/O pipelining.)
This section describes how to convert XML input (of any type) to JSON for use in CollateX.
CollateX can process input in pretokenized JSON with:
collate_pretokenized_json(inputFile.json,output='json')
The input JSON file should have the following shape (example from http://collatex.net/doc/#json-input):
{ "witnesses" : [ { "id" : "A", "tokens" : [ { "t" : "A", "ref" : 123 }, { "t" : "black" , "adj" : true }, { "t" : "cat", "id" : "xyz" } ] }, { "id" : "B", "tokens" : [ { "t" : "A" }, { "t" : "white" , "adj" : true }, { "t" : "kitten.", "n" : "cat" } ] } ] }
JSON is a data structure that consists of a hierarchical (nested) arrangement of JSON objects, each of which has a property name and a value. This is comparable to an XML hierarchy, where elements have a name (comparable to a JSON property name) and contents (comparable to the value of the JSON property). JSON does not have anything analogous to the XML distinction between child elements and attributes; both are expressed as embedded JSON objects with property names and values.
JSON syntax works as follows:
A JSON object is surrounded by curly braces.
Each JSON object has one or more name : value pairs. The name is separated from the value by a colon, and if the JSON object has multiple name : value pairs, those are separated from one another by commas.
The name in a name : value pair must be a string (enclosed in matching single or double quotation marks).
The value in a name : value pair may be any of several data types, the most important of which, for our purposes, are strings and lists.
String values are enclosed in matching single or double quotation marks.
For our purposes, lists are lists of JSON objects. The entire list is enclosed in square brackets and the individual list items (JSON objects surrounded by curly braces) are separated from one another by commas.
Whitespace new-line characters and indentation are for human legibility and are not informational from a JSON perspective. Although JSON supports other data types, such as integers and booleans, for the purpose of converting XML to JSON, these can be treated as if they were strings.
In the example above, the root of the JSON input document (comparable to the XML
document node) has a single witnesses
property (comparable to an XML root element), the value of which is a
list of witnesses, each of which is a JSON object (comparable to
the children of an XML root element). In this example there are two witnesses.
Each witnesses, in turn, has two properties: the value of the id
property is a string that identifies the witnesses (e.g., a manuscript siglum)
and the value of the tokens
property is a list of word tokens. Each
word token is a JSON object that has an obligatory t
property,
which contains the string value of the token. All other properties of tokens are
optional. The only property that CollateX uses other than t
is the
n
(normalized
) property. As described in our separate
normalization tutorial, if an n
property is specified on the JSON
object, CollateX will identify correspondences during collation by comparing
n
property values. If there is no n
property,
CollateX compares t
properties to identify correspondences. Other
properties are for the convenience of the user; CollateX passes them through to
the output, but does not use them directly.
The preceding means that to collate JSON input, which is necessary if you want to
perform your own tokenization or if you want to use normalization, you must
create the JSON structure illustrated above, providing at least a t
property for every word token.
As described above, JSON is a hierarchical format, where curly braces delimit JSON objects that many contain other JSON objects. What the human user sees, though, is a linear sequence of individual characters, which we can understand as a serialization of the inherent hierarchical JSON structure as a string. Python can represent either the actual hierarchy (using a Python dictionary) or the serialization (using a Python string). These will look the same to the human user when they are printed to the screen because the screen displays one character at a time, and therefore serializes the hierarchy as part of the rendering process, but internally the dictionary and the string representations are different. At present CollateX can accept JSON input only as a Python dictionary, and not as a string. Support for stringified JSON may be added at a later date, but this tutorial assumes that when you create JSON input for CollateX, you must create it as a Python dictionary.
The structure of a Python dictionary is isomorphic with the hierarchical JSON structure, which means that a Python dictionary is capable of representing hierarchical JSON objects directly. As described above, JSON objects are name : value pairs, where the name is a string and the value may be any of several datatypes, including JSON objects, lists, strings, integers, etc. Meanwhile, Python dictionaries consist of key : value pairs that have properties analogous to the name : value properties of JSON objects. This means that converting your XML to JSON for input into CollateX can be thought of as converting your XML to a Python dictionary, which can then be used by CollateX as a JSON representation without further transformation.
If your input is in XML and you want to pass it to CollateX as JSON (which is necessary if you want to do your own tokenization or normalization), you have to convert your XML to the specific JSON format described above. The complexity of your XML will vary from project to project (this is true even of projects that use TEI, since TEI XML may take different forms), and it is not possible to write a generic Python script that will convert any arbitrary XML (or even any arbitrary TEI) to the JSON format needed for CollateX. This means that you will have to write your own Python code, which knows about your specific XML elements, to perform that transformation. There are, however, three basic XML patterns you are likely to encounter when processing this type of transformation, and we discuss each of them here. If you know how to process these three types of XML patterns, you can write Python code to deal with your specific XML.
This tutorial assumes that all processing (transformation of the input XML to JSON, collation of the JSON with CollateX, transformation of the output JSON to XML) will take place in a single Python program. Alternatively, it is also possible to perform the XML to JSON transformation as a separate preprocessing step, using an XSLT transformation tool such as Saxon. Processing the XML with Saxon instead of in Python has the advantage of making XPath 2.0 and XSLT 2.0 resources available (Python supports only XPath 1.0 and XSLT 1.0), along with the possible disadvantage of separating your collation into separate processes. This tutorial performs all transformations within Python, using XPath 1.0 an XSLT 1.0 strategies.
Simple XML might make no use of internal markup. For example, we might have two witnesses that look like:
<l id="13" n="13">Li salaus se torne al serain</l>
and
<l id="13" n="13">Li solaus se torne al serain</l>
These and the following examples are taken from witnesses A and B,
respectively, of Partonopeus de
Blois, available at the Oxford Text Archive. The second word
tokens differ (salaus
vs solaus
), but
otherwise the readings are the same, and there is no markup inside the
<l>
elements.
More complex XML might include markup within an individual word token. For example:
<l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l>
and
<l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l>
In this example, there there are both character-level differences and
markup differences, and the markup differences are confined to a single
word token. Sometimes the token-level markup affects an entire token
(e.g., et
vs <abbrev>et</abbrev>
)
and sometimes it affects only part of a token (e.g., maint
vs mai<abbrev>n</abbrev>t
). Sometimes the
variation affects both the character-level data content and the markup
(e.g., Ki
vs
Q<abbrev>ui</abbrev>
).
Still more complex XML might have markup that spans more than one token, e.g.:
<l id="1116" n="1098">Nus cler<crease>s ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l>
and
<l id="1116" n="1024">Nus clers ne v<abbrev>os</abbrev> poroit desc<abbrev>ri</abbrev>re</l>
Here the <crease>
element in Witness A begins in the
middle of one word token, spans the following two, and ends in the
middle of the next. If we are ultimately to treat each word token as a
separate XML element (for example, as a separate table cell in an HTML
rendering of an alignment table), we will need to translate the spanning
element to an alternative representation in order to avoid an
overlapping hierarchy conflict.
Each of the preceding types of XML can be represented in JSON in the ways described below. The procedure for incorporating normalized shadow tokens is discussed in a separate unit.
In some cases when you are collating witnesses that have already been tagged
in TEI or other XML, the source documents may already contains markup that
can be used for alignment purposes. For example, the division of the text in
the examples above into <l>
elements makes it possible to
treat each <l>
as a separate collation task, which
reduces the likelihood of misaligning tokens that repeat in different lines.
In the samples above, the @id
attributes match across
witnesses, which means that the texts have already been aligned coarsely
during initial markup, and the task for CollateX is to refine the alignment
so that it operates at the level of the individual word token, instead of
the entire <l>
element. This type of prealignment is not
required, but if it is present in your documents, exploiting it will improve
the speed and accuracy of the collation process.
We assume in this tutorial that we want to tokenize the contents of the
witnesses (in the preceding examples, the contents of the
<l>
elements) on whitespace. One complication that
arises in tokenizing XML on whitespace is that not all whitespace characters
in XML represent word boundaries. For example, in:
<l id="13" n="13">Li salaus se torne al serain</l>
there is whitespace inside the start tag that delimits the attributes; the
only whitespace we care about for tokenization purposes is whitespace inside
text()
nodes. A second complication is that if we do the
tokenization within Python, we are limited to XPath 1.0 and XSLT 1.0, which
means that we cannot use several powerful features of XPath 2.0 and XSLT 2.0
that would be helpful in this context (e.g., tokenize()
,
matches()
, <xsl:analyze-string>
,
<xsl:for-each-group>
, except
, and
<<
).
We cope with these challenges by performing the tokenization with XSLT,
rather than with Python regular expression matching against a string,
because it is easier to distinguish whitespace in text()
nodes
from whitespace inside tags using XML-aware processing. But we do the
processing in two passes: first we replace all whitespace sequences inside
text()
nodes with empty milestone <w/>
tags and we then, on a second pass, convert those to wrapper
<w>
elements around each word token.
As a simplified example, let’s assume that we have two two-line witnesses to the Partonopeus de Blois text, as follows:
<l id="21" n="21">Laloete cante damor</l> <l id="22" n="22">Si estrine laube del jor</l>
<l id="21" n="19">Laloete chante damor</l> <l id="22" n="20">Sin estrine laube dou jor</l>
This is a simplified example that focuses on creating JSON that can serve as input into CollateX. Real-world examples will be more complicated in the following ways:
<text>
, <body>
,
<div>
, <lg>
, etc.), your Python
script can read entire document and distinguish what to ignore, what to pass
through unchanged, and what to collate with CollateX.<l>
tags but lack the
@id
attributes that provide coarse prealignment here, you
have to align the entire texts at once with CollateX, since without the
@id
attributes you cannot determine which lines in one
witness is supposed to be collated against which line in aother. In that
case the <l>
tags represent a case of internal markup
that crosses word boundaries (type 3 above), which is discussed below.Here is a Python script that can collate this input using CollateX, along with the JSON output of CollateX. We discuss converting the JSON output to XML separately below.
#!/usr/bin/env python3 """ example1.py Author: David J. Birnbaum (djbpitt@gmail.com; http://www.obdurodon.org) Acknowledgements: Revised with the help of Ronald Dekker First version: 2015-06-25 Collates minimal TEI input <l> elements without internal markup """ from collatex import * from lxml import etree import re, json class Witness: """An instance of Witness is the etree representation of a witness""" def __init__(self,xml): self.xml = etree.XML(xml) def ids(self): return [int(i) for i in self.xml.xpath('//@id')] def minId(self): return min(self.ids()) def maxId(self): return max(self.ids()) def lById(self,id): """word-tokenized <l> of a witness by @id not called directly; used by words(), below """ self.id = str(id) line = Line(self.xml,self.id) return line.tokenized() def words(self,id): """word tokens with <w> wrappers removed""" self.id = str(id) wrappedWords = self.lById(self.id).xpath('//w') for w in wrappedWords: yield Word(w).unwrap() def siglum(self): # xpath() returns a list, even if there's just one object return str(self.xml.xpath('//lg/@wit')[0]) def generate_tokens(self, lineId): words = self.words(lineId) currentTokens = [] for word in words: wordToken = {} wordToken['t'] = word currentTokens.append(wordToken) return currentTokens class Line: """An instance of Line is an <l> in a witness with a specified @id""" # The two XSLT transformations are class properties # The first replaces whitespace in the input with <w/> milestone tags # The second transforms the milestones to wrappers xsltAddW = etree.XML(''' <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="node() | @*"/> </xsl:copy> </xsl:template> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="@*"/> <!-- insert a <w/> milestone before the first word --> <w/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way) CUSTOMIZE HERE: add other elements that may span multiple word tokens --> <xsl:template match="add | sic | crease "> <xsl:element name="{name()}"> <xsl:attribute name="n">start</xsl:attribute> </xsl:element> <xsl:apply-templates/> <xsl:element name="{name()}"> <xsl:attribute name="n">end</xsl:attribute> </xsl:element> </xsl:template> <xsl:template match="text()"> <xsl:call-template name="whiteSpace"> <xsl:with-param name="input" select="translate(.,'
',' ')"/> </xsl:call-template> </xsl:template> <xsl:template name="whiteSpace"> <xsl:param name="input"/> <xsl:choose> <xsl:when test="not(contains($input, ' '))"> <xsl:value-of select="$input"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="substring-before($input, ' ')"/> <w/> <xsl:call-template name="whiteSpace"> <xsl:with-param name="input" select="substring-after($input,' ')"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> ''') transformAddW = etree.XSLT(xsltAddW) xsltWrapW = etree.XML(''' <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="w"/> </xsl:copy> </xsl:template> <xsl:template match="w"> <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators --> <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/> <w> <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/> </w> </xsl:template> </xsl:stylesheet> ''') transformWrapW = etree.XSLT(xsltWrapW) def __init__(self,witness,id): self.witness = witness self.id = id def fullLine(self): # xpath() returns a list, even if there's just one object return self.witness.xpath('//l[@id =' + self.id + ']')[0] def lineString(self): # lineString() is for diagnosis, and is not used in production return etree.tostring(self.fullLine(), encoding='unicode') def tokenized(self): self.withMilestones = Line.transformAddW(self.fullLine()) self.withWrappers = Line.transformWrapW(self.withMilestones) return self.withWrappers class Word: """An instance of Word is a word token in a line""" unwrapRegex = re.compile('<.*?>\s*(.*)\s*</.*?>') def __init__(self,w): self.w = w self.stringified = etree.tostring(self.w,encoding='unicode') def unwrap(self): """Remove <w> tags from around word token""" return Word.unwrapRegex.match(self.stringified).group(1) class WitnessSet: """An instance of WitnessSet is the set of witnesses being collated In these examples, the witnesses are individual <l> elements, not entire mss """ def __init__(self, witnesses): self.witnesses = witnesses def get_line_ids(self): witnessMin = min([witness.minId() for witness in self.witnesses]) witnessMax = max([witness.maxId() for witness in self.witnesses]) return range(witnessMin,witnessMax + 1) def generate_block(self, lineId): block = {} witnesses = [] for witness in self.witnesses: currentWitness = {} currentWitness['id'] = witness.siglum() currentWitness['tokens'] = witness.generate_tokens(lineId) witnesses.append(currentWitness) block['witnesses'] = witnesses return block def generate_blocks_by_line(self): for lineId in self.get_line_ids(): block = self.generate_block(lineId) yield block def main(): witnessA = """ <lg wit="A"> <l id="21" n="21">Laloete cante damor</l> <l id="22" n="22">Si estrine laube del jor</l> </lg> """ witnessB = """ <lg wit="B"> <l id="21" n="19">Laloete chante damor</l> <l id="22" n="20">Sin estrine laube dou jor</l> </lg> """ witnessC = """ <lg wit="A"> <l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l> <l id="9" n="9">Ki ma doune soie merci</l> </lg> """ witnessD = """ <lg wit="B"> <l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l> <l id="9" n="9">Q<abbrev>ui</abbrev> ma done soie merci</l> </lg> """ # get the numeric range of @id values as witnessMin and witnessMax witnessATree = Witness(witnessA) witnessBTree = Witness(witnessB) witnessSet = WitnessSet([witnessATree, witnessBTree]) # treat each <l> (by @id) as a separate collation block for block in witnessSet.generate_blocks_by_line(): # Uncomment the following line to see the JSON input to CollateX # print(json.dumps(block,indent=2)) collation = collate_pretokenized_json(block) print(collation) if __name__ == "__main__": main()
If you run this, the output is:
+---+---------+--------+-------+ | A | Laloete | cante | damor | | B | Laloete | chante | damor | +---+---------+--------+-------+ +---+-----+---------+-------+-----+-----+ | A | Si | estrine | laube | del | jor | | B | Sin | estrine | laube | dou | jor | +---+-----+---------+-------+-----+-----+
Word tokens that contain no markup can be collated on their string value, but
if we want to recognize that, for example, maint
and
mai<abbrev>n</abbrev>t
are a perfect string
match if we ignore the markup, we can create an n
property for
the JSON word token object that strips the internal markup. The changes to
the preceding Python script that perform that normalization are the
following:
puncRegex
(159) and tagRegex
(160) to
match punctuation and tags, respectively, and we use those plus the
str.lower()
method to create the normalized shadow
token (171). As noted above, if we specify an n
property in
our JSON word token (57, 169–71), CollateX will use that property to
perform matching and alignment, but it will return the original
non-normalized token in the alignment table.#!/usr/bin/env python3 """ example2.py Author: David J. Birnbaum (djbpitt@gmail.com; http://www.obdurodon.org) Acknowledgements: Revised with the help of Ronald Dekker First version: 2015-06-25 Collates TEI input <l> elements with token-internal markup Based on example1.py """ from collatex import * from lxml import etree import re, json, string class Witness: """An instance of Witness is the etree representation of a witness""" def __init__(self,xml): self.xml = etree.XML(xml) def ids(self): return [int(i) for i in self.xml.xpath('//@id')] def minId(self): return min(self.ids()) def maxId(self): return max(self.ids()) def lById(self,id): """word-tokenized <l> of a witness by @id not called directly; used by words(), below """ self.id = str(id) line = Line(self.xml,self.id) return line.tokenized() def words(self,id): """word tokens with <w> wrappers removed""" self.id = str(id) wrappedWords = self.lById(self.id).xpath('//w') for w in wrappedWords: yield Word(w) def siglum(self): # xpath() returns a list, even if there's just one object return str(self.xml.xpath('//lg/@wit')[0]) def generate_tokens(self, lineId): words = self.words(lineId) currentTokens = [] for word in words: wordToken = {} wordToken['t'] = word.unwrap() wordToken['n'] = word.normalizeToken() currentTokens.append(wordToken) return currentTokens class Line: """An instance of Line is an <l> in a witness with a specified @id""" # The two XSLT transformations are class properties # The first replaces whitespace in the input with <w/> milestone tags # The second transforms the milestones to wrappers xsltAddW = etree.XML(''' <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="node() | @*"/> </xsl:copy> </xsl:template> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="@*"/> <!-- insert a <w/> milestone before the first word --> <w/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way) CUSTOMIZE HERE: add other elements that may span multiple word tokens --> <xsl:template match="add | sic | crease "> <xsl:element name="{name()}"> <xsl:attribute name="n">start</xsl:attribute> </xsl:element> <xsl:apply-templates/> <xsl:element name="{name()}"> <xsl:attribute name="n">end</xsl:attribute> </xsl:element> </xsl:template> <xsl:template match="text()"> <xsl:call-template name="whiteSpace"> <xsl:with-param name="input" select="translate(.,'
',' ')"/> </xsl:call-template> </xsl:template> <xsl:template name="whiteSpace"> <xsl:param name="input"/> <xsl:choose> <xsl:when test="not(contains($input, ' '))"> <xsl:value-of select="$input"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="substring-before($input, ' ')"/> <w/> <xsl:call-template name="whiteSpace"> <xsl:with-param name="input" select="substring-after($input,' ')"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> ''') transformAddW = etree.XSLT(xsltAddW) xsltWrapW = etree.XML(''' <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="w"/> </xsl:copy> </xsl:template> <xsl:template match="w"> <!-- faking <xsl:for-each-group> as well as the "<<" and except" operators --> <xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/> <w> <xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/> </w> </xsl:template> </xsl:stylesheet> ''') transformWrapW = etree.XSLT(xsltWrapW) def __init__(self,witness,id): self.witness = witness self.id = id def fullLine(self): # xpath() returns a list, even if there's just one object return self.witness.xpath('//l[@id =' + self.id + ']')[0] def lineString(self): # lineString() is for diagnosis, and is not used in production return etree.tostring(self.fullLine(), encoding='unicode') def tokenized(self): self.withMilestones = Line.transformAddW(self.fullLine()) self.withWrappers = Line.transformWrapW(self.withMilestones) return self.withWrappers class Word: """An instance of Word is a word token in a line""" unwrapRegex = re.compile('<.*?>\s*(.*)\s*</.*?>') puncRegex = re.compile('[' + string.punctuation + ']+') tagRegex = re.compile('<.*?>') def __init__(self,w): self.w = w self.stringified = etree.tostring(self.w,encoding='unicode') def unwrap(self): """Remove <w> tags from around word token""" return Word.unwrapRegex.match(self.stringified).group(1) def normalizeToken(self): """Create shadow 'n' property""" return Word.tagRegex.sub('',Word.tagRegex.sub('',self.stringified.lower())) class WitnessSet: """An instance of WitnessSet is the set of witnesses being collated In these examples, the witnesses are individual <l> elements, not entire mss """ def __init__(self, witnesses): self.witnesses = witnesses def get_line_ids(self): witnessMin = min([witness.minId() for witness in self.witnesses]) witnessMax = max([witness.maxId() for witness in self.witnesses]) return range(witnessMin,witnessMax + 1) def generate_block(self, lineId): block = {} witnesses = [] for witness in self.witnesses: currentWitness = {} currentWitness['id'] = witness.siglum() currentWitness['tokens'] = witness.generate_tokens(lineId) witnesses.append(currentWitness) block['witnesses'] = witnesses return block def generate_blocks_by_line(self): for lineId in self.get_line_ids(): block = self.generate_block(lineId) yield block def main(): witnessA = """ <lg wit="A"> <l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l> <l id="9" n="9">Ki ma doune soie merci</l> </lg> """ witnessB = """ <lg wit="B"> <l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l> <l id="9" n="9">Q<abbrev>ui</abbrev> ma done soie merci</l> </lg> """ # get the numeric range of @id values as witnessMin and witnessMax witnessATree = Witness(witnessA) witnessBTree = Witness(witnessB) witnessSet = WitnessSet([witnessATree, witnessBTree]) # treat each <l> (by @id) as a separate collation block for block in witnessSet.generate_blocks_by_line(): # Uncomment the following line to see the JSON input to CollateX # print(json.dumps(block,indent=2)) collation = collate_pretokenized_json(block) print(collation) if __name__ == "__main__": main()
If you run this, the output is:
+---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+ | A | Ki | maint | <abbrev>et</abbrev> | el | pere | et | el | fis | | B | Q<abbrev>ui</abbrev> | mai<abbrev>n</abbrev>t | <abbrev>et</abbrev> | el | pere | <abbrev>et</abbrev> | el | fis | +---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+ +---+----------------------+----+-------+------+-------+ | A | Ki | ma | doune | soie | merci | | B | Q<abbrev>ui</abbrev> | ma | done | soie | merci | +---+----------------------+----+-------+------+-------+
If we want to visualize the alignment table in HTML or other XML, we have to avoid overlapping elements. In the preceding example there is no risk of overlap because the markup that we retain in the word tokens is wholly confined to a single token. But in an example like:
<l id="1116" n="1098">Nus cler<crease>s ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l>
we cannot safely create output with the <crease>
start tag
inside one word token and the end tag inside another, since if we were to do
that and then wrap each word token in tags, we would create an overlap
situation that violates XML well-formedness constraints. To deal with this
issue we identify the tags that may span word tokens and convert those to
empty milestone tags. In line 84 of the Python script above, we’ve flagged
<add>
, <sic>
, and
<crease>
to be treated in this way. If we change the
input XML in the Python script (line 193–205) to include an example that of
<crease>
that spans word tokens:
witnessA = """ <lg wit="A"> <l id="1115" n="1097">La cambre est de marbre porfire</l> <l id="1116" n="1098">Nus cler<crease> ne vos poroit</crease> desc<abbrev>ri</abbrev>re</l> </lg> """ witnessB = """ <lg wit="B"> <l id="1115" n="1023">La cambre est d<abbrev>un</abbrev> marbre porfire</l> <l id="1116" n="1024">Nus clers ne v<abbrev>os</abbrev> poroit desc<abbrev>ri</abbrev>re</l> </lg> """
the output is:
+---+----+--------+-----+----------------------+--------+---------+ | A | La | cambre | est | de | marbre | porfire | | B | La | cambre | est | d<abbrev>un</abbrev> | marbre | porfire | +---+----+--------+-----+----------------------+--------+---------+ +---+-----+--------------------------+----+----------------------+-------------------------+---------------------------+ | A | Nus | cler<crease n="start"/>s | ne | vos | poroit<crease n="end"/> | desc<abbrev>ri</abbrev>re | | B | Nus | clers | ne | v<abbrev>os</abbrev> | poroit | desc<abbrev>ri</abbrev>re | +---+-----+--------------------------+----+----------------------+-------------------------+---------------------------+
In the Python script above, we use the collate_pretokenized_json()
function of CollateX (line 226) to perform collation of the JSON input we’ve
prepared, and by default it creates output in the form of a horizontal alignment
table. The only type of output that CollateX can create directly from JSON input
other than an alignment table is JSON. For example, the JSON output of the preceding
collation looks like:
{ "table": [ [ [{ "n": "ki", "t": "Ki" }], [{ "n": "maint", "t": "maint" }], [{ "n": "et", "t": "<abbrev>et<\/abbrev>" }], [{ "n": "el", "t": "el" }], [{ "n": "pere", "t": "pere" }], [{ "n": "et", "t": "et" }], [{ "n": "el", "t": "el" }], [{ "n": "fis", "t": "fis" }] ], [ [{ "n": "qui", "t": "Q<abbrev>ui<\/abbrev>" }], [{ "n": "maint", "t": "mai<abbrev>n<\/abbrev>t" }], [{ "n": "et", "t": "<abbrev>et<\/abbrev>" }], [{ "n": "el", "t": "el" }], [{ "n": "pere", "t": "pere" }], [{ "n": "et", "t": "<abbrev>et<\/abbrev>" }], [{ "n": "el", "t": "el" }], [{ "n": "fis", "t": "fis" }] ] ], "witnesses": [ "A", "B" ] }
The JSON output has two properties, table
and witnesses
.
The value of the witnesses
property is a list of strings that are the
witness identifiers (A
and B
in the example above). The value of
table
property is a list, the items of which are themselves lists.
Each of those sublists itself contains sub-sublists, each of which contains only one
item, a JSON object that represents a single word token. The JSON object returns all
of the properties that were present on input, which in our example are just
t
and n
, but if you wanted to include other
information from the input (e.g., a line number) and retrieve it on output, you
could assign to an arbitrary JSON property (other than t
and
n
, which are reserved property names that have predefined meanings)
and it would pass through the collation process unchanged and be reported in the
output.
If we eventually want to output some specific type of XML (such as HTML, TEI, or something else), we can start by converting the JSON output of the collation process to a simple XML that observes the same hierarchy as the JSON. We can then convert the simple XML to our desired final form with XSLT. For the alignment table:
+---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+ | A | Ki | maint | <abbrev>et</abbrev> | el | pere | et | el | fis | | B | Q<abbrev>ui</abbrev> | mai<abbrev>n</abbrev>t | <abbrev>et</abbrev> | el | pere | <abbrev>et</abbrev> | el | fis | +---+----------------------+------------------------+---------------------+----+------+---------------------+----+-----+
our simple XML will have the following structure:
<block> <witnesses> <i>A</i> <i>B</i> </witnesses> <table> <i> <i> <i> <t>Ki</t> <n>ki</n> </i> </i> <i> <i> <t>maint</t> <n>maint</n> </i> </i> <i> <i> <t><abbrev>et</abbrev></t> <n>et</n> </i> </i> <i> <i> <t>el</t> <n>el</n> </i> </i> <i> <i> <t>pere</t> <n>pere</n> </i> </i> <i> <i> <t>et</t> <n>et</n> </i> </i> <i> <i> <t>el</t> <n>el</n> </i> </i> <i> <i> <t>fis</t> <n>fis</n> </i> </i> </i> <i> <i> <i> <t>Q<abbrev>ui</abbrev></t> <n>qui</n> </i> </i> <i> <i> <t>mai<abbrev>n</abbrev>t</t> <n>maint</n> </i> </i> <i> <i> <t><abbrev>et</abbrev></t> <n>et</n> </i> </i> <i> <i> <t>el</t> <n>el</n> </i> </i> <i> <i> <t>pere</t> <n>pere</n> </i> </i> <i> <i> <t><abbrev>et</abbrev></t> <n>et</n> </i> </i> <i> <i> <t>el</t> <n>el</n> </i> </i> <i> <i> <t>fis</t> <n>fis</n> </i> </i> </i> </table> </block>
The following modifications to the Python script create the XML output:
# Dictionary to XML conversion (converted to Python 3) based on # http://code.activestate.com/recipes/577882-convert-a-nested-python-data-structure-to-xml/ def data2xml(d, name='data'): r = etree.Element(name) return buildxml(r, d) def buildxml(r, d): if isinstance(d, dict): for k, v in d.items(): s = etree.SubElement(r, k) buildxml(s, v) elif isinstance(d, tuple) or isinstance(d, list): for v in d: s = etree.SubElement(r, 'i') buildxml(s, v) elif isinstance(d, str): r.text = d else: r.text = str(d) return r def main(): witnessA = """ <lg wit="A"> <l id="8" n="8">Ki maint <abbrev>et</abbrev> el pere et el fis</l> <l id="9" n="9">Ki ma doune soie merci</l> </lg> """ witnessB = """ <lg wit="B"> <l id="8" n="8">Q<abbrev>ui</abbrev> mai<abbrev>n</abbrev>t <abbrev>et</abbrev> el pere <abbrev>et</abbrev> el fis</l> <l id="9" n="9">Q<abbrev>ui</abbrev> ma done soie merci</l> </lg> """ # get the numeric range of @id values as witnessMin and witnessMax witnessATree = Witness(witnessA) witnessBTree = Witness(witnessB) witnessSet = WitnessSet([witnessATree, witnessBTree]) # treat each <l> (by @id) as a separate collation block for block in witnessSet.generate_blocks_by_line(): # Uncomment the following line to see the JSON input to CollateX # print(json.dumps(block,indent=2)) collationTable = collate_pretokenized_json(block) collation = collate_pretokenized_json(block,output='json') print(re.sub('&','&',re.sub('>','>',re.sub(r'<','<', etree.tostring(data2xml(json.loads(collation),name='block'),pretty_print=True,encoding='unicode'))))) print(collationTable)
The XML output can be converted to an HTML table with:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/> <xsl:strip-space elements="*"/> <xsl:template match="/"> <html> <head> <title>Collation output</title> </head> <body> <h1>Collation output</h1> <xsl:apply-templates select="//table"/> </body> </html> </xsl:template> <xsl:template match="t"> <td> <xsl:apply-templates/> </td> </xsl:template> <xsl:template match="*"> <xsl:element name="{name()}"> <xsl:apply-templates/> </xsl:element> </xsl:template> <xsl:template match="table/i"> <tr> <th> <xsl:value-of select="//witnesses/i[position() eq count(current()/preceding-sibling::i) + 1]"/> </th> <xsl:apply-templates/> </tr> </xsl:template> <xsl:template match="table/i/i"> <xsl:apply-templates/> </xsl:template> <xsl:template match="table/i/i/i"> <xsl:apply-templates select="t"/> </xsl:template> <xsl:template match="abbrev"> <xsl:text>(</xsl:text> <xsl:apply-templates/> <xsl:text>)</xsl:text> </xsl:template> </xsl:stylesheet>
with the following output:
A | Ki | maint | (et) | el | pere | et | el | fis |
---|---|---|---|---|---|---|---|---|
B | Q(ui) | mai(n)t | (et) | el | pere | (et) | el | fis |