Many complex Unicode characters can be expressed in more than one way. For example, an uppercase A with a superscript ring can be expressed as a single Unicode character U+212b or as a combination of a base uppercase A (U+0041) followed by a combining superscript ring (U+030a). The documents you need to collate may contain alternative representations that you would like to treat as identical for collation purposes. To do that, you can create a shadow "n" property in your pretokenized JSON (see Unit 5 of this workshop) and normalize the strings. Here’s how to do that.

To show that the two representations are different at the underlying byte level but look alike to a human, we create two variables, which we call a and b, and assign one representation to each, which we then print so that we can examine how they look.

In [1]:
a = "\u212b"
In [2]:
In [3]:
b = "\u0041\u030a"
In [4]:

If we check these two values for equality, Python tells us that they are not equal:

In [5]:
a == b

We can use the unicodedata.normalize() function to normalize both variables and then compare them. When we do that, the normalized versions are equal:

In [5]:
import unicodedata
unicodedata.normalize('NFC',a) == unicodedata.normalize('NFC',b)

We’ve performed this normalization by itself so that we can examine the results. In a CollateX context, you would incorporate Unicode normalization as part of the process of generating an "n" property for your pretokenized JSON tokens. See for more information about Unicode normalization forms.

In [ ]: