The modern text processing pipeline: Segmentation

This blog post is part of a series:

The modern text processing pipeline: Preface
The modern text processing pipeline: Overview
The modern text processing pipeline: Segmentation

This is the start of the technical overview of what goes on in a modern text processing pipeline. At each step of the pipeline, we will be discussing what the standards say, what real implementations do, and what I did in kb_text_shape.

The algorithms

For our purposes, text segmentation is the first part of the text processing pipeline¹. As was discussed previously, the goal of segmentation is to split an unstructured blob of Unicode text into a sequence of substrings, called “runs”, where each run has a single direction and writing system.

The Unicode consortium describes various algorithms you can apply to Unicode text in documents called “annexes”. Even though one of them is titled “Unicode Text Segmentation”, it is not enough to cover the whole of segmentation. The relevant annexes are:

Unicode Text Segmentation for word and grapheme breaking²,
Unicode Line Breaking Algorithm for line breaking,
Unicode Script Property for script breaking,
Unicode Bidirectional Algorithm for direction breaking.

(Note: I will be using “segmentation” and “breaking” interchangeably. Strictly speaking, you can consider segmentation to mean the finding of runs, and breaking to mean the finding of breaks that separate the runs. For all intents and purposes, the two are equivalent.)

Grapheme, word, sentence and line breaking are all described in a similar fashion. Each algorithm maps Unicode codepoints to input classes, and then describes rules that apply to those classes. This strategy of mapping a large set of possible inputs (in this case, Unicode codepoints) to a smaller set (in this case, input classes) is very common in standard Unicode algorithms like these³. Each algorithm describes its own input classes, meaning that a single Unicode codepoint will map to a grapheme break class, a word break class, a sentence break class and a line break class, all of which are distinct and cannot be used interchangeably. While there is some tedium associated with this multiplicty of classes, they do end up being quite different from one another in practice. Furthermore, having the tightest possible set of input classes for each algorithm provides a few benefits, such as reducing lookup table size.

Extracting Unicode properties

Some input classes have straightforward descriptions, like Grapheme_Cluster_Break=CR being assigned to the single codepoint U+000D⁴. Many others, however, reference character properties like “General_Category” or “Emoji_Modifier”. These properties can be found in the Unicode Character Database (often abbreviated to UCD), which is a very fancy name for a directory of text files containing semicolon-separated values. Microsoft also publishes their own extension to the UCD, which you can find here. All properties defined by Microsoft take precedence over Unicode's. The Microsoft files were published as part of the Universal Shaping Engine specification, which is about shaping and not segmentation, but, realistically speaking, you will want the classes from both passes to line up anyway, so you should use the same ones for both. Once parsed, you can bake the properties into any lookup-friendly format you so choose. I recommend two-stage tables, as they are fast and very simple. In kb_text_shape, I use two-stage tables for all properties (for an example, see kbts_GetUnicodeLineBreakClass). The block size for each table is chosen by brute-force optimization.

Implementation

The segmentation algorithms are described by pattern-matching on infinitely-long strings. As such, a naive implementation would require access to the full input text in advance. This causes a few issues: going over the text multiple times is slow, it is annoying when the encoding is variable-sized like UTF-8 and UTF-16 are, and the resulting API is worse because maybe the user only wants to find breaks up to a certain point. For all of these reasons, it was important to me to find streaming implementations of all of the segmentation algorithms for kb_text_shape.

Script breaking

Locating script breaks is easy: codepoints that have a script property (UCD: Scripts.txt) of “Common”, “Inherited” or that do not have a script property at all do not change the current script. Any other value sets the current script, which causes a potential script break. The one special case we have to handle is that closing brackets inherit the script values of their matching opening brackets (UCD: BidiBrackets.txt). Thankfully, the consortium's algorithm limits the bracket matching depth to 64, so this is representable with a fixed-size stack without issue.

Direction breaking

Direction breaking spends a great deal of time discussing the explicit bidirectional formatting characters. These are:

U+202A Left-to-right embedding
U+202B Right-to-left embedding
U+202D Left-to-right override
U+202E Right-to-left override
U+202C Pop directional formatting
U+2066 Left-to-right isolate
U+2067 Right-to-left isolate
U+2068 First strong isolate
U+2069 Pop directional isolate

As far as I know, these do require access to the entire source text. However, their use is niche and officially discouraged, so I did not implement them for kb_text_shape, and the rest of the rules can absolutely be implemented in a streaming fashion. If you ignore the explicit formatting characters, you also get to ignore all of the embedding and isolate machinery. The direction breaking algorithm then starts to look a lot like script breaking: for each character, resolve its direction (either left-to-right, right-to-left, or neutral); non-neutral directions set the current direction, possibly creating a direction break.

Resolving directions requires pattern matching. Rules are given in order of precedence, meaning they are supposed to be read as a big if-else chain, with the corresponding priority this implies. This kind of structure is reused for all the other segmentation algorithms, too. For our streaming implementation to work, we need these rules to be implementable with a fixed amount of state, which we will briefly analyze here.

In the context of a streaming segmenter, at any point in the input text, we can assume we have seen all of the preceding characters. This makes simple backward-looking rules very simple to implement. In the case of W1, we simply store the last bidirectional class we have seen. In the case of W2, we set a bit when we see an AL, and clear it when we see an R or L. Forward-looking rules, on the other hand, require more adaptation effort. For the first sub-rule of W5, a potentially-infinite sequence of ETs is stored by keeping track of the first ET seen. When the sequence ends, it then becomes a matter of either resolving a bunch of ENs or a bunch of ETs all at once. Note that this includes keeping track of the W6 and W7 cases involving those classes of characters, so multiple rules might need to be resolved all at once. The key insight here is that there is a statically-known amount of rules, those rules all require bounded state, and rules are applied top-to-bottom, meaning that rule application does not recurse. For all of these reasons, it is possible to make a streaming implementation by working through each case.

Grapheme breaking

Grapheme breaking can be approached in the same way as direction breaking. Alternatively, as the Unicode consortium points out in Unicode Text Segmentation, a finite state machine is another good way to implement the algorithm, and is likely faster. If you do end up going for the FSM, I recommend writing it by hand, simply because there are a lot of benefits to choosing which state gets assigned to which integer. In kb_text_shape, for example, exit states that indicate that a break exists one character back also encode the state to reset to inside of the exit state itself:

kbts_u8 GraphemeBreakState = kbts_GraphemeBreakTransition[GraphemeBreakClass][State->GraphemeBreakState];
switch(GraphemeBreakState)
{
case KBTS_GRAPHEME_BREAK_STATE_b01: KBTS_BREAK2(KBTS_BREAK_FLAG_GRAPHEME, 1, 0); GraphemeBreakState = KBTS_GRAPHEME_BREAK_STATE_START; break;
case KBTS_GRAPHEME_BREAK_STATE_b0: KBTS_BREAK(KBTS_BREAK_FLAG_GRAPHEME, 0); GraphemeBreakState = KBTS_GRAPHEME_BREAK_STATE_START; break;

case KBTS_GRAPHEME_BREAK_STATE_b1:
case KBTS_GRAPHEME_BREAK_STATE_b1toCR:
case KBTS_GRAPHEME_BREAK_STATE_b1toL:
case KBTS_GRAPHEME_BREAK_STATE_b1toLVxV:
case KBTS_GRAPHEME_BREAK_STATE_b1toLVTxT:
case KBTS_GRAPHEME_BREAK_STATE_b1toIndicConsonantxIndicLinker:
case KBTS_GRAPHEME_BREAK_STATE_PADDING0: // Padding values are just here to help the compiler.
case KBTS_GRAPHEME_BREAK_STATE_PADDING1:
case KBTS_GRAPHEME_BREAK_STATE_b1toExtendedPictographic:
case KBTS_GRAPHEME_BREAK_STATE_PADDING2:
case KBTS_GRAPHEME_BREAK_STATE_PADDING3:
case KBTS_GRAPHEME_BREAK_STATE_b1toRI:
case KBTS_GRAPHEME_BREAK_STATE_b1toSKIP:
  KBTS_BREAK(KBTS_BREAK_FLAG_GRAPHEME, 1);
  GraphemeBreakState -= KBTS_GRAPHEME_BREAK_STATE_b1;
}

In any case, an FSM for grapheme breaking ends up being quite small, both in terms of the transition table and the code that uses it.

Word and line breaking

A lot of the contextual rules for word and line breaking can overlap with one another, which leads to a combinatorial explosion that makes a state machine implementation infeasible. Instead, we implement these rules in a similar way to direction breaking, albeit with much more complexity to handle this time.

Since we have many more rules, special-casing every piece of state would be exceedingly tedious, error-prone and complex. By looking at the rules, we observe that:

Sets of input classes, using the operator |, increase matching complexity.
Repeats, using the operator *, increase matching complexity.
Excluding repeats, rules involve no more than 4 characters.
Rules are given in order of precedence.
Precedence between two rules of the same type (either break or no-break) is meaningless, because the outcome will be the same regardless of which one is applied.

We take advantage of (3) by storing a 4-long history of input classes, and using it as our primary means of matching. Since there are few input classes overall, we packing the entire history in a single integer. This allows us to match on the entire history at once. Rules that are shorter than 4 characters can simply mask off part of the history. This is the resulting code for matching three-character word break rules:

switch(WordBreakHistory & 0xFFFFFF)
{
  KBTS_C3(ALnep, ML, ALnep): KBTS_C3(ALnep, ML, ALep): KBTS_C3(ALnep, ML, HL):
  KBTS_C3(ALnep, MNL, ALnep): KBTS_C3(ALnep, MNL, ALep): KBTS_C3(ALnep, MNL, HL):
  KBTS_C3(ALnep, SQ, ALnep): KBTS_C3(ALnep, SQ, ALep): KBTS_C3(ALnep, SQ, HL):
  KBTS_C3(ALep, ML, ALnep): KBTS_C3(ALep, ML, ALep): KBTS_C3(ALep, ML, HL):
  KBTS_C3(ALep, MNL, ALnep): KBTS_C3(ALep, MNL, ALep): KBTS_C3(ALep, MNL, HL):
  KBTS_C3(ALep, SQ, ALnep): KBTS_C3(ALep, SQ, ALep): KBTS_C3(ALep, SQ, HL):
  KBTS_C3(HL, ML, ALnep): KBTS_C3(HL, ML, ALep): KBTS_C3(HL, ML, HL):
  KBTS_C3(HL, MNL, ALnep): KBTS_C3(HL, MNL, ALep): KBTS_C3(HL, MNL, HL):
  KBTS_C3(HL, SQ, ALnep): KBTS_C3(HL, SQ, ALep): KBTS_C3(HL, SQ, HL):
  KBTS_C3(HL, DQ, HL):
  KBTS_C3(NM, MN, NM): KBTS_C3(NM, MNL, NM): KBTS_C3(NM, SQ, NM):
    WordUnbreaks |= KBTS_WORD_BREAK_BITS(0, 1) | KBTS_WORD_BREAK_BITS(0, 2); break;
}

Since the entire matching logic uses switch statements, we can trivially handle the operator | by duplicating every possible case inside of the switch. This solves (1).

Using (5), we assign a single level of precedence to a block of no-break rules followed by break rules, with the implicit behavior that no-break rules have priority over break rules of the same level. This drastically reduces the number of levels of precedence, enough so that we can every possible rule application into a single integer:

// Word breaks.
// We buffer 3 characters for word breaks.
// Each character gets 3 bits (padded to 4) representing 3 levels of priority.
#define KBTS_WORD_BREAK_BITS(Priority, Position) (((1 << ((Priority) + 1)) - 1) << ((Position) * 4))

The single-integer representation allows us to handle precedence for free: a higher level of precedence will also affect all of the lower levels by using bitmasks (this is where the (- 1) comes from in the macro above). Additionally, breaks are accumulated into a different integer than no-breaks, which makes application order truly meaningless. The final results are obtained by &-ing off the no-break bits from the break bits.

Finally, (2) is handled using similar tricks to direction breaking. There is no general solution to these; they are all special-cased.

I have shown examples of word breaking here because they are simpler and more succinct, but the same approach generalizes well to line breaking. Line breaking differenciates between mandatory breaks (also called hard line breaks) and allowed breaks (also called soft line breaks); we solve this with more bitmasking, although you could also use a third integer that you would then merge at the end, just like no-breaks.

In kb_text_shape, the final code is all in kbts_BreakAddCodepoint_.

Depending on their needs, some programs may want to sanitize or normalize input, or even change encoding, before segmentation. These passes will not be discussed, as they deal with concerns like security or internal string representations, which don't directly contribute to getting text on the screen.^Return
The annex also describes sentence breaking, but I have yet to find any use for it in any context whatsoever. As such, I will not be covering it.^Return
Input set remapping like this is also widely used in parsers in general.^Return
In Unicode parlance, “U+” is the prefix used for codepoints, which are written in hexadecimal. For some reason, they have also decided to pad everything to use at least 4 digits, even though codepoints are 21-bit, so it doesn't even make their representation fixed-width or anything. For all intents and purposes, you can replace it with “0x” in C, which makes “U+000D” simply 0xD.^Return