The modern text processing pipeline: Overview

My text segmentation and shaping library was just released. The announcement received a lot of attention, which I am very grateful for, but a certain amount of people seemed confused as to what the library was or did, or how they could use it in their own projects. I will try to give a bird's-eye view of the pieces that make up a modern text processing pipeline, and, by the end of the article, you should hopefully have a better idea of where my library fits.

Text encoding

Before we even get to processing anything, we have to define what we mean by text. Back in the TrueType days, some fonts had glyph tables that facilitated certain encodings, like ASCII, UTF-16 or SHIFT JIS, and, beyond that, all of the font's functionality was specified within the font itself.

While OpenType fonts reuse the same glyph tables to map text data to glyph indices, shaping requires knowing the Unicode properties for each character we want to display. These properties indicate things like “is this character a punctuation mark?” or “what side does this attaching vowel typically attach from?”. These tables only exist for Unicode and are published by the Unicode consortium, so, practically speaking, some form of Unicode encoding is required. Harfbuzz and kb_text_shape.h both expect the user to provide Unicode codepoints, with optional UTF-8 input support1.

Segmentation

Text segmentation divides the input text into several subtsrings called “runs”. Runs are sequences of characters that share a common direction (left-to-right or right-to-left) as well as a common script. Additionally, a “hard line break”, like an explicit newline or carriage return character, signals the end of a run2. Fundamentally, there is nothing more to it than that; at minimum, a program may decide to break its text by direction, script and hard line breaks. Programs are, of course, free to subdivide their text further depending on their needs. A terminal, for instance, might want to try to segment its text by grapheme, which represent individual visible units of text, or soft line breaks for more complex scripts3. All of these checks require knowing certain properties about each character, which can be found in the Unicode Character Database. As such, segmentation happens on Unicode codepoints.

It is worth noting that segmentation only produces bounds between runs. It does not modify the input text in any way. In practice, the user of a segmentation API is free to either convert their entire text buffer to a codepoint buffer in advance, or simply decode their text encoding of choice (like UTF-8) to codepoints on the fly and pass those to the segmenter. The latter, of course, necessitates that the segmentation can be done in a streaming fashion, which some libraries (like kb_text_shape) do provide.

Shaping

Text shaping is the first step in the text processing pipeline that requires a font file. The runs that are output by segmentation are fed into a font file. First, the Unicode codepoints are mapped to glyph indices, as mentioned in the “Text encoding” section. Then, font features are applied to the glyph indices.

Substitution features are applied first and produce a new glyph sequence. For instance, a ligature may consume 2 glyphs and replace them with a singular new glyph. A typical example of a ligature most programmers will be familiar with is replacing a hyphen followed by a greater-than sign, ->, by a singular arrow glyph. In contrast, a multiple substitution performs the opposite: it consumes a single glyph and produces several. Once substitution is done, positioning features are applied. Positioning features use the same kind of matching logic as substitution features, but they modify glyph positions instead of glyph indices. An OpenType font is full of both of these types of lookups, and is free to use them however the font designer pleases. They constitute most of the work of shaping4.

Why, then, do runs have to have a uniform script and direction? Simply enough, script uniformity is required because font feature selection for any given run depends on its script, so mixing two scripts in a single run risks applying the wrong features to the wrong glyphs. Direction uniformity, on the other hand, is required because mixing directions makes it impossible to reason about the distance between any two glyphs, so the font's matching logic breaks down. To illustrate this, consider a sentence that mixes left-to-right and right-to-left text:

دينيس ريتشي فاش كان خدام ف مختبرات بيل، مابين 1972 و 1973

This sentence starts from the right and is read right-to-left. When “1972” is reached, the first logical character is the digit “1”, which is the leftmost digit! Even though “1” is “next to” the letter “ن”, the visual distance between them is huge, which makes their logical distance meaningless.

Rasterization

Shaping outputs glyph indices, which can be used to extract a visual description of each glyph. It also outputs glyph positions, so all we have left to do is to transform these visual descriptions into something we can display on a screen.

Text rasterization is a very well-studied topic, and many solutions exist. The core problem is the rasterization of curved outlines into a pixel grid, which has been the subject of research papers for a long time. I will not go into details here, as the problem space is much larger than the other steps of the pipeline, and existing resources are much better.

Layout

Text shaping only gives us glyph positions on a single, infinitely long line. While this is fine and sufficient for file names, buttons, and other common UI elements, it is obviously not enough to typeset a document. Text layout is the process of wrapping lines to fit a certain width, and optionally making the text “fit better” through a bunch of adjustments, like justification and hyphenation. These adjustments are typically very dependent on the script being processed. While Latin layout is straightforward enough, Arabic justification, for instance, is notoriously complex.

So, where does [X library] fit?


  1. Harfbuzz also supports UTF-16 and Latin1 input, both of which are easy to translate back to Unicode codepoints.Return

  2. In practice, line breaks are more complicated to check for than simply looking for singular characters. It does turn out, however, that a newline character (ASCII 10) is a guaranteed hard line break.Return

  3. This strategy of dividing simple scripts by grapheme and complex scripts by soft line breaks is used by the refterm terminal renderer.Return

  4. Complex scripts require special-coded passes that are not part of the font file, but part of the shaper itself. However, the kind of work that is performed by these passes is not so fundamentally different from the application of font features, and, as such, are not worth describing in this article.Return