<?xml version="1.0" encoding="UTF-8"?><rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>New Road, Old Way</title><link>https://newroadoldway.com</link><atom:link href="https://newroadoldway.com/feed.xml" rel="self" type="application/rss+xml" /><description>Jimmy Lefevre's blog</description><language>en-us</language><lastBuildDate>Tue, 02 Jul 2025 00:00:00 +0000</lastBuildDate><item><title>The modern text processing pipeline: Segmentation</title><pubDate>Tue, 02 Jul 2025 00:00:00 +0000</pubDate><guid>https://newroadoldway.com/text2.html</guid><description><p>This blog post is part of a series:</p><ul><li><a href="./text0.html">The modern text processing pipeline: Preface</a></li><li><a href="./text1.html">The modern text processing pipeline: Overview</a></li><li><strong>The modern text processing pipeline: Segmentation</strong></li></ul><p>This is the start of the technical overview of what goes on in a modern text processing pipeline. At each step of the pipeline, we will be discussing what the standards say, what real implementations do, and what I did in kb_text_shape.</p><h3>The algorithms</h3><p>For our purposes, text segmentation is the first part of the text processing pipeline<sup id="fnr1"><a rel="footnote" href="#fn1">1</a></sup>. <a href="./text1.html">As was discussed previously</a>, the goal of segmentation is to split an unstructured blob of Unicode text into a sequence of substrings, called &quot;runs&quot;, where each run has a single direction and writing system.</p><p>The Unicode consortium describes various algorithms you can apply to Unicode text in documents called &quot;annexes&quot;. Even though one of them is titled &quot;Unicode Text Segmentation&quot;, it is not enough to cover the whole of segmentation. The relevant annexes are:</p><ul><li><a href="https://www.unicode.org/reports/tr29/">Unicode Text Segmentation</a> for word and grapheme breaking<sup id="fnr2"><a rel="footnote" href="#fn2">2</a></sup>,</li><li><a href="https://www.unicode.org/reports/tr14/">Unicode Line Breaking Algorithm</a> for line breaking,</li><li><a href="https://www.unicode.org/reports/tr24/">Unicode Script Property</a> for script breaking,</li><li><a href="https://www.unicode.org/reports/tr9/">Unicode Bidirectional Algorithm</a> for direction breaking.</li></ul><p>(Note: I will be using &quot;segmentation&quot; and &quot;breaking&quot; interchangeably. Strictly speaking, you can consider segmentation to mean the finding of runs, and breaking to mean the finding of breaks that separate the runs. For all intents and purposes, the two are equivalent.)</p><p>Grapheme, word, sentence and line breaking are all described in a similar fashion. Each algorithm maps Unicode codepoints to <em>input classes</em>, and then describes rules that apply to those classes. This strategy of mapping a large set of possible inputs (in this case, Unicode codepoints) to a smaller set (in this case, input classes) is very common in standard Unicode algorithms like these<sup id="fnr3"><a rel="footnote" href="#fn3">3</a></sup>. Each algorithm describes its own input classes, meaning that a single Unicode codepoint will map to a grapheme break class, a word break class, a sentence break class and a line break class, all of which are distinct and cannot be used interchangeably. While there is some tedium associated with this multiplicty of classes, they do end up being quite different from one another in practice. Furthermore, having the tightest possible set of input classes for each algorithm provides a few benefits, such as reducing lookup table size.</p><h3>Extracting Unicode properties</h3><p>Some input classes have straightforward descriptions, like Grapheme_Cluster_Break=CR being assigned to the single codepoint U+000D<sup id="fnr4"><a rel="footnote" href="#fn4">4</a></sup>. Many others, however, reference character properties like &quot;General_Category&quot; or &quot;Emoji_Modifier&quot;. These properties can be found in the <a href="https://www.unicode.org/ucd/">Unicode Character Database</a> (often abbreviated to UCD), which is a very fancy name for a directory of text files containing semicolon-separated values. Microsoft also publishes their own extension to the UCD, which you can find <a href="https://github.com/microsoft/font-tools/tree/main/USE">here</a>. All properties defined by Microsoft take precedence over Unicode's. The Microsoft files were published as part of the <a href="https://learn.microsoft.com/en-us/typography/script-development/use">Universal Shaping Engine</a> specification, which is about shaping and not segmentation, but, realistically speaking, you will want the classes from both passes to line up anyway, so you should use the same ones for both. Once parsed, you can bake the properties into any lookup-friendly format you so choose. I recommend <a href="https://www.strchr.com/multi-stage_tables">two-stage tables</a>, as they are fast and very simple. In <a href="https://github.com/JimmyLefevre/kb/blob/main/kb_text_shape.h">kb_text_shape</a>, I use two-stage tables for all properties (for an example, see <span style="font-family:monospace;">kbts_GetUnicodeLineBreakClass</span>). The block size for each table is chosen by brute-force optimization.</p><h3>Implementation</h3><p>The segmentation algorithms are described by pattern-matching on infinitely-long strings. As such, a naive implementation would require access to the full input text in advance. This causes a few issues: going over the text multiple times is slow, it is annoying when the encoding is variable-sized like UTF-8 and UTF-16 are, and the resulting API is worse because maybe the user only wants to find breaks up to a certain point. For all of these reasons, it was important to me to find streaming implementations of all of the segmentation algorithms for kb_text_shape.</p><h4>Script breaking</h4><p>Locating script breaks is easy: codepoints that have a script property (UCD: Scripts.txt) of &quot;Common&quot;, &quot;Inherited&quot; or that do not have a script property at all do not change the current script. Any other value sets the current script, which causes a potential script break. The one special case we have to handle is that closing brackets inherit the script values of their matching opening brackets (UCD: BidiBrackets.txt). Thankfully, the consortium's algorithm limits the bracket matching depth to 64, so this is representable with a fixed-size stack without issue.</p><h4>Direction breaking</h4><p>Direction breaking spends a great deal of time discussing the explicit bidirectional formatting characters. These are:</p><ul><li>U+202A Left-to-right embedding</li><li>U+202B Right-to-left embedding</li><li>U+202D Left-to-right override</li><li>U+202E Right-to-left override</li><li>U+202C Pop directional formatting</li><li>U+2066 Left-to-right isolate</li><li>U+2067 Right-to-left isolate</li><li>U+2068 First strong isolate</li><li>U+2069 Pop directional isolate</li></ul><p>As far as I know, these do require access to the entire source text. However, their use is niche and officially discouraged, so I did not implement them for kb_text_shape, and the rest of the rules can absolutely be implemented in a streaming fashion. If you ignore the explicit formatting characters, you also get to ignore all of the embedding and isolate machinery. The direction breaking algorithm then starts to look a lot like script breaking: for each character, resolve its direction (either left-to-right, right-to-left, or neutral); non-neutral directions set the current direction, possibly creating a direction break.</p><p>Resolving directions requires pattern matching. Rules are given in order of precedence, meaning they are supposed to be read as a big if-else chain, with the corresponding priority this implies. This kind of structure is reused for all the other segmentation algorithms, too. For our streaming implementation to work, we need these rules to be implementable with a fixed amount of state, which we will briefly analyze here.</p><p>In the context of a streaming segmenter, at any point in the input text, we can assume we have seen all of the preceding characters. This makes simple backward-looking rules very simple to implement. In the case of <a href="https://www.unicode.org/reports/tr9/#W1">W1</a>, we simply store the last bidirectional class we have seen. In the case of <a href="https://www.unicode.org/reports/tr9/#W2">W2</a>, we set a bit when we see an AL, and clear it when we see an R or L. Forward-looking rules, on the other hand, require more adaptation effort. For the first sub-rule of <a href="https://www.unicode.org/reports/tr9/#W5">W5</a>, a potentially-infinite sequence of ETs is stored by keeping track of the first ET seen. When the sequence ends, it then becomes a matter of either resolving a bunch of ENs or a bunch of ETs all at once. Note that this includes keeping track of the <a href="https://www.unicode.org/reports/tr9/#W6">W6</a> and <a href="https://www.unicode.org/reports/tr9/#W7">W7</a> cases involving those classes of characters, so multiple rules might need to be resolved all at once. The key insight here is that there is a statically-known amount of rules, those rules all require bounded state, and rules are applied top-to-bottom, meaning that rule application does not recurse. For all of these reasons, it is possible to make a streaming implementation by working through each case.</p><h4>Grapheme breaking</h4><p>Grapheme breaking can be approached in the same way as direction breaking. Alternatively, as the Unicode consortium points out in <a href="https://www.unicode.org/reports/tr29/">Unicode Text Segmentation</a>, a finite state machine is another good way to implement the algorithm, and is likely faster. If you do end up going for the FSM, I recommend <a href="https://nothings.org/computer/lexing.html">writing it by hand</a>, simply because there are a lot of benefits to choosing which state gets assigned to which integer. In kb_text_shape, for example, exit states that indicate that a break exists one character back also encode the state to reset to inside of the exit state itself:</p><pre><code>kbts_u8 GraphemeBreakState = kbts_GraphemeBreakTransition[GraphemeBreakClass][State-&gt;GraphemeBreakState];
switch(GraphemeBreakState)
{
case KBTS_GRAPHEME_BREAK_STATE_b01: KBTS_BREAK2(KBTS_BREAK_FLAG_GRAPHEME, 1, 0); GraphemeBreakState = KBTS_GRAPHEME_BREAK_STATE_START; break;
case KBTS_GRAPHEME_BREAK_STATE_b0: KBTS_BREAK(KBTS_BREAK_FLAG_GRAPHEME, 0); GraphemeBreakState = KBTS_GRAPHEME_BREAK_STATE_START; break;

case KBTS_GRAPHEME_BREAK_STATE_b1:
case KBTS_GRAPHEME_BREAK_STATE_b1toCR:
case KBTS_GRAPHEME_BREAK_STATE_b1toL:
case KBTS_GRAPHEME_BREAK_STATE_b1toLVxV:
case KBTS_GRAPHEME_BREAK_STATE_b1toLVTxT:
case KBTS_GRAPHEME_BREAK_STATE_b1toIndicConsonantxIndicLinker:
case KBTS_GRAPHEME_BREAK_STATE_PADDING0: // Padding values are just here to help the compiler.
case KBTS_GRAPHEME_BREAK_STATE_PADDING1:
case KBTS_GRAPHEME_BREAK_STATE_b1toExtendedPictographic:
case KBTS_GRAPHEME_BREAK_STATE_PADDING2:
case KBTS_GRAPHEME_BREAK_STATE_PADDING3:
case KBTS_GRAPHEME_BREAK_STATE_b1toRI:
case KBTS_GRAPHEME_BREAK_STATE_b1toSKIP:
  KBTS_BREAK(KBTS_BREAK_FLAG_GRAPHEME, 1);
  GraphemeBreakState -= KBTS_GRAPHEME_BREAK_STATE_b1;
}
</code></pre><p>In any case, an FSM for grapheme breaking ends up being quite small, both in terms of the transition table and the code that uses it.</p><h4>Word and line breaking</h4><p>A lot of the contextual rules for word and line breaking can overlap with one another, which leads to a combinatorial explosion that makes a state machine implementation infeasible. Instead, we implement these rules in a similar way to direction breaking, albeit with much more complexity to handle this time.</p><p>Since we have many more rules, special-casing every piece of state would be exceedingly tedious, error-prone and complex. By looking at the rules, we observe that:</p><ol><li>Sets of input classes, using the operator <span style="font-family:monospace;">|</span>, increase matching complexity.</li><li>Repeats, using the operator <span style="font-family:monospace;">*</span>, increase matching complexity.</li><li>Excluding repeats, rules involve no more than 4 characters.</li><li>Rules are given in order of precedence.</li><li>Precedence between two rules of the same type (either break or no-break) is meaningless, because the outcome will be the same regardless of which one is applied.</li></ol><p>We take advantage of (3) by storing a 4-long history of input classes, and using it as our primary means of matching. Since there are few input classes overall, we packing the entire history in a single integer. This allows us to match on the entire history at once. Rules that are shorter than 4 characters can simply mask off part of the history. This is the resulting code for matching three-character word break rules:</p><pre><code>switch(WordBreakHistory &amp; 0xFFFFFF)
{
  KBTS_C3(ALnep, ML, ALnep): KBTS_C3(ALnep, ML, ALep): KBTS_C3(ALnep, ML, HL):
  KBTS_C3(ALnep, MNL, ALnep): KBTS_C3(ALnep, MNL, ALep): KBTS_C3(ALnep, MNL, HL):
  KBTS_C3(ALnep, SQ, ALnep): KBTS_C3(ALnep, SQ, ALep): KBTS_C3(ALnep, SQ, HL):
  KBTS_C3(ALep, ML, ALnep): KBTS_C3(ALep, ML, ALep): KBTS_C3(ALep, ML, HL):
  KBTS_C3(ALep, MNL, ALnep): KBTS_C3(ALep, MNL, ALep): KBTS_C3(ALep, MNL, HL):
  KBTS_C3(ALep, SQ, ALnep): KBTS_C3(ALep, SQ, ALep): KBTS_C3(ALep, SQ, HL):
  KBTS_C3(HL, ML, ALnep): KBTS_C3(HL, ML, ALep): KBTS_C3(HL, ML, HL):
  KBTS_C3(HL, MNL, ALnep): KBTS_C3(HL, MNL, ALep): KBTS_C3(HL, MNL, HL):
  KBTS_C3(HL, SQ, ALnep): KBTS_C3(HL, SQ, ALep): KBTS_C3(HL, SQ, HL):
  KBTS_C3(HL, DQ, HL):
  KBTS_C3(NM, MN, NM): KBTS_C3(NM, MNL, NM): KBTS_C3(NM, SQ, NM):
    WordUnbreaks |= KBTS_WORD_BREAK_BITS(0, 1) | KBTS_WORD_BREAK_BITS(0, 2); break;
}
</code></pre><p>Since the entire matching logic uses <span style="font-family:monospace;">switch</span> statements, we can trivially handle the operator <span style="font-family:monospace;">|</span> by duplicating every possible case inside of the switch. This solves (1).</p><p>Using (5), we assign a single level of precedence to a block of no-break rules followed by break rules, with the implicit behavior that no-break rules have priority over break rules of the same level. This drastically reduces the number of levels of precedence, enough so that we can every possible rule application into a single integer:</p><pre><code>// Word breaks.
// We buffer 3 characters for word breaks.
// Each character gets 3 bits (padded to 4) representing 3 levels of priority.
#define KBTS_WORD_BREAK_BITS(Priority, Position) (((1 &lt;&lt; ((Priority) + 1)) - 1) &lt;&lt; ((Position) * 4))
</code></pre><p>The single-integer representation allows us to handle precedence for free: a higher level of precedence will also affect all of the lower levels by using bitmasks (this is where the (- 1) comes from in the macro above). Additionally, breaks are accumulated into a different integer than no-breaks, which makes application order truly meaningless. The final results are obtained by <span style="font-family:monospace;">&amp;</span>-ing off the no-break bits from the break bits.</p><p>Finally, (2) is handled using similar tricks to direction breaking. There is no general solution to these; they are all special-cased.</p><p>I have shown examples of word breaking here because they are simpler and more succinct, but the same approach generalizes well to line breaking. Line breaking differenciates between mandatory breaks (also called hard line breaks) and allowed breaks (also called soft line breaks); we solve this with more bitmasking, although you could also use a third integer that you would then merge at the end, just like no-breaks.</p><p>In kb_text_shape, the final code is all in <span style="font-family:monospace;">kbts_BreakAddCodepoint_</span>.</p><hr /><ol><li id="fn1">Depending on their needs, some programs may want to sanitize or <a href="https://unicode.org/reports/tr15/">normalize</a> input, or even change encoding, before segmentation. These passes will not be discussed, as they deal with concerns like security or internal string representations, which don't directly contribute to getting text on the screen.<sup><a rel="footnote" href="#fnr1">Return</a></sup></li><li id="fn2">The annex also describes sentence breaking, but I have yet to find any use for it in any context whatsoever. As such, I will not be covering it.<sup><a rel="footnote" href="#fnr2">Return</a></sup></li><li id="fn3">Input set remapping like this is also widely used in parsers in general.<sup><a rel="footnote" href="#fnr3">Return</a></sup></li><li id="fn4">In Unicode parlance, &quot;U+&quot; is the prefix used for codepoints, which are written in hexadecimal. For some reason, they have also decided to pad everything to use at least 4 digits, even though codepoints are 21-bit, so it doesn't even make their representation fixed-width or anything. For all intents and purposes, you can replace it with &quot;0x&quot; in C, which makes &quot;U+000D&quot; simply 0xD.<sup><a rel="footnote" href="#fnr4">Return</a></sup></li></ol></description></item><item><title>The modern text processing pipeline: Overview</title><pubDate>Fri, 20 Jun 2025 00:00:00 +0000</pubDate><guid>https://newroadoldway.com/text1.html</guid><description><p>This blog post is part of a series:</p><ul><li><a href="./text0.html">The modern text processing pipeline: Preface</a></li><li><strong>The modern text processing pipeline: Overview</strong></li><li><a href="./text2.html">The modern text processing pipeline: Segmentation</a></li></ul><p><a href="https://github.com/JimmyLefevre/kb/tree/main?tab=readme-ov-file#kb_text_shapeh">My text segmentation and shaping library was just released</a>. The announcement received a <em>lot</em> of attention, which I am very grateful for, but a certain amount of people seemed confused as to what the library was or did, or how they could use it in their own projects. I will try to give a bird's-eye view of the pieces that make up a modern text processing pipeline, and, by the end of the article, you should hopefully have a better idea of where my library fits.</p><h3>Text encoding</h3><p>Before we even get to processing anything, we have to define what we mean by text. Back in the TrueType days, some fonts had glyph tables that facilitated certain encodings, like ASCII, UTF-16 or <a href="https://en.wikipedia.org/wiki/Shift_JIS">SHIFT JIS</a>, and, beyond that, all of the font's functionality was specified within the font itself.</p><p>While OpenType fonts reuse the same glyph tables to map text data to glyph indices, shaping requires knowing the Unicode properties for each character we want to display. These properties indicate things like &quot;is this character a punctuation mark?&quot; or &quot;what side does this attaching vowel typically attach from?&quot;. These tables only exist for Unicode and are <a href="https://www.unicode.org/ucd/">published by the Unicode consortium</a>, so, practically speaking, some form of Unicode encoding is required. Harfbuzz and kb_text_shape.h both expect the user to provide Unicode codepoints, with optional UTF-8 input support<sup id="fnr1"><a rel="footnote" href="#fn1">1</a></sup>.</p><h3>Segmentation</h3><p>Text segmentation divides the input text into several subtsrings called &quot;runs&quot;. <strong>Runs are sequences of characters that share a common <a href="https://www.unicode.org/reports/tr9/">direction</a></strong> (left-to-right or right-to-left) <strong>as well as a common script</strong>. Additionally, a &quot;hard line break&quot;, like an explicit newline or carriage return character, signals the end of a run<sup id="fnr2"><a rel="footnote" href="#fn2">2</a></sup>. Fundamentally, there is nothing more to it than that; at minimum, a program may decide to break its text by direction, script and hard line breaks. Programs are, of course, free to subdivide their text further depending on their needs. A terminal, for instance, might want to try to segment its text by <em>grapheme</em>, which represent individual visible units of text, or soft line breaks for more complex scripts<sup id="fnr3"><a rel="footnote" href="#fn3">3</a></sup>. All of these checks require knowing certain properties about each character, <a href="https://www.unicode.org/ucd/">which can be found in the Unicode Character Database</a>. As such, segmentation happens on Unicode codepoints.</p><p>It is worth noting that segmentation only produces <em>bounds</em> between runs. It does not modify the input text in any way. In practice, the user of a segmentation API is free to either convert their entire text buffer to a codepoint buffer in advance, or simply decode their text encoding of choice (like UTF-8) to codepoints on the fly and pass those to the segmenter. The latter, of course, necessitates that the segmentation can be done in a streaming fashion, which some libraries (like kb_text_shape) do provide.</p><h3>Shaping</h3><p>Text shaping is the first step in the text processing pipeline that requires a font file. The runs that are output by segmentation are fed into a font file. First, the Unicode codepoints are mapped to glyph indices, as mentioned in the &quot;Text encoding&quot; section. Then, font features are applied to the glyph indices.</p><p>Substitution features are applied first and produce a new glyph sequence. For instance, a <a href="https://learn.microsoft.com/en-us/typography/opentype/spec/gsub#LS">ligature</a> may consume 2 glyphs and replace them with a singular new glyph. A typical example of a ligature most programmers will be familiar with is replacing a hyphen followed by a greater-than sign, <span style="font-family:monospace;">-&gt;</span>, by a singular arrow glyph. In contrast, a <a href="https://learn.microsoft.com/en-us/typography/opentype/spec/gsub#MS">multiple substitution</a> performs the opposite: it consumes a single glyph and produces several. Once substitution is done, positioning features are applied. Positioning features use the same kind of matching logic as substitution features, but they modify glyph positions instead of glyph indices. An OpenType font is full of both of these types of lookups, and is free to use them however the font designer pleases. They constitute most of the work of shaping<sup id="fnr4"><a rel="footnote" href="#fn4">4</a></sup>.</p><p>Why, then, do runs have to have a uniform script and direction? Simply enough, script uniformity is required because <strong>font feature selection for any given run depends on its script</strong>, so mixing two scripts in a single run risks applying the wrong features to the wrong glyphs. Direction uniformity, on the other hand, is required because <strong>mixing directions makes it impossible to reason about the distance between any two glyphs</strong>, so the font's matching logic breaks down. To illustrate this, consider a sentence that mixes left-to-right and right-to-left text:</p><p>دينيس ريتشي فاش كان خدام ف مختبرات بيل، مابين 1972 و 1973</p><p>This sentence starts from the right and is read right-to-left. When &quot;1972&quot; is reached, the first logical character is the digit &quot;1&quot;, which is the leftmost digit! Even though &quot;1&quot; is &quot;next to&quot; the letter &quot;ن&quot;, the visual distance between them is huge, which makes their logical distance meaningless.</p><h3>Rasterization</h3><p>Shaping outputs glyph indices, which can be used to extract a visual description of each glyph. It also outputs glyph positions, so all we have left to do is to transform these visual descriptions into something we can display on a screen.</p><p>Text rasterization is a very well-studied topic, and <a href="https://freetype.org/">many</a> <a href="https://github.com/Chlumsky/msdfgen">solutions</a> <a href="https://sluglibrary.com/">exist</a>. The core problem is the rasterization of curved outlines into a pixel grid, which has been the subject of <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2005/01/p1000-loop.pdf">research papers</a> for a long time. I will not go into details here, as the problem space is much larger than the other steps of the pipeline, and existing resources are much better.</p><h3>Layout</h3><p>Text shaping only gives us glyph positions on a single, infinitely long line. While this is fine and sufficient for file names, buttons, and other common UI elements, it is obviously not enough to typeset a document. Text layout is the process of wrapping lines to fit a certain width, and optionally making the text &quot;fit better&quot; through a bunch of adjustments, like justification and hyphenation. These adjustments are typically very dependent on the script being processed. While Latin layout is straightforward enough, Arabic justification, for instance, is <a href="https://research.reading.ac.uk/typoarabic/on-arabic-justification-part-2-software-implementations/">notoriously complex</a>.</p><h3>So, where does [X library] fit?</h3><ul><li><a href="https://github.com/harfbuzz/harfbuzz">Harfbuzz</a> is a shaping library. From <a href="https://harfbuzz.github.io/what-harfbuzz-doesnt-do.html">its own website</a>: <blockquote><p>HarfBuzz won't help you with bidirectionality.</p><p>HarfBuzz won't help you with text that contains different font properties.</p><p>HarfBuzz won't help you with line breaking, hyphenation, or justification.</p></blockquote></li><li><a href="https://github.com/fribidi/fribidi">fribidi</a> is a segmentation library. It only implements direction breaking.</li><li><a href="https://icu.unicode.org/">ICU</a> is a very complex and featureful library. However, it is restricted to offering Unicode functionality, and, as we have seen, text shaping and rasterization both require OpenType functionality. As such, for the purposes of this article, it performs segmentation.</li><li><a href="https://www.gtk.org/docs/architecture/pango">Pango</a> is a high-level do-it-all library. It can do all of the steps described in this article.</li><li><a href="https://github.com/JimmyLefevre/kb?tab=readme-ov-file#kb_text_shapeh">kb_text_shape</a> performs both segmentation (through <span style="font-family:monospace;">kbts_Break</span>) and shaping (through <span style="font-family:monospace;">kbts_Shape</span>).</li></ul><hr /><ol><li id="fn1"><p>Harfbuzz also supports UTF-16 and Latin1 input, both of which are easy to translate back to Unicode codepoints.<sup><a rel="footnote" href="#fnr1">Return</a></sup></p></li><li id="fn2"><p>In practice, line breaks are <a href="https://www.unicode.org/reports/tr14/">more complicated</a> to check for than simply looking for singular characters. It does turn out, however, that a newline character (ASCII 10) is a guaranteed hard line break.<sup><a rel="footnote" href="#fnr2">Return</a></sup></p></li><li id="fn3"><p>This strategy of dividing simple scripts by grapheme and complex scripts by soft line breaks is used by the <a href="https://github.com/cmuratori/refterm/blob/8a560d49c23a9945e3d9e497a971f8250190aa4c/refterm_example_terminal.c#L427C13-L427C16">refterm</a> terminal renderer.<sup><a rel="footnote" href="#fnr3">Return</a></sup></p></li><li id="fn4"><p>Complex scripts require special-coded passes that are not part of the font file, but part of the shaper itself. However, the kind of work that is performed by these passes is not so fundamentally different from the application of font features, and, as such, are not worth describing in this article.<sup><a rel="footnote" href="#fnr4">Return</a></sup></p></li></ol></description></item><item><title>The modern text processing pipeline: Preface</title><pubDate>Thu, 24 Apr 2025 00:00:00 +0000</pubDate><guid>https://newroadoldway.com/text0.html</guid><description><p>This blog post is part of a series:</p><ul><li><strong>The modern text processing pipeline: Preface</strong></li><li><a href="./text1.html">The modern text processing pipeline: Overview</a></li><li><a href="./text2.html">The modern text processing pipeline: Segmentation</a></li></ul><p>It is easy to find high quality resources on text rendering. It is a well-studied subject, with <a href="https://nothings.org/gamedev/rasterize/">many</a> <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2005/01/p1000-loop.pdf">openly</a> <a href="https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf">documented</a> <a href="https://inria.hal.science/hal-00821839">techniques</a>, and there are many places one can read and ask about their implementations in detail. In contrast, text layout is significantly less documented. The two main and official resources on text layout are the <a href="https://www.unicode.org/standard/standard.html">Unicode standard</a> and the <a href="https://learn.microsoft.com/en-us/typography/opentype/spec/">OpenType specification</a> (along with the OpenType script development specs, which are available on the same website as the main spec), none of which accurately describes the behavior of actual text layout implementations in the wild, let alone addresses the problems that come up when trying to implement one of these systems. Since I have been working on text layout for a while now, and I believe such essential and ubiquitous systems should not be opaque, I will try to document the modern text processing pipeline as it exists at the time of writing to the best of my ability.</p><p>This series of articles will be written from an implementer's point of view; I will describe what actually happens in practice, and how that diverges from the official documentation available online in such a way that implementers and maintainers should hopefully spend less time on reverse engineering than I did.</p><p>As is often the case for lesser-known technical subjects, when text layout comes up in online discussions, it is often <a href="https://danluu.com/cocktail-ideas/">misrepresented</a> or <a href="https://danluu.com/sounds-easy/">outright dismissed</a>. Before we discuss any technical details, I would like to address some unfair preconceived notions that I have read again and again.</p><blockquote><p><a href="https://freetype.org/">FreeType</a>/<a href="https://github.com/nothings/stb/blob/master/stb_truetype.h">stb_truetype</a> is all you need.</p></blockquote><p>This is only true for languages that are trivial to display. FreeType and stb_truetype use classic, unextended TrueType fonts, which include a 1:1 character-to-glyph table and a kerning table for adjusting neighboring glyphs. In theory, these features are enough for scripts like Latin, Greek, Cyrillic and Chinese, so it seems like, for known, fixed combinations of source text, pre-selected fonts, and supported languages, one might get away with a simpler setup. Not only does this make it impossible to support ligatures of any kind, possibly worsening the typographic quality on scripts that don't need them and definitely making it impossible to support those that do, such as Arabic and Indic scripts, ignoring the OpenType feature set in its entirety also means ignoring <a href="https://web.archive.org/web/20020501015345/http://www.adobe.com/aboutadobe/pressroom/pressreleases/200204/200204opentype.html">the past two decades of font design</a>. Even Latin fonts are likely to make extensive use of OpenType features now.</p><blockquote><p><a href="https://github.com/harfbuzz/harfbuzz">Harfbuzz</a> is all you need.</p></blockquote><p>This is only true for single-line, single-script, single-style text. <a href="https://harfbuzz.github.io/what-harfbuzz-doesnt-do.html">As the Harfbuzz manual itself points out</a>, text shaping is just one step in the text processing pipeline, and, in general, it does not make sense to use it on its own. To give a simple, common example found in everyday use, Arabic and Hebrew often intermix English with another (right-to-left) script, which HarfBuzz does not handle<sup id="fnr1"><a rel="footnote" href="#fn1">1</a></sup>.</p><blockquote><p>This is too hard/complicated, can't I just support simple languages like English?</p></blockquote><p><a href="https://www.ethnologue.com/insights/ethnologue200/">Less than a fifth of the world population speaks English</a>. Of the 20 most spoken languages on Earth, 11 use complex writing systems<sup id="fnr2"><a rel="footnote" href="#fn2">2</a></sup>. India, whose writing systems are particularly complex<sup id="fnr3"><a rel="footnote" href="#fn3">3</a></sup>, is a 1.4B person market. I find it both irresponsible and immature of you, the English-speaking programmer, to restrict access to your software to anyone but those who speak your native tongue just because you think something is too hard. Technology should accomodate people, not the other way around, and, on a global scale, people overwhelmingly do not speak English.</p><p>With that said, I certainly agree that text shaping could be much simpler than it is today. OpenType is an old spec that has grown organically for decades to try and accomodate more and more scripts with the same basic mechanisms, and the result is a collection of ad-hoc solutions that, as an implementer, you just have to know about, lest some weird corner case completely break your layout. (I believe this is also why the OpenType script development spec is generally wrong; it most likely hasn't been updated to document all of the organic changes that have happened to Microsoft's implementation over the years.) I have more sympathy for OpenType, and Microsoft and Apple's implementations thereof, which are actual products used by millions of people every day, than I have for the Unicode consortium, whose only job it is to standardize behavior, and whose standards are broadly disconnected from the reality of text processing<sup id="fnr4"><a rel="footnote" href="#fn4">4</a></sup>. We will get to see exactly how this divide manifests itself in future posts.</p><hr /><ol><li id="fn1"><p>Harfbuzz does, however, have an ad-hoc fix for handling digits, which should be displayed left-to-right, within Arabic sentences, which should be read right-to-left. This fix simply exists to make up for bad usage of their API; in the text processing world, the digit sequence is considered an independent run of text, which should be shapen on its own.<sup><a rel="footnote" href="#fnr1">Return</a></sup></p></li><li id="fn2"><p>OpenType considers a writing system &quot;complex&quot; if it requires a different implementation than Latin. Of the 20 most spoken languages in the world, the ones using a complex writing system are Hindi, Standard Arabic, Bengali, Indonesian, Urdu, Nigerian Pidgin, Egyptian Arabic, Marathi, Vietnamese, Telugu and Hausa.<sup><a rel="footnote" href="#fnr2">Return</a></sup></p></li><li id="fn3"><p>Most of the script-specific complexity in OpenType comes from Indic scripts. Their implementation has had two revisions, the latter of which became the basis for the Universal Shaping Engine, a generic implementation for every other complex script that had not yet been handled by the standard. Amusingly, the Universal Shaping Engine is <em>still</em> simpler than the Indic shaper.<sup><a rel="footnote" href="#fnr3">Return</a></sup></p></li><li id="fn4"><p>For your reading pleasure, here is some Unicode awesomeness: <a href="https://github.com/harfbuzz/harfbuzz/issues/524#issuecomment-333528227">missing character properties</a>, <a href="https://github.com/harfbuzz/harfbuzz/issues/2116#issuecomment-1166173708">underspecified behavior</a>, <a href="https://github.com/harfbuzz/harfbuzz/issues/2017#issuecomment-785420619">questionable recommended mark ordering</a>, <a href="https://github.com/harfbuzz/harfbuzz/issues/1518#issuecomment-882131242">questionable normalization</a><sup><a rel="footnote" href="#fnr4">Return</a></sup>.</p></li></ol></description></item></channel></rss>