The modern text processing pipeline: Preface

It is easy to find high quality resources on text rendering. It is a well-studied subject, with many openly documented techniques, and there are many places one can read and ask about their implementations in detail. In contrast, text layout is significantly less documented. The two main and official resources on text layout are the Unicode standard and the OpenType specification (along with the OpenType script development specs, which are available on the same website as the main spec), none of which accurately describes the behavior of actual text layout implementations in the wild, let alone addresses the problems that come up when trying to implement one of these systems. Since I have been working on text layout for a while now, and I believe such essential and ubiquitous systems should not be opaque, I will try to document the modern text processing pipeline as it exists at the time of writing to the best of my ability.

This series of articles will be written from an implementer's point of view; I will describe what actually happens in practice, and how that diverges from the official documentation available online in such a way that implementers and maintainers should hopefully spend less time on reverse engineering than I did.

As is often the case for lesser-known technical subjects, when text layout comes up in online discussions, it is often misrepresented or outright dismissed. Before we discuss any technical details, I would like to address some unfair preconceived notions that I have read again and again.

FreeType/stb_truetype is all you need.

This is only true for languages that are trivial to display. FreeType and stb_truetype use classic, unextended TrueType fonts, which include a 1:1 character-to-glyph table and a kerning table for adjusting neighboring glyphs. In theory, these features are enough for scripts like Latin, Greek, Cyrillic and Chinese, so it seems like, for known, fixed combinations of source text, pre-selected fonts, and supported languages, one might get away with a simpler setup. Not only does this make it impossible to support ligatures of any kind, possibly worsening the typographic quality on scripts that don't need them and definitely making it impossible to support those that do, such as Arabic and Indic scripts, ignoring the OpenType feature set in its entirety also means ignoring the past two decades of font design. Even Latin fonts are likely to make extensive use of OpenType features now.

Harfbuzz is all you need.

This is only true for single-line, single-script, single-style text. As the Harfbuzz manual itself points out, text shaping is just one step in the text processing pipeline, and, in general, it does not make sense to use it on its own. To give a simple, common example found in everyday use, Arabic and Hebrew often intermix English with another (right-to-left) script, which HarfBuzz does not handle¹.

This is too hard/complicated, can't I just support simple languages like English?

Less than a fifth of the world population speaks English. Of the 20 most spoken languages on Earth, 11 use complex writing systems². India, whose writing systems are particularly complex³, is a 1.4B person market. I find it both irresponsible and immature of you, the English-speaking programmer, to restrict access to your software to anyone but those who speak your native tongue just because you think something is too hard. Technology should accomodate people, not the other way around, and, on a global scale, people overwhelmingly do not speak English.

With that said, I certainly agree that text shaping could be much simpler than it is today. OpenType is an old spec that has grown organically for decades to try and accomodate more and more scripts with the same basic mechanisms, and the result is a collection of ad-hoc solutions that, as an implementer, you just have to know about, lest some weird corner case completely break your layout. (I believe this is also why the OpenType script development spec is generally wrong; it most likely hasn't been updated to document all of the organic changes that have happened to Microsoft's implementation over the years.) I have more sympathy for OpenType, and Microsoft and Apple's implementations thereof, which are actual products used by millions of people every day, than I have for the Unicode consortium, whose only job it is to standardize behavior, and whose standards are broadly disconnected from the reality of text processing⁴. We will get to see exactly how this divide manifests itself in future posts.

Harfbuzz does, however, have an ad-hoc fix for handling digits, which should be displayed left-to-right, within Arabic sentences, which should be read right-to-left. This fix simply exists to make up for bad usage of their API; in the text processing world, the digit sequence is considered an independent run of text, which should be shapen on its own.^Return
OpenType considers a writing system “complex” if it requires a different implementation than Latin. Of the 20 most spoken languages in the world, the ones using a complex writing system are Hindi, Standard Arabic, Bengali, Indonesian, Urdu, Nigerian Pidgin, Egyptian Arabic, Marathi, Vietnamese, Telugu and Hausa.^Return
Most of the script-specific complexity in OpenType comes from Indic scripts. Their implementation has had two revisions, the latter of which became the basis for the Universal Shaping Engine, a generic implementation for every other complex script that had not yet been handled by the standard. Amusingly, the Universal Shaping Engine is still simpler than the Indic shaper.^Return
For your reading pleasure, here is some Unicode awesomeness: missing character properties, underspecified behavior, questionable recommended mark ordering, questionable normalization^Return.