In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible unit. A single letter or symbol can be a cluster of its own. Other clusters correspond to longer subsequences of the input code points — such as a ligature or conjunct form — and require the shaper to ensure that the cluster is not broken during the shaping process.
A cluster is distinct from a grapheme, which is the smallest unit of meaning in a writing system or script.
The definitions of the two terms are similar. However, clusters are only relevant for script shaping and glyph layout. In contrast, graphemes are a property of the underlying script, and are of interest when client programs implement orthographic or linguistic functionality.
For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine — even though the two original, underlying letters remain separate graphemes.
HarfBuzz is concerned with clusters, not with graphemes — although client programs using HarfBuzz may still care about graphemes for other reasons from time to time.
During the shaping process, there are several shaping operations
that may merge adjacent characters (for example, when two code
points form a ligature or a conjunct form and are replaced by a
single glyph) or split one character into several (for example,
when decomposing a code point through the
ccmp
feature). Operations like these alter
clusters; HarfBuzz tracks the changes to ensure that no clusters
get lost or broken during shaping.
HarfBuzz records cluster information independently from how shaping operations affect the individual glyphs returned in an output buffer. Consequently, a client program using HarfBuzz can utilize the cluster information to implement features such as:
Correctly positioning the cursor within a shaped text run, even when characters have formed ligatures, composed or decomposed, reordered, or undergone other shaping operations.
Correctly highlighting a text selection that includes some, but not all, of the characters in a word.
Applying text attributes (such as color or underlining) to part, but not all, of a word.
Generating output document formats (such as PDF) with embedded text that can be fully extracted.
Determining the mapping between input characters and output glyphs, such as which glyphs are ligatures.
Performing line-breaking, justification, and other line-level or paragraph-level operations that must be done after shaping is complete, but which require examining character-level properties.