Working with HarfBuzz clusters: HarfBuzz Manual

Working with HarfBuzz clusters

When you add text to a HarfBuzz buffer, each code point must be assigned a cluster value.

This cluster value is an arbitrary number; HarfBuzz uses it only to distinguish between clusters. Many client programs will use the index of each code point in the input text stream as the cluster value. This is for the sake of convenience; the actual value does not matter.

Some of the shaping operations performed by HarfBuzz — such as reordering, composition, decomposition, and substitution — may alter the cluster values of some characters. The final cluster values in the buffer at the end of the shaping process will indicate to client programs which subsequences of glyphs represent a cluster and, therefore, must not be separated.

In addition, client programs can query the final cluster values to discern other potentially important information about the glyphs in the output buffer (such as whether or not a ligature was formed).

For example, if the initial sequence of cluster values was:

0,1,2,3,4

and the final sequence of cluster values is:

0,0,3,3

then there are two clusters in the output buffer: the first cluster includes the first two glyphs, and the second cluster includes the third and fourth glyphs. It is also evident that a ligature or conjunct has been formed, because there are fewer glyphs in the output buffer (four) than there were code points in the input buffer (five).

Although client programs using HarfBuzz are free to assign initial cluster values in any manner they choose to, HarfBuzz does offer some useful guarantees if the cluster values are assigned in a monotonic (either non-decreasing or non-increasing) order.

For buffers in the left-to-right (LTR) or top-to-bottom (TTB) text flow direction, HarfBuzz will preserve the monotonic property: client programs are guaranteed that monotonically increasing initial cluster values will be returned as monotonically increasing final cluster values.

For buffers in the right-to-left (RTL) or bottom-to-top (BTT) text flow direction, the directionality of the buffer itself is reversed for final output as a matter of design. Therefore, HarfBuzz inverts the monotonic property: client programs are guaranteed that monotonically increasing initial cluster values will be returned as monotonically decreasing final cluster values.

Client programs can adjust how HarfBuzz handles clusters during shaping by setting the cluster_level of the buffer. HarfBuzz offers three levels of clustering support for this property:

Level 0 is the default.

The distinguishing feature of level 0 behavior is that, at the beginning of processing the buffer, all code points that are categorized as marks, modifier symbols, or Emoji extended pictographic modifiers, as well as the Zero Width Joiner and Zero Width Non-Joiner code points, are assigned the cluster value of the closest preceding code point from different category.

In essence, whenever a base character is followed by a mark character or a sequence of mark characters, those marks are reassigned to the same initial cluster value as the base character. This reassignment is referred to as "merging" the affected clusters. This behavior is based on the Grapheme Cluster Boundary specification in Unicode Technical Report 29.

This cluster level is suitable for code that likes to use HarfBuzz cluster values as an approximation of the Unicode Grapheme Cluster Boundaries as well.

Client programs can specify level 0 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES.
Level 1 tweaks the old behavior slightly to produce better results. Therefore, level 1 clustering is recommended for code that is not required to implement backward compatibility with the old HarfBuzz.

Level 1 differs from level 0 by not merging the clusters of marks and other modifier code points with the preceding "base" code point's cluster. By preserving the separate cluster values of these marks and modifier code points, script shapers can perform additional operations that might lead to improved results (for example, coloring mark glyphs differently than their base).

Client programs can specify level 1 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS.
Level 2 differs significantly in how it treats cluster values. In level 2, HarfBuzz never merges clusters.

This difference can be seen most clearly when HarfBuzz processes ligature substitutions and glyph decompositions. In level 0 and level 1, ligatures and glyph decomposition both involve merging clusters; in level 2, neither of these operations triggers a merge.

Client programs can specify level 2 behavior for a buffer by setting its cluster_level to HB_BUFFER_CLUSTER_LEVEL_CHARACTERS.

As mentioned earlier, client programs using HarfBuzz often assign initial cluster values in a buffer by reusing the indices of the code points in the input text. This gives a sequence of cluster values that is monotonically increasing (for example, 0,1,2,3,4).

It is not required that the cluster values in a buffer be monotonically increasing. However, if the initial cluster values in a buffer are monotonic and the buffer is configured to use cluster level 0 or 1, then HarfBuzz guarantees that the final cluster values in the shaped buffer will also be monotonic. No such guarantee is made for cluster level 2.

In levels 0 and 1, HarfBuzz implements the following conceptual model for cluster values:

If the sequence of input cluster values is monotonic, the sequence of cluster values will remain monotonic.
Each cluster value represents a single cluster.
Each cluster contains one or more glyphs and one or more characters.

In practice, this model offers several benefits. Assuming that the initial cluster values were monotonically increasing and distinct before shaping began, then, in the final output:

All adjacent glyphs having the same final cluster value belong to the same cluster.
Each character belongs to the cluster that has the highest cluster value not larger than its initial cluster value.