Setting buffer properties

Buffers containing input characters still need several properties set before HarfBuzz can shape their text correctly.

Initially, all buffers are set to the HB_BUFFER_CONTENT_TYPE_INVALID content type. After adding text, the buffer should be set to HB_BUFFER_CONTENT_TYPE_UNICODE instead, which indicates that it contains un-shaped input characters. After shaping, the buffer will have the HB_BUFFER_CONTENT_TYPE_GLYPHS content type.

hb_buffer_add_utf8() and the other UTF functions set the content type of their buffer automatically. But if you are reusing a buffer you may want to check its state with hb_buffer_get_content_type(buffer). If necessary you can set the content type with

      hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);

to prepare for shaping.

Buffers also need to carry information about the script, language, and text direction of their contents. You can set these properties individually:

      hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
      hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
      hb_buffer_set_language(buf, hb_language_from_string("en", -1));

However, since these properties are often repeated for multiple text runs, you can also save them in a hb_segment_properties_t for reuse:

      hb_segment_properties_t *savedprops;
      hb_buffer_get_segment_properties (buf, savedprops);
      hb_buffer_set_segment_properties (buf2, savedprops);

HarfBuzz also provides getter functions to retrieve a buffer's direction, script, and language properties individually.

HarfBuzz recognizes four text directions in hb_direction_t: left-to-right (HB_DIRECTION_LTR), right-to-left (HB_DIRECTION_RTL), top-to-bottom (HB_DIRECTION_TTB), and bottom-to-top (HB_DIRECTION_BTT). For the script property, HarfBuzz uses identifiers based on the ISO 15924 standard. For languages, HarfBuzz uses tags based on the IETF BCP 47 standard.

Helper functions are provided to convert character strings into the necessary script and language tag types.

Two additional buffer properties to be aware of are the "invisible glyph" and the replacement code point. The replacement code point is inserted into buffer output in place of any invalid code points encountered in the input. By default, it is the Unicode REPLACEMENT CHARACTER code point, U+FFFD "�". You can change this with

      hb_buffer_set_replacement_codepoint(buf, replacement);

passing in the replacement Unicode code point as the replacement parameter.

The invisible glyph is used to replace all output glyphs that are invisible. By default, the standard space character U+0020 is used; you can replace this (for example, when using a font that provides script-specific spaces) with

      hb_buffer_set_invisible_glyph(buf, replacement_glyph);

Do note that in the replacement_glyph parameter, you must provide the glyph ID of the replacement you wish to use, not the Unicode code point.

HarfBuzz supports a few additional flags you might want to set on your buffer under certain circumstances. The HB_BUFFER_FLAG_BOT and HB_BUFFER_FLAG_EOT flags tell HarfBuzz that the buffer represents the beginning or end (respectively) of a text element (such as a paragraph or other block). Knowing this allows HarfBuzz to apply certain contextual font features when shaping, such as initial or final variants in connected scripts.

HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES tells HarfBuzz not to hide glyphs with the Default_Ignorable property in Unicode. This property designates control characters and other non-printing code points, such as joiners and variation selectors. Normally HarfBuzz replaces them in the output buffer with zero-width space glyphs (using the "invisible glyph" property discussed above); setting this flag causes them to be printed, which can be helpful for troubleshooting.

Conversely, setting the HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES flag tells HarfBuzz to remove Default_Ignorable glyphs from the output buffer entirely. Finally, setting the HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE flag tells HarfBuzz not to insert the dotted-circle glyph (U+25CC, "◌"), which is normally inserted into buffer output when broken character sequences are encountered (such as combining marks that are not attached to a base character).