unicode string library
git clone git://
Log | Files | Refs | README | LICENSE

libgrapheme.7 (5379B)

      1 .Dd 2021-12-22
      2 .Dt LIBGRAPHEME 7
      3 .Os
      4 .Sh NAME
      5 .Nm libgrapheme
      6 .Nd unicode string library
      7 .Sh SYNOPSIS
      8 .In grapheme.h
     10 The
     11 .Nm
     12 library provides functions to properly handle Unicode strings according
     13 to the Unicode specification.
     14 Unicode strings are made up of user-perceived characters (so-called
     15 .Dq grapheme clusters ,
     16 see
     17 .Sx MOTIVATION )
     18 that are made up of one or more Unicode codepoints, which in turn
     19 are encoded in one or more bytes in an encoding like UTF-8.
     20 .Pp
     21 There is a widespread misconception that it was enough to simply
     22 determine codepoints in a string and treat them as user-perceived
     23 characters to be Unicode compliant.
     24 While this may work in some cases, this assumption quickly breaks,
     25 especially for non-Western languages and decomposed Unicode strings
     26 where user-perceived characters are usually represented using multiple
     27 codepoints.
     28 .Pp
     29 Despite this complicated multilevel structure of Unicode strings,
     30 .Nm
     31 provides methods to work with them at the byte-level (i.e. UTF-8
     32 .Sq char
     33 arrays) while also offering codepoint-level methods.
     34 .Pp
     35 Every documented function's manual page provides a self-contained
     36 example illustrating the possible usage.
     37 .Sh SEE ALSO
     38 .Xr grapheme_decode_utf8 3 ,
     39 .Xr grapheme_encode_utf8 3 ,
     40 .Xr grapheme_is_character_break 3 ,
     41 .Xr grapheme_next_character_break 3
     42 .Sh STANDARDS
     43 .Nm
     44 is compliant with the Unicode 14.0.0 specification.
     45 .Sh MOTIVATION
     46 The idea behind every character encoding scheme like ASCII or Unicode
     47 is to express abstract characters (which can be thought of as shapes
     48 making up a written language). ASCII for instance, which comprises the
     49 range 0 to 127, assigns the number 65 (0x41) to the abstract character
     50 .Sq A .
     51 This number is called a
     52 .Dq codepoint ,
     53 and all codepoints of an encoding make up its so-called
     54 .Dq code space .
     55 .Pp
     56 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
     57 first 128 codepoints are identical to ASCII's. The additional code
     58 points are needed as Unicode's goal is to express all writing systems
     59 of the world.
     60 To give an example, the abstract character
     61 .Sq \[u00C4]
     62 is not expressable in ASCII, given no ASCII codepoint has been assigned
     63 to it.
     64 It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
     65 .Pp
     66 One may assume that this process is straightfoward, but as more and
     67 more codepoints were assigned to abstract characters, the Unicode
     68 Consortium (that defines the Unicode standard) was facing a problem:
     69 Many (mostly non-European) languages have such a large amount of
     70 abstract characters that it would exhaust the available Unicode code
     71 space if one tried to assign a codepoint to each abstract character.
     72 The solution to that problem is best introduced with an example: Consider
     73 the abstract character
     74 .Sq \[u01DE] ,
     75 which is
     76 .Sq A
     77 with an umlaut and a macron added to it.
     78 In this sense, one can consider
     79 .Sq \[u01DE]
     80 as a two-fold modification (namely
     81 .Dq add umlaut
     82 and
     83 .Dq add macron )
     84 of the
     85 .Dq base character
     86 .Sq A .
     87 .Pp
     88 The Unicode Consortium adapted this idea by assigning codepoints to
     89 modifications.
     90 For example, the codepoint 0x308 represents adding an umlaut and 0x304
     91 represents adding a macron, and thus, the codepoint sequence
     92 .Dq 0x41 0x308 0x304 ,
     93 namely the base character
     94 .Sq A
     95 followed by the umlaut and macron modifiers, represents the abstract
     96 character
     97 .Sq \[u01DE] .
     98 As a side-note, the single codepoint 0x1DE was also assigned to
     99 .Sq \[u01DE] ,
    100 which is a good example for the fact that there can be multiple
    101 representations of a single abstract character in Unicode.
    102 .Pp
    103 Expressing a single abstract character with multiple codepoints solved
    104 the code space exhaustion-problem, and the concept has been greatly
    105 expanded since its first introduction (emojis, joiners, etc.). A sequence
    106 (which can also have the length 1) of codepoints that belong together
    107 this way and represents an abstract character is called a
    108 .Dq grapheme cluster .
    109 .Pp
    110 In many applications it is necessary to count the number of
    111 user-perceived characters, i.e. grapheme clusters, in a string.
    112 A good example for this is a terminal text editor, which needs to
    113 properly align characters on a grid.
    114 This is pretty simple with ASCII-strings, where you just count the number
    115 of bytes (as each byte is a codepoint and each codepoint is a grapheme
    116 cluster).
    117 With Unicode-strings, it is a common mistake to simply adapt the
    118 ASCII-approach and count the number of code points.
    119 This is wrong, as, for example, the sequence
    120 .Dq 0x41 0x308 0x304 ,
    121 while made up of 3 codepoints, is a single grapheme cluster and
    122 represents the user-perceived character
    123 .Sq \[u01DE] .
    124 .Pp
    125 The proper way to segment a string into user-perceived characters
    126 is to segment it into its grapheme clusters by applying the Unicode
    127 grapheme cluster breaking algorithm (UAX #29).
    128 It is based on a complex ruleset and lookup-tables and determines if a
    129 grapheme cluster ends or is continued between two codepoints.
    130 Libraries like ICU and libunistring, which also offer this functionality,
    131 are often bloated, not correct, difficult to use or not reasonably
    132 statically linkable.
    133 .Pp
    134 Analogously, the standard provides algorithms to separate strings by
    135 words, sentences and lines, convert cases and compare strings.
    136 The motivation behind
    137 .Nm
    138 is to make unicode handling suck less and abide by the UNIX philosophy.
    139 .Sh AUTHORS
    140 .An Laslo Hunhold Aq Mt