grapheme cluster utility library
git clone git://
Log | Files | Refs | LICENSE

libgrapheme.7 (4754B)

      1 .Dd 2020-10-12
      2 .Dt LIBGRAPHEME 7
      3 .Os
      4 .Sh NAME
      5 .Nm libgrapheme
      6 .Nd grapheme cluster library
      7 .Sh SYNOPSIS
      8 .In grapheme.h
     10 The
     11 .Nm
     12 library provides functions to properly separate a string into
     13 user-perceived characters
     14 .Dq ( grapheme clusters ,
     15 see
     16 .Sx MOTIVATION )
     17 using the Unicode grapheme cluster breaking algorithm (UAX #29).
     18 .Pp
     19 You can either count the length (in bytes) of the grapheme cluster at
     20 the beginning of an UTF-8-encoded string (see
     21 .Xr grapheme_bytelen 3 )
     22 or determine if a grapheme cluster breaks between two Unicode code
     23 points (see
     24 .Xr grapheme_boundary 3 ) ,
     25 while a safe UTF-8-de/encoder for the latter purpose is provided (see
     26 .Xr grapheme_cp_decode 3
     27 and
     28 .Xr grapheme_cp_encode 3 ) .
     29 .Sh SEE ALSO
     30 .Xr grapheme_boundary 3 ,
     31 .Xr grapheme_bytelen 3
     32 .Xr grapheme_cp_decode 3 ,
     33 .Xr grapheme_cp_encode 3 ,
     34 .Sh STANDARDS
     35 .Nm
     36 is compliant with the Unicode 13.0.0 specification.
     37 .Sh MOTIVATION
     38 The idea behind every character encoding scheme like ASCII or Unicode
     39 is to express abstract characters (which can be thought of as shapes
     40 making up a written language). ASCII for instance, which comprises the
     41 range 0 to 127, assigns the number 65 (0x41) to the abstract character
     42 .Sq A .
     43 This number is called a
     44 .Dq code point ,
     45 and all code points of an encoding make up its so-called
     46 .Dq code space .
     47 .Pp
     48 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
     49 first 128 code points are identical to ASCII's. The additional code
     50 points are needed as Unicode's goal is to express all writing systems
     51 of the world. To give an example, the abstract character
     52 .Sq \[u00C4]
     53 is not expressable in ASCII, given no ASCII code point has been assigned
     54 to it. It can be expressed in Unicode, though, with the code point 196
     55 (0xC4).
     56 .Pp
     57 One may assume that this process is straightfoward, but as more and
     58 more code points were assigned to abstract characters, the Unicode
     59 Consortium (that defines the Unicode standard) was facing a problem:
     60 Many (mostly non-European) languages have such a large amount of
     61 abstract characters that it would exhaust the available Unicode code
     62 space if one tried to assign a code point to each abstract character. The
     63 solution to that problem is best introduced with an example: Consider
     64 the abstract character
     65 .Sq \[u01DE] ,
     66 which is
     67 .Sq A
     68 with an umlaut and a macron added to it. In this sense, one can consider
     69 .Sq \[u01DE]
     70 as a two-fold modification (namely
     71 .Dq add umlaut
     72 and
     73 .Dq add macron )
     74 of the
     75 .Dq base character
     76 .Sq A .
     77 .Pp
     78 The Unicode Consortium adapted this idea by assigning code points to
     79 modifications. For example, the code point 0x308 represents adding an
     80 umlaut and 0x304 represents adding a macron, and thus, the code point
     81 sequence
     82 .Dq 0x41 0x308 0x304 ,
     83 namely the base character
     84 .Sq A
     85 followed by the umlaut and macron modifiers, represents the abstract
     86 character
     87 .Sq \[u01DE] .
     88 As a side-note, the single code point 0x1DE was also assigned to
     89 .Sq \[u01DE] ,
     90 which is a good example for the fact that there can be multiple
     91 representations of a single abstract character in Unicode.
     92 .Pp
     93 Expressing a single abstract character with multiple code points solved
     94 the code space exhaustion-problem, and the concept has been greatly
     95 expanded since its first introduction (emojis, joiners, etc.). A sequence
     96 (which can also have the length 1) of code points that belong together
     97 this way and represents an abstract character is called a
     98 .Dq grapheme cluster .
     99 .Pp
    100 In many applications it is necessary to count the number of
    101 user-perceived characters, i.e. grapheme clusters, in a string. A good
    102 example for this is a terminal text editor, which needs to properly align
    103 characters on a grid. This is pretty simple with ASCII-strings, where you
    104 just count the number of bytes (as each byte is a code point and each
    105 code point is a grapheme cluster). With Unicode-strings, it is a common
    106 mistake to simply adapt the ASCII-approach and count the number of code
    107 points. This is wrong, as, for example, the sequence
    108 .Dq 0x41 0x308 0x304 ,
    109 while made up of 3 code points, is a single grapheme cluster and
    110 represents the user-perceived character
    111 .Sq \[u01DE] .
    112 .Pp
    113 The proper way to segment a string into user-perceived characters
    114 is to segment it into its grapheme clusters by applying the Unicode
    115 grapheme cluster breaking algorithm (UAX #29). It is based on a complex
    116 ruleset and lookup-tables and determines if a grapheme cluster ends or
    117 is continued between two code points. Libraries like ICU, which also
    118 offer this functionality, are often bloated, not correct, difficult to
    119 use or not statically linkable. The motivation behind
    120 .Nm
    121 is to make grapheme cluster handling suck less and abide by the UNIX
    122 philosophy.
    123 .Sh AUTHORS
    124 .An Laslo Hunhold Aq Mt