libgrapheme

unicode string library
git clone git://git.suckless.org/libgrapheme
Log | Files | Refs | README | LICENSE

libgrapheme.7 (5643B)


      1 .Dd 2022-08-26
      2 .Dt LIBGRAPHEME 7
      3 .Os suckless.org
      4 .Sh NAME
      5 .Nm libgrapheme
      6 .Nd unicode string library
      7 .Sh SYNOPSIS
      8 .In grapheme.h
      9 .Sh DESCRIPTION
     10 The
     11 .Nm
     12 library provides functions to properly handle Unicode strings according
     13 to the Unicode specification.
     14 Unicode strings are made up of user-perceived characters (so-called
     15 .Dq grapheme clusters ,
     16 see
     17 .Sx MOTIVATION )
     18 that are made up of one or more Unicode codepoints, which in turn
     19 are encoded in one or more bytes in an encoding like UTF-8.
     20 .Pp
     21 There is a widespread misconception that it was enough to simply
     22 determine codepoints in a string and treat them as user-perceived
     23 characters to be Unicode compliant.
     24 While this may work in some cases, this assumption quickly breaks,
     25 especially for non-Western languages and decomposed Unicode strings
     26 where user-perceived characters are usually represented using multiple
     27 codepoints.
     28 .Pp
     29 Despite this complicated multilevel structure of Unicode strings,
     30 .Nm
     31 provides methods to work with them at the byte-level (i.e. UTF-8
     32 .Sq char
     33 arrays) while also offering codepoint-level methods.
     34 .Pp
     35 Every documented function's manual page provides a self-contained
     36 example illustrating the possible usage.
     37 .Sh SEE ALSO
     38 .Xr grapheme_decode_utf8 3 ,
     39 .Xr grapheme_encode_utf8 3 ,
     40 .Xr grapheme_is_character_break 3 ,
     41 .Xr grapheme_next_character_break 3 ,
     42 .Xr grapheme_next_line_break 3 ,
     43 .Xr grapheme_next_sentence_break 3 ,
     44 .Xr grapheme_next_word_break 3 ,
     45 .Xr grapheme_next_character_break_utf8 3 ,
     46 .Xr grapheme_next_line_break_utf8 3 ,
     47 .Xr grapheme_next_sentence_break_utf8 3 ,
     48 .Xr grapheme_next_word_break_utf8 3
     49 .Sh STANDARDS
     50 .Nm
     51 is compliant with the Unicode 14.0.0 specification.
     52 .Sh MOTIVATION
     53 The idea behind every character encoding scheme like ASCII or Unicode
     54 is to express abstract characters (which can be thought of as shapes
     55 making up a written language). ASCII for instance, which comprises the
     56 range 0 to 127, assigns the number 65 (0x41) to the abstract character
     57 .Sq A .
     58 This number is called a
     59 .Dq codepoint ,
     60 and all codepoints of an encoding make up its so-called
     61 .Dq code space .
     62 .Pp
     63 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
     64 first 128 codepoints are identical to ASCII's. The additional code
     65 points are needed as Unicode's goal is to express all writing systems
     66 of the world.
     67 To give an example, the abstract character
     68 .Sq \[u00C4]
     69 is not expressable in ASCII, given no ASCII codepoint has been assigned
     70 to it.
     71 It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
     72 .Pp
     73 One may assume that this process is straightfoward, but as more and
     74 more codepoints were assigned to abstract characters, the Unicode
     75 Consortium (that defines the Unicode standard) was facing a problem:
     76 Many (mostly non-European) languages have such a large amount of
     77 abstract characters that it would exhaust the available Unicode code
     78 space if one tried to assign a codepoint to each abstract character.
     79 The solution to that problem is best introduced with an example: Consider
     80 the abstract character
     81 .Sq \[u01DE] ,
     82 which is
     83 .Sq A
     84 with an umlaut and a macron added to it.
     85 In this sense, one can consider
     86 .Sq \[u01DE]
     87 as a two-fold modification (namely
     88 .Dq add umlaut
     89 and
     90 .Dq add macron )
     91 of the
     92 .Dq base character
     93 .Sq A .
     94 .Pp
     95 The Unicode Consortium adapted this idea by assigning codepoints to
     96 modifications.
     97 For example, the codepoint 0x308 represents adding an umlaut and 0x304
     98 represents adding a macron, and thus, the codepoint sequence
     99 .Dq 0x41 0x308 0x304 ,
    100 namely the base character
    101 .Sq A
    102 followed by the umlaut and macron modifiers, represents the abstract
    103 character
    104 .Sq \[u01DE] .
    105 As a side-note, the single codepoint 0x1DE was also assigned to
    106 .Sq \[u01DE] ,
    107 which is a good example for the fact that there can be multiple
    108 representations of a single abstract character in Unicode.
    109 .Pp
    110 Expressing a single abstract character with multiple codepoints solved
    111 the code space exhaustion-problem, and the concept has been greatly
    112 expanded since its first introduction (emojis, joiners, etc.). A sequence
    113 (which can also have the length 1) of codepoints that belong together
    114 this way and represents an abstract character is called a
    115 .Dq grapheme cluster .
    116 .Pp
    117 In many applications it is necessary to count the number of
    118 user-perceived characters, i.e. grapheme clusters, in a string.
    119 A good example for this is a terminal text editor, which needs to
    120 properly align characters on a grid.
    121 This is pretty simple with ASCII-strings, where you just count the number
    122 of bytes (as each byte is a codepoint and each codepoint is a grapheme
    123 cluster).
    124 With Unicode-strings, it is a common mistake to simply adapt the
    125 ASCII-approach and count the number of code points.
    126 This is wrong, as, for example, the sequence
    127 .Dq 0x41 0x308 0x304 ,
    128 while made up of 3 codepoints, is a single grapheme cluster and
    129 represents the user-perceived character
    130 .Sq \[u01DE] .
    131 .Pp
    132 The proper way to segment a string into user-perceived characters
    133 is to segment it into its grapheme clusters by applying the Unicode
    134 grapheme cluster breaking algorithm (UAX #29).
    135 It is based on a complex ruleset and lookup-tables and determines if a
    136 grapheme cluster ends or is continued between two codepoints.
    137 Libraries like ICU and libunistring, which also offer this functionality,
    138 are often bloated, not correct, difficult to use or not reasonably
    139 statically linkable.
    140 .Pp
    141 Analogously, the standard provides algorithms to separate strings by
    142 words, sentences and lines, convert cases and compare strings.
    143 The motivation behind
    144 .Nm
    145 is to make unicode handling suck less and abide by the UNIX philosophy.
    146 .Sh AUTHORS
    147 .An Laslo Hunhold Aq Mt dev@frign.de