libgrapheme.sh (6347B)
1 cat << EOF 2 .Dd ${MAN_DATE} 3 .Dt LIBGRAPHEME 7 4 .Os suckless.org 5 .Sh NAME 6 .Nm libgrapheme 7 .Nd unicode string library 8 .Sh SYNOPSIS 9 .In grapheme.h 10 .Sh DESCRIPTION 11 The 12 .Nm 13 library provides functions to properly handle Unicode strings according 14 to the Unicode specification in regard to character, word, sentence and 15 line segmentation and case detection and conversion. 16 .Pp 17 Unicode strings are made up of user-perceived characters (so-called 18 .Dq grapheme clusters , 19 see 20 .Sx MOTIVATION ) 21 that are composed of one or more Unicode codepoints, which in turn 22 are encoded in one or more bytes in an encoding like UTF-8. 23 .Pp 24 There is a widespread misconception that it was enough to simply 25 determine codepoints in a string and treat them as user-perceived 26 characters to be Unicode compliant. 27 While this may work in some cases, this assumption quickly breaks, 28 especially for non-Western languages and decomposed Unicode strings 29 where user-perceived characters are usually represented using multiple 30 codepoints. 31 .Pp 32 Despite this complicated multilevel structure of Unicode strings, 33 .Nm 34 provides methods to work with them at the byte-level (i.e. UTF-8 35 .Sq char 36 arrays) while also offering codepoint-level methods. 37 Additionally, it is a 38 .Dq freestanding 39 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on 40 a standard library. This makes it easy to use in bare metal environments. 41 .Pp 42 Every documented function's manual page provides a self-contained 43 example illustrating the possible usage. 44 .Sh SEE ALSO 45 .Xr grapheme_decode_utf8 3 , 46 .Xr grapheme_encode_utf8 3 , 47 .Xr grapheme_is_character_break 3 , 48 .Xr grapheme_is_lowercase 3 , 49 .Xr grapheme_is_lowercase_utf8 3 , 50 .Xr grapheme_is_titlecase 3 , 51 .Xr grapheme_is_titlecase_utf8 3 , 52 .Xr grapheme_is_uppercase 3 , 53 .Xr grapheme_is_uppercase_utf8 3 , 54 .Xr grapheme_next_character_break 3 , 55 .Xr grapheme_next_character_break_utf8 3 , 56 .Xr grapheme_next_line_break 3 , 57 .Xr grapheme_next_line_break_utf8 3 , 58 .Xr grapheme_next_sentence_break 3 , 59 .Xr grapheme_next_sentence_break_utf8 3 , 60 .Xr grapheme_next_word_break 3 , 61 .Xr grapheme_next_word_break_utf8 3 , 62 .Xr grapheme_to_lowercase 3 , 63 .Xr grapheme_to_lowercase_utf8 3 , 64 .Xr grapheme_to_titlecase 3 , 65 .Xr grapheme_to_titlecase_utf8 3 66 .Xr grapheme_to_uppercase 3 , 67 .Xr grapheme_to_uppercase_utf8 3 , 68 .Sh STANDARDS 69 .Nm 70 is compliant with the Unicode ${UNICODE_VERSION} specification. 71 .Sh MOTIVATION 72 The idea behind every character encoding scheme like ASCII or Unicode 73 is to express abstract characters (which can be thought of as shapes 74 making up a written language). ASCII for instance, which comprises the 75 range 0 to 127, assigns the number 65 (0x41) to the abstract character 76 .Sq A . 77 This number is called a 78 .Dq codepoint , 79 and all codepoints of an encoding make up its so-called 80 .Dq code space . 81 .Pp 82 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its 83 first 128 codepoints are identical to ASCII's. The additional code 84 points are needed as Unicode's goal is to express all writing systems 85 of the world. 86 To give an example, the abstract character 87 .Sq \[u00C4] 88 is not expressible in ASCII, given no ASCII codepoint has been assigned 89 to it. 90 It can be expressed in Unicode, though, with the codepoint 196 (0xC4). 91 .Pp 92 One may assume that this process is straightforward, but as more and 93 more codepoints were assigned to abstract characters, the Unicode 94 Consortium (that defines the Unicode standard) was facing a problem: 95 Many (mostly non-European) languages have such a large amount of 96 abstract characters that it would exhaust the available Unicode code 97 space if one tried to assign a codepoint to each abstract character. 98 The solution to that problem is best introduced with an example: Consider 99 the abstract character 100 .Sq \[u01DE] , 101 which is 102 .Sq A 103 with an umlaut and a macron added to it. 104 In this sense, one can consider 105 .Sq \[u01DE] 106 as a two-fold modification (namely 107 .Dq add umlaut 108 and 109 .Dq add macron ) 110 of the 111 .Dq base character 112 .Sq A . 113 .Pp 114 The Unicode Consortium adapted this idea by assigning codepoints to 115 modifications. 116 For example, the codepoint 0x308 represents adding an umlaut and 0x304 117 represents adding a macron, and thus, the codepoint sequence 118 .Dq 0x41 0x308 0x304 , 119 namely the base character 120 .Sq A 121 followed by the umlaut and macron modifiers, represents the abstract 122 character 123 .Sq \[u01DE] . 124 As a side-note, the single codepoint 0x1DE was also assigned to 125 .Sq \[u01DE] , 126 which is a good example for the fact that there can be multiple 127 representations of a single abstract character in Unicode. 128 .Pp 129 Expressing a single abstract character with multiple codepoints solved 130 the code space exhaustion-problem, and the concept has been greatly 131 expanded since its first introduction (emojis, joiners, etc.). A sequence 132 (which can also have the length 1) of codepoints that belong together 133 this way and represents an abstract character is called a 134 .Dq grapheme cluster . 135 .Pp 136 In many applications it is necessary to count the number of 137 user-perceived characters, i.e. grapheme clusters, in a string. 138 A good example for this is a terminal text editor, which needs to 139 properly align characters on a grid. 140 This is pretty simple with ASCII-strings, where you just count the number 141 of bytes (as each byte is a codepoint and each codepoint is a grapheme 142 cluster). 143 With Unicode-strings, it is a common mistake to simply adapt the 144 ASCII-approach and count the number of code points. 145 This is wrong, as, for example, the sequence 146 .Dq 0x41 0x308 0x304 , 147 while made up of 3 codepoints, is a single grapheme cluster and 148 represents the user-perceived character 149 .Sq \[u01DE] . 150 .Pp 151 The proper way to segment a string into user-perceived characters 152 is to segment it into its grapheme clusters by applying the Unicode 153 grapheme cluster breaking algorithm (UAX #29). 154 It is based on a complex ruleset and lookup-tables and determines if a 155 grapheme cluster ends or is continued between two codepoints. 156 Libraries like ICU and libunistring, which also offer this functionality, 157 are often bloated, not correct, difficult to use or not reasonably 158 statically linkable. 159 .Pp 160 Analogously, the standard provides algorithms to separate strings by 161 words, sentences and lines, convert cases and compare strings. 162 The motivation behind 163 .Nm 164 is to make unicode handling suck less and abide by the UNIX philosophy. 165 .Sh AUTHORS 166 .An Laslo Hunhold Aq Mt dev@frign.de 167 EOF