index.md (6688B)
1  2 3 libgrapheme is an extremely simple freestanding C99 library providing 4 utilities for properly handling strings according to the latest 5 Unicode standard 17.0.0. It offers fully Unicode compliant 6 7 * __grapheme cluster__ (i.e. user-perceived character) __segmentation__ 8 * __word segmentation__ 9 * __sentence segmentation__ 10 * detection of permissible __line break opportunities__ 11 * __case detection__ (lower-, upper- and title-case) 12 * __case conversion__ (to lower-, upper- and title-case) 13 14 on UTF-8 strings and codepoint arrays, which both can also be 15 null-terminated. 16 17 The necessary lookup-tables are automatically generated from the Unicode 18 standard data (contained in the tarball) and heavily compressed. Over 19 10,000 automatically generated conformance tests and over 150 unit tests 20 ensure conformance and correctness. 21 22 There is no complicated build-system involved and it's all done using 23 one POSIX-compliant Makefile. All you need is a C99 compiler, given 24 the lookup-table-generators and compressors that are only run at 25 build-time are also written in C99. 26 The resulting library is freestanding and thus not even dependent on a 27 standard library to be present at runtime, making it a suitable choice 28 for bare metal applications. 29 30 It is also way smaller and much faster than the other established Unicode 31 string libraries (ICU, GNU's libunistring, libutf8proc). 32 33 Development 34 ----------- 35 You can [browse](//git.suckless.org/libgrapheme) the source code 36 repository or get a copy with the following command: 37 38 git clone https://git.suckless.org/libgrapheme 39 40 Download 41 -------- 42 libgrapheme follows the [semantic versioning](https://semver.org/) scheme. 43 44 * [libgrapheme-3.0.0](//dl.suckless.org/libgrapheme/libgrapheme-3.0.0.tar.gz) (2025-12-24) 45 * [libgrapheme-2.0.2](//dl.suckless.org/libgrapheme/libgrapheme-2.0.2.tar.gz) (2022-11-02) 46 * [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.0.0.tar.gz) (2021-12-22) 47 48 49 Getting Started 50 --------------- 51 Automatically configuring and installing libgrapheme via 52 53 ./configure 54 make install 55 56 will install the header grapheme.h and both the static library 57 libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in 58 the respective folders. The conformance and unit tests can be run with 59 60 make test 61 62 and comparative benchmarks against libutf8proc (which is the only Unicode 63 library compliant enough to be comparable to) can be run with 64 65 make benchmark 66 67 You can access the manual [here](man/) or via libgrapheme(7) by typing 68 69 man libgrapheme 70 71 and looking at the referred pages, e.g. 72 [grapheme\_next\_character\_break_utf8(3)](man/grapheme_next_character_break_utf8.3/). 73 Each page contains code-examples and an extensive description. To give 74 one example that is also given in the manuals, the following code 75 separates a given string 'Tëst 👨👩👦 🇺🇸 नी நி!' 76 into its user-perceived characters: 77 78 #include <grapheme.h> 79 #include <stdint.h> 80 #include <stdio.h> 81 82 int 83 main(void) 84 { 85 /* UTF-8 encoded input */ 86 char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" 87 "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" 88 "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" 89 "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; 90 size_t ret, len, off; 91 92 printf("Input: \"%s\"\n", s); 93 94 /* print each grapheme cluster with byte-length */ 95 printf("grapheme clusters in NUL-delimited input:\n"); 96 for (off = 0; s[off] != '\0'; off += ret) { 97 ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX); 98 printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off); 99 } 100 printf("\n"); 101 102 /* do the same, but this time string is length-delimited */ 103 len = 17; 104 printf("grapheme clusters in input delimited to %zu bytes:\n", len); 105 for (off = 0; off < len; off += ret) { 106 ret = grapheme_next_character_break_utf8(s + off, len - off); 107 printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off); 108 } 109 110 return 0; 111 } 112 113 This code can be compiled with 114 115 cc (-static) -o example example.c -lgrapheme 116 117 and the output is 118 119 Input: "Tëst 👨👩👦 🇺🇸 नी நி!" 120 grapheme clusters in NUL-delimited input: 121 1 bytes | T 122 2 bytes | ë 123 1 bytes | s 124 1 bytes | t 125 1 bytes | 126 18 bytes | 👨👩👦 127 1 bytes | 128 8 bytes | 🇺🇸 129 1 bytes | 130 6 bytes | नी 131 1 bytes | 132 6 bytes | நி 133 1 bytes | ! 134 135 grapheme clusters in input delimited to 17 bytes: 136 1 bytes | T 137 2 bytes | ë 138 1 bytes | s 139 1 bytes | t 140 1 bytes | 141 11 bytes | 👨👩 142 143 Motivation 144 ---------- 145 The goal of this project is to be a suckless and statically linkable 146 alternative to the existing bloated, complicated, overscoped and/or 147 incorrect solutions for Unicode string handling (ICU, GNU's 148 libunistring, libutf8proc, etc.), motivating more hackers to properly 149 handle Unicode strings in their projects and allowing this even in 150 embedded applications. 151 152 The problem can be easily seen when looking at the sizes of the respective 153 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a, 154 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring 155 (libunistring.a) is around 2MB, which is unacceptable for static 156 linking. Both take many minutes to compile even on a good computer and 157 require a lot of dependencies, including Python for ICU. On 158 the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K 159 and is compiled (including Unicode data parsing and compression) in 160 under a second, requiring nothing but a C99 compiler and POSIX make(1). 161 162 Some libraries, like libutf8proc and libunistring, are incorrect by 163 basing their API on assumptions that haven't been true for years 164 (e.g. offering stateless grapheme cluster segmentation even though the 165 underlying algorithm is not stateless). As an additional factor, 166 libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings 167 that can be easily used for exploits. 168 169 While ICU and libunistring offer a lot of functions and the weight mostly 170 comes from locale-data provided by the Unicode standard, which is applied 171 implementation-specifically (!) for some things, the same standard always 172 defines a sane 'default' behaviour as an alternative in such cases that 173 is satisfying in 99% of the cases and which you can rely on. 174 175 For some languages, for instance, it is necessary to have a dictionary 176 on hand to always accurately determine when a word begins and ends. The 177 defaults provided by the standard, though, already do a great job 178 respecting the language's boundaries in the general case and are not too 179 taxing in terms of performance. 180 181 Author 182 ------ 183 * Laslo Hunhold (dev@frign.de) 184 185 Please contact me if you have information that could be added to this page.