index.md - sites - public wiki contents of suckless.org

index.md (6872B)
      1 ![libgrapheme](libgrapheme.svg)
      2 
      3 libgrapheme is an extremely simple freestanding C99 library providing
      4 utilities for properly handling strings according to the latest
      5 Unicode standard 17.0.0. It offers fully Unicode compliant
      6 
      7 * __grapheme cluster__ (i.e. user-perceived character) __segmentation__
      8 * __word segmentation__
      9 * __sentence segmentation__
     10 * detection of permissible __line break opportunities__
     11 * __case detection__ (lower-, upper- and title-case)
     12 * __case conversion__ (to lower-, upper- and title-case)
     13 
     14 on UTF-8 strings and codepoint arrays, which both can also be
     15 null-terminated.
     16 
     17 The necessary lookup-tables are automatically generated from the Unicode
     18 standard data (contained in the tarball) and heavily compressed. Over
     19 10,000 automatically generated conformance tests and over 150 unit tests
     20 ensure conformance and correctness.
     21 
     22 There is no complicated build-system involved and it's all done using
     23 one POSIX-compliant Makefile. All you need is a C99 compiler, given
     24 the lookup-table-generators and compressors that are only run at
     25 build-time are also written in C99.
     26 The resulting library is freestanding and thus not even dependent on a
     27 standard library to be present at runtime, making it a suitable choice
     28 for bare metal applications.
     29 
     30 It is also way smaller and much faster than the other established Unicode
     31 string libraries (ICU, GNU's libunistring, libutf8proc).
     32 
     33 Development
     34 -----------
     35 You can [browse](//git.suckless.org/libgrapheme) the source code
     36 repository or get a copy with the following command:
     37 
     38 	git clone https://git.suckless.org/libgrapheme
     39 
     40 Download
     41 --------
     42 libgrapheme follows the [semantic versioning](https://semver.org/) scheme.
     43 
     44 * [libgrapheme-3.0.0](//dl.suckless.org/libgrapheme/libgrapheme-3.0.0.tar.gz) (2025-12-24)
     45 * [libgrapheme-2.0.2](//dl.suckless.org/libgrapheme/libgrapheme-2.0.2.tar.gz) (2022-11-02)
     46 * [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.0.0.tar.gz) (2021-12-22)
     47 
     48 
     49 Getting Started
     50 ---------------
     51 Automatically configuring and installing libgrapheme via
     52 
     53 	./configure
     54 	make install
     55 
     56 will install the header grapheme.h and both the static library
     57 libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in
     58 the respective folders. The conformance and unit tests can be run with
     59 
     60 	make test
     61 
     62 and comparative benchmarks against libutf8proc (which is the only Unicode
     63 library compliant enough to be comparable to) can be run with
     64 
     65 	make benchmark
     66 
     67 You can access the manual [here](man/) or via libgrapheme(7) by typing
     68 
     69 	man libgrapheme
     70 
     71 and looking at the referred pages, e.g.
     72 [grapheme\_next\_character\_break_utf8(3)](man/grapheme_next_character_break_utf8.3/).
     73 Each page contains code-examples and an extensive description. To give
     74 one example that is also given in the manuals, the following code
     75 separates a given string 'Tëst 👨‍👩‍👦 🇺🇸 नी நி!'
     76 into its user-perceived characters:
     77 
     78 	#include <grapheme.h>
     79 	#include <stdint.h>
     80 	#include <stdio.h>
     81 	
     82 	int
     83 	main(void)
     84 	{
     85 		/* UTF-8 encoded input */
     86 		char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0"
     87 		          "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0"
     88 		          "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0"
     89 		          "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!";
     90 		size_t ret, len, off;
     91 	
     92 		printf("Input: \"%s\"\n", s);
     93 	
     94 		/* print each grapheme cluster with byte-length */
     95 		printf("grapheme clusters in NUL-delimited input:\n");
     96 		for (off = 0; s[off] != '\0'; off += ret) {
     97 			ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX);
     98 			printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
     99 		}
    100 		printf("\n");
    101 	
    102 		/* do the same, but this time string is length-delimited */
    103 		len = 17;
    104 		printf("grapheme clusters in input delimited to %zu bytes:\n", len);
    105 		for (off = 0; off < len; off += ret) {
    106 			ret = grapheme_next_character_break_utf8(s + off, len - off);
    107 			printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
    108 		}
    109 	
    110 		return 0;
    111 	}
    112 
    113 This code can be compiled with
    114 
    115 	cc (-static) -o example example.c -lgrapheme
    116 
    117 and the output is
    118 
    119 	Input: "Tëst 👨‍👩‍👦 🇺🇸 नी நி!"
    120 	grapheme clusters in NUL-delimited input:
    121 	 1 bytes | T
    122 	 2 bytes | ë
    123 	 1 bytes | s
    124 	 1 bytes | t
    125 	 1 bytes |  
    126 	18 bytes | 👨‍👩‍👦
    127 	 1 bytes |  
    128 	 8 bytes | 🇺🇸
    129 	 1 bytes |  
    130 	 6 bytes | नी
    131 	 1 bytes |  
    132 	 6 bytes | நி
    133 	 1 bytes | !
    134 	
    135 	grapheme clusters in input delimited to 17 bytes:
    136 	 1 bytes | T
    137 	 2 bytes | ë
    138 	 1 bytes | s
    139 	 1 bytes | t
    140 	 1 bytes |  
    141 	11 bytes | 👨‍👩
    142 
    143 Motivation
    144 ----------
    145 The goal of this project is to be a suckless and statically linkable
    146 alternative to the existing bloated, complicated, overscoped and/or
    147 incorrect solutions for Unicode string handling (ICU, GNU's
    148 libunistring, libutf8proc, etc.), motivating more hackers to properly
    149 handle Unicode strings in their projects and allowing this even in
    150 embedded applications.
    151 
    152 The problem can be easily seen when looking at the sizes of the respective
    153 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a,
    154 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring
    155 (libunistring.a) is around 2MB. Both take many minutes to compile even on
    156 a good computer, and ICU depends on Python, among others. On the other hand,
    157 libgrapheme (libgrapheme.a) only weighs in at around 400K and is compiled
    158 (including Unicode data parsing and compression) in under a second,
    159 requiring nothing but a C99 compiler and POSIX make(1).
    160 
    161 Some libraries, like libutf8proc, are incorrect by basing their API on
    162 assumptions that haven't been true for years (e.g. offering stateless
    163 grapheme cluster segmentation even though the underlying algorithm is
    164 not stateless). As an additional factor, libutf8proc's UTF-8-decoder
    165 is unsafe, as it allows overlong encodings that can be easily used for
    166 exploits. While libunistring has expanded their API offering e.g.
    167 u8\_grapheme\_next() and u8\_grapheme\_prev() that are standard conformant,
    168 its API still contains not-explicitly deprecated functions assuming
    169 an older data model, for instance uc\_is\_grapheme\_break().
    170 
    171 While ICU and libunistring offer a lot of functions and the weight mostly
    172 comes from locale-data provided by the Unicode standard, which is applied
    173 implementation-specifically (!) for some things, the same standard always
    174 defines a sane 'default' behaviour as an alternative in such cases that
    175 is satisfying in 99% of the cases and which you can rely on.
    176 
    177 For some languages, for instance, it is necessary to have a dictionary
    178 on hand to always accurately determine when a word begins and ends. The
    179 defaults provided by the standard, though, already do a great job
    180 respecting the language's boundaries in the general case and are not too
    181 taxing in terms of performance.
    182 
    183 Author
    184 ------
    185 * Laslo Hunhold (dev@frign.de)
    186 
    187 Please contact me if you have information that could be added to this page.
	sites public wiki contents of suckless.org
	git clone git://git.suckless.org/sites
	Log \| Files \| Refs