sites

public wiki contents of suckless.org
git clone git://git.suckless.org/sites
Log | Files | Refs

index.md (6597B)


      1 ![libgrapheme](libgrapheme.svg)
      2 
      3 libgrapheme is an extremely simple freestanding C99 library providing
      4 utilities for properly handling strings according to the latest
      5 Unicode standard 15.0.0. It offers fully Unicode compliant
      6 
      7 * __grapheme cluster__ (i.e. user-perceived character) __segmentation__
      8 * __word segmentation__
      9 * __sentence segmentation__
     10 * detection of permissible __line break opportunities__
     11 * __case detection__ (lower-, upper- and title-case)
     12 * __case conversion__ (to lower-, upper- and title-case)
     13 
     14 on UTF-8 strings and codepoint arrays, which both can also be
     15 null-terminated.
     16 
     17 The necessary lookup-tables are automatically generated from the Unicode
     18 standard data (contained in the tarball) and heavily compressed. Over
     19 10,000 automatically generated conformance tests and over 150 unit tests
     20 ensure conformance and correctness.
     21 
     22 There is no complicated build-system involved and it's all done using
     23 one POSIX-compliant Makefile. All you need is a C99 compiler, given
     24 the lookup-table-generators and compressors that are only run at
     25 build-time are also written in C99.
     26 The resulting library is freestanding and thus not even dependent on a
     27 standard library to be present at runtime, making it a suitable choice
     28 for bare metal applications.
     29 
     30 It is also way smaller and much faster than the other established Unicode
     31 string libraries (ICU, GNU's libunistring, libutf8proc).
     32 
     33 Development
     34 -----------
     35 You can [browse](//git.suckless.org/libgrapheme) the source code
     36 repository or get a copy with the following command:
     37 
     38 	git clone https://git.suckless.org/libgrapheme
     39 
     40 Download
     41 --------
     42 libgrapheme follows the [semantic versioning](https://semver.org/) scheme.
     43 
     44 * [libgrapheme-2.0.2](//dl.suckless.org/libgrapheme/libgrapheme-2.0.2.tar.gz) (2022-11-02)
     45 * [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.0.0.tar.gz) (2021-12-22)
     46 
     47 
     48 Getting Started
     49 ---------------
     50 Automatically configuring and installing libgrapheme via
     51 
     52 	./configure
     53 	make install
     54 
     55 will install the header grapheme.h and both the static library
     56 libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in
     57 the respective folders. The conformance and unit tests can be run with
     58 
     59 	make test
     60 
     61 and comparative benchmarks against libutf8proc (which is the only Unicode
     62 library compliant enough to be comparable to) can be run with
     63 
     64 	make benchmark
     65 
     66 You can access the manual [here](man/) or via libgrapheme(7) by typing
     67 
     68 	man libgrapheme
     69 
     70 and looking at the referred pages, e.g.
     71 [grapheme\_next\_character\_break_utf8(3)](man/grapheme_next_character_break_utf8.3/).
     72 Each page contains code-examples and an extensive description. To give
     73 one example that is also given in the manuals, the following code
     74 separates a given string 'Tëst 👨‍👩‍👦 🇺🇸 नी நி!'
     75 into its user-perceived characters:
     76 
     77 	#include <grapheme.h>
     78 	#include <stdint.h>
     79 	#include <stdio.h>
     80 	
     81 	int
     82 	main(void)
     83 	{
     84 		/* UTF-8 encoded input */
     85 		char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0"
     86 		          "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0"
     87 		          "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0"
     88 		          "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!";
     89 		size_t ret, len, off;
     90 	
     91 		printf("Input: \"%s\"\n", s);
     92 	
     93 		/* print each grapheme cluster with byte-length */
     94 		printf("grapheme clusters in NUL-delimited input:\n");
     95 		for (off = 0; s[off] != '\0'; off += ret) {
     96 			ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX);
     97 			printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
     98 		}
     99 		printf("\n");
    100 	
    101 		/* do the same, but this time string is length-delimited */
    102 		len = 17;
    103 		printf("grapheme clusters in input delimited to %zu bytes:\n", len);
    104 		for (off = 0; off < len; off += ret) {
    105 			ret = grapheme_next_character_break_utf8(s + off, len - off);
    106 			printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off);
    107 		}
    108 	
    109 		return 0;
    110 	}
    111 
    112 This code can be compiled with
    113 
    114 	cc (-static) -o example example.c -lgrapheme
    115 
    116 and the output is
    117 
    118 	Input: "Tëst 👨‍👩‍👦 🇺🇸 नी நி!"
    119 	grapheme clusters in NUL-delimited input:
    120 	 1 bytes | T
    121 	 2 bytes | ë
    122 	 1 bytes | s
    123 	 1 bytes | t
    124 	 1 bytes |  
    125 	18 bytes | 👨‍👩‍👦
    126 	 1 bytes |  
    127 	 8 bytes | 🇺🇸
    128 	 1 bytes |  
    129 	 6 bytes | नी
    130 	 1 bytes |  
    131 	 6 bytes | நி
    132 	 1 bytes | !
    133 	
    134 	grapheme clusters in input delimited to 17 bytes:
    135 	 1 bytes | T
    136 	 2 bytes | ë
    137 	 1 bytes | s
    138 	 1 bytes | t
    139 	 1 bytes |  
    140 	11 bytes | 👨‍👩
    141 
    142 Motivation
    143 ----------
    144 The goal of this project is to be a suckless and statically linkable
    145 alternative to the existing bloated, complicated, overscoped and/or
    146 incorrect solutions for Unicode string handling (ICU, GNU's
    147 libunistring, libutf8proc, etc.), motivating more hackers to properly
    148 handle Unicode strings in their projects and allowing this even in
    149 embedded applications.
    150 
    151 The problem can be easily seen when looking at the sizes of the respective
    152 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a,
    153 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring
    154 (libunistring.a) is around 2MB, which is unacceptable for static
    155 linking. Both take many minutes to compile even on a good computer and
    156 require a lot of dependencies, including Python for ICU. On
    157 the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K
    158 and is compiled (including Unicode data parsing and compression) in
    159 under a second, requiring nothing but a C99 compiler and POSIX make(1).
    160 
    161 Some libraries, like libutf8proc and libunistring, are incorrect by
    162 basing their API on assumptions that haven't been true for years
    163 (e.g. offering stateless grapheme cluster segmentation even though the
    164 underlying algorithm is not stateless). As an additional factor,
    165 libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings
    166 that can be easily used for exploits.
    167 
    168 While ICU and libunistring offer a lot of functions and the weight mostly
    169 comes from locale-data provided by the Unicode standard, which is applied
    170 implementation-specifically (!) for some things, the same standard always
    171 defines a sane 'default' behaviour as an alternative in such cases that
    172 is satisfying in 99% of the cases and which you can rely on.
    173 
    174 For some languages, for instance, it is necessary to have a dictionary
    175 on hand to always accurately determine when a word begins and ends. The
    176 defaults provided by the standard, though, already do a great job
    177 respecting the language's boundaries in the general case and are not too
    178 taxing in terms of performance.
    179 
    180 Author
    181 ------
    182 * Laslo Hunhold (dev@frign.de)
    183 
    184 Please contact me if you have information that could be added to this page.