sites

public wiki contents of suckless.org
git clone git://git.suckless.org/sites
Log | Files | Refs

index.md (4610B)


      1 ![libgrapheme](libgrapheme.svg)
      2 
      3 libgrapheme is an extremely simple C99 library providing utilities for
      4 properly handling Unicode strings made up of user-perceived characters
      5 ('grapheme clusters') according to the Unicode standard. While providing
      6 convenience functions to operate on UTF-8-encoded strings, you can also
      7 use libgrapheme for any other encoding as well.
      8 
      9 The necessary lookup-tables and test-data are automatically generated
     10 from the Unicode standard data, ensuring correctness and validation.
     11 A specialized 'Heisenstate' state-handling combined with
     12 O(log(n))-binary-search on the lookup-tables and data-recycling provides
     13 great processing-performance in the order of millions of codepoints per
     14 second.
     15 
     16 There is no complicated build-system involved and it's all done using
     17 one POSIX-compliant Makefile. All you need is a C99 compiler, because
     18 the data-generators are also written in C99.
     19 
     20 Motivation
     21 ----------
     22 The goal of this project is to be a suckless and statically linkable
     23 alternative to the existing bloated, complicated and overscoped solutions
     24 for Unicode string handling (ICU, GNU's libunistring, etc.), motivating
     25 more hackers to properly handle Unicode strings in their projects and
     26 allowing this even in embedded applications.
     27 
     28 The problem can be easily seen when looking at the sizes of the respective
     29 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a,
     30 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring
     31 (libunistring.a) is around 2MB, which is unacceptable for static
     32 linking. Both take many minutes to compile even on a good computer and
     33 require a lot of dependencies, including Python for ICU. On
     34 the other hand libgrapheme (libgrapheme.a) only weighs in at around 40K
     35 and is compiled (including Unicode data parsing) in fractions of a
     36 second, requiring nothing but a C99 compiler and make(1).
     37 
     38 While ICU and libunistring offer a lot of functions and the weight mostly
     39 comes from locale-data provided by the Unicode standard, which is applied
     40 implementation-specifically (!) for some things, the same standard always
     41 defines a sane 'default' behaviour as an alternative in such cases that
     42 is satisfying in 99% of the cases and which you can rely on.
     43 
     44 For some languages, for instance, it is necessary to have a dictionary
     45 on hand to always accurately determine when a word begins and ends. The
     46 defaults provided by the standard, though, already do a good job
     47 respecting the language's boundaries in the general case and are not too
     48 taxing in terms of performance.
     49 
     50 Handling user-perceived characters is not locale-dependent, though, and
     51 does not require locale-data.
     52 
     53 Getting Started
     54 ---------------
     55 Installing libgrapheme will install the header grapheme.h and both the
     56 static library libgrapheme.a and the dynamic library libgrapheme.so in
     57 the respective folders. Access the manual under libgrapheme(7) by typing
     58 
     59 	man libgrapheme
     60 
     61 and looking at the referred pages, e.g. grapheme\_next\_character\_break(3).
     62 Each page contains code-examples and an extensive description. To give
     63 one example that is also given in the manuals, the following code
     64 separates a given string 'Tëst 👨‍👩‍👦 🇺🇸 नी நி!'
     65 into its user-perceived characters:
     66 
     67 	#include <grapheme.h>
     68 	#include <stdint.h>
     69 	#include <stdio.h>
     70 	
     71 	int
     72 	main(void)
     73 	{
     74 		/* UTF-8 encoded input */
     75 		char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0"
     76 		          "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0"
     77 		          "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0"
     78 		          "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!";
     79 		size_t ret, off;
     80 	
     81 		printf("Input: \"%s\"\n", s);
     82 	
     83 		for (off = 0; s[off] != '\0'; off += ret) {
     84 			ret = grapheme_next_character_break(s + off, SIZE_MAX);
     85 			printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret);
     86 		}
     87 	
     88 		return 0;
     89 	}
     90 
     91 This code can be compiled with
     92 
     93 	cc (-static) -o example example.c -lgrapheme
     94 
     95 and the output is
     96 
     97 	Input: "Tëst 👨‍👩‍👦 🇺🇸 नी நி!"
     98 	 1 bytes | T
     99 	 2 bytes | ë
    100 	 1 bytes | s
    101 	 1 bytes | t
    102 	 1 bytes |  
    103 	18 bytes | 👨‍👩‍👦
    104 	 1 bytes |  
    105 	 8 bytes | 🇺🇸
    106 	 1 bytes |  
    107 	 6 bytes | नी
    108 	 1 bytes |  
    109 	 6 bytes | நி
    110 	 1 bytes | !
    111 
    112 Development
    113 -----------
    114 You can [browse](//git.suckless.org/libgrapheme) the source code
    115 repository or get a copy with the following command:
    116 
    117 	git clone https://git.suckless.org/libgrapheme
    118 
    119 Download
    120 --------
    121 * [libgrapheme-1](//dl.suckless.org/libgrapheme/libgrapheme-1.tar.gz) (2021-12-22)
    122 
    123 Author
    124 ------
    125 * Laslo Hunhold (dev@frign.de)
    126 
    127 Please contact me if you find information that could be added to this page.