sites

public wiki contents of suckless.org
git clone git://git.suckless.org/sites
Log | Files | Refs

commit c0322961a34af28595d3f6e21f92d5af3313063e
parent 97acbace106b469625f5c6a9363f5ddbe49199d6
Author: Laslo Hunhold <dev@frign.de>
Date:   Thu,  6 Oct 2022 22:08:10 +0200

Update libgrapheme-page and add manuals

Signed-off-by: Laslo Hunhold <dev@frign.de>

Diffstat:
Mlibs.suckless.org/libgrapheme/index.md | 135++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
Alibs.suckless.org/libgrapheme/man/grapheme_decode_utf8\(3\)/index.md | 80+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_encode_utf8\(3\)/index.md | 87+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_character_break\(3\)/index.md | 69+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_lowercase\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_lowercase_utf8\(3\)/index.md | 38++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_titlecase\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_titlecase_utf8\(3\)/index.md | 38++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_uppercase\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_is_uppercase_utf8\(3\)/index.md | 38++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_character_break\(3\)/index.md | 42++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_character_break_utf8\(3\)/index.md | 77+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_line_break\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_line_break_utf8\(3\)/index.md | 75+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_sentence_break\(3\)/index.md | 40++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_sentence_break_utf8\(3\)/index.md | 77+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_word_break\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_next_word_break_utf8\(3\)/index.md | 75+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_lowercase\(3\)/index.md | 40++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_lowercase_utf8\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_titlecase\(3\)/index.md | 40++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_titlecase_utf8\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_uppercase\(3\)/index.md | 40++++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/grapheme_to_uppercase_utf8\(3\)/index.md | 39+++++++++++++++++++++++++++++++++++++++
Alibs.suckless.org/libgrapheme/man/libgrapheme\(7\)/index.md | 122+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
25 files changed, 1372 insertions(+), 53 deletions(-)

diff --git a/libs.suckless.org/libgrapheme/index.md b/libs.suckless.org/libgrapheme/index.md @@ -1,60 +1,61 @@ ![libgrapheme](libgrapheme.svg) -libgrapheme is an extremely simple C99 library providing utilities for -properly handling Unicode strings made up of user-perceived characters -('grapheme clusters') according to the Unicode standard. While providing -convenience functions to operate on UTF-8-encoded strings, you can also -use libgrapheme for any other encoding as well. - -The necessary lookup-tables and test-data are automatically generated -from the Unicode standard data, ensuring correctness and validation. -A specialized 'Heisenstate' state-handling combined with -O(log(n))-binary-search on the lookup-tables and data-recycling provides -great processing-performance in the order of millions of codepoints per -second. +libgrapheme is an extremely simple freestanding C99 library providing +utilities for properly handling strings according to the latest +Unicode standard 15.0.0. It offers fully Unicode compliant + +* __grapheme cluster__ (i.e. user-perceived character) __segmentation__ +* __word segmentation__ +* __sentence segmentation__ +* detection of permissible __line break opportunities__ +* __case detection__ (lower-, upper- and title-case) +* __case conversion__ (to lower-, upper- and title-case) + +on UTF-8 strings and codepoint arrays, which both can also be +null-terminated. + +The necessary lookup-tables are automatically generated from the Unicode +standard data (contained in the tarball) and heavily compressed. Over +10,000 automatically generated conformance tests and over 150 unit tests +ensure conformance and correctness. There is no complicated build-system involved and it's all done using -one POSIX-compliant Makefile. All you need is a C99 compiler, because -the data-generators are also written in C99. +one POSIX-compliant Makefile. All you need is a C99 compiler, given +the lookup-table-generators and compressors are also written in C99. +The resulting library is freestanding and thus not even dependent on a +standard library to be present at runtime. -Motivation ----------- -The goal of this project is to be a suckless and statically linkable -alternative to the existing bloated, complicated and overscoped solutions -for Unicode string handling (ICU, GNU's libunistring, etc.), motivating -more hackers to properly handle Unicode strings in their projects and -allowing this even in embedded applications. +Development +----------- +You can [browse](//git.suckless.org/libgrapheme) the source code +repository or get a copy with the following command: -The problem can be easily seen when looking at the sizes of the respective -libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a, -libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring -(libunistring.a) is around 2MB, which is unacceptable for static -linking. Both take many minutes to compile even on a good computer and -require a lot of dependencies, including Python for ICU. On -the other hand libgrapheme (libgrapheme.a) only weighs in at around 40K -and is compiled (including Unicode data parsing) in fractions of a -second, requiring nothing but a C99 compiler and make(1). + git clone https://git.suckless.org/libgrapheme -While ICU and libunistring offer a lot of functions and the weight mostly -comes from locale-data provided by the Unicode standard, which is applied -implementation-specifically (!) for some things, the same standard always -defines a sane 'default' behaviour as an alternative in such cases that -is satisfying in 99% of the cases and which you can rely on. +Download +-------- +libgrapheme follows the semantic versioning scheme. -For some languages, for instance, it is necessary to have a dictionary -on hand to always accurately determine when a word begins and ends. The -defaults provided by the standard, though, already do a good job -respecting the language's boundaries in the general case and are not too -taxing in terms of performance. +* [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.tar.gz) (2021-12-22) -Handling user-perceived characters is not locale-dependent, though, and -does not require locale-data. Getting Started --------------- -Installing libgrapheme will install the header grapheme.h and both the -static library libgrapheme.a and the dynamic library libgrapheme.so in -the respective folders. Access the manual under libgrapheme(7) by typing +Installing libgrapheme via + + make install + +will install the header grapheme.h and both the static library +libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in +the respective folders. The conformance and unit tests can be run with + + make test + +and comparative benchmarks against libutf8proc can be run with + + make benchmark + +You can access the manual via libgrapheme(7) by typing man libgrapheme @@ -109,16 +110,44 @@ and the output is 6 bytes | நி 1 bytes | ! -Development ------------ -You can [browse](//git.suckless.org/libgrapheme) the source code -repository or get a copy with the following command: - git clone https://git.suckless.org/libgrapheme +Motivation +---------- +The goal of this project is to be a suckless and statically linkable +alternative to the existing bloated, complicated, overscoped and/or +incorrect solutions for Unicode string handling (ICU, GNU's +libunistring, libutf8proc, etc.), motivating more hackers to properly +handle Unicode strings in their projects and allowing this even in +embedded applications. -Download --------- -* [libgrapheme-1](//dl.suckless.org/libgrapheme/libgrapheme-1.tar.gz) (2021-12-22) +The problem can be easily seen when looking at the sizes of the respective +libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a, +libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring +(libunistring.a) is around 2MB, which is unacceptable for static +linking. Both take many minutes to compile even on a good computer and +require a lot of dependencies, including Python for ICU. On +the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K +and is compiled (including Unicode data parsing and compression) in +under a second, requiring nothing but a C99 compiler and POSIX make(1). + +Some libraries, like libutf8proc and libunistring, are incorrect by +basing their API on assumptions that haven't been true for years +(e.g. offering stateless grapheme cluster segmentation even though the +underlying algorithm is not stateless). As an additional factor, +libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings +that can be easily used for exploits. + +While ICU and libunistring offer a lot of functions and the weight mostly +comes from locale-data provided by the Unicode standard, which is applied +implementation-specifically (!) for some things, the same standard always +defines a sane 'default' behaviour as an alternative in such cases that +is satisfying in 99% of the cases and which you can rely on. + +For some languages, for instance, it is necessary to have a dictionary +on hand to always accurately determine when a word begins and ends. The +defaults provided by the standard, though, already do a great job +respecting the language's boundaries in the general case and are not too +taxing in terms of performance. Author ------ diff --git a/libs.suckless.org/libgrapheme/man/grapheme_decode_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_decode_utf8\(3\)/index.md @@ -0,0 +1,80 @@ + GRAPHEME_DECODE_UTF8(3) Library Functions Manual GRAPHEME_DECODE_UTF8(3) + + NAME + grapheme_decode_utf8 – decode first codepoint in UTF-8-encoded string + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_decode_utf8(const char *str, size_t len, uint_least32_t *cp); + + DESCRIPTION + The grapheme_decode_utf8() function decodes the first codepoint in the + UTF-8-encoded string str of length len. If the UTF-8-sequence is invalid + (overlong encoding, unexpected byte, string ends unexpectedly, empty + string, etc.) the decoding is stopped at the last processed byte and the + decoded codepoint set to GRAPHEME_INVALID_CODEPOINT. + + If cp is not NULL the decoded codepoint is stored in the memory pointed + to by cp. + + Given NUL has a unique 1 byte representation, it is safe to operate on + NUL-terminated strings by setting len to SIZE_MAX (stdint.h is already + included by grapheme.h) and terminating when cp is 0 (see EXAMPLES for an + example). + + RETURN VALUES + The grapheme_decode_utf8() function returns the number of processed bytes + and 0 if str is NULL or len is 0. If the string ends unexpectedly in a + multibyte sequence, the desired length (that is larger than len) is + returned. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <inttypes.h> + #include <stdio.h> + + void + print_cps(const char *str, size_t len) + { + size_t ret, off; + uint_least32_t cp; + + for (off = 0; off < len; off += ret) { + if ((ret = grapheme_decode_utf8(str + off, + len - off, &cp)) > (len - off)) { + /* + * string ended unexpectedly in the middle of a + * multibyte sequence and we have the choice + * here to possibly expand str by ret - len + off + * bytes to get a full sequence, but we just + * bail out in this case. + */ + break; + } + printf("%"PRIxLEAST32"\n", cp); + } + } + + void + print_cps_nul_terminated(const char *str) + { + size_t ret, off; + uint_least32_t cp; + + for (off = 0; (ret = grapheme_decode_utf8(str + off, + SIZE_MAX, &cp)) > 0 && + cp != 0; off += ret) { + printf("%"PRIxLEAST32"\n", cp); + } + } + + SEE ALSO + grapheme_encode_utf8(3), libgrapheme(7) + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_encode_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_encode_utf8\(3\)/index.md @@ -0,0 +1,87 @@ + GRAPHEME_ENCODE_UTF8(3) Library Functions Manual GRAPHEME_ENCODE_UTF8(3) + + NAME + grapheme_encode_utf8 – encode codepoint into UTF-8 string + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_encode_utf8(uint_least32_t cp, char *str, size_t len); + + DESCRIPTION + The grapheme_encode_utf8() function encodes the codepoint cp into a + UTF-8-string. If str is not NULL and len is large enough it writes the + UTF-8-string to the memory pointed to by str. Otherwise no data is + written. + + RETURN VALUES + The grapheme_encode_utf8() function returns the length (in bytes) of the + UTF-8-string resulting from encoding cp, even if len is not large enough + or str is NULL. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stddef.h> + #include <stdlib.h> + + size_t + cps_to_utf8(const uint_least32_t *cp, size_t cplen, char *str, size_t len) + { + size_t i, off, ret; + + for (i = 0, off = 0; i < cplen; i++, off += ret) { + if ((ret = grapheme_encode_utf8(cp[i], str + off, + len - off)) > (len - off)) { + /* buffer too small */ + break; + } + } + + return off; + } + + size_t + cps_bytelen(const uint_least32_t *cp, size_t cplen) + { + size_t i, len; + + for (i = 0, len = 0; i < cplen; i++) { + len += grapheme_encode_utf8(cp[i], NULL, 0); + } + + return len; + } + + char * + cps_to_utf8_alloc(const uint_least32_t *cp, size_t cplen) + { + char *str; + size_t len, i, ret, off; + + len = cps_bytelen(cp, cplen); + + if (!(str = malloc(len))) { + return NULL; + } + + for (i = 0, off = 0; i < cplen; i++, off += ret) { + if ((ret = grapheme_encode_utf8(cp[i], str + off, + len - off)) > (len - off)) { + /* buffer too small */ + break; + } + } + str[off] = '\0'; + + return str; + } + + SEE ALSO + grapheme_decode_utf8(3), libgrapheme(7) + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_character_break\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_character_break\(3\)/index.md @@ -0,0 +1,69 @@ + GRAPHEME_IS_CHARACTER_BREAK(3) Library Functions Manual + + NAME + grapheme_is_character_break – test for a grapheme cluster break between + two codepoints + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_character_break(uint_least32_t cp1, uint_least32_t cp2, + uint_least16_t *state); + + DESCRIPTION + The grapheme_is_character_break() function determines if there is a + grapheme cluster break (see libgrapheme(7)) between the two codepoints + cp1 and cp2. By specification this decision depends on a state that can + at most be completely reset after detecting a break and must be reset + every time one deviates from sequential processing. + + If state is NULL grapheme_is_character_break() behaves as if it was + called with a fully reset state. + + RETURN VALUES + The grapheme_is_character_break() function returns true if there is a + grapheme cluster break between the codepoints cp1 and cp2 and false if + there is not. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stdint.h> + #include <stdio.h> + #include <stdlib.h> + + int + main(void) + { + uint_least16_t state = 0; + uint_least32_t s1[] = ..., s2[] = ...; /* two input arrays */ + size_t i; + + for (i = 0; i + 1 < sizeof(s1) / sizeof(*s1); i++) { + if (grapheme_is_character_break(s[i], s[i + 1], &state)) { + printf("break in s1 at offset %zu0, i); + } + } + memset(&state, 0, sizeof(state)); /* reset state */ + for (i = 0; i + 1 < sizeof(s2) / sizeof(*s2); i++) { + if (grapheme_is_character_break(s[i], s[i + 1], &state)) { + printf("break in s2 at offset %zu0, i); + } + } + + return 0; + } + + SEE ALSO + grapheme_next_character_break(3), grapheme_next_character_break_utf8(3), + libgrapheme(7) + + STANDARDS + grapheme_is_character_break() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_lowercase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_lowercase\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_IS_LOWERCASE(3) Library Functions Manual GRAPHEME_IS_LOWERCASE(3) + + NAME + grapheme_is_lowercase – check if codepoint array is lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_lowercase(const uint_least32_t *str, size_t len, + size_t *caselen); + + DESCRIPTION + The grapheme_is_lowercase() function checks if the codepoint array str is + lowercase and writes the length of the matching lowercase-sequence to the + integer pointed to by caselen, unless caselen is set to NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_is_lowercase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_is_lowercase() function returns true if the codepoint array + str is lowercase, otherwise false. + + SEE ALSO + grapheme_is_lowercase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_is_lowercase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_lowercase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_lowercase_utf8\(3\)/index.md @@ -0,0 +1,38 @@ + GRAPHEME_IS_LOWERCASE_UTF8(3) Library Functions Manual + + NAME + grapheme_is_lowercase_utf8 – check if UTF-8-encoded string is lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_lowercase_utf8(const char *str, size_t len, size_t *caselen); + + DESCRIPTION + The grapheme_is_lowercase_utf8() function checks if the UTF-8-encoded + string str is lowercase and writes the length of the matching lowercase- + sequence to the integer pointed to by caselen, unless caselen is set to + NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_is_lowercase(3) can be used instead. + + RETURN VALUES + The grapheme_is_lowercase_utf8() function returns true if the + UTF-8-encoded string str is lowercase, otherwise false. + + SEE ALSO + grapheme_is_lowercase(3), libgrapheme(7) + + STANDARDS + grapheme_is_lowercase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_titlecase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_titlecase\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_IS_TITLECASE(3) Library Functions Manual GRAPHEME_IS_TITLECASE(3) + + NAME + grapheme_is_titlecase – check if codepoint array is titlecase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_titlecase(const uint_least32_t *str, size_t len, + size_t *caselen); + + DESCRIPTION + The grapheme_is_titlecase() function checks if the codepoint array str is + titlecase and writes the length of the matching titlecase-sequence to the + integer pointed to by caselen, unless caselen is set to NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_is_titlecase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_is_titlecase() function returns true if the codepoint array + str is titlecase, otherwise false. + + SEE ALSO + grapheme_is_titlecase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_is_titlecase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_titlecase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_titlecase_utf8\(3\)/index.md @@ -0,0 +1,38 @@ + GRAPHEME_IS_TITLECASE_UTF8(3) Library Functions Manual + + NAME + grapheme_is_titlecase_utf8 – check if UTF-8-encoded string is titlecase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_titlecase_utf8(const char *str, size_t len, size_t *caselen); + + DESCRIPTION + The grapheme_is_titlecase_utf8() function checks if the UTF-8-encoded + string str is titlecase and writes the length of the matching titlecase- + sequence to the integer pointed to by caselen, unless caselen is set to + NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_is_titlecase(3) can be used instead. + + RETURN VALUES + The grapheme_is_titlecase_utf8() function returns true if the + UTF-8-encoded string str is titlecase, otherwise false. + + SEE ALSO + grapheme_is_titlecase(3), libgrapheme(7) + + STANDARDS + grapheme_is_titlecase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_uppercase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_uppercase\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_IS_UPPERCASE(3) Library Functions Manual GRAPHEME_IS_UPPERCASE(3) + + NAME + grapheme_is_uppercase – check if codepoint array is uppercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_uppercase(const uint_least32_t *str, size_t len, + size_t *caselen); + + DESCRIPTION + The grapheme_is_uppercase() function checks if the codepoint array str is + uppercase and writes the length of the matching uppercase-sequence to the + integer pointed to by caselen, unless caselen is set to NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_is_uppercase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_is_uppercase() function returns true if the codepoint array + str is uppercase, otherwise false. + + SEE ALSO + grapheme_is_uppercase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_is_uppercase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_is_uppercase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_is_uppercase_utf8\(3\)/index.md @@ -0,0 +1,38 @@ + GRAPHEME_IS_LOWERCASE_UTF8(3) Library Functions Manual + + NAME + grapheme_is_lowercase_utf8 – check if UTF-8-encoded string is lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_is_lowercase_utf8(const char *str, size_t len, size_t *caselen); + + DESCRIPTION + The grapheme_is_lowercase_utf8() function checks if the UTF-8-encoded + string str is lowercase and writes the length of the matching lowercase- + sequence to the integer pointed to by caselen, unless caselen is set to + NULL. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_is_lowercase(3) can be used instead. + + RETURN VALUES + The grapheme_is_lowercase_utf8() function returns true if the + UTF-8-encoded string str is lowercase, otherwise false. + + SEE ALSO + grapheme_is_lowercase(3), libgrapheme(7) + + STANDARDS + grapheme_is_lowercase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_character_break\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_character_break\(3\)/index.md @@ -0,0 +1,42 @@ + GRAPHEME_NEXT_CHARACTER_BREAK(3) Library Functions Manual + + NAME + grapheme_next_character_break – determine codepoint-offset to next + grapheme cluster break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_character_break(const uint_least32_t *str, size_t len); + + DESCRIPTION + The grapheme_next_character_break() function computes the offset (in + codepoints) to the next grapheme cluster break (see libgrapheme(7)) in + the codepoint array str of length len. If a grapheme cluster begins at + str this offset is equal to the length of said grapheme cluster. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a codepoint with the value 0 is encountered. + + For UTF-8-encoded input data grapheme_next_character_break_utf8(3) can be + used instead. + + RETURN VALUES + The grapheme_next_character_break() function returns the offset (in + codepoints) to the next grapheme cluster break in str or 0 if str is + NULL. + + SEE ALSO + grapheme_is_character_break(3), grapheme_next_character_break_utf8(3), + libgrapheme(7) + + STANDARDS + grapheme_next_character_break() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_character_break_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_character_break_utf8\(3\)/index.md @@ -0,0 +1,77 @@ + GRAPHEME_NEXT_CHARACTER_BREAK_UTF8(3) Library Functions Manual + + NAME + grapheme_next_character_break_utf8 – determine byte-offset to next + grapheme cluster break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_character_break_utf8(const char *str, size_t len); + + DESCRIPTION + The grapheme_next_character_break_utf8() function computes the offset (in + bytes) to the next grapheme cluster break (see libgrapheme(7)) in the + UTF-8-encoded string str of length len. If a grapheme cluster begins at + str this offset is equal to the length of said grapheme cluster. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_is_character_break(3) and + grapheme_next_character_break(3) can be used instead. + + RETURN VALUES + The grapheme_next_character_break_utf8() function returns the offset (in + bytes) to the next grapheme cluster break in str or 0 if str is NULL. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stdint.h> + #include <stdio.h> + + int + main(void) + { + /* UTF-8 encoded input */ + char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" + "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" + "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" + "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; + size_t ret, len, off; + + printf("Input: \"%s\"\n", s); + + /* print each grapheme cluster with byte-length */ + printf("grapheme clusters in NUL-delimited input:\n"); + for (off = 0; s[off] != '\0'; off += ret) { + ret = grapheme_next_character_break_utf8(s + off, SIZE_MAX); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + printf("\n"); + + /* do the same, but this time string is length-delimited */ + len = 17; + printf("grapheme clusters in input delimited to %zu bytes:\n", len); + for (off = 0; off < len; off += ret) { + ret = grapheme_next_character_break_utf8(s + off, len - off); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + + return 0; + } + + SEE ALSO + grapheme_next_character_break(3), libgrapheme(7) + + STANDARDS + grapheme_next_character_break_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_line_break\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_line_break\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_NEXT_LINE_BREAK(3) Library Functions Manual + + NAME + grapheme_next_line_break – determine codepoint-offset to next possible + line break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_line_break(const uint_least32_t *str, size_t len); + + DESCRIPTION + The grapheme_next_line_break() function computes the offset (in + codepoints) to the next possible line break (see libgrapheme(7)) in the + codepoint array str of length len. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a codepoint with the value 0 is encountered. + + For UTF-8-encoded input data grapheme_next_line_break_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_next_line_break() function returns the offset (in + codepoints) to the next possible line break in str or 0 if str is NULL. + + SEE ALSO + grapheme_next_line_break_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_next_line_break() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_line_break_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_line_break_utf8\(3\)/index.md @@ -0,0 +1,75 @@ + GRAPHEME_NEXT_LINE_BREAK_UTF8(3) Library Functions Manual + + NAME + grapheme_next_line_break_utf8 – determine byte-offset to next possible + line break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_line_break_utf8(const char *str, size_t len); + + DESCRIPTION + The grapheme_next_line_break_utf8() function computes the offset (in + bytes) to the next possible line break (see libgrapheme(7)) in the + UTF-8-encoded string str of length len. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_next_line_break(3) can be used instead. + + RETURN VALUES + The grapheme_next_line_break_utf8() function returns the offset (in + bytes) to the next possible line break in str or 0 if str is NULL. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stdint.h> + #include <stdio.h> + + int + main(void) + { + /* UTF-8 encoded input */ + char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" + "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" + "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" + "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; + size_t ret, len, off; + + printf("Input: \"%s\"\n", s); + + /* print each possible line with byte-length */ + printf("possible lines in NUL-delimited input:\n"); + for (off = 0; s[off] != '\0'; off += ret) { + ret = grapheme_next_line_break_utf8(s + off, SIZE_MAX); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + printf("\n"); + + /* do the same, but this time string is length-delimited */ + len = 17; + printf("possible lines in input delimited to %zu bytes:\n", len); + for (off = 0; off < len; off += ret) { + ret = grapheme_next_line_break_utf8(s + off, len - off); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + + return 0; + } + + SEE ALSO + grapheme_next_line_break(3), libgrapheme(7) + + STANDARDS + grapheme_next_line_break_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_sentence_break\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_sentence_break\(3\)/index.md @@ -0,0 +1,40 @@ + GRAPHEME_NEXT_SENTENCE_BREAK(3) Library Functions Manual + + NAME + grapheme_next_sentence_break – determine codepoint-offset to next + sentence break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_sentence_break(const uint_least32_t *str, size_t len); + + DESCRIPTION + The grapheme_next_sentence_break() function computes the offset (in + codepoints) to the next sentence break (see libgrapheme(7)) in the + codepoint array str of length len. If a sentence begins at str this + offset is equal to the length of said sentence. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a codepoint with the value 0 is encountered. + + For UTF-8-encoded input data grapheme_next_sentence_break_utf8(3) can be + used instead. + + RETURN VALUES + The grapheme_next_sentence_break() function returns the offset (in + codepoints) to the next sentence break in str or 0 if str is NULL. + + SEE ALSO + grapheme_next_sentence_break_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_next_sentence_break() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_sentence_break_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_sentence_break_utf8\(3\)/index.md @@ -0,0 +1,77 @@ + GRAPHEME_NEXT_SENTENCE_BREAK_UTF8(3) Library Functions Manual + + NAME + grapheme_next_sentence_break_utf8 – determine byte-offset to next + sentence break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_sentence_break_utf8(const char *str, size_t len); + + DESCRIPTION + The grapheme_next_sentence_break_utf8() function computes the offset (in + bytes) to the next sentence break (see libgrapheme(7)) in the + UTF-8-encoded string str of length len. If a sentence begins at str this + offset is equal to the length of said sentence. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_next_sentence_break(3) can be used + instead. + + RETURN VALUES + The grapheme_next_sentence_break_utf8() function returns the offset (in + bytes) to the next sentence break in str or 0 if str is NULL. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stdint.h> + #include <stdio.h> + + int + main(void) + { + /* UTF-8 encoded input */ + char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" + "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" + "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" + "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; + size_t ret, len, off; + + printf("Input: \"%s\"\n", s); + + /* print each sentence with byte-length */ + printf("sentences in NUL-delimited input:\n"); + for (off = 0; s[off] != '\0'; off += ret) { + ret = grapheme_next_sentence_break_utf8(s + off, SIZE_MAX); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + printf("\n"); + + /* do the same, but this time string is length-delimited */ + len = 17; + printf("sentences in input delimited to %zu bytes:\n", len); + for (off = 0; off < len; off += ret) { + ret = grapheme_next_sentence_break_utf8(s + off, len - off); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + + return 0; + } + + SEE ALSO + grapheme_next_sentence_break(3), libgrapheme(7) + + STANDARDS + grapheme_next_sentence_break_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_word_break\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_word_break\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_NEXT_WORD_BREAK(3) Library Functions Manual + + NAME + grapheme_next_word_break – determine codepoint-offset to next word break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_word_break(const uint_least32_t *str, size_t len); + + DESCRIPTION + The grapheme_next_word_break() function computes the offset (in + codepoints) to the next word break (see libgrapheme(7)) in the codepoint + array str of length len. If a word begins at str this offset is equal to + the length of said word. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a codepoint with the value 0 is encountered. + + For UTF-8-encoded input data grapheme_next_word_break_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_next_word_break() function returns the offset (in + codepoints) to the next word break in str or 0 if str is NULL. + + SEE ALSO + grapheme_next_word_break_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_next_word_break() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_next_word_break_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_next_word_break_utf8\(3\)/index.md @@ -0,0 +1,75 @@ + GRAPHEME_NEXT_WORD_BREAK_UTF8(3) Library Functions Manual + + NAME + grapheme_next_word_break_utf8 – determine byte-offset to next word break + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_next_word_break_utf8(const char *str, size_t len); + + DESCRIPTION + The grapheme_next_word_break_utf8() function computes the offset (in + bytes) to the next word break (see libgrapheme(7)) in the UTF-8-encoded + string str of length len. If a word begins at str this offset is equal + to the length of said word. + + If len is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the string str is interpreted to be NUL-terminated and processing stops + when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_next_word_break(3) can be used instead. + + RETURN VALUES + The grapheme_next_word_break_utf8() function returns the offset (in + bytes) to the next word break in str or 0 if str is NULL. + + EXAMPLES + /* cc (-static) -o example example.c -lgrapheme */ + #include <grapheme.h> + #include <stdint.h> + #include <stdio.h> + + int + main(void) + { + /* UTF-8 encoded input */ + char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" + "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" + "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" + "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; + size_t ret, len, off; + + printf("Input: \"%s\"\n", s); + + /* print each word with byte-length */ + printf("words in NUL-delimited input:\n"); + for (off = 0; s[off] != '\0'; off += ret) { + ret = grapheme_next_word_break_utf8(s + off, SIZE_MAX); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + printf("\n"); + + /* do the same, but this time string is length-delimited */ + len = 17; + printf("words in input delimited to %zu bytes:\n", len); + for (off = 0; off < len; off += ret) { + ret = grapheme_next_word_break_utf8(s + off, len - off); + printf("%2zu bytes | %.*s\n", ret, (int)ret, s + off, ret); + } + + return 0; + } + + SEE ALSO + grapheme_next_word_break(3), libgrapheme(7) + + STANDARDS + grapheme_next_word_break_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_lowercase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_lowercase\(3\)/index.md @@ -0,0 +1,40 @@ + GRAPHEME_TO_LOWERCASE(3) Library Functions Manual GRAPHEME_TO_LOWERCASE(3) + + NAME + grapheme_to_lowercase – convert codepoint array to lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_lowercase(const uint_least32_t *src, size_t srclen, + uint_least32_t *dest, size_t destlen); + + DESCRIPTION + The grapheme_to_lowercase() function converts the codepoint array str to + lowercase and writes the result to dest up to destlen, unless dest is set + to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_to_lowercase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_to_lowercase() function returns the number of codepoints in + the array resulting from converting src to lowercase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_lowercase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_to_lowercase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_lowercase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_lowercase_utf8\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_TO_LOWERCASE_UTF8(3) Library Functions Manual + + NAME + grapheme_to_lowercase_utf8 – convert UTF-8-encoded string to lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_lowercase_utf8(const char *src, size_t srclen, char *dest, + size_t destlen); + + DESCRIPTION + The grapheme_to_lowercase_utf8() function converts the UTF-8-encoded + string str to lowercase and writes the result to dest up to destlen, + unless dest is set to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_to_lowercase(3) can be used instead. + + RETURN VALUES + The grapheme_to_lowercase_utf8() function returns the number of bytes in + the array resulting from converting src to lowercase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_lowercase(3), libgrapheme(7) + + STANDARDS + grapheme_to_lowercase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_titlecase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_titlecase\(3\)/index.md @@ -0,0 +1,40 @@ + GRAPHEME_TO_TITLECASE(3) Library Functions Manual GRAPHEME_TO_TITLECASE(3) + + NAME + grapheme_to_titlecase – convert codepoint array to titlecase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_titlecase(const uint_least32_t *src, size_t srclen, + uint_least32_t *dest, size_t destlen); + + DESCRIPTION + The grapheme_to_titlecase() function converts the codepoint array str to + titlecase and writes the result to dest up to destlen, unless dest is set + to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_to_titlecase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_to_titlecase() function returns the number of codepoints in + the array resulting from converting src to titlecase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_titlecase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_to_titlecase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_titlecase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_titlecase_utf8\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_TO_TITLECASE_UTF8(3) Library Functions Manual + + NAME + grapheme_to_titlecase_utf8 – convert UTF-8-encoded string to titlecase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_titlecase_utf8(const char *src, size_t srclen, char *dest, + size_t destlen); + + DESCRIPTION + The grapheme_to_titlecase_utf8() function converts the UTF-8-encoded + string str to titlecase and writes the result to dest up to destlen, + unless dest is set to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_to_titlecase(3) can be used instead. + + RETURN VALUES + The grapheme_to_titlecase_utf8() function returns the number of bytes in + the array resulting from converting src to titlecase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_titlecase(3), libgrapheme(7) + + STANDARDS + grapheme_to_titlecase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_uppercase\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_uppercase\(3\)/index.md @@ -0,0 +1,40 @@ + GRAPHEME_TO_UPPERCASE(3) Library Functions Manual GRAPHEME_TO_UPPERCASE(3) + + NAME + grapheme_to_uppercase – convert codepoint array to uppercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_uppercase(const uint_least32_t *src, size_t srclen, + uint_least32_t *dest, size_t destlen); + + DESCRIPTION + The grapheme_to_uppercase() function converts the codepoint array str to + uppercase and writes the result to dest up to destlen, unless dest is set + to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the codepoint array src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For UTF-8-encoded input data grapheme_to_uppercase_utf8(3) can be used + instead. + + RETURN VALUES + The grapheme_to_uppercase() function returns the number of codepoints in + the array resulting from converting src to uppercase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_uppercase_utf8(3), libgrapheme(7) + + STANDARDS + grapheme_to_uppercase() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/grapheme_to_uppercase_utf8\(3\)/index.md b/libs.suckless.org/libgrapheme/man/grapheme_to_uppercase_utf8\(3\)/index.md @@ -0,0 +1,39 @@ + GRAPHEME_TO_LOWERCASE_UTF8(3) Library Functions Manual + + NAME + grapheme_to_lowercase_utf8 – convert UTF-8-encoded string to lowercase + + SYNOPSIS + #include <grapheme.h> + + size_t + grapheme_to_lowercase_utf8(const char *src, size_t srclen, char *dest, + size_t destlen); + + DESCRIPTION + The grapheme_to_lowercase_utf8() function converts the UTF-8-encoded + string str to lowercase and writes the result to dest up to destlen, + unless dest is set to NULL. + + If srclen is set to SIZE_MAX (stdint.h is already included by grapheme.h) + the UTF-8-encoded string src is interpreted to be NUL-terminated and + processing stops when a NUL-byte is encountered. + + For non-UTF-8 input data grapheme_to_lowercase(3) can be used instead. + + RETURN VALUES + The grapheme_to_lowercase_utf8() function returns the number of bytes in + the array resulting from converting src to lowercase, even if destlen is + not large enough or dest is NULL. + + SEE ALSO + grapheme_to_lowercase(3), libgrapheme(7) + + STANDARDS + grapheme_to_lowercase_utf8() is compliant with the Unicode 15.0.0 + specification. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org diff --git a/libs.suckless.org/libgrapheme/man/libgrapheme\(7\)/index.md b/libs.suckless.org/libgrapheme/man/libgrapheme\(7\)/index.md @@ -0,0 +1,122 @@ + LIBGRAPHEME(7) Miscellaneous Information Manual LIBGRAPHEME(7) + + NAME + libgrapheme – unicode string library + + SYNOPSIS + #include <grapheme.h> + + DESCRIPTION + The libgrapheme library provides functions to properly handle Unicode + strings according to the Unicode specification in regard to character, + word, sentence and line segmentation and case detection and conversion. + + Unicode strings are made up of user-perceived characters (so-called + “grapheme clusters”, see MOTIVATION) that are composed of one or more + Unicode codepoints, which in turn are encoded in one or more bytes in an + encoding like UTF-8. + + There is a widespread misconception that it was enough to simply + determine codepoints in a string and treat them as user-perceived + characters to be Unicode compliant. While this may work in some cases, + this assumption quickly breaks, especially for non-Western languages and + decomposed Unicode strings where user-perceived characters are usually + represented using multiple codepoints. + + Despite this complicated multilevel structure of Unicode strings, + libgrapheme provides methods to work with them at the byte-level (i.e. + UTF-8 ‘char’ arrays) while also offering codepoint-level methods. + Additionally, it is a “freestanding” library (see ISO/IEC 9899:1999 + section 4.6) and thus does not depend on a standard library. This makes + it easy to use in bare metal environments. + + Every documented function's manual page provides a self-contained example + illustrating the possible usage. + + SEE ALSO + grapheme_decode_utf8(3), grapheme_encode_utf8(3), + grapheme_is_character_break(3), grapheme_is_lowercase(3), + grapheme_is_lowercase_utf8(3), grapheme_is_titlecase(3), + grapheme_is_titlecase_utf8(3), grapheme_is_uppercase(3), + grapheme_is_uppercase_utf8(3), grapheme_next_character_break(3), + grapheme_next_character_break_utf8(3), grapheme_next_line_break(3), + grapheme_next_line_break_utf8(3), grapheme_next_sentence_break(3), + grapheme_next_sentence_break_utf8(3), grapheme_next_word_break(3), + grapheme_next_word_break_utf8(3), grapheme_to_lowercase(3), + grapheme_to_lowercase_utf8(3), grapheme_to_titlecase(3), + grapheme_to_titlecase_utf8(3) grapheme_to_uppercase(3), + grapheme_to_uppercase_utf8(3), + + STANDARDS + libgrapheme is compliant with the Unicode 15.0.0 specification. + + MOTIVATION + The idea behind every character encoding scheme like ASCII or Unicode is + to express abstract characters (which can be thought of as shapes making + up a written language). ASCII for instance, which comprises the range 0 + to 127, assigns the number 65 (0x41) to the abstract character ‘A’. This + number is called a “codepoint”, and all codepoints of an encoding make up + its so-called “code space”. + + Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its + first 128 codepoints are identical to ASCII's. The additional code points + are needed as Unicode's goal is to express all writing systems of the + world. To give an example, the abstract character ‘Ä’ is not expressable + in ASCII, given no ASCII codepoint has been assigned to it. It can be + expressed in Unicode, though, with the codepoint 196 (0xC4). + + One may assume that this process is straightfoward, but as more and more + codepoints were assigned to abstract characters, the Unicode Consortium + (that defines the Unicode standard) was facing a problem: Many (mostly + non-European) languages have such a large amount of abstract characters + that it would exhaust the available Unicode code space if one tried to + assign a codepoint to each abstract character. The solution to that + problem is best introduced with an example: Consider the abstract + character ‘Ǟ’, which is ‘A’ with an umlaut and a macron added to it. In + this sense, one can consider ‘Ǟ’ as a two-fold modification (namely “add + umlaut” and “add macron”) of the “base character” ‘A’. + + The Unicode Consortium adapted this idea by assigning codepoints to + modifications. For example, the codepoint 0x308 represents adding an + umlaut and 0x304 represents adding a macron, and thus, the codepoint + sequence “0x41 0x308 0x304”, namely the base character ‘A’ followed by + the umlaut and macron modifiers, represents the abstract character ‘Ǟ’. + As a side-note, the single codepoint 0x1DE was also assigned to ‘Ǟ’, + which is a good example for the fact that there can be multiple + representations of a single abstract character in Unicode. + + Expressing a single abstract character with multiple codepoints solved + the code space exhaustion-problem, and the concept has been greatly + expanded since its first introduction (emojis, joiners, etc.). A sequence + (which can also have the length 1) of codepoints that belong together + this way and represents an abstract character is called a “grapheme + cluster”. + + In many applications it is necessary to count the number of user- + perceived characters, i.e. grapheme clusters, in a string. A good + example for this is a terminal text editor, which needs to properly align + characters on a grid. This is pretty simple with ASCII-strings, where + you just count the number of bytes (as each byte is a codepoint and each + codepoint is a grapheme cluster). With Unicode-strings, it is a common + mistake to simply adapt the ASCII-approach and count the number of code + points. This is wrong, as, for example, the sequence “0x41 0x308 0x304”, + while made up of 3 codepoints, is a single grapheme cluster and + represents the user-perceived character ‘Ǟ’. + + The proper way to segment a string into user-perceived characters is to + segment it into its grapheme clusters by applying the Unicode grapheme + cluster breaking algorithm (UAX #29). It is based on a complex ruleset + and lookup-tables and determines if a grapheme cluster ends or is + continued between two codepoints. Libraries like ICU and libunistring, + which also offer this functionality, are often bloated, not correct, + difficult to use or not reasonably statically linkable. + + Analogously, the standard provides algorithms to separate strings by + words, sentences and lines, convert cases and compare strings. The + motivation behind libgrapheme is to make unicode handling suck less and + abide by the UNIX philosophy. + + AUTHORS + Laslo Hunhold <dev@frign.de> + + suckless.org 2022-10-06 suckless.org