Refactor libgrapheme.7 - libgrapheme - unicode string library

commit 51eca9eff65def13d1370e32dad2988731d38e7d
parent cdb1c28f7f51e53d95bdb763ba2e3034fbe2585f
Author: Laslo Hunhold <dev@frign.de>
Date:   Sat, 10 Oct 2020 18:56:47 +0200

Refactor libgrapheme.7

It read more than a rant and didn't get to the point of what a manual
should do: Provide an overview. Still, I felt like adding a few
paragraphs on the motivation and added a section "BACKGROUND" for this
purpose.

The other manual pages will follow accordingly.

Signed-off-by: Laslo Hunhold <dev@frign.de>

Diffstat:
M man/libgrapheme.7  | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------

1 file changed, 79 insertions(+), 27 deletions(-)
diff --git a/man/libgrapheme.7 b/man/libgrapheme.7
@@ -1,38 +1,90 @@
-.Dd 2020-03-26
+.Dd 2020-10-10
 .Dt LIBGRAPHEME 7
 .Os suckless.org
 .Sh NAME
 .Nm libgrapheme
-.Nd grapheme cluster utility library
+.Nd grapheme cluster detection library
+.Sh SYNOPSIS
+.In grapheme.h
 .Sh DESCRIPTION
+The
 .Nm
-is a C library for working with grapheme clusters. What are grapheme
-clusters? In C, one usually uses 8-Bit unsigned integers (chars) to
-store strings, and many people assume that one such char represents
-one visible character in a printed output.
+library provides functions to properly count characters
+.Dq ( grapheme clusters )
+in Unicode strings using the Unicode grapheme
+cluster breaking algorithm (UAX #29).
 .Pp
-This is not true and only holds for encodings that map numbers from
-0-255 to characters. Modern Unicode maps numbers ('code points') far
-larger than that to characters. A common encoding to represent such
-code points is UTF-8. A common misunderstanding is that a code
-point represents a single printed character, which is not correct.
-Instead, Unicode has a concept of so called 'grapheme clusters', which
-are a set of one or more code points that in total make up one printed
-character.
-.Pp
-To put it shortly: To count printed characters in a string, it is
-neither enough to just count the chars nor to count the UTF-8 code points.
-Instead, what is necessary is to apply a complex ruleset, specified
-by Unicode, to determine if a set of code points belongs together in the
-form of a grapheme cluster, which then counts as a single character.
-.Pp
-.Nm
-is a suckless response to the bloated ecosystem of grapheme cluster
-handling (e.g. ICU) and provides a simple interface for this complex
-concept. The rules are automatically downloaded from unicode.org
-and parsed and automatic testing is performed based on tests provided
-by Unicode.
+You can either count the characters in an UTF-8-encoded string (see
+.Xr grapheme_len 3 )
+or determine if a grapheme cluster breaks between two code points (see
+.Xr grapheme_boundary 3 ) ,
+while a safe UTF-8-de/encoder for the latter purpose is provided (see
+.Xr grapheme_cp_decode 3
+and
+.Xr grapheme_cp_encode 3 ) .
 .Sh SEE ALSO
+.Xr grapheme_boundary 3 ,
+.Xr grapheme_cp_decode 3 ,
+.Xr grapheme_cp_encode 3 ,
 .Xr grapheme_len 3
+.Sh STANDARDS
+.Nm
+is compliant with the Unicode 13.0.0 specification.
+.Sh MOTIVATION
+The idea behind every character encoding scheme like ASCII or Unicode
+is to assign numbers to abstract characters. ASCII for instance, which
+comprises the range 0 to 127, assigns the number 65 (0x41) to the
+character
+.Sq A .
+This number is called a
+.Dq code point ,
+and all code points of an encoding make up its so-called
+.Dq code space .
+.Pp
+Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
+first 128 code points are identical to ASCII's. The additional code
+points are needed as Unicode's goal is to express all writing systems
+of the world. To give an example, the character
+.Sq \[u00C4]
+is not expressable in ASCII, as it lacks a code point for it. It can be
+expressed in Unicode, though, as the code point 196 (0xC4) has been
+assigned to it.
+.Pp
+At some point, when more and more characters were assigned to code
+points, the Unicode Consortium (that defines the Unicode standard)
+noticed a problem: Many languages have much more complex characters,
+for example
+.Sq \[u01DE]
+(Unicode code point 0x1DE), which is an
+.Sq A
+with an umlaut and a macron, and it gets much more complicated in some
+non-European languages. Instead of assigning a code point to each
+modification of a
+.Dq base character
+(like
+.Sq A
+in this example here), they started introducing modifiers, which are
+code points that would not correspond to characters but would modify a
+preceding
+.Dq base
+character. For example, the code point 0x308 adds an umlaut and the
+code point 0x304 adds a macron, so the code point sequence
+.Dq 0x41 0x308 0x304
+represents the character
+.Sq \[u01DE] ,
+just like the single code point 0x1DE.
+.Pp
+In many applications, it is necessary to count the number of characters
+in a string. This is pretty simple with ASCII-strings, where you just
+count the number of bytes. With Unicode-strings, it is a common mistake
+to simply adapt the ASCII-approach and count the number of code points,
+given, for example, the sequence
+.Dq 0x41 0x308 0x304 ,
+while made up of 3 code points, only represents a single character.
+.Pp
+The proper way to count the number of characters in a Unicode string
+is to apply the Unicode grapheme cluster breaking algorithm (UAX #29)
+that is based on a complex ruleset and determines if a grapheme cluster
+ends or is continued between two code points.
 .Sh AUTHORS
 .An Laslo Hunhold Aq Mt dev@frign.de

	libgrapheme unicode string library
	git clone git://git.suckless.org/libgrapheme
	Log \| Files \| Refs \| README \| LICENSE