commit 5998352d2d2e6e37531548f8e986abae5ff8ef02 parent dd15fea026c3e0b389381ae8cc08e0f39fa1a8f7 Author: Laslo Hunhold <dev@frign.de> Date: Tue, 25 Oct 2022 13:20:47 +0200 Implement the Unicode Bidirectional Algorithm (UAX #9) To be frank, I never heard about this until I started learning more about Unicode, but this is an absolute must for all languages that go from right to left (Hebrew, Arabic, Farsi, etc.) and any case where you mix RTL and LTR languages. The Unicode Bidirectional Algorithm is the normative procedure you apply on a string to obtain embedding levels that can then be used to reorder the string such that you obtain the proper reading direction. The central aspect is that strings are always stored LTR in memory and only reordered for presentation on the screen. Currently, only ICU and GNU fribidi implement the algorithm, and as usual it's pretty convoluted to use them. There are many memory allocations, kitchen-sink-madness and legacy cruft, but the demand is there (there's even a bidi-patch for dwm[0]). What's special about this implementation? There are no memory allocations at runtime. The user provides a 32-bit-integer-array which is then filled with the embedding levels. The levels themselves only range from -1 to 125 (by the standard!) and would fit in a signed 8-bit-integer, but the algorithm naturally needs a scratchpad to store processing data. A complication of the algorithm is that you, at some point, have to break the paragraph into lines and based on the line breaks the level determination is affected. GNU fribidi and ICU make this very complicated and hard to understand. The API is not final as you see it here, but the final process will be (each number corresponding to a function): 1) "preprocessing" the string up to the part where the algorithm does not depend on the line breaks 2) determining line embedding levels for a line (by specifying the preprocessed data buffer and an output level-buffer) 3) reordering a line (by specifying the preprocessed data buffer and an output string that is allowed to be the input string) Conformance is obviously a large priority: There are literally over a million automatic conformance tests for the bidirectional algorithm split across the files BidiTest.txt and BidiCharacterTest.txt that are automatically parsed into the header gen/bidirectional-test.h. Currently, only BidiTest.txt is used for tests (which we all pass), given bracket-pairs have not been implemented yet. This and (maybe) arabic shaping are what is left to be implemented, but this here is already a big step. One more note: Yes, the data files are very large, but they compress down very well and the tarball stays below 800K. It's very important to me that there's no need to pull any data from the web for compilation or testing for obvious reasons. [0]:https://dwm.suckless.org/patches/bidi/ Signed-off-by: Laslo Hunhold <dev@frign.de> Diff is too large, output suppressed.