STATUS (7064B)

1 Optimisation progress for libzahl. Benchmarks are done in the 2 range 1 to 4097 bits. So far all benchmarks on libzahl are 3 done with the following combinations of cc and libc: 4 5 gcc + glibc 6 gcc + musl 7 clang + glibc 8 9 Benchmarks on the other libraries are done with gcc and glibc. 10 11 All benchmarks are done on an x86-64 (specifically an Intel 12 Core 2 Quad CPU Q9300), without any extensions turned on 13 during compilation, and without any use of extensions in 14 assembly code. The benchmarks are performed with Linux as 15 the OS's kernel with 50 µs timer slack, and the benchmarking 16 processes are fixed to one CPU. 17 18 19 The following functions are probably implemented optimally: 20 21 zset .................... always fastest 22 zseti(a, +) ............. tomsfastmath is faster 23 zseti(a, -) ............. tomsfastmath is faster 24 zsetu ................... tomsfastmath is faster 25 zswap ................... always fastest 26 zzero ................... always fastest (shared with gmp) 27 zsignum ................. always fastest (shared with gmp) 28 zeven ................... always fastest (shared with gmp) 29 zodd .................... always fastest (shared with gmp) 30 zeven_nonzero ........... always fastest (shared with gmp) 31 zodd_nonzero ............ always fastest (shared with gmp) 32 zbtest .................. always fastest 33 zsave ................... always fastest 34 zload ................... always fastest 35 36 37 The following functions are probably implemented optimally, but 38 depends on other functions or call-cases for better performance: 39 40 zneg(a, b) .............. always fastest 41 zabs(a, b) .............. always fastest 42 ztrunc(a, b, c) ......... always fastest 43 zbset(a, b, 1) .......... always fastest 44 zbset(a, b, 0) .......... always fastest 45 zbset(a, b, -1) ......... always fastest 46 zsplit .................. alternating with gmp for fastest, but gmp is a bit faster on average 47 48 49 The following functions are probably implemented close to 50 optimally, further optimisation should not be a priority: 51 52 zadd_unsigned ........... fastest after ~140 (depends on cc and libc) compared against zadd too 53 ztrunc(a, a, b) ......... fastest until ~100, then 77 % (gcc) or 68 % (clang) of tomsfastmath 54 zbset(a, a, 1) .......... always fastest 55 zbset(a, a, 0) .......... always fastest 56 zbset(a, a, -1) ......... always fastest (only marginally faster than gmp with clang) 57 zlsb .................... always fastest <<suspicious>> 58 zlsh .................... not too fast anymore 59 zand .................... fastest after ~400, tomsfastmath before 60 zor ..................... fastest after ~1150, tomsfastmath before 61 zxor .................... alternative with gmp after ~700, tomsfastmath before (musl), a bit slow with glibc 62 znot .................... always fastest 63 64 65 The following functions require structural changes for 66 further optimisations: 67 68 zneg(a, a) .............. always fastest (shared with gmp (gcc)) 69 zabs(a, a) .............. 34 % (clang) or 8 % (gcc) of tomsfastmath 70 71 72 The following functions are probably implemented optimally 73 or close to optimally, except they contain some code that 74 should not be necessary after some bugs have been fixed: 75 76 zbits ................... always fastest 77 zcmpi(a, +) ............. always fastest 78 zcmpi(a, -) ............. always fastest 79 zcmpu ................... always fastest 80 81 82 It may be possible optimise the following functions 83 further: 84 85 zadd .................... fastest after ~90 (clang), ~260 (gcc+musl), or ~840 (gcc+glibc) (gcc+glibc is slow) 86 zcmp .................... almost always fastest (musl), almost always slowest (glibc) <<suspicious (clang)>> 87 zcmpmag ................. always fastest <<suspicious, see zcmp>> 88 89 90 The following functions could be optimised further: 91 92 zrsh .................... gmp is almost always faster; also tomsfastmath after ~3000 (gcc+glibc) or ~2800 (clang) 93 zsub_unsigned ........... always fastest (compared against zsub too) 94 zsub .................... always fastest (slower with gcc+glibc than gcc+musl or clang) 95 96 97 The following functions could probably be optimised further, 98 but their performance can be significantly improved by 99 optimising their dependencies: 100 101 zmul .................... slowest 102 zsqr .................... slowest 103 zstr_length(a, 10) ...... gmp is faster (clang is faster than gcc, musl is faster than glibc) 104 zstr(a, b, n) ........... slowest 105 106 107 musl has more stable performance than glibc. clang is better at 108 inlining than gcc. (Which is better at optimising must be judged 109 on a per-function basis.) 110 111 112 113 {{{ [out of date legacy area, this being phased out] 114 Optimisation progress for libzahl, compared to other big integer 115 libraries. These comparisons are for 152-bit integers. Functions 116 in parenthesis the right column are functions that needs 117 optimisation to improve the peformance of the function in the 118 left column. Double-parenthesis means there may be a better way 119 to do it. Inside square-brackets, there are some comments on 120 multi-bit comparisons. 121 122 zgcd .................... 21 % of gmp (zcmpmag) 123 zmodmul(big mod) ........ slowest ((zmul, zmod)) 124 zmodsqr(big mod) ........ slowest ((zmul, zmod)) 125 zmodmul(tiny mod) ....... slowest ((zmul)) 126 zmodsqr(tiny mod) ....... slowest ((zmul)) 127 zpow .................... slowest (zmul, zsqr) 128 zpowu ................... slowest (zmul, zsqr) 129 zmodpow ................. slowest (zmul, zsqr. zmod) 130 zmodpowu ................ slowest (zmul, zsqr, zmod) 131 zsets ................... 13 % of gmp 132 zrand(default uniform) .. 51 % of gmp 133 zptest .................. slowest (zrand, zmodpow, zsqr, zmod) 134 zdiv(big denum) ......... tomsfastmath is faster (zdivmod) 135 zmod(big denum) ......... fastest (zdivmod) 136 zdivmod(big denum) ...... fastest 137 zdiv(tiny denum) ........ slowest 138 zmod(tiny denum) ........ slowest 139 zdivmod(tiny denum) ..... slowest 140 }}} 141 142 143 144 Note, some corresponding functions are not implemented in 145 some other libraries. In such cases, they have been implemented 146 in the translation layers (found under bench/). Those 147 implementations are often suboptimal, but probably in style 148 with what you would write if you need that functionality. 149 Note further, if for example, you want do perform addition 150 and you know that your operands are non-negative, you would 151 choose zadd_unsigned in libzahl, but if you are using a 152 library that does not have the corrsponding function, you are 153 better of with the regular addition (zadd). This is however 154 reflected in the comment column. 155 156 Also note, TomsFastMath does not support arbitrarily large 157 integers, the limit is set at compile-time, which gives is 158 a significant performance advantage. Furthermore, no failure 159 check is done with GMP. Additionally, hebimath has been 160 excluded from these comparison because it is not fully 161 operational. 162 163 Also note, NOT does not mean the same thing in all libraries, 164 for example in GMP it means (-x - 1), thus, znot does not 165 use GMP's NOT in the translations layer. 166 167 168 The following optimisation flags have been tested: 169 170 -O0 ...................... Bad 171 -O1 ...................... Bad 172 -O2 ...................... Not so good 173 -O3 ...................... Good 174 -fno-builtin ............. Bad