In the quest to get the most security and performance out of Tor I've done a bunch of custom compiling and in this post I'll talk about the performance gains I've seen.
The Setup:
I run our Tor server on a lower power 4 core Intel D-2123IT CPU running at 2.2 GHz
As far as the operating system is concerned it is a "Skylake-avx512" CPU
The Compile:
Originally Tor was compiled by using "-march=skylake" in the Makefile and then with "./configure --with-openssl-dir=/openssl-path"
Later I realized I had not compiled with avx512 support so I changed to -march=skylake-avx512
You will also need to make sure OpenSSL is compiled with at least "./Configure -march=skylake-avx512 enable-ec_nistp_64_gcc_128"
The Question?:
Did it really make much of a difference? Let's see! How much does it help or hurt? Does the kind of math that Tor performs take advantage of the AVX features?
The Performance Test:
The tor bench performance testing tool (tor /src/test/bench) was used for these tests.
Note: I only ran the test once instead of averaging multiple runs. I was looking for only large differences.
I have highlighted anything that was more than a 5% increase or decrease in performance.
The Most Important Tests:
Please note the performance numbers are measured in us (Microseconds), ns (Nanosecons), and ms (Miliseconds)
The most important performance numbers are around
The Conclusion:
If your CPU supports the AVX512 extensions it is a
What is AVX512?:
(From wikipedia) AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing)[1] and Skylake-X CPUs; this includes the Core-X series (excluding the Core i5-7640X and Core i7-7740X), as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series.[2]
EDIT! NEW DATA from 2021 May 24
I've updated the Tor server OS to FreeBSD 13 to support KTLS and OSSL and I have recompiled Tor and re-run the benchmark and added a new column comparing AVX-512 vs. KTLS.
What is KTLS and OSSL and why is it making such a huge improvement?
The kernel now supports in-kernel framing and encryption of Transport Layer Security (TLS) data on TCP sockets for TLS versions 1.0 through 1.3. Transmit offload via in-kernel crypto drivers is supported for MtE cipher suites using AES-CBC as well as AEAD cipher suites using AES-GCM. Receive offload via in-kernel crypto drivers is supported for AES-GCM cipher suites for TLS 1.2. Using KTLS requires the use of a KTLS-aware userland SSL library. The OpenSSL library included in the base system does not enable KTLS support by default, but support can be enabled by building with the WITH_OPENSSL_KTLS option
Test Name | skylake | skylake-avx512 | Difference | Percentage | KTLS | Difference | Percentage |
dmap: digestset_probably_contains | 127.91 | 70.37 | -57.54 ns | 44.98% Faster | 109.07 | +38.7 ns | 55% Slower |
dmap: digestmap_set | 101.17 | 71.30 | -29.87 ns | 29.52% Faster | 86.47 | +15.17 ns | 21% Slower |
dmap: digestmap_get | 87.79 | 64.83 | -22.96 ns | 26.15% Faster | 72.63 | +7.8 ns | 12% Slower |
dmap: digestset_add | 136.86 | 72.70 | -64.16 ns | 46.88% Faster | 116.94 | +44.24 ns | 60% Slower |
onion_TAP: Client-side part 1 | 1006.4394 | 1026.0896 | +19.6502 ns | 1.95% Slower | 146.24 | -879.8496 ns | 702% Faster |
onion_TAP: Server-side key right | 2956.9824 | 3027.5585 | +70.5761 ns | 2.38% Slower | 364.61 | -2662.9485 ns | 831% Faster |
onion_TAP: Server-side key wrong | 4068.7480 | 4177.2812 | +108.5332 ns | 2.66% Slower | 487.41 | -3689.8712 ns | 857% Faster |
onion_TAP: Client-side part 2 | 917.4394 | 933.0117 | +15.5723 ns | 1.69% Slower | 119.68 | -813.3317 ns | 784% Faster |
onion_ntor: 25519 boff: Client part 1 | 103.1357 | 101.2304 | -1.9053 us | 01.84% Faster | 74.57 | -26.6604 us | 35% Faster |
onion_ntor: 25519 boff: Server side | 316.1425 | 310.2880 | -5.8545 us | 01.85% Faster | 227.77 | -82.518 us | 36% Faster |
onion_ntor: 25519 boff: Client part 2 | 212.8984 | 209.0576 | -3.8408 us | 01.80% Faster | 155.32 | -53.7376 us | 34% Faster |
onion_ntor: 25519 bon: Client part 1 | 33.6640 | 32.6943 | -0.9697 us | 02.88% Faster | 23.08 | -9.6143 us | 41% Faster |
onion_ntor: 25519 bon: Server side | 246.8164 | 241.8593 | -4.9571 us | 02.00% Faster | 176.26 | -65.5993 us | 37% Faster |
onion_ntor: 25519 bon: Client part 2 | 212.8955 | 209.0898 | -3.8057 us | 01.78% Faster | 153 | -56.0898 us | 36% Faster |
ed25519-donna off: verify signature | 209.05 | 190.92 | -18.13 us | 08.67% Faster | 176.88 | -14.04 us | 08% Faster |
ed25519-donna on: verify signature | 87.31 | 81.85 | -5.46 us | 06.25% Faster | 67.38 | -14.47 us | 24% Faster |
ed25519-donna off: gen pub key | 55.89 | 60.22 | +4.33 us | 07.74% Slower | 46.01 | -14.21 us | 30% Faster |
ed25519-donna on: gen pub key | 25.43 | 24.15 | -1.28 us | 05.03% Faster | 19.62 | -4.53 us | 23% Faster |
dh: Complete DH handshakes | 3.7169 | 3.7855 | +0.0686 ms | 01.84% Slower | 0.49 | -3.2955 ms | 772% Faster |
ecdh_p256: Complete ECDH P-256 | 0.5149 | 0.5203 | +0.0054 ms | 01.04% Slower | 0.19 | -0.3303 ms | 273% Faster |
ecdh_p224: Complete ECDH P-224 | 0.3239 | 0.3317 | +0.0078 ms | 02.40% Slower | 2.13 | +1.79 ms | 642% Slower |
md_parse: Microdescripter parse | 17452.62 | 13226.09 | -4226.53 ns | 24.21% Faster | 11877.52 | -1348.57 ns | 11% Faster |
crypto_strongest_rand(16) | 12051.07 | 3192.13 | -8858.94 ns | 377% Faster | |||
sha512 (2048) | 7344.62 | 4228.52 | -3116.10 ns | 173% Faster |
I don't fully understand why p224 is so much slower while p256 is faster.
I also don't understand why dmap performance is about equal to the non-AVX compile.
I'm going to see what Tor uses the most of, and ask some Tor developers to speculate on what elements are the most influential to overall performance.