Bug #12794
closedZFS support for vectorized algorithms on x86 (HW support)
100%
Description
This is the second phase of the work to port the ZFS HW-accelerated raidz
calculation code from OpenZFS. This work depends on tickets 12668 and 12793.
Files
Related issues
Updated by Joshua M. Clulow over 2 years ago
- Related to Bug #12668: ZFS support for vectorized algorithms on x86 (initial support) added
Updated by Joshua M. Clulow over 2 years ago
- Related to Bug #12793: kernel FPU support added
Updated by Jerry Jelinek over 2 years ago
I ran a variety of different tests.
1) zfs test suite - this includes the raidz tests added in 12668 which test all of the algorithms in the code. The new sse2, ssse3 and avx2 algorithms are now tested.
2) I ran the raidz exercising subset of the zfs test suite tests in a continuous loop for 24 hours as a stress test.
3) I ran a heavy fio write load onto a raidz to force kernel FPU usage for parity generation while simultaneously running the raidz_test generation code. This code will generates a heavy user-land FPU load. I ran these two simultaneously for 1/2 hour and verified that that was no failures in any of the user-level tests. The goal is to test that concurrent kernel/user-level FPU usage is safe.
4) I ran this on a system where I had a copy of a set of files in a raidz2. I first failed one disk (using zinject) and diff-verified the files against a golden copy in a separate file system. I then repeated this test (after clearing the ARC) with 2 failed disks. The files were correctly reconstructed from parity in both cases.
5) For performance testing, the "raidz_test" program includes a benchmark mode. I have attached a complete set of output. Since this code runs through all of the algorithms and all of the permutations, there are a lot of results. This is a repeat of the performance testing from 12668, but now includes the FPU algorithms.
I ran this on a machine which has integrated FPU support for all 3 of the new HW algorithms.
Here is a summary of some key comparisons at 4K, 128K and 1M recordsize. In general all of the FPU algorithms are much better on generation, except for the small recordsize (4k) on raidz2/3. All of the FPU algorithms are always the same or much better on reconstruction.
In these results, the columns are the following and the 'total_bw' column (2nd to last) provides the relevant comparison.
impl, math, dcols, iosize, disk_bw, total_bw, iter
raidz1 generation original, gen_p, 8, 4096, 108.577368, 977.196313, 1048576 scalar, gen_p, 8, 4096, 105.701704, 951.315332, 1048576 sse2, gen_p, 8, 4096, 107.799976, 970.199786, 1048576 ssse3, gen_p, 8, 4096, 107.669024, 969.021218, 1048576 avx2, gen_p, 8, 4096, 108.621007, 977.589063, 1048576 original, gen_p, 8, 131072, 248.801670, 2239.215031, 32768 scalar, gen_p, 8, 131072, 354.875480, 3193.879323, 32768 sse2, gen_p, 8, 131072, 377.592469, 3398.332219, 32768 ssse3, gen_p, 8, 131072, 378.103709, 3402.933384, 32768 avx2, gen_p, 8, 131072, 404.689481, 3642.205326, 32768 original, gen_p, 8, 1048576, 475.440396, 4278.963560, 4096 scalar, gen_p, 8, 1048576, 1091.816247, 9826.346221, 4096 sse2, gen_p, 8, 1048576, 1353.059128, 12177.532150, 4096 ssse3, gen_p, 8, 1048576, 1366.004388, 12294.039488, 4096 avx2, gen_p, 8, 1048576, 1534.082664, 13806.743972, 4096
raidz2 original, gen_pq, 8, 4096, 106.136765, 1061.367650, 1048576 scalar, gen_pq, 8, 4096, 53.154873, 531.548732, 1048576 sse2, gen_pq, 8, 4096, 53.959130, 539.591303, 1048576 ssse3, gen_pq, 8, 4096, 54.129877, 541.298770, 1048576 avx2, gen_pq, 8, 4096, 54.395867, 543.958667, 1048576 original, gen_pq, 8, 131072, 195.754027, 1957.540271, 32768 scalar, gen_pq, 8, 131072, 293.189424, 2931.894238, 32768 sse2, gen_pq, 8, 131072, 459.507476, 4595.074761, 32768 ssse3, gen_pq, 8, 131072, 459.897220, 4598.972202, 32768 avx2, gen_pq, 8, 131072, 530.460067, 5304.600671, 32768 original, gen_pq, 8, 1048576, 312.232121, 3122.321213, 4096 scalar, gen_pq, 8, 1048576, 458.227846, 4582.278458, 4096 sse2, gen_pq, 8, 1048576, 1028.868958, 10288.689577, 4096 ssse3, gen_pq, 8, 1048576, 1025.170281, 10251.702811, 4096 avx2, gen_pq, 8, 1048576, 1397.466808, 13974.668081, 4096
raidz3 original, gen_pqr, 8, 4096, 103.737739, 1141.115133, 1048576 scalar, gen_pqr, 8, 4096, 35.388557, 389.274123, 1048576 sse2, gen_pqr, 8, 4096, 36.030771, 396.338479, 1048576 ssse3, gen_pqr, 8, 4096, 36.085041, 396.935455, 1048576 avx2, gen_pqr, 8, 4096, 36.302696, 399.329651, 1048576 original, gen_pqr, 8, 131072, 105.744226, 1163.186483, 32768 scalar, gen_pqr, 8, 131072, 160.649974, 1767.149711, 32768 sse2, gen_pqr, 8, 131072, 320.647502, 3527.122526, 32768 ssse3, gen_pqr, 8, 131072, 318.813970, 3506.953670, 32768 avx2, gen_pqr, 8, 131072, 399.040193, 4389.442126, 32768 original, gen_pqr, 8, 1048576, 132.985046, 1462.835507, 4096 scalar, gen_pqr, 8, 1048576, 210.713554, 2317.849091, 4096 sse2, gen_pqr, 8, 1048576, 596.334562, 6559.680178, 4096 ssse3, gen_pqr, 8, 1048576, 600.951422, 6610.465643, 4096 avx2, gen_pqr, 8, 1048576, 953.593970, 10489.533666, 4096
Updated by Jerry Jelinek over 2 years ago
- File fpu_bench_new.txt fpu_bench_new.txt added
Updated by Electric Monk over 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit f91a454727d8e1cd4bbbe2d4efd2754590298697
commit f91a454727d8e1cd4bbbe2d4efd2754590298697 Author: Gvozden Neskovic <neskovic@gmail.com> Date: 2020-07-15T12:43:07.000Z 12794 ZFS support for vectorized algorithms on x86 (HW support) Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Robert Mustacchi <rm@fingolfin.org>
Updated by Marcel Telka over 2 years ago
- Related to Bug #12968: curthread swtch-ing while the kernel is using the FPU added