Project

General

Profile

Actions

Feature #16423

closed

Import fletcher-4 algorithms from OpenZFS

Added by Andy Fiddaman 26 days ago. Updated 16 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:
oxide:stlouis#541

Description

A colleague was doing a 4K random read benchmark on an illumos system, and generated a flame graph that shows we are spending a lot of time calculating Fletcher-4 checksums:

Quoth he:

It looks like fletcher_4_incremental_native operates one byte at a time

OpenZFS includes optimized implementations for SSE / AVX2 / AVX-512. We should explore bringing them into our system.


Files

clipboard-202403261138-c0g5v.png (261 KB) clipboard-202403261138-c0g5v.png Andy Fiddaman, 2024-03-26 11:38 AM
Actions #1

Updated by Andy Fiddaman 25 days ago

I've imported the following commits from OpenZFS:

1eeb4562a72ab29345572609e1e4315ecd26c5a1

Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date:   Wed Dec 9 15:34:16 2015 -0800

    Implementation of AVX2 optimized Fletcher-4

0dab2e84fcecff2806287efacb7c6205f346f69d
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue Jun 28 13:31:21 2016 -0700

    Vectorized fletcher_4 must be 128-bit aligned

35a76a0366372d89a0f1ac3cebd5bc7646aadec3
Author: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Date:   Thu Jun 23 23:32:40 2016 -0400

    Implementation of SSE optimized Fletcher-4

70b258fc962fd40673b9a47574cb83d8438e7d94
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Wed Jul 6 13:42:04 2016 +0200

    Fletcher4 implementation using avx512f instruction set

fc897b24b2efafccb5c9e915b81dc5f797673e72
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Tue Jul 12 17:50:54 2016 +0200

    Rework of fletcher_4 module

37f520db2d19389deb2a68065391ae2b229c6b50
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Fri Sep 23 03:52:29 2016 +0200

    Fletcher4: Incremental using SIMD

5bf703b8f381b6a8a89a2c251ba04dc9db59bcd6
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sun Sep 25 00:56:22 2016 +0200

    Fletcher4: save/reload implementation context

7f3194932d22c667026aff1b263ceaa1ebd012ee
Author: Romain Dolbeau <romain.github@dolbeau.name>
Date:   Fri Nov 4 18:53:03 2016 +0100

    Add superscalar fletcher4

2fe36b0bfb80a4955f6ff42b2448f432223f6011
Author: David Quigley <david.quigley@intel.com>
Date:   Wed Feb 1 10:34:22 2017 -0700

    Use fletcher_4 routines natively with `abd_iterate_func()`

0b2a642351f375cb9be3d2569a0ac0417340c741
Author: Romain Dolbeau <romain.dolbeau@atos.net>
Date:   Wed Oct 30 20:26:14 2019 +0100

    Add AVX512BW variant of fletcher

83b698dc42bf9ff06aa025c625eca39c9785f3e1
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Sun Dec 6 09:57:20 2020 -0800

    Reduce fletcher4 and raidz benchmark times

59493b63c18ea223857066218d6a58b67eb88159
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Mon Dec 5 14:00:34 2022 -0500

    Micro-optimize fletcher4 calculations

78289b84589e632d87504df6a9c63b5ac694d2f9
Author: Attila Fülöp <attila@fueloep.org>
Date:   Tue Mar 14 17:45:28 2023 +0100

    zcommon: Refactor FPU state handling in fletcher4

0375465536c9790a7fb3e1ac94412fef91076e3f
Author: Romain Dolbeau <romain@dolbeau.org>
Date:   Mon Apr 26 21:42:42 2021 +0200

    Fix AVX512BW Fletcher code on AVX512-but-not-BW machines

0ad5f4344238b548e2240a405d418f7af9290623
Author: Rich Ercolani <rincebrain@gmail.com>
Date:   Fri Mar 24 13:29:19 2023 -0400

    Drop lying to the compiler in the fletcher4 code

dc03fa3092472c40bf1b6c7d7ea3170e3ffa9e38
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sun Sep 25 10:35:12 2016 +0200

    Fletcher4: Init in libzfs_init()

    All users of fletcher4 methods must call `fletcher_4_init()/_fini()`
    There's no benchmarking overhead when called from user-space.

616fa7c02b0cc373f011998f56ed53bb37742d13

Author: Tim Chase <tim@chase2k.com>
Date:   Tue Nov 29 15:47:05 2016 -0600

    zstreamdump needs to initialize fletcher 4 support

    Otherwise, the checksum function pointer isn't initialized.

And made a couple of additional adjustments:

- Implement benchmark kstats in a way that works on illumos, since we don't
have table based kstats;
- Separate out the FPU use flags between native and bitswap operations, to
support cases when the fastest native and fastest bitswap implementations
differ in their FPU use.

With these in place, the micro benchmarks report AVX2 as 4 times faster
than the baseline (scalar) on an AMD Zen3 processor, and superscalar4
is a 30% gain.

gimlet-sn06 # kstat -p zfs::fletcher_4_bench: | sort -n +1
zfs:0:fletcher_4_bench:class    misc
zfs:0:fletcher_4_bench:crtime   96.452515986
zfs:0:fletcher_4_bench:snaptime 187.892179504
zfs:0:fletcher_4_bench:scalar_byteswap  5840770469
zfs:0:fletcher_4_bench:sse2_byteswap    6071121346
zfs:0:fletcher_4_bench:scalar_native    6203086825
zfs:0:fletcher_4_bench:superscalar_byteswap     6671237898
zfs:0:fletcher_4_bench:superscalar4_byteswap    7297386075
zfs:0:fletcher_4_bench:superscalar_native       7349402488
zfs:0:fletcher_4_bench:superscalar4_native      7972499341
zfs:0:fletcher_4_bench:ssse3_byteswap   11991868764
zfs:0:fletcher_4_bench:sse2_native      13019238768
zfs:0:fletcher_4_bench:ssse3_native     13054258861
zfs:0:fletcher_4_bench:avx2_byteswap    23675409613
zfs:0:fletcher_4_bench:avx2_native      24785051257

Running ztest overnight cycling through all of the algorithms has not
shown any issues, and zpool scrub is not reporting any errors. I still
have to run the ZFS test suite and find a system that supports AVX512.

Actions #2

Updated by Andy Fiddaman 25 days ago

Stats from a Zen4 system which has AVX512F and AVX512BW

zfs:0:fletcher_4_bench:ssse3_native     6230629479
zfs:0:fletcher_4_bench:scalar_byteswap  6411794155
zfs:0:fletcher_4_bench:sse2_native      6457002237
zfs:0:fletcher_4_bench:scalar_native    7680820400
zfs:0:fletcher_4_bench:sse2_byteswap    8216794820
zfs:0:fletcher_4_bench:superscalar_native       8805834299
zfs:0:fletcher_4_bench:superscalar4_byteswap    9541479527
zfs:0:fletcher_4_bench:superscalar_byteswap     9737211839
zfs:0:fletcher_4_bench:superscalar4_native      10290581962
zfs:0:fletcher_4_bench:avx512f_byteswap 13255192149
zfs:0:fletcher_4_bench:ssse3_byteswap   15079708314
zfs:0:fletcher_4_bench:avx2_byteswap    27089237242
zfs:0:fletcher_4_bench:avx512bw_byteswap        27373879316
zfs:0:fletcher_4_bench:avx2_native      27513452576
zfs:0:fletcher_4_bench:avx512f_native   31721560248
zfs:0:fletcher_4_bench:avx512bw_native  31756114965

I ran a modified ztest on this system which only ran the ztest_fletcher and ztest_fletcher_incr tests, cycling through the different algorithms and checking the output against the original scalar algorithm.

Workload summary:

  Calls      Time   Function
  -----      ----   --------
    305    10m21s   ztest_fletcher
    314     9m24s   ztest_fletcher_incr

DTrace also confirms that we're hitting all of the algorithms.

# dtrace -n 'pid$target::fletcher*:entry{@[probefunc] = count()}' -p `pgrep -n ztest`
dtrace: description 'pid$target::fletcher*:entry' matched 48 probes
dtrace: pid 574 has exited

  fletcher_4_fini                                                   2
  fletcher_2_incremental_native                                   124
  fletcher_2_native                                               124
  fletcher_init                                                   124
  fletcher_4_impl_set                                            3328
  fletcher_4_avx2_fini                                         129324
  fletcher_4_avx2_init                                         129324
  fletcher_4_superscalar4_fini                                 129325
  fletcher_4_superscalar4_init                                 129325
  fletcher_4_superscalar_init                                  129325
  fletcher_4_superscalar_fini                                  129326
  fletcher_4_scalar_init                                       139175
  fletcher_4_scalar_fini                                       139176
  fletcher_4_avx512f_fini                                      258649
  fletcher_4_avx512f_init                                      258649
  fletcher_4_sse2_init                                         258650
  fletcher_4_sse2_fini                                         258654
  fletcher_4_avx512bw_byteswap                                 294187
  fletcher_4_avx2_byteswap                                     297096
  fletcher_4_avx512f_byteswap                                  311135
  fletcher_4_incremental_byteswap                              313377
  fletcher_4_incremental_native                                313377
  fletcher_4_avx2_native                                       318699
  fletcher_4_superscalar_byteswap                              322435
  fletcher_4_superscalar4_native                               322558
  fletcher_4_sse2_byteswap                                     330096
  fletcher_4_superscalar4_byteswap                             340785
  fletcher_4_ssse3_byteswap                                    345538
  fletcher_4_byteswap                                          351707
  fletcher_4_native                                            351712
  fletcher_4_superscalar_native                                387153
  fletcher_4_scalar_native                                     392469
  fletcher_4_scalar_byteswap                                   412981
  fletcher_4_avx512f_native                                    558356
  fletcher_4_sse2_native                                       675024
Actions #3

Updated by Electric Monk 25 days ago

  • Gerrit CR set to 3386
Actions #4

Updated by Andy Fiddaman 24 days ago

I ran the the ZFS testsuite on a system running DEBUG bits, and with these
changes applied. The results are excellent considering the known problems
with the testsuite.

Results Summary
PASS    1259
FAIL       8
SKIP      24

I confirmed that all of the skips are related to TRIM tests, which makes sense
as I ran this in a propolis VM with emulated NVMe drives that do not support
TRIM.

% grep -F '[SKIP]' 20240328T130614/log | cut -d/ -f5- | cut -d\  -f1
functional/cli_root/zpool_trim/setup
functional/cli_root/zpool_trim/zpool_trim_attach_detach_add_remove
functional/cli_root/zpool_trim/zpool_trim_import_export
functional/cli_root/zpool_trim/zpool_trim_multiple
functional/cli_root/zpool_trim/zpool_trim_neg
functional/cli_root/zpool_trim/zpool_trim_offline_export_import_online
functional/cli_root/zpool_trim/zpool_trim_online_offline
functional/cli_root/zpool_trim/zpool_trim_partial
functional/cli_root/zpool_trim/zpool_trim_rate
functional/cli_root/zpool_trim/zpool_trim_rate_neg
functional/cli_root/zpool_trim/zpool_trim_secure
functional/cli_root/zpool_trim/zpool_trim_split
functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_neg
functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_pos
functional/cli_root/zpool_trim/zpool_trim_suspend_resume
functional/cli_root/zpool_trim/zpool_trim_unsupported_vdevs
functional/cli_root/zpool_trim/zpool_trim_verify_checksums
functional/cli_root/zpool_trim/zpool_trim_verify_trimmed
functional/trim/setup
functional/trim/autotrim_integrity
functional/trim/autotrim_config
functional/trim/autotrim_trim_integrity
functional/trim/trim_integrity
functional/trim/trim_config

The failures are below, annotated with illumos issue numbers:

  • functional/cli_root/zpool_import/zpool_import_missing_003_pos (#16431)
  • functional/delegate/zfs_allow_010_pos (#6568)
  • functional/delegate/zfs_allow_012_neg (#6568)
  • functional/projectquota/projectspace_004_pos
  • functional/refreserv/refreserv_004_pos (#16274)
  • functional/rsend/rsend_008_pos (#12033)
  • functional/rsend/send_encrypted_freeobjects
  • functional/vdev_zaps/vdev_zaps_007_pos (#12035)

For the failing tests that don't have issues logged, a search of other illumos
issues also shows them failing in other test runs. I'll log issues for the
others.

Actions #5

Updated by Electric Monk 16 days ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 0886dcadf4b2cd677c3b944167f0d16ccb243616

commit  0886dcadf4b2cd677c3b944167f0d16ccb243616
Author: Andy Fiddaman <illumos@fiddaman.net>
Date:   2024-04-05T11:03:35.000Z

    16423 Import fletcher-4 algorithms from OpenZFS
    Portions contributed by: Attila Fülöp <attila@fueloep.org>
    Portions contributed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Portions contributed by: David Quigley <david.quigley@intel.com>
    Portions contributed by: Gvozden Nešković <neskovic@gmail.com>
    Portions contributed by: Jinshan Xiong <jinshan.xiong@intel.com>
    Portions contributed by: Rich Ercolani <rincebrain@gmail.com>
    Portions contributed by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Portions contributed by: Romain Dolbeau <romain.github@dolbeau.name>
    Portions contributed by: Tim Chase <tim@chase2k.com>
    Portions contributed by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
    Reviewed by: Rich Lowe <richlowe@richlowe.net>
    Reviewed by: Dan Cross <cross@oxidecomputer.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF