sd timeout/retry settings are absurd
The sd command timeout of 60 seconds with up to 15 retries is just silly. Not only does this make it highly likely that an otherwise healthy system will trigger the ZFS deadman timer and panic, it also makes it likely that the system will be unusable for an extended period if the operation being retried is a read. Worse, there is simply no reason to believe that modern disk drives need this sort of heroic effort.
We will adjust the defaults as follows:
- un_retry_count from 5 down to 2
- un_cmd_timeout from 60 (seconds) down to 10
- un_victim_retry_count from 10 down to 1
SD_IO_TIME is the hard-coded default command timeout in seconds. It is used for commands executed during sd attach only, prior to having information about the target type.
sd:sd_io_time is set by default to SD_IO_TIME. However, it can be tuned using /etc/system so that this value is used after the initial probes are done. A small code change will be introduced so that optical devices will continue using SD_IO_TIME; while this is suboptimal (this should always be tunable), we don't use nor recommend optical devices at all so changing behaviour here could only harm users.
sd.conf allows setting retries-timeout, retries-reset, retries-busy, and retries-notready. We will set these to 1, except for retries-notready which we will leave at the default value (disks will not return this condition except during spinup).
un_victim_retry_count is hardcoded to twice the value of un_retry_count; however, it is set during attach after un_retry_count has been initialised to its default value but before the value is overridden by sd.conf. So it currently ends up being 10, always. We will introduce another sd.conf setting, retries-victim, to override this explicitly and set it to 2.
Doing so will also automatically, given existing code, reduce the busy retry count from 5 to 2 and the reset retry limit from 2 to 1.