dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, and with a workaround for the object allocation issue described at: https://github.com/zfsonlinux/zfs/issues/4636, I observed poor performance (~30,000 file creations per second). I noticed that `dmu_tx_hold_zap()` was using about 30% of all CPU, and doing `dnode_hold()` 7 times on the same object (the ZAP object that is being held).
`dmu_tx_hold_zap()` keeps a hold on the `dnode_t` the entire time it is running, in `dmu_tx_hold_t:txh_dnode`, so it would be nice to use the `dnode_t` that we already have in hand, rather than repeatedly calling `dnode_hold()`. To do this, we need to pass the `dnode_t` down through all the intermediate calls that `dmu_tx_hold_zap()` makes, making these routines take the `dnode_t*` rather than an `objset_t*` and a `uint64_t object` number. In particular, the following routines will need to have analogous `*_by_dnode()` variants created:
This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by `dmu_tx_hold_zap()` was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds.