Bug #7339
closedhang in lufs_unsnarf() due to zfs on files on ufs
100%
Description
When running the zfs test suite, sometimes it hangs, and we see that the hang is in UFS:
stack pointer for thread ffffff037d4e7780: ffffff000cebfbc0 [ ffffff000cebfbc0 _resume_from_idle+0xf4() ] ffffff000cebfbf0 swtch+0x18a() ffffff000cebfc20 cv_wait+0x89(ffffff0bf313a914, ffffff0bf313a8e0) ffffff000cebfc70 lufs_unsnarf+0x63(ffffff1435c43a00) ffffff000cebfd60 ufs_unmount+0x6a2(ffffff0337a8c4a8, 400, ffffff033a6301f0) ffffff000cebfd90 fsop_unmount+0x1b(ffffff0337a8c4a8, 400, ffffff033a6301f0) ffffff000cebfde0 dounmount+0x73(ffffff0337a8c4a8, 400, ffffff033a6301f0) ffffff000cebfe30 umount2_engine+0x96(ffffff0337a8c4a8, 400, ffffff033a6301f0, 1) ffffff000cebfeb0 umount2+0x163(8066500, 400) ffffff000cebff00 _sys_sysenter_post_swapgs+0x237() > ffffff1435c43a00::print -a ufsvfs_t vfs_log->un_logmap->mtm_taskq_sync_count ffffff0bf313a910 vfs_log->un_logmap->mtm_taskq_sync_count = 0xffffffff
The mtm_taskq_sync_count has gone negative, which should never happen. The intention is that it should be decremented by top_issue_sync if top_issue_sync() is being called from the taskq_dispatch(system_taskq) in top_begin_async(). However, it is actually decremented by top_issue_sync if it's being called by any system_taskq thread.
top_issue_sync() is called from all sorts of ufs routines. Therefore if we ever call into ufs from a system_taskq, it could end up incorrectly decrementing mtm_taskq_sync_count, causing it to go negative, and then we get this hang when trying to unmount the UFS filesystem.
We don't see this consistently because although ufs can call top_issue_sync() all over the place, it only calls it if certain other qualifications are met, e.g. in top_end_async:
/* * Generate a sync op if the log, logmap, or deltamap are heavily used. * Unless we are possibly holding any VM locks, since if we are holding * any VM locks and we issue a top_end_sync(), we could deadlock. */ if ((mtm->mtm_activesync == 0) && !(mtm->mtm_closed & TOP_SYNC) && (deltamap_need_commit(ul->un_deltamap) || logmap_need_commit(mtm) || ldl_need_commit(ul)) && (topid != TOP_GETPAGE)) { top_issue_sync(ufsvfsp); }
The functional/cli_root/zpool_import tests can trigger this, because they create zfs storage pools on top of files in a ufs filesystem. vdev_file_io_start() uses the system_taskq to operate on the underlying files (on UFS in this case):
VERIFY3U(taskq_dispatch(system_taskq, vdev_file_io_strategy, bp, TQ_SLEEP), !=, 0);
It's reasonable for ZFS to use the system_taskq. UFS needs to keep better track of whether top_issue_sync is being called from the taskq_dispatch in top_begin_async, vs. being called synchronously.
Updated by Electric Monk about 7 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 31d4cf520a749c6f68cc90540de415399538680b
commit 31d4cf520a749c6f68cc90540de415399538680b Author: Matthew Ahrens <mahrens@delphix.com> Date: 2016-09-07T03:21:43.000Z 7339 hang in lufs_unsnarf() due to zfs on files on ufs Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com>