Project

General

Profile

Bug #7339

hang in lufs_unsnarf() due to zfs on files on ufs

Added by Daniel Kimmel over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2016-08-30
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

When running the zfs test suite, sometimes it hangs, and we see that the hang is in UFS:

stack pointer for thread ffffff037d4e7780: ffffff000cebfbc0
[ ffffff000cebfbc0 _resume_from_idle+0xf4() ]
  ffffff000cebfbf0 swtch+0x18a()
  ffffff000cebfc20 cv_wait+0x89(ffffff0bf313a914, ffffff0bf313a8e0)
  ffffff000cebfc70 lufs_unsnarf+0x63(ffffff1435c43a00)
  ffffff000cebfd60 ufs_unmount+0x6a2(ffffff0337a8c4a8, 400, ffffff033a6301f0)
  ffffff000cebfd90 fsop_unmount+0x1b(ffffff0337a8c4a8, 400, ffffff033a6301f0)
  ffffff000cebfde0 dounmount+0x73(ffffff0337a8c4a8, 400, ffffff033a6301f0)
  ffffff000cebfe30 umount2_engine+0x96(ffffff0337a8c4a8, 400, ffffff033a6301f0, 1)
  ffffff000cebfeb0 umount2+0x163(8066500, 400)
  ffffff000cebff00 _sys_sysenter_post_swapgs+0x237()
> ffffff1435c43a00::print -a ufsvfs_t vfs_log->un_logmap->mtm_taskq_sync_count
ffffff0bf313a910 vfs_log->un_logmap->mtm_taskq_sync_count = 0xffffffff

The mtm_taskq_sync_count has gone negative, which should never happen. The intention is that it should be decremented by top_issue_sync if top_issue_sync() is being called from the taskq_dispatch(system_taskq) in top_begin_async(). However, it is actually decremented by top_issue_sync if it's being called by any system_taskq thread.

top_issue_sync() is called from all sorts of ufs routines. Therefore if we ever call into ufs from a system_taskq, it could end up incorrectly decrementing mtm_taskq_sync_count, causing it to go negative, and then we get this hang when trying to unmount the UFS filesystem.

We don't see this consistently because although ufs can call top_issue_sync() all over the place, it only calls it if certain other qualifications are met, e.g. in top_end_async:

    /*
     * Generate a sync op if the log, logmap, or deltamap are heavily used.
     * Unless we are possibly holding any VM locks, since if we are holding
     * any VM locks and we issue a top_end_sync(), we could deadlock.
     */
    if ((mtm->mtm_activesync == 0) &&
        !(mtm->mtm_closed & TOP_SYNC) &&
        (deltamap_need_commit(ul->un_deltamap) ||
        logmap_need_commit(mtm) ||
        ldl_need_commit(ul)) &&
        (topid != TOP_GETPAGE)) {
        top_issue_sync(ufsvfsp);
    }

The functional/cli_root/zpool_import tests can trigger this, because they create zfs storage pools on top of files in a ufs filesystem. vdev_file_io_start() uses the system_taskq to operate on the underlying files (on UFS in this case):

    VERIFY3U(taskq_dispatch(system_taskq, vdev_file_io_strategy, bp,
        TQ_SLEEP), !=, 0);

It's reasonable for ZFS to use the system_taskq. UFS needs to keep better track of whether top_issue_sync is being called from the taskq_dispatch in top_begin_async, vs. being called synchronously.

History

#1

Updated by Electric Monk over 3 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

git commit 31d4cf520a749c6f68cc90540de415399538680b

commit  31d4cf520a749c6f68cc90540de415399538680b
Author: Matthew Ahrens <mahrens@delphix.com>
Date:   2016-09-07T03:21:43.000Z

    7339 hang in lufs_unsnarf() due to zfs on files on ufs
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF