Project

General

Profile

Actions

Bug #4997

closed

kstat_install()/header_kstat_snapshot() race induces panic

Added by Robert Mustacchi about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2014-07-11
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

The marlin agent was reading kstat data when we panicked at:

ffffff01f5292a60 bcopy+0x194()
ffffff01f5292c40 read_kstat_data+0x102(ffffff01f5292e68, 80475e8, 100001)
ffffff01f5292c80 kstat_ioctl+0x83(200000000, 4b02, 80475e8, 100001, 
ffffff4366da4d70, ffffff01f5292e68)
ffffff01f5292cc0 cdev_ioctl+0x39(200000000, 4b02, 80475e8, 100001, 
ffffff4366da4d70, ffffff01f5292e68)
ffffff01f5292d10 spec_ioctl+0x60(ffffff43157edb00, 4b02, 80475e8, 100001, 
ffffff4366da4d70, ffffff01f5292e68, 0)
ffffff01f5292da0 fop_ioctl+0x55(ffffff43157edb00, 4b02, 80475e8, 100001, 
ffffff4366da4d70, ffffff01f5292e68, 0)
ffffff01f5292ec0 ioctl+0x9b(d, 4b02, 80475e8)
ffffff01f5292f10 _sys_sysenter_post_swapgs+0x149()

The following is Bryan's analysis:

We died trying to read the kstat_headers:

> fffffffffbc15a50::print kstat_t
{
    ks_crtime = 0
    ks_next = 0
    ks_kid = 0
    ks_module = [ "unix" ]
    ks_resv = 0
    ks_instance = 0
    ks_name = [ "kstat_headers" ]
    ks_type = 0
    ks_class = [ "kstat" ]
    ks_flags = 0x3
    ks_data = 0
    ks_ndata = 0x1642
    ks_data_size = 0xfff70
    ks_snaptime = 0x4d8ac8cee4d68
    ks_update = header_kstat_update
    ks_private = 0
    ks_snapshot = header_kstat_snapshot
    ks_lock = kstat_chain_lock
}

Note the data size: 0xfff70. The kbuf that we allocated is 0xffffffa9e28a9000
but we died at 0xffffffa9e29a9028 (which is unmapped). This is at
offset 0x100028 from the base of the kbuf – which is 0xb8 beyond our
buffer and denotes that we were off by (at least) 2 in our calculation.
This count is determined in header_kstat_update() and actually copied into
kbuf in header_kstat_snapshot() – and kstat_chain_lock is held across the
two calls to assure that this list doesn't change. However, with the
fix for OS-2532 (pushed up as illumos#4246), there is another condition
for being on the list: KSTAT_FLAG_INVALID must be cleared. And while
the contents of the list can't change without kstat_chain_lock held,
the state of KSTAT_FLAG_INVALID can: it's cleared in kstat_install()
without the kstat_chain_lock held. If this is the case, however, we
would expect to have caught the other thread red handed: after clearing
KSTAT_FLAG_INVALID in kstat_install(), we kstat_rele() the kstat – which
requires acquiring the lock. But the kstat_chain_lock itself is held
by the panicking thread and appears to not have waiters:

> kstat_chain_lock/K
kstat_chain_lock:
kstat_chain_lock:               ffffff50b3da8760 

But given the narrowness of the race, we know that the racing thread
was on CPU while we were holding kstat_chain_lock – and that the panicking
thread itself was on CPU when it panicked. Looking through the on-cpu
threads reveals the thread that we're racing:

> ffffff435cbc3be0::findstack
stack pointer for thread ffffff435cbc3be0: ffffff0204739940
  ffffff0204739970 apix_setspl+0x17()
  ffffff02047399b0 apix_intr_exit+0x24()
  ffffff0204739a00 0xffffff42ecdf5540()
  ffffff0204739a60 apix_do_interrupt+0xfe()
  ffffff0204739a70 _interrupt+0xba()
  ffffff0204739b90 mutex_delay_default+0xc()
  ffffff0204739c00 mutex_vector_enter+0xcc()
  ffffff0204739c30 kstat_rele+0x1f()
  ffffff0204739c70 kstat_install+0x94()
  ffffff0204739cc0 zone_zfs_kstat_create+0x183()
  ffffff0204739cf0 zone_kstat_create+0xa0()
  ffffff0204739db0 zone_create+0x8b2()
  ffffff0204739f00 zone+0x1d4()
  ffffff0204739f10 sys_syscall+0x17a()

This therefore is indeed the race: on the first pass through the list, we
didn't count this kstat because it was marked KSTAT_FLAG_INVALID, but on the
second pass we copied it because the KSTAT_FLAG_INVALID had been cleared.
The fix should be straightforward: either we need to acquire (and drop) the
kstat_chain_lock to clear the KSTAT_FLAG_INVALID flag from kstat_install(),
or we need to always count the KSTAT_FLAG_INVALID kstats in
header_kstat_update() because that will always give us more room than
we will need in header_kstat_snapshot().

Actions #1

Updated by Electric Monk about 7 years ago

  • Status changed from New to Closed

git commit c9bbee954bfabfc1fec46cc4c3753a98eb0f8264

commit  c9bbee954bfabfc1fec46cc4c3753a98eb0f8264
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Date:   2014-07-14T18:09:59.000Z

    4997 kstat_install()/header_kstat_snapshot() race induces panic
    Reviewed by: Richard Lowe <richlowe@richlowe.net>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Actions

Also available in: Atom PDF