Bug #5592
closedNULL pointer dereference in dsl_prop_notify_all_cb()
100%
Description
The fix for Issue #5531 didn't fully close the race in the dsl_prop callback code. It is still possible for the dsl_prop callback entry to be referencing a dsl_dataset_t object which is in the process of being evicted. In the core analyzed below, the dsl_dataset_t in the callback entry ("stale entry") was able to get a hold on the bonus buffer for that dataset object, but this is because a new dsl_dataset_t had been instantiated, referencing the same bonus buffer.
The stack trace from the crash:
$ mdb vmcore.0 mdb: warning: dump is from SunOS 5.11 joyent_20150202T140745Z; dcmds and macros may not match kernel implementation Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs mpt idm sd crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ipc ] > $c mutex_enter+0xb() dmu_buf_rele+0x15(0, fffffffff7e00720) dsl_dataset_rele+0x19(ffffff0340e250c0, fffffffff7e00720) dsl_prop_notify_all_cb+0xbb(ffffff0254120900, ffffff02d13aa9c0, 0) dmu_objset_find_dp+0x108(ffffff0254120900, 366, fffffffff7d742f0, 0, 2) dsl_prop_notify_all+0x28(ffffff02654dbac0) dsl_dir_rename_sync+0x2d1(ffffff00111a9b40, ffffff0260db3680) dsl_sync_task_sync+0x10a(ffffff00111a9a60, ffffff0260db3680) dsl_pool_sync+0x28b(ffffff0254120900, 1558) spa_sync+0x27e(ffffff0254d3b000, 1558) txg_sync_thread+0x227(ffffff0254120900) thread_start+8()
dsl_dataset_rele() tried to release a NULL ds_dbuf. Verify:
> ffffff0340e250c0 ::print dsl_dataset_t dsl_dbuf ds_dbuf = 0
Yup. Any chance the callback list is corrupted?
> ffffff02d13aa9c0 ::print dsl_dataset_t ds_dir | ::print dsl_dir_t dd_prop_cbs | ::walk list | ::print dsl_prop_cb_record_t cbr_ds cbr_ds = 0xffffff0353ff4380 cbr_ds = 0xffffff0353ff4380 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0 cbr_ds = 0xffffff0340e250c0
This dsl_datset_t is still registered multiple times on this dsl_dir. Looks valid.
Was this datset evicted?
> ffffff0340e250c0 ::print dsl_dataset_t ds_dbu ds_dbu = { dbu_tqent = { tqent_next = 0 tqent_prev = 0 tqent_func = dsl_dataset_evict tqent_arg = 0xffffff0340e250c0 tqent_un = { tqent_bucket = 1 tqent_flags = 0x1 } tqent_thread = 0 tqent_cv = { _opaque = 0 } } dbu_evict_func = dsl_dataset_evict }
The dsl_dataset_t's dbuf user was processed by the eviction taskq, so yes.
We had a hold on the bonus buffer, so the bonus buffer must still be around. Let's find it by object number.
> ffffff0340e250c0 ::print dsl_dataset_t ds_object ds_object = 0x369
dsl_prop_notify_all_cb() has the dsl pool as its first argument. Find the dnode for this dataset via the meta objset for this pool.
> $c mutex_enter+0xb() dmu_buf_rele+0x15(0, fffffffff7e00720) dsl_dataset_rele+0x19(ffffff0340e250c0, fffffffff7e00720) dsl_prop_notify_all_cb+0xbb(ffffff0254120900, ffffff02d13aa9c0, 0) dmu_objset_find_dp+0x108(ffffff0254120900, 366, fffffffff7d742f0, 0, 2) dsl_prop_notify_all+0x28(ffffff02654dbac0) dsl_dir_rename_sync+0x2d1(ffffff00111a9b40, ffffff0260db3680) dsl_sync_task_sync+0x10a(ffffff00111a9a60, ffffff0260db3680) dsl_pool_sync+0x28b(ffffff0254120900, 1558) spa_sync+0x27e(ffffff0254d3b000, 1558) txg_sync_thread+0x227(ffffff0254120900) thread_start+8() > ffffff0254120900 ::print dsl_pool_t dp_meta_objset | ::print objset_t os_dnodes | ::walk list | ::print -a dnode_t dn_object ffffff0382298350 dn_object = 0x370 ffffff03822980d0 dn_object = 0x36f ffffff0291a3cab0 dn_object = 0x36e ffffff0354b68cd0 dn_object = 0x36d ffffff0354b68a50 dn_object = 0x36c ffffff0289735058 dn_object = 0x36b ffffff02774e7b20 dn_object = 0x36a ffffff02774450b0 dn_object = 0x369 ...
dn_object is 32 bytes into the dnode. Adjust the address for object 0x369 and verify we did our math correctly:
> ffffff0277445090 ::print dnode_t dn_object dn_object = 0x369 > ffffff0277445090 ::print dnode_t dn_type dn_type = 0t16 (DMU_OT_DSL_DATASET)
Yup.
Is the bonus buffer properly held and ineligible for eviction?
> ffffff0277445090 ::print dnode_t dn_bonus | ::print dmu_buf_impl_t db_holds db_holds = { db_holds.rc_count = 0x2 } > ffffff0277445090 ::print dnode_t dn_bonus | ::print dmu_buf_impl_t db_dirtycnt db_dirtycnt = 0
Holds > dirties. Yes.
This is why dsl_dataset_try_addref()->dmu_buf_try_addref() succeeded on this bonus buffer.
Is the user the same dsl_dataset_t as in the callback?
> ffffff0277445090 ::print dnode_t dn_bonus | ::print dmu_buf_impl_t db_user db_user = 0xffffff02d13aa9c0
Nope.
Is it a valid dsl_dataset_t?
> 0xffffff02d13aa9c0 ::print dsl_dataset_t { ds_dbu = { dbu_tqent = { tqent_next = 0 tqent_prev = 0 tqent_func = 0 tqent_arg = 0 tqent_un = { tqent_bucket = 0 tqent_flags = 0 } tqent_thread = 0 tqent_cv = { _opaque = 0 } } dbu_evict_func = dsl_dataset_evict } ds_dir = 0xffffff02654dbac0 ds_dbuf = 0xffffff0272b00c90 ds_object = 0x369 ds_fsid_guid = 0xfcb0034fdd43f2 ds_is_snapshot = 0 (0) ds_prev = 0xffffff0353ff4380 ...
Yup.
So we evicted one dsl_dataset_t for this dataset and almost immediately re-referenced the same dataset. This created a new dsl_dataset_t holding the same bonus buffer as the old one. dsl_prop_notify_all_cb() caught the bonus buffer in the "stale" dsl_dataset_t before it was cleared by dsl_dataset_evict() and got a valid hold, but this isn't enough to ensure a dsl_dataset_t is valid. We have to hold the bonus buffer and verify that the dbuf user is the dsl_dataset_t we're looking for, otherwise it could be being evicted.
Updated by Justin Gibbs about 8 years ago
- Due date set to 2015-02-04
- % Done changed from 0 to 90
- Estimated time set to 2.00 h
Review available at: https://reviews.csiden.org/r/155/
Updated by Electric Monk about 8 years ago
- Status changed from New to Closed
- % Done changed from 90 to 100
git commit 9d47dec0481d8cd53b2c1053c96bfa3f78357d6a
commit 9d47dec0481d8cd53b2c1053c96bfa3f78357d6a Author: Justin T. Gibbs <justing@spectralogic.com> Date: 2015-02-04T22:01:54.000Z 5592 NULL pointer dereference in dsl_prop_notify_all_cb() Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com>