Bug #12262
Suboptimal vmem hash table slows down boot
100%
Description
We noticed on some machines booting custom install image over PXE very long delay (about 30 minutes) in the middle of the boot. The delay is caused by extremely slow i40e_m_stop()
with this stack:
[0]> ::stack kmdb_enter+0xb() debug_enter+0x5d(fffffffff7944132) apix`apic_nmi_intr+0x95(0, fffffffffbc73770) av_dispatch_nmivect+0x34(fffffffffbc73770) nmiint+0x152() vmem_hash_delete+0x6d(fffffffffbccd2e0, ffffc1338dfe8000, 1000) vmem_xfree+0x4b(fffffffffbccd2e0, ffffc1338dfe8000, 1000) vmem_free+0x23(fffffffffbccd2e0, ffffc1338dfe8000, 1000) contig_free+0xdc(ffffc1338dfe8000, 800) kfreea+0x65(ffffc1338dfe8000) i_ddi_mem_free+0x35(ffffc1338dfe8000, ffffc1338dd75c80) ddi_dma_mem_free+0x27(ffffc1338d7d2368) i40e`i40e_free_dma_buffer+0x48(ffffc1338d7d2358) i40e`i40e_free_tx_dma+0x4d(ffffc1338c263000) i40e`i40e_free_ring_mem+0x67(ffffc1338be5f000, 0) i40e`i40e_stop+0x207(ffffc1338be5f000, 1) i40e`i40e_m_stop+0x3b(ffffc1338be5f000) mac`mac_stop+0x6c(ffffc1338c317810) dls`dls_close+0xe4(ffffc1338c6c8e48) dld`dld_str_detach+0x5f(ffffc1338c6c8e48) dld`dld_str_close+0x6a(ffffc1338c20fa48) dld`dld_close+0x20(ffffc1338c20fa48) qdetach+0x9b(ffffc1338c20fa48, 1, 40000003, ffffc13382359db0, 0) strclose+0x2c1(ffffc1338c210b80, 40000003, ffffc13382359db0) specfs`device_close+0xb9(ffffc1338c210c80, 40000003, ffffc13382359db0) specfs`spec_close+0x17b(ffffc1338c210c80, 40000003, 1, 0, ffffc13382359db0, 0) fop_close+0x61(ffffc1338c210c80, 40000003, 1, 0, ffffc13382359db0, 0) ldi_close+0x93(ffffc1338bd92aa8, 3, ffffc13382359db0) strplumb`getmacaddr+0x174(ffffc13380a327f8, fffffffffbc73fb0) strplumb`matchmac+0xa4(ffffc13380a327f8, fffffffffbc74338) walk_devs+0x56(ffffc13380a327f8, fffffffff7a11d70, fffffffffbc74338, 1) walk_devs+0xc6(ffffc13380a32aa0, fffffffff7a11d70, fffffffffbc74338, 1) walk_devs+0xc6(ffffc13380a32d48, fffffffff7a11d70, fffffffffbc74338, 1) walk_devs+0xc6(ffffc133809fb550, fffffffff7a11d70, fffffffffbc74338, 1) walk_devs+0xc6(ffffc13382341d58, fffffffff7a11d70, fffffffffbc74338, 1) walk_devs+0xc6(ffffc133721fad50, fffffffff7a11d70, fffffffffbc74338, 1) ddi_walk_devs+0x1e(ffffc133721fad50, fffffffff7a11d70, fffffffffbc74338) strplumb`strplumb_get_netdev_path+0x89() strplumb`resolve_boot_path+0xcd() strplumb`strplumb+0x31() stubs_common_code+0x51() main+0x47b() _locore_start+0x90() [0]>
Once we installed our custom illumos build from that slowly booting install image all subsequent boots were fast as usual.
Related issues
Updated by Marcel Telka 12 months ago
Root cause
The problem is happening because i40e_m_stop()
needs to call ddi_dma_mem_free()
many times - approximately 0x100*(0x400+0x400+0x600) = ca 1M times. Every ddi_dma_mem_free()
usually turns to one vmem_free()
call via contig_free()
and then to one vmem_hash_delete()
call. The vmem_hash_delete()
function is the actual bottleneck as could be seen here:
[0]> ::bp i40e`i40e_m_stop [0]> :c ... kmdb: stop at i40e`i40e_m_stop kmdb: target stopped at: i40e`i40e_m_stop: pushq %rbp [0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64 vm_kstat.vk_lookup.value.ui64 = 0x3 [0]> dls_link_rele::bp [0]> :c ... after ca 30 minutes ... kmdb: stop at dls`dls_link_rele kmdb: target stopped at: dls`dls_link_rele: pushq %rbp [0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64 vm_kstat.vk_lookup.value.ui64 = 0x66d76ba0c [0]>
Yes, in our example the vmem_hash_delete()
function had to iterate in the while loop 27606301196 (2.76e10) times! Honestly, I'm a bit surprised it took 30 minutes only :-).
Why is the heap_arena
so slow? Because during boot the vmem_hash_rescale()
optimizer does not run yet:
380void 381main(void) 382{ ... 516 /* 517 * Initialize Solaris Audit Subsystem 518 */ 519 audit_init(); 520 521 /* 522 * Plumb the protocol modules and drivers only if we are not 523 * networked booted, in this case we already did it in rootconf(). 524 */ 525 if (netboot == 0 && iscsiboot_prop == NULL) 526 (void) strplumb(); ... 601 /* 602 * After mp_init(), number of cpus are known (this is 603 * true for the time being, when there are actually 604 * hot pluggable cpus then this scheme would not do). 605 * Any per cpu initialization is done here. 606 */ 607 kmem_mp_init(); 608 vmem_update(NULL); ...
The i40e
driver is loaded at line 526 via strplumb()
. Then, once all drivers are loaded, the hash optimizer is started at line 608 - via the vmem_update()
call, but i40e_m_stop()
is called while still in strplumb()
.
Updated by Marcel Telka 12 months ago
Proposed solution
With the root cause known the obvious question arises: Why we do not call vmem_update(NULL)
before strplumb()
to make sure all vmem arenas are properly optimized as soon as possible?
The short answer is: I do not know.
The longer answer is: I was unable to find any reason why we need to defer the vmem_update(NULL)
call until so late in the boot process. So the easiest fix should be just to move the vmem_update(NULL)
call before the strplumb()
call.
To confirm how such modification changes our case (and to see whether it possibly causes any regression) I patched the live code as follows:
[0]> main+0x170::dis -n 2 main+0x161: call -0x2253a6 <post_startup> main+0x166: movl $0x1,+0x2910c0(%rip) <swaploaded> main+0x170: call -0xa8eb5 <audit_init> main+0x175: movl +0x28846d(%rip),%edx <netboot> main+0x17b: testl %edx,%edx [0]> main+0x171/W 000CBB7B main+0x171: 0xfff5714b = 0xcbb7b [0]> main+0x170::dis -n 2 main+0x161: call -0x2253a6 <post_startup> main+0x166: movl $0x1,+0x2910c0(%rip) <swaploaded> main+0x170: call +0xcbb7b <vmem_update> main+0x175: movl +0x28846d(%rip),%edx <netboot> main+0x17b: testl %edx,%edx [0]> [0]> [0]> [0]> [0]> [0]> main+0x2ba::dis -n 2 main+0x2b3: call -0x15508 <kmem_mp_init> main+0x2b8: xorl %edi,%edi main+0x2ba: call +0xcba31 <vmem_update> main+0x2bf: call -0x1c6254 <clock_tick_init_post> main+0x2c4: movq +0x19d165(%rip),%rdx <mp_init_tbl> [0]> main+0x2ba/W 90909090 main+0x2ba: 0xcba31e8 = 0x90909090 [0]> main+0x2be/v 90 main+0x2be: 0 = 0x90 [0]> main+0x2ba::dis -n 6 main+0x2a2: xorl %edi,%edi main+0x2a4: call -0x22de89 <start_other_cpus> main+0x2a9: movl $0x4,%edi main+0x2ae: call -0x1f2ec3 <lgrp_init> main+0x2b3: call -0x15508 <kmem_mp_init> main+0x2b8: xorl %edi,%edi main+0x2ba: nop main+0x2bb: nop main+0x2bc: nop main+0x2bd: nop main+0x2be: nop main+0x2bf: call -0x1c6254 <clock_tick_init_post> main+0x2c4: movq +0x19d165(%rip),%rdx <mp_init_tbl> [0]>
So I simply replaced the audit_init()
call at line 519 by the vmem_update()
call and removed the vmem_update()
call from line 608. Then I measured how it behaves:
[0]> ::bp i40e`i40e_m_stop [0]> :c ... kmdb: stop at i40e`i40e_m_stop kmdb: target stopped at: i40e`i40e_m_stop: pushq %rbp [0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64 vm_kstat.vk_lookup.value.ui64 = 0 [0]> dls_link_rele::bp [0]> :c ... after few seconds ... kmdb: stop at dls`dls_link_rele kmdb: target stopped at: dls`dls_link_rele: pushq %rbp [0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64 vm_kstat.vk_lookup.value.ui64 = 0x37bff [0]> ::delete 1 [0]> ::delete 2 [0]> :c ... openindiana console login:
Far better results.
Updated by Marcel Telka 12 months ago
Testing Done
Without this change the boot process on affected machines was stuck for about 30 minutes in the i40e driver. With the fix the delay is hardly noticeable.
I also confirmed that this change does not introduce any regression on tested machines.
Updated by Electric Monk 12 months ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit 948761701ebc8a0cf820a973780b2fbb931c4390
commit 948761701ebc8a0cf820a973780b2fbb931c4390 Author: Marcel Telka <marcel@telka.sk> Date: 2020-02-05T15:41:03.000Z 12262 Suboptimal vmem hash table slows down boot Reviewed by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed by: Gergő Mihály Doma <domag02@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Marcel Telka 6 months ago
- Related to Bug #12958: i40e allocates large amounts of DMA added