Project

General

Profile

Bug #12262

Suboptimal vmem hash table slows down boot

Added by Marcel Telka 10 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We noticed on some machines booting custom install image over PXE very long delay (about 30 minutes) in the middle of the boot. The delay is caused by extremely slow i40e_m_stop() with this stack:

[0]> ::stack
kmdb_enter+0xb()
debug_enter+0x5d(fffffffff7944132)
apix`apic_nmi_intr+0x95(0, fffffffffbc73770)
av_dispatch_nmivect+0x34(fffffffffbc73770)
nmiint+0x152()
vmem_hash_delete+0x6d(fffffffffbccd2e0, ffffc1338dfe8000, 1000)
vmem_xfree+0x4b(fffffffffbccd2e0, ffffc1338dfe8000, 1000)
vmem_free+0x23(fffffffffbccd2e0, ffffc1338dfe8000, 1000)
contig_free+0xdc(ffffc1338dfe8000, 800)
kfreea+0x65(ffffc1338dfe8000)
i_ddi_mem_free+0x35(ffffc1338dfe8000, ffffc1338dd75c80)
ddi_dma_mem_free+0x27(ffffc1338d7d2368)
i40e`i40e_free_dma_buffer+0x48(ffffc1338d7d2358)
i40e`i40e_free_tx_dma+0x4d(ffffc1338c263000)
i40e`i40e_free_ring_mem+0x67(ffffc1338be5f000, 0)
i40e`i40e_stop+0x207(ffffc1338be5f000, 1)
i40e`i40e_m_stop+0x3b(ffffc1338be5f000)
mac`mac_stop+0x6c(ffffc1338c317810)
dls`dls_close+0xe4(ffffc1338c6c8e48)
dld`dld_str_detach+0x5f(ffffc1338c6c8e48)
dld`dld_str_close+0x6a(ffffc1338c20fa48)
dld`dld_close+0x20(ffffc1338c20fa48)
qdetach+0x9b(ffffc1338c20fa48, 1, 40000003, ffffc13382359db0, 0)
strclose+0x2c1(ffffc1338c210b80, 40000003, ffffc13382359db0)
specfs`device_close+0xb9(ffffc1338c210c80, 40000003, ffffc13382359db0)
specfs`spec_close+0x17b(ffffc1338c210c80, 40000003, 1, 0, ffffc13382359db0, 0)
fop_close+0x61(ffffc1338c210c80, 40000003, 1, 0, ffffc13382359db0, 0)
ldi_close+0x93(ffffc1338bd92aa8, 3, ffffc13382359db0)
strplumb`getmacaddr+0x174(ffffc13380a327f8, fffffffffbc73fb0)
strplumb`matchmac+0xa4(ffffc13380a327f8, fffffffffbc74338)
walk_devs+0x56(ffffc13380a327f8, fffffffff7a11d70, fffffffffbc74338, 1)
walk_devs+0xc6(ffffc13380a32aa0, fffffffff7a11d70, fffffffffbc74338, 1)
walk_devs+0xc6(ffffc13380a32d48, fffffffff7a11d70, fffffffffbc74338, 1)
walk_devs+0xc6(ffffc133809fb550, fffffffff7a11d70, fffffffffbc74338, 1)
walk_devs+0xc6(ffffc13382341d58, fffffffff7a11d70, fffffffffbc74338, 1)
walk_devs+0xc6(ffffc133721fad50, fffffffff7a11d70, fffffffffbc74338, 1)
ddi_walk_devs+0x1e(ffffc133721fad50, fffffffff7a11d70, fffffffffbc74338)
strplumb`strplumb_get_netdev_path+0x89()
strplumb`resolve_boot_path+0xcd()     
strplumb`strplumb+0x31()              
stubs_common_code+0x51()              
main+0x47b()                          
_locore_start+0x90()                  
[0]>

Once we installed our custom illumos build from that slowly booting install image all subsequent boots were fast as usual.


Related issues

Related to illumos gate - Bug #12958: i40e allocates large amounts of DMAClosed

Actions
#1

Updated by Marcel Telka 10 months ago

Root cause

The problem is happening because i40e_m_stop() needs to call ddi_dma_mem_free() many times - approximately 0x100*(0x400+0x400+0x600) = ca 1M times. Every ddi_dma_mem_free() usually turns to one vmem_free() call via contig_free() and then to one vmem_hash_delete() call. The vmem_hash_delete() function is the actual bottleneck as could be seen here:

[0]> ::bp i40e`i40e_m_stop
[0]> :c

...

kmdb: stop at i40e`i40e_m_stop
kmdb: target stopped at:
i40e`i40e_m_stop:       pushq  %rbp
[0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64
vm_kstat.vk_lookup.value.ui64 = 0x3
[0]> dls_link_rele::bp
[0]> :c

... after ca 30 minutes ...

kmdb: stop at dls`dls_link_rele
kmdb: target stopped at:
dls`dls_link_rele:      pushq  %rbp
[0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64
vm_kstat.vk_lookup.value.ui64 = 0x66d76ba0c
[0]>

Yes, in our example the vmem_hash_delete() function had to iterate in the while loop 27606301196 (2.76e10) times! Honestly, I'm a bit surprised it took 30 minutes only :-).

Why is the heap_arena so slow? Because during boot the vmem_hash_rescale() optimizer does not run yet:

380void
381main(void)
382{

...

516    /*
517     * Initialize Solaris Audit Subsystem
518     */
519    audit_init();
520
521    /*
522     * Plumb the protocol modules and drivers only if we are not
523     * networked booted, in this case we already did it in rootconf().
524     */
525    if (netboot == 0 && iscsiboot_prop == NULL)
526        (void) strplumb();

...

601    /*
602     * After mp_init(), number of cpus are known (this is
603     * true for the time being, when there are actually
604     * hot pluggable cpus then this scheme  would not do).
605     * Any per cpu initialization is done here.
606     */
607    kmem_mp_init();
608    vmem_update(NULL);

...

The i40e driver is loaded at line 526 via strplumb(). Then, once all drivers are loaded, the hash optimizer is started at line 608 - via the vmem_update() call, but i40e_m_stop() is called while still in strplumb().

#2

Updated by Marcel Telka 10 months ago

Proposed solution

With the root cause known the obvious question arises: Why we do not call vmem_update(NULL) before strplumb() to make sure all vmem arenas are properly optimized as soon as possible?

The short answer is: I do not know.

The longer answer is: I was unable to find any reason why we need to defer the vmem_update(NULL) call until so late in the boot process. So the easiest fix should be just to move the vmem_update(NULL) call before the strplumb() call.

To confirm how such modification changes our case (and to see whether it possibly causes any regression) I patched the live code as follows:

[0]> main+0x170::dis -n 2
main+0x161:                     call   -0x2253a6        <post_startup>
main+0x166:                     movl   $0x1,+0x2910c0(%rip)     <swaploaded>
main+0x170:                     call   -0xa8eb5 <audit_init>
main+0x175:                     movl   +0x28846d(%rip),%edx     <netboot>
main+0x17b:                     testl  %edx,%edx
[0]> main+0x171/W 000CBB7B
main+0x171:     0xfff5714b      =       0xcbb7b
[0]> main+0x170::dis -n 2
main+0x161:                     call   -0x2253a6        <post_startup>
main+0x166:                     movl   $0x1,+0x2910c0(%rip)     <swaploaded>
main+0x170:                     call   +0xcbb7b <vmem_update>
main+0x175:                     movl   +0x28846d(%rip),%edx     <netboot>
main+0x17b:                     testl  %edx,%edx
[0]>
[0]>
[0]>
[0]>
[0]>
[0]> main+0x2ba::dis -n 2
main+0x2b3:                     call   -0x15508 <kmem_mp_init>
main+0x2b8:                     xorl   %edi,%edi
main+0x2ba:                     call   +0xcba31 <vmem_update>
main+0x2bf:                     call   -0x1c6254        <clock_tick_init_post>
main+0x2c4:                     movq   +0x19d165(%rip),%rdx     <mp_init_tbl>
[0]> main+0x2ba/W 90909090
main+0x2ba:     0xcba31e8       =       0x90909090
[0]> main+0x2be/v 90
main+0x2be:     0       =       0x90
[0]> main+0x2ba::dis -n 6
main+0x2a2:                     xorl   %edi,%edi
main+0x2a4:                     call   -0x22de89        <start_other_cpus>
main+0x2a9:                     movl   $0x4,%edi
main+0x2ae:                     call   -0x1f2ec3        <lgrp_init>
main+0x2b3:                     call   -0x15508 <kmem_mp_init>
main+0x2b8:                     xorl   %edi,%edi
main+0x2ba:                     nop    
main+0x2bb:                     nop    
main+0x2bc:                     nop    
main+0x2bd:                     nop    
main+0x2be:                     nop    
main+0x2bf:                     call   -0x1c6254        <clock_tick_init_post>
main+0x2c4:                     movq   +0x19d165(%rip),%rdx     <mp_init_tbl>
[0]>

So I simply replaced the audit_init() call at line 519 by the vmem_update() call and removed the vmem_update() call from line 608. Then I measured how it behaves:

[0]> ::bp i40e`i40e_m_stop
[0]> :c

...

kmdb: stop at i40e`i40e_m_stop
kmdb: target stopped at:
i40e`i40e_m_stop:       pushq  %rbp
[0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64
vm_kstat.vk_lookup.value.ui64 = 0
[0]> dls_link_rele::bp
[0]> :c

... after few seconds ...

kmdb: stop at dls`dls_link_rele
kmdb: target stopped at:
dls`dls_link_rele:      pushq  %rbp
[0]> heap_arena::print|::print vmem_t vm_kstat.vk_lookup.value.ui64
vm_kstat.vk_lookup.value.ui64 = 0x37bff
[0]> ::delete 1
[0]> ::delete 2
[0]> :c

...

openindiana console login:

Far better results.

#4

Updated by Marcel Telka 10 months ago

Testing Done

Without this change the boot process on affected machines was stuck for about 30 minutes in the i40e driver. With the fix the delay is hardly noticeable.

I also confirmed that this change does not introduce any regression on tested machines.

#5

Updated by Marcel Telka 10 months ago

  • Status changed from In Progress to Pending RTI
#6

Updated by Electric Monk 10 months ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit 948761701ebc8a0cf820a973780b2fbb931c4390

commit  948761701ebc8a0cf820a973780b2fbb931c4390
Author: Marcel Telka <marcel@telka.sk>
Date:   2020-02-05T15:41:03.000Z

    12262 Suboptimal vmem hash table slows down boot
    Reviewed by: Stefan Behrens <sbehrens@giantdisaster.de>
    Reviewed by: Gergő Mihály Doma <domag02@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

#7

Updated by Marcel Telka 5 months ago

  • Related to Bug #12958: i40e allocates large amounts of DMA added

Also available in: Atom PDF