Project

General

Profile

Bug #3705

stack overflow due to zfs lz4 compression

Added by Matthew Ahrens over 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2013-04-09
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:
needs-triage

Description

::status

debugging crash dump vmcore.0 (64-bit) from delphix
operating system: 5.11 os-build-723 (i86pc)
image uuid: 750af9ed-92a2-4c56-fa05-c249baf5ef0b
panic message: BAD TRAP: type=8 (#df Double fault) rp=ffffff1158771f10 addr=0
dump content: kernel pages only

$C

ffffff007b7e40a0 dtrace_dynvar+0x45(ffffff1164a22cf0, 3, ffffff007b7e4150, 8,
1, ffffff007b7e4360, ffffff1164a22cb8)
ffffff007b7e4280 dtrace_dif_emulate+0x661()
ffffff007b7e4460 dtrace_probe+0x415(36ed, ffffff1157d3ef30, ffffff007b7e8c40,
63, 80, 0)
ffffff007b7e4470 lockstat_wrapper+0x21()
ffffff007b7e44a0 disp_lock_exit_high+0x32(ffffff1157d3ef30)
ffffff007b7e44c0 thread_transition+0x26(ffffff007ae33c40)
ffffff007b7e4520 dispdeq+0x177(ffffff007ae33c40)
ffffff007b7e4590 disp_getbest+0x1da(ffffff1157d3ef30)
ffffff007b7e4620 disp_getwork+0x2a5(ffffff1158a3a580)
ffffff007b7e4670 disp+0x1b0()
ffffff007b7e46a0 swtch+0xba()
ffffff007b7e46c0 preempt+0xec()
ffffff007b7e46f0 kpreempt+0x98(1)
ffffff007b7e4720 sys_rtt_common+0x1ba(ffffff007b7e4730)
ffffff007b7e4730 _sys_rtt_ints_disabled+8()
ffffff007b7e4860 bzero+0x3a7()
ffffff007b7e8900 LZ4_compress64kCtx+0x7a(0, ffffff0114f4c000, ffffff1157d20004,
2000, 1bfc)
ffffff007b7e8930 real_LZ4_compress+0x47(ffffff0114f4c000, ffffff1157d20004,
2000, 1bfc)
ffffff007b7e8980 lz4_compress+0x36()
ffffff007b7e89c0 zio_compress_data+0x89()
ffffff007b7e8a30 zio_write_bp_init+0x1c4(ffffff119c8c0bd8)
ffffff007b7e8a70 zio_execute+0x88(ffffff119c8c0bd8)
ffffff007b7e8b30 taskq_thread+0x2d0(ffffff115d9466d8)
ffffff007b7e8b40 thread_start+8()

Note that the stack is 20k, and LZ4_compress64kCtx() uses 16k of stack. 1K is
used by the 13 frames between LZ4 and dtrace. dtrace_probe() and
dtrace_dif_emulate() each use 480 bytes of stack. dtrace_dynvar() is trying to
use 240 bytes of stack, but overflows.

Fundamentally, the 4k of stack (the amount left after LZ4 takes its 16k) is
insufficient for the rest of the system. I would suggest that we limit LZ4 to
8k of stack space, or use kmem_alloc() for the hashtable, or use a thread-specific buffer. In any case, we should leave at least 12k for the rest of the stack.


Files

lz4_compression_bench.ods (32.2 KB) lz4_compression_bench.ods Sašo Kiselkov, 2013-04-09 08:26 PM
lz4_compression_bench_new.ods (30.1 KB) lz4_compression_bench_new.ods Sašo Kiselkov, 2013-04-18 11:21 AM
lz4_heap.patch (817 Bytes) lz4_heap.patch Sašo Kiselkov, 2013-04-18 11:21 AM
lz4_heap.patch (1.22 KB) lz4_heap.patch Sašo Kiselkov, 2013-04-18 11:24 AM

History

#1

Updated by Sašo Kiselkov over 6 years ago

I've run a quick bunch of tests on the proposed scenarios, results attached (see only the green rows, the remainder were done with a different kernel, so probably not comparable to this run).

As can be seen, if we were to do a simple hot-fix, I'd go with changing the stacklimit to use 8k of stack. Even though this reduces compression by about 1.8% on the large block sizes, speed remains unaffected (even getting slight increase on decompression probably due to the small caches on my test system). The other solution, which is to use kmem_alloc, maintains the compression ratio, but is very expensive on compression performance (which drops to nearly lzjb levels).

#2

Updated by Sašo Kiselkov over 6 years ago

Do you think the slab allocator could give better performance? We're dealing with relatively few large fixed-size objects here, so keeping a few of them around in some slab cache might make sense... but I'm totally just guessing now...

#3

Updated by Matthew Ahrens over 6 years ago

I agree that going with 8k of stack is the best solution for now. Longer term, I'd like to investigate why kmem_alloc() decreases performance so much.

#4

Updated by Rich Lowe over 6 years ago

on 32bit systems, 8K is still 2/3rds the stack, isn't it?
(presumably the existing code just croaks immediately?)

#5

Updated by Sašo Kiselkov over 6 years ago

On 32 bit systems we can always just fall back to a heap allocator. It's the 64-bit systems that we consider performance sensitive, at this moment.

#6

Updated by Rich Lowe over 6 years ago

Right, I wasn't suggesting you care about their performance, just that a reduction to 8K would probably still be at heavy risk of blowing the stack there still. Falling back to allocation would be great.

#7

Updated by Sašo Kiselkov over 6 years ago

So it appears that I was indeed wrong about my original assumptions on performance with heap allocation. Turns out, I was running a DEBUG kernel build with kmem debugging enabled, which was a significant drag on this. I realized this not long ago while figuring out why my kernel builds were so slow while testing some ixgbe changes I had done.

I've redone the tests at 128k block sizes for all of our major compression algos and can see no drop in performance when using heap allocation (in fact, there's a slight increase in compression, but it could just be noise).

The attached patch also enables hardware bitcounts in LZ4. For some reason I had thought that these were an issue in the kernel, but it turns out, I was wrong again. Enabling hardware bitcounts improves performance somewhat.

#8

Updated by Sašo Kiselkov over 6 years ago

Minor update, I forgot to remove the comments around the STACKLIMIT lines. Attached is a fix for it.

#9

Updated by Christopher Siden over 6 years ago

  • Status changed from Feedback to Closed
commit d8fa96c
Author: Sašo Kiselkov <skiselkov.ml@gmail.com>
Date:   Wed Apr 24 09:45:43 2013

    3705 stack overflow due to zfs lz4 compression
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Christopher Siden <christopher.siden@delphix.com>
#10

Updated by Richard Yao about 6 years ago

The ZFSOnLinux project changed this code to use SLAB allocation when it was imported. Linux has 8KB stacks, so stack allocation never worked for us. It might be a good idea to switch to SLAB allocation in Illumos. That should perform better than using kmem_zalloc.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/lz4.c

Also available in: Atom PDF