Bug #3673


core dumping is abysmally slow

Added by Robert Mustacchi over 9 years ago. Updated over 9 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


The actual problem here – the reason why so many of the core dumps are
incomplete – is actually a core dumping performance problem, compounded by an
observability problem.

Currently, when we're dumping a segment as part of a core dump, we write
core_chunk pages – and then we sleep for core_delay_usec microseconds.
With the default core_chunk pages of 32 ("a reasonable value so that we don't
overwhelm the VM system with a monstrously large single write and cause
pageout to stop running") and core_delay_usec of 10,000, this defaults to 32
pages times 4K per page every 10 milliseconds – limiting core dumping speed
to the very strange number of 13 MB/second. A process with an entirely
reasonable 1G of address space will therefore take 78 seconds to dump a core;
a process with a (less reasonable but still possible) 80G of address space
will take an eye-watering 105 minutes to dump core. While limiting the
sizes of write might make some sense, throttling them to delay for ten
milliseconds between them does not. Why sleep? This was done putatively to
"let pageout run", which is presumably a reference (albeit a cryptic one) to
the fact that pageout (by default) runs at priority 98 and kpreemptpri is
either 99 (SmartOS) or 100 (illumos). So when a core dumping process is doing
work core dumping in the kernel, pageout cannot preempt it – or at least,
such would be the theory. (That is, would be the theory if the comments
expanded on their fear – which they don't.) In practice, the core dumping
thread is doing intensive I/O, and as such is very likely to give the CPU up
and/or hit a kpreempt check naturally. And of course, on an SMP system (which
is to say, every system) the only threads that could truly be starved would be
those bound to the CPU upon which the core dumping thread was running – a
dubious proposition. In short, the concern over the need to "let pageout run"
during a core dump seems to be a bogeyman (or, as the Haitians say, Uncle
Gunnysack); core dumping should be allowed to proceed as quickly as it can.

That said, there is a second performance problem here: when a fatal signal is
received during a core dump (as it almost always will be for a dump that is
going to take 105 minutes to complete), we abort writing the current segment,
but then reattempt (and subsequently abort) the next segment. The problem is
that this segment abort occurs after having written core_chunk pages (128K
given current defaults). If one has thousands of segments, this alone can
result in hundreds of megabytes of entirely wasted effort.

Both of these problems need to be fixed: we need to write core dumps as
quickly as we can, but then immediately abort (entirely) upon receipt of a
fatal signal. To make this case more readily identifiable, information beyond
a simple errno should be recorded: for the segment that was being written
when the signal was received, we should additionally record the siginfo of the
fatal signal. This will allow us to see via mdb that the process was in fact
killed during core dump processing – and by whom.

Actions #1

Updated by Robert Mustacchi over 9 years ago

  • Status changed from New to Resolved
  • % Done changed from 90 to 100

Resolved in f971a3462face662ae8ef220a18a98354d625d54.


Also available in: Atom PDF