imc driver blew up on missing channel
I was given a panic that a user hit in in the imc driver:
> ::status debugging crash dump vmcore.0 (64-bit) from operating system: 5.11 omnios-r151034-b35d9a8b4a (i86pc) build version: gfx-drm - heads/master-0-gbdc58b1-dirty illumos-kvm - heads/r151034-0-g889916e heads/r151034-0-gb35d9a8b4a image uuid: 93ea4e7b-043b-40bf-ac7b-8a5539824f20 panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe007b8ba950 addr=30 occurred in module "imc" due to a NULL pointer dereference dump content: kernel pages only > $C fffffe007b8baaa0 imc_fill_dimms+0x91(fffffe594d6c8000, fffffe594d6c8bc8, fffffe594d6c8c00) fffffe007b8bab20 imc_fill_data+0xeb(fffffe594d6c8000) fffffe007b8bab50 imc_attach_complete+0x69(fffffe594d6c8000) fffffe007b8bac00 taskq_thread+0x2cd(fffffe594dd9dbb8) fffffe007b8bac10 thread_start+0xb()
If we do some digging here:
> fffffe594d6c8000::print imc_t imc_gen_data igd_max_dimms imc_gen_data = imc_gen_data_has_brd > fffffe594d6c8000::print imc_t imc_gen_data-> igd_max_dimms imc_gen_data->igd_max_dimms = 0x3 > fffffe594d6c8c00::print imc_channel_t ich_desc ich_desc = 0
The surprising thing is that we have a channel that's actually missing the memory controller device to actually describe it. If we look a bit at the other channels, do they have devices or is it just this one missing?
> fffffe594d6c8000::print imc_t imc_sockets.isock_imcs.icn_channels.ich_desc imc_sockets.isock_imcs.icn_channels.ich_desc = 0 > fffffe594d6c8000::print imc_t imc_sockets.isock_imcs.icn_channels.ich_desc imc_sockets.isock_imcs.icn_channels.ich_desc = 0xfffffe594c1a8440 > fffffe594d6c8000::print imc_t imc_sockets.isock_imcs.icn_channels[ ].ich_desc imc_sockets.isock_imcs.icn_channels.ich_desc = 0xfffffe594c1a8240 > fffffe594d6c8000::print imc_t imc_sockets.isock_imcs.icn_channels.ich_desc imc_sockets.isock_imcs.icn_channels.ich_desc = 0xfffffe594c1a8380
This is quite peculiar. Basically on this system we don't have information for channel 0 at all! It's not clear what would cause this to happen, however, if we look in prtconf we don't actually see an imc stub related to this at all. It's not clear how the CPU itself would not have all of its devices enumerated for this device. However, we should clearly check that we have a number of channels in order that we expect. Systems generally have 2 or 4 channel devices available, so if we found any number of channels and they're not the first ones, like in this case, we should fail to attach.
With this fix, the user no longer sees the panic.
Updated by Robert Mustacchi 10 months ago
This was tested in two ways:
1. Andy was able to arrange for the user who hit the panic to test this and confirm that they no longer saw the panic.
2. Patrick was able to boot this on an Ivy Bridge based system and verified that the imc driver still was working in the normal case.
Updated by Electric Monk 10 months ago
- Status changed from New to Closed
- % Done changed from 90 to 100
commit 7d91603476b740ff8f4c917d71ee5884ab39cb60 Author: Robert Mustacchi <firstname.lastname@example.org> Date: 2020-07-21T22:50:52.000Z 12966 imc driver blew up on missing channel Reviewed by: Andy Fiddaman <email@example.com> Reviewed by: Igor Kozhukhov <firstname.lastname@example.org> Reviewed by: Paul Winder <email@example.com> Approved by: Dan McDonald <firstname.lastname@example.org>