Project

General

Profile

Actions

Bug #12966

closed

imc driver blew up on missing channel

Added by Robert Mustacchi 11 months ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I was given a panic that a user hit in in the imc driver:

> ::status
debugging crash dump vmcore.0 (64-bit) from 
operating system: 5.11 omnios-r151034-b35d9a8b4a (i86pc)
build version: gfx-drm - heads/master-0-gbdc58b1-dirty
illumos-kvm - heads/r151034-0-g889916e
heads/r151034-0-gb35d9a8b4a

image uuid: 93ea4e7b-043b-40bf-ac7b-8a5539824f20
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe007b8ba950 addr=30 occurred in module "imc" due to a NULL pointer dereference
dump content: kernel pages only
> $C
fffffe007b8baaa0 imc_fill_dimms+0x91(fffffe594d6c8000, fffffe594d6c8bc8, fffffe594d6c8c00)
fffffe007b8bab20 imc_fill_data+0xeb(fffffe594d6c8000)
fffffe007b8bab50 imc_attach_complete+0x69(fffffe594d6c8000)
fffffe007b8bac00 taskq_thread+0x2cd(fffffe594dd9dbb8)
fffffe007b8bac10 thread_start+0xb()

If we do some digging here:

> fffffe594d6c8000::print imc_t imc_gen_data igd_max_dimms
imc_gen_data = imc_gen_data_has_brd
> fffffe594d6c8000::print imc_t imc_gen_data->  igd_max_dimms
imc_gen_data->igd_max_dimms = 0x3
> fffffe594d6c8c00::print imc_channel_t ich_desc
ich_desc = 0

The surprising thing is that we have a channel that's actually missing the memory controller device to actually describe it. If we look a bit at the other channels, do they have devices or is it just this one missing?

> fffffe594d6c8000::print imc_t imc_sockets[0].isock_imcs[0].icn_channels[0].ich_desc
imc_sockets[0].isock_imcs[0].icn_channels[0].ich_desc = 0
> fffffe594d6c8000::print imc_t imc_sockets[0].isock_imcs[0].icn_channels[1].ich_desc
imc_sockets[0].isock_imcs[0].icn_channels[1].ich_desc = 0xfffffe594c1a8440
> fffffe594d6c8000::print imc_t imc_sockets[0].isock_imcs[0].icn_channels[ ].ich_desc
imc_sockets[0].isock_imcs[0].icn_channels[2].ich_desc = 0xfffffe594c1a8240
> fffffe594d6c8000::print imc_t imc_sockets[0].isock_imcs[0].icn_channels[3].ich_desc
imc_sockets[0].isock_imcs[0].icn_channels[3].ich_desc = 0xfffffe594c1a8380

This is quite peculiar. Basically on this system we don't have information for channel 0 at all! It's not clear what would cause this to happen, however, if we look in prtconf we don't actually see an imc stub related to this at all. It's not clear how the CPU itself would not have all of its devices enumerated for this device. However, we should clearly check that we have a number of channels in order that we expect. Systems generally have 2 or 4 channel devices available, so if we found any number of channels and they're not the first ones, like in this case, we should fail to attach.

With this fix, the user no longer sees the panic.

Actions

Also available in: Atom PDF