FMA's "fmd" gets events with a NULL pointer in 'em
Pardon the lengthy explanation. I'm recalling this from memory and the flaky hardware in question is not attached right now.
- SOMETHING (a 10GigE board in my case) can't attach to its driver, or even gets "retired". - An event is generated, and it eventually arrives at fmd's fme_undiagnosable(). - The "struct fme" contains a list of "observations", which are "struct event". These observations are *SUPPOSED* to contain an fmd_event_t "ffep". The problem is, these observations have their ffep set to NULL!!! - This eventually leads to a NULL pointer dereference deeper down the stack: fme_undiagnosable -> fmd_case_add_ereport -> fmd_case_insert_event -> fmd_event_hold -> BOOM. - One other bit of the fme.c code has a check for a NULL ffep. I tried replicating this check in fme_undiagnosable(), but the panic happens later, and in an area where the data structure is so corrupt, I can only assume checking for NULL ffep wasn't a good idea. - I know little to nothing about FMA and fmd. But a very flaky system (it's possible there's a BIOS bug that may cause the 10GigE board to not to be recognized, though prtconf showed a 10GigE board) may generate events that instead of causing faults, cause FMA crashes. That can't be good.
No data to display