Project

General

Profile

Bug #9118

fmadm reports faults without message summary

Added by Rob Johnston over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2018-02-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

The following scenario (where fmadm reports a large number of phantom FMD-8000-11 diagnoses) was seen on several systems in the Joyent Public Cloud (some information redacted):

# fmadm faulty -sg 
--------------- ------------------------------------ -------------- --------- 
TIME EVENT-ID MSG-ID SEVERITY 
--------------- ------------------------------------ -------------- --------- 
Jan 01 1970 0782c245-2720-6e46-a07c-9febc1aaa033 FMD-8000-11 Minor 
Jan 01 1970 f9e07d2f-e5c4-4d31-bac2-8d892b692fd5 FMD-8000-11 Minor 
Jan 01 1970 f8c1ac0b-b3f6-c3b9-b852-e6312cb37432 FMD-8000-11 Minor 
Jan 01 1970 f70f2a39-b15d-c4c2-8478-a618429d3f01 FMD-8000-11 Minor 
Jan 01 1970 f32c7fe0-ebfe-eb52-e72c-9ee120332489 FMD-8000-11 Minor 
Jan 01 1970 ed27ce09-b892-4f6a-de5f-a4bcc35809eb FMD-8000-11 Minor 
Jan 01 1970 c9670bd0-8387-c88a-9dd1-942612616d59 FMD-8000-11 Minor 
Jan 01 1970 a1865b9e-27d9-e57d-de5d-9b5259462cf1 FMD-8000-11 Minor 
Jan 01 1970 9d78de76-fe2e-6622-ced8-9d453861a3f5 FMD-8000-11 Minor 
Jan 01 1970 8b7367b7-2d38-e477-d635-b58a1ebd7373 FMD-8000-11 Minor 
Jan 01 1970 7a3891c8-ce82-6c94-c36a-f51849721565 FMD-8000-11 Minor 
Jan 01 1970 73d582a5-ba3f-675e-8bef-e48963df53c9 FMD-8000-11 Minor 
Jan 01 1970 73ac25e0-3f13-cba9-8d0b-c42533b1ccfc FMD-8000-11 Minor 
Jan 01 1970 6edef667-4b85-4ece-bb70-bf6c5520ccb6 FMD-8000-11 Minor 
Jan 01 1970 5223789c-d479-c05e-d955-b2e3a5f5aadc FMD-8000-11 Minor 
Jan 01 1970 346bf262-b5e7-4599-ff8d-ac0307ec549e FMD-8000-11 Minor 
Jan 01 1970 33cb0424-9a15-e8c4-89fe-8ec84f1cfead FMD-8000-11 Minor 
Jan 01 1970 2d130654-086c-cdf7-b462-8ae2c1a9a9ee FMD-8000-11 Minor 
Jan 01 1970 2141f471-8ee3-62a1-ac1c-e9e4bb786fb2 FMD-8000-11 Minor 
Jan 01 1970 10e41eb1-dea2-46f4-d318-a3529d4bb6f0 FMD-8000-11 Minor 

# fmdump -v -u 0782c245-2720-6e46-a07c-9febc1aaa033 
TIME UUID SUNW-MSG-ID EVENT

# fmadm faulty -u 10e41eb1-dea2-46f4-d318-a3529d4bb6f0 
--------------- ------------------------------------ -------------- --------- 
TIME EVENT-ID MSG-ID SEVERITY 
--------------- ------------------------------------ -------------- --------- 
Jan 01 1970 10e41eb1-dea2-46f4-d318-a3529d4bb6f0 FMD-8000-11 Minor

Host : (redacted) 
Platform : ProLiant-SL4540-Gen8
Chassis_id : (redacted)
Product_sn :

Description : A Solaris Fault Manager component generated a diagnosis for which 
no message summary exists. Refer to 
http://illumos.org/msg/FMD-8000-11 for more information.

Response : The diagnosis has been saved in the fault log for examination by 
Sun.

Impact : The fault log will need to be manually examined using fmdump(1M) 
in order to determine if any human response is required.

Action : Use fmdump -v -u <EVENT-ID> to view the diagnosis result. Run 
pkgchk -n SUNWfmd to ensure that fault management software is 
installed properly.

Soe notes from looking at the fmd process core:

fmd is holding on to a bunch of cases. They are all in the FMD_CASE_CLOSED state and they all have the same diagcode (FMD-8000-11):

::fmd_case

ADDR STATE REF DATA UUID
838f020 CLOSE 1 0 f70f2a39-b15d-c4c2-8478-a618429d3f01
838f340 CLOSE 1 0 9d78de76-fe2e-6622-ced8-9d453861a3f5
838f2a0 CLOSE 1 0 73ac25e0-3f13-cba9-8d0b-c42533b1ccfc
847ae88 CLOSE 1 0 2d130654-086c-cdf7-b462-8ae2c1a9a9ee
838f660 CLOSE 1 0 f32c7fe0-ebfe-eb52-e72c-9ee120332489
838f480 CLOSE 1 0 0782c245-2720-6e46-a07c-9febc1aaa033
838f8e0 CLOSE 1 0 c9670bd0-8387-c88a-9dd1-942612616d59
838f520 CLOSE 1 0 f9e07d2f-e5c4-4d31-bac2-8d892b692fd5
838f160 CLOSE 1 0 7a3891c8-ce82-6c94-c36a-f51849721565
847ad48 CLOSE 1 0 ed27ce09-b892-4f6a-de5f-a4bcc35809eb
838f840 CLOSE 1 0 2141f471-8ee3-62a1-ac1c-e9e4bb786fb2
838f3e0 CLOSE 1 0 33cb0424-9a15-e8c4-89fe-8ec84f1cfead
838f980 CLOSE 1 0 a1865b9e-27d9-e57d-de5d-9b5259462cf1
838f7a0 CLOSE 1 0 f8c1ac0b-b3f6-c3b9-b852-e6312cb37432
838f5c0 CLOSE 1 0 8b7367b7-2d38-e477-d635-b58a1ebd7373
838f0c0 CLOSE 1 0 6edef667-4b85-4ece-bb70-bf6c5520ccb6
838f700 CLOSE 1 0 10e41eb1-dea2-46f4-d318-a3529d4bb6f0
847ade8 CLOSE 1 0 73d582a5-ba3f-675e-8bef-e48963df53c9
847af28 CLOSE 1 0 346bf262-b5e7-4599-ff8d-ac0307ec549e
838f200 CLOSE 1 0 5223789c-d479-c05e-d955-b2e3a5f5aadc

FMD-8000-11 is defect.sunos.fmd.nodiagcode, which is diagnosed when there’s a problem looking up the message content for a given diagnosis. That lookup can fail if:

The message content doesn’t exist in any of the FM dictionaries (usr/src/cmd/fmd/dicts/)
OR
The list.suspect event has no suspects, making it impossible to compute the diagcode, which is hashed from the set of fault/defect event class(es) in the suspect list.

Also, none of the cases have valid timestamps. This is why fmadm(1m) shows the time as epoch. Question is, why is the timestamp not valid?

struct timeval ci_tv =
{ time_t tv_sec = 0 suseconds_t tv_usec = 0 }

int ci_tv_valid = 0

Somewhere up the stack we must not be checking if ci_tv_valid, so we're interpreting the zero-valued timeval and interpreting it as 0 secs since epoch. Probably would be better to say "UNKNOWN" for somesuch.

Root Cause

Here is what is happening:

kernel generates DIMM-related ereport
fmd receives the ereport and sends it to the eft plugin's fmdo_recv entry point (fme_ereport_receive())
fme_ereport_receive() determines that the FMRI of the error detector doesn't match anything in hc-scheme topology and silently fails, but leaves the case lying around unsolved, effectively turning it into a zombie case.

The specific code in question is this bit from usr/src/cmd/fm/modules/common/fme.c

static struct fme *
newfme(const char *e0class, const struct ipath *e0ipp, fmd_hdl_t *hdl,
    fmd_case_t *fmcase, fmd_event_t *ffep, nvlist_t *nvl)
{
    struct cfgdata *cfgdata;
    int init_size;
    extern int alloc_total();
    nvlist_t *detector = NULL;
    char *pathstr;
    char *arg;

    /*
     * First check if e0ipp is actually in the topology so we can give a
     * more useful error message.
     */
    ipathlastcomp(e0ipp);
    pathstr = ipath2str(NULL, e0ipp);
    cfgdata = config_snapshot();
    platform_units_translate(0, cfgdata->cooked, NULL, NULL,
        &detector, pathstr);
    FREE(pathstr);
    structconfig_free(cfgdata->cooked);
    config_free(cfgdata);
    if (detector == NULL) {
        /* See if class permits silent discard on unknown component. */
        if (lut_lookup(Ereportenames_discard, (void *)e0class, NULL)) {
            out(O_ALTFP|O_VERB2, "Unable to map \"%s\" ereport " 
                "to component path, but silent discard allowed.",
                e0class);

/* THE FIX: should call fmd_case_close() HERE before returning */

        } else {
            Undiag_reason = UD_VAL_BADEVENTPATH;
            (void) nvlist_lookup_nvlist(nvl, FM_EREPORT_DETECTOR,
                &detector);
            arg = ipath2str(e0class, e0ipp);
            publish_undiagnosable(hdl, ffep, fmcase, detector, arg);
            FREE(arg);
        }
        return (NULL);
    }

Normally when an ereport is received that involves hardware that isn't in topology, then fmd will diagnose a defect event, indicating that it received error telemetry that is undiagnosable. The reason we instead silently discard these ereports is due to entries like the following in usr/src/cmd/fm/eversholt/files/i386/i86pc/intel.esc:

event ereport.cpu.intel.quickpath.mem_ce@
    chip/memory-controller {within(12s)}, discard_if_config_unknown=1;

The "discard_if_config_unknown" parameter cases eft to silently discard the ereport if we can't find the detector in topology.

History

#1

Updated by Rob Johnston over 1 year ago

Testing

To test the change I built a platform image with the fix and booted it on magma. Then I fired up an fmsim environment and replayed the ereport telemetry captured from the JPC system:

# fmsim -d /var/tmp/fmsim -i
fmsim: creating simulation world /var/tmp/fmsim ... done.
fmsim: populating /var ... done.
fmsim: populating /usr/lib/fm from / ... 8432 blocks
fmsim: populating /usr/platform/i86pc/lib/fm from / ... 1568 blocks
fmsim: populating /usr/lib/locale/en_US.UTF-8 from / ... 320 blocks
fmsim: populating /usr/sbin from / ... 272 blocks
fmsim: adding customizations: done.
fmsim: generating script ... done.
fmsim: simulation 5648 running fmd(1M) version 1.2
fmd: [ loading modules ... fmd: ext-event-transport: another fmd is active on channel com.sun:fm:user_priv_highval
fmd: failed to load /var/tmp/fmsim/usr/lib/fm/fmd/plugins/ext-event-transport.so: client requested that module execution abort
fmd: failed to load /var/tmp/fmsim/usr/lib/fm/fmd/plugins/ip-transport.so: client handle wasn't initialized by _fmd_init
fmd: failed to load /var/tmp/fmsim/usr/lib/fm/fmd/plugins/disk-monitor.so: client handle wasn't initialized by _fmd_init
fmd: failed to load /var/tmp/fmsim/usr/lib/fm/fmd/plugins/sp-monitor.so: client handle wasn't initialized by _fmd_init
done ]
fmd: [ awaiting events ]
fmsim: rpc adm requests can rendezvous at 1073741824
fmsim: injectors should use channel com.sun:fm:fmd5648
fmsim: debuggers should attach to PID 5665

# fminject -c com.sun:fm:fmd5648 ./errlog 
fminject: record [54] (ereport.io.scsi.cmd.disk.dev.rqs.merr) timestamp is out of order: advancing event time to 58569bc4.3b09b011
fminject: record [56] (ereport.io.scsi.cmd.disk.recovered) timestamp is out of order: advancing event time to 58569bc9.250a9fc0
fminject: record [64] (ereport.io.scsi.cmd.disk.tran) timestamp is out of order: advancing event time to 5a6365fa.1aa39f08
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.l2cache ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.l2cache ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.io.scsi.cmd.disk.dev.rqs.merr ... done
sending event ereport.io.scsi.cmd.disk.dev.rqs.merr ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.io.scsi.cmd.disk.dev.rqs.derr ... done
sending event ereport.io.scsi.cmd.disk.recovered ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_ce ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.cpu.intel.quickpath.mem_redundant ... done
sending event resource.fm.fmd.clock.addhrtime ... done
sending event ereport.io.scsi.cmd.disk.dev.serr ... done
sending event ereport.io.scsi.cmd.disk.tran ... done

Prior to the fix, after replaying the telemetry, fmd would be left with a few dozen zombie cases stuck in an unsolved state. With the fix, the cases that were created for each DIMM-related ereport get closed after eft determines that it can't proceed with diagnosis. This was confirmed both by attaching to fmd with mdb and walking the list of cases, and by dumping the eft checkpoint file and verifying there were no checkpoint structures for unsolved cases:

[root@magma /var/tmp]# ./dump-fmd-ckpt fmsim/var/fm/fmd/ckpt/eft/eft 
==========
FCF Header
==========
fcfh_ident.id_magic = 0x7f, F, C, F
fcfh_ident.id_model = ILP32
fcfh_ident.id_encoding = LSB
fcfh_ident.id_version = 1
fcfh_flags = 0x0
fcfh_hdrsize = 64 bytes
fcfh_secsize = 32 bytes
fcfh_secnum = 8
fcfh_secoff = 64 bytes
fcfh_filesz = 782 bytes
fcfh_cgen = 118

=============
SECTION 0 (addr: 80627d0)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
none       1     0     0     140    0    

=============
SECTION 1 (addr: 80627f0)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
events     8     0     0x28  140    0x28 

=============
SECTION 2 (addr: 8062810)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
events     8     0     0x28  168    0x28 

=============
SECTION 3 (addr: 8062830)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
serd       8     0     0x18  190    0x30 

=============
SECTION 4 (addr: 8062850)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
buffer     4     0     0     1c0    0x70 

=============
SECTION 5 (addr: 8062870)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
bufs       4     0     0x8   230    0x8  

=============
SECTION 6 (addr: 8062890)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
module     4     0     0     238    0x14 

fcfm_name = eft (0x77)
fcfm_path = /var/tmp/fmsim/usr/lib/fm/fmd/plugins/eft.so (0x7b)
fcfm_desc = eft diagnosis engine (0xa8)
fcfm_vers = 1.16 (0xbd)

=============
SECTION 7 (addr: 80628b0)
=============
TYPE       ALIGN FLAGS ENTSZ OFF    SIZE 
strtab     1     0     0     24c    0xc2 

#2

Updated by Rob Johnston over 1 year ago

As part of debugging this issue, I wrote the following tool for dumping an FMD module checkpoint file:

https://github.com/rejohnst/illumos-toolbox/tree/master/fma/dump-fmd-ckpt

#3

Updated by Electric Monk over 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit a3f6a2a475fa4c811f7440bedffcf6acabbe1e0e

commit  a3f6a2a475fa4c811f7440bedffcf6acabbe1e0e
Author: Rob Johnston <rob.johnston@joyent.com>
Date:   2018-02-20T15:24:47.000Z

    9118 fmadm reports faults without message summary
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Yuri Pankov <yuripv@yuripv.net>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF