Project

General

Profile

Feature #12500

fmd: add support for write-only transport modules

Added by Rob Johnston 7 months ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

For more background, see chapter 4 of the Fault Manager Programmer’s Reference Manual.

Most of the operations of the illumos Fault Management Daemon (fmd) are driven by a set of plugin modules which implement the logic for diagnosing and responding to fault events. fmd also supports a special class of plugins called transport modules. Transport modules allow for FM events (faults, ereports, etc.) to be transferred into and out of an fmd instance.

Today, fmd supports two types of transport modules: read-only and read-write.

Read-only transport modules take error and fault telemetry from a non-FMA source and convert it into FM events and then post it to the local fmd instance. An example of this is the disk-transport module, which converts SMART telemetry into FM disk ereports.

Read-write transport modules are used to transfer FMA events between two instances of fmd (presumably running in different fault domains). The ip-transport module in the illumos source provides a reference implementation of read-write transport module. I’m not aware of any uses of read-write transport modules in the illumos community. But as a practical example, the ip-transport module is used on Oracle X64 systems to transfer FM events between the fmd on the host system and the fmd running on the service processor.

Another potential use-case for transport modules would be to transfer FM events from fmd on an illumos system to an outside, non-FMA-aware software entity for processing and/or storage. But the module API for transport modules today only supports sending events to endpoints that implement the FM transport protocol (i.e. another fmd instance).

One could achieve something approximate to this by writing a response agent plugin that subscribed to FM events and then transferred those events off system. But the behaviour and API around the per-module event queues for non-transport modules is not really designed for this use-case. This is because for non-transport modules (diagnosis engines and response agents) once an event has been posted to the module - via the fmdo_recv() entry point and the entry point returns, the event is dequeued from the module's event queue and not posted again. If, for some reason, the module were unable to deliver the event, it would either have to drop the event or block in the entry point until it could deliver the event - both of which are undesirable behaviours.

On the other hand, in the event of a failure to deliver a given event, the API for transport modules provides mechanisms to ask fmd to either:

a) keep the event in the queue so that it can be posted to the module again to retry delivery

b) temporarily suspend the modules event queue (new events continue to be enqueued, but are not posted to the transport module)

The combination of these mechanisms allows transport modules to sanely handle both transient and prolonged outages of the remote endpoint.

Thus this ticket is to extend fmd’s transport module API to support the notion of a write-only transport where the sending endpoint is an fmd instance and the receiving endpoint is some non-FMA aware entity. In this case, the sending side of the transport will configure the event subscriptions on behalf of the receiving side and fmd will not post transport control events to the transport module.

#1

Updated by Rob Johnston 7 months ago

  • Description updated (diff)
#2

Updated by Rob Johnston 7 months ago

A preliminary fix was been pushed to the '12500' branch of https://github.com/rejohnst/illumos-gate

Also available in: Atom PDF