Project

General

Profile

Actions

Bug #14419

closed

iprb transmit watchdog somewhat overzealous

Added by Joshua M. Clulow 7 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

The iprb driver attempts to detect a transmit stall by keeping the transmit time of the last outstanding frame submitted to the controller. Each time we submit a command to the controller, we update tx_wdog with the current time. We also update tx_wdog each time we reclaim a command that the controller has completed. In iprb_periodic(), we check to make sure this value is not more than 15 seconds in the past, and if it is we reset the controller.

Unfortunately, we only seem to trigger the reclaim routine when we're trying to transmit another packet. If the link is not particularly busy, and no data is transmitted for 15 seconds, it seems like we will spuriously reset the controller which causes a variety of other issues on a system that would have otherwise been working correctly.

A report from the field suggests that by pinging the computer with iprb from another computer, the problem goes away; presumably because we are sending an echo reply each second and keeping the watchdog at bay.

One way to fix this is to call iprb_cmd_reclaim() in the periodic routine, right before we check the watchdog value, to ensure that things are moving along. Indeed, an iprb with this patch applied improves the behaviour.

Another question is: in iprb_cmd_submit(), we set the CB_CMD_I flag on the submitted command iff we believe it is the last available command; i.e., if we have dispatched 128 commands to the controller and heard nothing back. I can't see how this would trigger a reclaim, though, unless there was already another packet about to be sent out -- which in some cases on idle systems there may not be.


Related issues

Related to illumos gate - Bug #14078: null pointer dereference crashes from iprb nicsDuplicate

Actions
Actions #1

Updated by Joshua M. Clulow 7 months ago

  • Description updated (diff)
Actions #2

Updated by Joshua M. Clulow 7 months ago

  • Related to Bug #14078: null pointer dereference crashes from iprb nics added
Actions #3

Updated by Electric Monk 7 months ago

  • Gerrit CR set to 1964
Actions #4

Updated by Robert Mustacchi 7 months ago

  • Assignee set to Joshua M. Clulow
Actions #5

Updated by Joshua M. Clulow 7 months ago

Testing Notes

I built the iprb module with this change and gave it to a user on IRC who confirms that it makes networking work as they expect on their machine.

I have also done the regular RTI build with the same diff, though I have not installed the bits on a machine with the hardware as it is a rare and old part, and I don't have one available myself.

Actions #6

Updated by Electric Monk 7 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit ab3f6e90e6b1d8edee27c66beb8e53bc6033fb2a

commit  ab3f6e90e6b1d8edee27c66beb8e53bc6033fb2a
Author: Joshua M. Clulow <josh@sysmgr.org>
Date:   2022-01-25T20:49:19.000Z

    14419 iprb transmit watchdog somewhat overzealous
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF