Bug #365

network stack hangs on heavy IO

Added by Alessandro Gervaso over 3 years ago. Updated over 3 years ago.

Status:Closed Start date:2010-10-22
Priority:Urgent Due date:
Assignee:- % Done:

0%

Category:kernel Spent time: -
Target version:-
Difficulty:Medium Tags:needs-triage

Description

While doing some heavy file transfers via nfs or cifs the network hangs and the machine eventually becomes totally nonresponsive.
A hard reboot is needed because after a "fast-reboot" the network is still not working. and the two e1000 interfaces are missing.
The nics are two on-board igb and a dual-port e1000, all configured in a link aggregation with two vlans on top.

I've had this problem with nexentastor (both commercial and community edition) and with openindiana b147 and I can reproduce it.
Sometimes it goes on for hours before hanging and often hangs after a few minutes.

Let me know if i can provide more information.

diag.txt (2.1 kB) Alessandro Gervaso, 10/22/2010 08:22 am

dladm.txt (525 Bytes) Alessandro Gervaso, 10/22/2010 08:22 am

fmadm.txt (3 kB) Alessandro Gervaso, 10/24/2010 02:00 pm

prtconf.txt (62.5 kB) Alessandro Gervaso, 10/24/2010 02:00 pm

History

Updated by Garrett D'Amore over 3 years ago

If you can provide precise steps to reproduce, then I can begin the work to analyze it further.

I'm not entirely sure its a software bug, for one. Its entirely possible that the lockup is due to thermal or other issues on your system, for example. If I can reproduce it on another instance of hardware, then I've got a good chance of analyzing it.

I'd also like fmadm -v faulty, and prtconf -vp output.

Updated by Alessandro Gervaso over 3 years ago

Hi Garret,
the machine has been tested (on linux) by the vendor for about a week before the shipment, so the hardware should be ok.

There are really no steps to reproduce the problem: just start a file transfer via cifs/nfs/rsync and the network will hang after a while, a few minutes in the "better" case, a few hours in the worse.
The same happened also when the output of a computation running on another node was being written.

After the hang the only traffic i see on the interfaces is the lacp control packets, like "LACPv1, length 110".

I can provide any additional information and ssh access too if necessary.

Thanks.

Updated by Garrett D'Amore over 3 years ago

  • Status changed from New to Feedback

It would appear that your system has taken a few faults on the PCI express bridge that hosts the pair of "igb" interfaces.

The fault could be a fault in one of the igb devices, or in the PCI express bridge itself.

In any case, I recommend removing the igb interfaces from the aggregation and trying again.

I'm placing this bug in feedback state.

Let me know if it occurs again after doing that, or if you are able to reproduce this problem on another system.

At the moment, I believe the problem is hardware in nature.

Updated by Alessandro Gervaso over 3 years ago

I've done some further testing in these weeks.

I've been doing the same network traffic without the aggregation and using only the two igb interfaces: the machine hanged.
I've disabled the onboard "igb" from /etc/system and used the e1000g interfaces in a link-aggregation: the machine hanged.
I've used the two e1000g without the aggregation: the machine hanged.

So maybe the problem was in the dual-porte e1000 card I fitted in the machine, so I've replaced it with a spare quad-port e1000 I had lying around and it has been working for over 14 hours so far; the igb driver is still disabled.

I should try a link aggregation using only the two igb interfaces and without any add-on card to isolate the problem. Could it be related to the pci-e slot used for the add-on card?

Updated by Garrett D'Amore over 3 years ago

  • Status changed from Feedback to Closed

The problem could indeed be the PCI slot.

In any case, this does not appear to be a software bug, but rather problematic or defective hardware. I'm closing it.

Also available in: Atom PDF