Actions
Bug #13359
closedmlxcx_update_link_state can race against mlxcx_register_mac
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
Description
After the changes to make mlxcx do all the commands during attach asynchronously, we now have a much wider window for us to get and try to handle a link state change notification before mlxcx_register_mac has finished setting up mlx_mac_hdl.
If this happens, we get a null deref:
panic[cpu53]/thread=fffffb04508c6c20: BAD TRAP: type=e (#pf Page fault) rp=fffffb04508c69f0 addr=9c occurred in module "mac" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x9c pid=0, pc=0xfffffffffbbd4f71, sp=0xfffffb04508c6ae0, eflags=0x10247 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de> cr2: 9c cr3: 14000000 cr8: 0 rdi: 0 rsi: 1 rdx: 1 rcx: 1 r8: fffffdc594029dc8 r9: 0 rax: 0 rbx: fffffdc592dafc80 rbp: fffffb04508c6b10 r10: 1 r11: fffffb04508c6c20 r12: fffffdc5714f5dc0 r13: fffffdc5714f5dc0 r14: fffffdc592dafc80 r15: fffffdc579897508 fsb: 0 gsb: fffffdc56f15e000 ds: 4b es: 4b fs: 0 gs: 1c3 trp: e err: 2 rip: fffffffffbbd4f71 cs: 30 rfl: 10247 rsp: fffffb04508c6ae0 ss: 38 Warning - stack not written to the dump buffer fffffb04508c68f0 unix:die+c6 () fffffb04508c69e0 unix:trap+11c1 () fffffb04508c69f0 unix:cmntrap+e9 () fffffb04508c6b10 mac:mac_link_update+1 () fffffb04508c6b50 mlxcx:mlxcx_link_state_task+4c () fffffb04508c6c00 genunix:taskq_thread+2cd () fffffb04508c6c10 unix:thread_start+b ()
This also exposes the fact that currently, the locking on the mlxcx_t is probably inadequate to cover this case and cases like it. We might get away with just a if (mlxp->mlx_mac_hdl != NULL)
in mlxcx_update_link_state
on x86, but if we ever run on another less-well-ordered CPU that probably won't cut it.
Updated by Alex Wilson over 1 year ago
- Gerrit CR set to 1100
Testing done so far on cr#1100, ps#2:
- Built a SmartOS PI with updated mlxcx
- Booted on a box with a connectx-5 10 times in a row (this box previously would hit the null ptr deref about one in every 10 boots on average), created an aggr over the mlxcx interfaces and brought up an IP to check it passes traffic still
- Booted on a box with a connectx-4 Lx, created aggr and checked IP
- Ran illumos nightly with patch on omnios r151036
Updated by Electric Monk about 1 year ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 80d1a7bde98a8ab2881940a6fe6775073564f253
commit 80d1a7bde98a8ab2881940a6fe6775073564f253 Author: Alex Wilson <alex@cooperi.net> Date: 2021-04-06T15:38:25.000Z 13359 mlxcx_update_link_state can race against mlxcx_register_mac 13370 mlxcx_intr_n doing redundant check on mleqe_event_type Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: Andy Fiddaman <andy@omnios.org> Reviewed by: Paul Winder <paul@winder.uk.net> Approved by: Dan McDonald <danmcd@joyent.com>
Actions