Bug #13966
closedstrong IP hostmodel breaks DHCP
100%
Description
When the ipv4 hostmodel
property is set to strong
(see ipadm(1M)
), while the initial DHCP lease succeeds, if the link flaps for any reason, dhcpagent is unable to send any subsequent requests. With the dhcpagent logging set to debug, you see:
2021-07-22T21:32:31.000000+00:00 localhost-empty-hostname /sbin/dhcpagent[127]: [ID 490758 daemon.error] send_pkt_internal: cannot send REQUEST packet to server (will retry in 4 seconds): Network is unreachable
repeated (with increasing backoff intervals).
The most immediate affect of this is that the default gateway is never re-added after the link is restored. Doing a little investigation with truss(1)
, the relevant sequence appears to be:
so_socket(PF_INET, SOCK_DGRAM, IPPROTO_IP, 0x00000000, SOV_XPG4_2) = 4 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, 0x08047438, 4, SOV_DEFAULT) = 0 bind(4, 0x0804743C, 16, SOV_XPG4_2) = 0 AF_INET name = 0.0.0.0 port = 68 setsockopt(4, ip, 66, 0x0804748C, 4, SOV_DEFAULT) = 0 setsockopt(4, ip, 69, 0x0807E2C4, 4, SOV_DEFAULT) = 0 setsockopt(4, ip, 67, 0x0804748B, 1, SOV_DEFAULT) = 0 setsockopt(4, ip, 65, 0x08047490, 4, SOV_DEFAULT) = 0 ioctl(0, SIOCGLIFFLAGS, 0x08047494) = 0 stat("/etc/default/dhcpagent", 0x08046DD4) = 0 fstat(6, 0x08046454) = 0 time() = 1626989551 getpid() = 127 [1] putmsg(6, 0x080464F8, 0x08046504, 0) = 0 fstat(6, 0x08046454) = 0 putmsg(6, 0x080464F8, 0x08046504, 0) = 0 fstat(6, 0x08046454) = 0 time() = 1626989551 getpid() = 127 [1] putmsg(6, 0x080464F8, 0x08046504, 0) = 0 stat("/etc/default/dhcpagent", 0x08046F44) = 0 fstat(6, 0x080465C4) = 0 time() = 1626989551 getpid() = 127 [1] putmsg(6, 0x08046668, 0x08046674, 0) = 0 open("/etc/hostname.admin0", O_RDONLY) Err#2 ENOENT fstat(6, 0x080465C4) = 0 time() = 1626989551 getpid() = 127 [1] putmsg(6, 0x08046668, 0x08046674, 0) = 0 fstat(6, 0x080468A4) = 0 time() = 1626989551 getpid() = 127 [1] putmsg(6, 0x08046948, 0x08046954, 0) = 0 sendto(4, "010106\01D03 ~ A\080\0\0".., 300, 32768, 0x0807E668, 16) Err#128 ENETUNREACH AF_INET to = 255.255.255.255 port = 67
Performing ipadm disable-addr -t addrobj; ipadm enable-addr -t addrobj
seems to unstick things, but obviously isn't useful unless one has console access to the server.
Files
Updated by Jason King about 2 years ago
With a bit more troubleshooting, it appears ip_select_route()
calls ip_verify_src_on_ill()
with a src address of ::
(all zeros), and this fails.
Still unknown is how/why this succeeds the verify first time (or after disabling/re-enabling the address).
Updated by Dan McDonald about 2 years ago
I performed a subset test by using a VMware fusion VM:
1.) Capture "netstat -rnc" and "echo '::ire -v | mdb -k'"
2.) Use VMware fusion to Disconnect Network Adapter and then reconnect it.
3.) Repeat step 1
I saw no obvious differences between the two routing tables.
I'd highly recommend following the bouncing DHCP packet in your failure case, and then again in the disable-addr/enable-addr case. See how ip_select_route() handles things afterwards.
Updated by Jason King about 2 years ago
With more troubleshooting. ipadm disable-addr -t <obj>
resets the IP of the interface to 0.0.0.0/0 and ip_verify_src_on_ill()
succeeds in this instance. This makes sense -- after the link comes up, the DHCP request is from 0.0.0.0 to 255.255.255.255, so when the interface address is 0.0.0.0/0 the source address check succeeds.
However, when the link goes down, and then comes back up, the interface still retains the previous address until it receives different information from the DHCP server. However the request still originates from 0.0.0.0/0, which fails the strong hostmodel check.
dhcpagent
does set some DHCP-specific IP options, TBD is to determine how they impact source selection. My suspicion (i.e. this needs to be verified still) is that in the non-DHCP case, when INADDR_ANY
is used as the source address, an interface (and the corresponding IP address) are selected for the source IP prior to reaching this check. However in the DHCP case, it is perhaps bypassing that check (which seems reasonable -- the requests need to ensure they egress on a specific link controlled by DHCP vs. letting the kernel pick), which is then running afoul of the later host check.
-- From further investigation, it appears the undocumented IP_UNSPEC_SRC
option is what allows dhcpagent to bypass the step where a source IP address is chosen when INADDR_ANY
is used for the source address.
Updated by Jason King about 2 years ago
- Subject changed from strong IP hostmodel breaks DHCP to strong IP hostmodel breaks DHCP
Updated by Jason King about 2 years ago
- I configured the network interface to DHCP, rebooted to verify it boots with a DHCP address
- I verified a default route was present for the interface
- I then disconnected the interface in VMWare
- I verified the default route entry had been removed
- I then reconnected the interface in VMWare
- I verified the default route had been re-added.
I then repeated the test with hostmodel=strong, and verified the default route was not re-added after reconnecting the interface.
Next I tested the fix by booting a BE with the fix and repeated the same tests. In this instance, the default route was re-added with both hostmodel=weak and hostmodel=strong.
Additionally, to attempt to verify the 'normal' hostmodel behavior (when IP_UNSPEC_SRC
is not enabled on the socket), I created a small test program (attached), that let me create a regular TCP connection from a specific source IP as well as a specific interface (using the ifindex value). I then added an additional interface to the VM in VMware (so that I would have two potential source interfaces).
Starting with a stock OmniOS VM, I then tried to send a packet out the 'wrong' interface (I specified the source IP of interface A while choosing the if index of interface B). With hostmodel=weak, this succeeds, with hostmodel=strong, this fails with test: connect: Cannot assign requested address
. I then tried to send a packet out the 'correct' interface (using the source IP and ifindex of interface A) which succeeded.
I then repeated the test on the BE with the change and saw the same results.
Updated by Electric Monk about 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit d48f713ac62c289b9e5d2c48645e62f4e644fe2a
commit d48f713ac62c289b9e5d2c48645e62f4e644fe2a Author: Jason King <jason.brian.king@gmail.com> Date: 2021-07-30T16:35:12.000Z 13966 strong IP hostmodel breaks DHCP Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Updated by Jason King about 2 years ago
As a bit of a followup (in case someone else hits it), it was also discovered this bug also prevents ping -i <interface> ip
from working (e.g. trying to ping the gateway on a subnet using the interface on the same subnet returns a network not available error). The fix for this bug also fixes ping -i.