Project

General

Profile

Actions

Bug #13966

closed

strong IP hostmodel breaks DHCP

Added by Jason King 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

When the ipv4 hostmodel property is set to strong (see ipadm(1M)), while the initial DHCP lease succeeds, if the link flaps for any reason, dhcpagent is unable to send any subsequent requests. With the dhcpagent logging set to debug, you see:

2021-07-22T21:32:31.000000+00:00 localhost-empty-hostname /sbin/dhcpagent[127]: [ID 490758 daemon.error] send_pkt_internal: cannot send REQUEST packet to server (will retry in 4 seconds): Network is unreachable

repeated (with increasing backoff intervals).

The most immediate affect of this is that the default gateway is never re-added after the link is restored. Doing a little investigation with truss(1), the relevant sequence appears to be:

so_socket(PF_INET, SOCK_DGRAM, IPPROTO_IP, 0x00000000, SOV_XPG4_2) = 4
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, 0x08047438, 4, SOV_DEFAULT) = 0
bind(4, 0x0804743C, 16, SOV_XPG4_2)             = 0
        AF_INET  name = 0.0.0.0  port = 68
setsockopt(4, ip, 66, 0x0804748C, 4, SOV_DEFAULT) = 0
setsockopt(4, ip, 69, 0x0807E2C4, 4, SOV_DEFAULT) = 0
setsockopt(4, ip, 67, 0x0804748B, 1, SOV_DEFAULT) = 0
setsockopt(4, ip, 65, 0x08047490, 4, SOV_DEFAULT) = 0
ioctl(0, SIOCGLIFFLAGS, 0x08047494)             = 0
stat("/etc/default/dhcpagent", 0x08046DD4)      = 0
fstat(6, 0x08046454)                            = 0
time()                                          = 1626989551
getpid()                                        = 127 [1]
putmsg(6, 0x080464F8, 0x08046504, 0)            = 0
fstat(6, 0x08046454)                            = 0
putmsg(6, 0x080464F8, 0x08046504, 0)            = 0
fstat(6, 0x08046454)                            = 0
time()                                          = 1626989551
getpid()                                        = 127 [1]
putmsg(6, 0x080464F8, 0x08046504, 0)            = 0
stat("/etc/default/dhcpagent", 0x08046F44)      = 0
fstat(6, 0x080465C4)                            = 0
time()                                          = 1626989551
getpid()                                        = 127 [1]
putmsg(6, 0x08046668, 0x08046674, 0)            = 0
open("/etc/hostname.admin0", O_RDONLY)          Err#2 ENOENT
fstat(6, 0x080465C4)                            = 0
time()                                          = 1626989551
getpid()                                        = 127 [1]
putmsg(6, 0x08046668, 0x08046674, 0)            = 0
fstat(6, 0x080468A4)                            = 0
time()                                          = 1626989551
getpid()                                        = 127 [1]
putmsg(6, 0x08046948, 0x08046954, 0)            = 0
sendto(4, "010106\01D03 ~ A\080\0\0".., 300, 32768, 0x0807E668, 16) Err#128 ENETUNREACH
        AF_INET  to = 255.255.255.255  port = 67

Performing ipadm disable-addr -t addrobj; ipadm enable-addr -t addrobj seems to unstick things, but obviously isn't useful unless one has console access to the server.


Files

test.c (1.99 KB) test.c Jason King, 2021-07-26 05:01 PM
Actions #1

Updated by Jason King 3 months ago

With a bit more troubleshooting, it appears ip_select_route() calls ip_verify_src_on_ill() with a src address of :: (all zeros), and this fails.

Still unknown is how/why this succeeds the verify first time (or after disabling/re-enabling the address).

Actions #2

Updated by Dan McDonald 3 months ago

I performed a subset test by using a VMware fusion VM:

1.) Capture "netstat -rnc" and "echo '::ire -v | mdb -k'"

2.) Use VMware fusion to Disconnect Network Adapter and then reconnect it.

3.) Repeat step 1

I saw no obvious differences between the two routing tables.

I'd highly recommend following the bouncing DHCP packet in your failure case, and then again in the disable-addr/enable-addr case. See how ip_select_route() handles things afterwards.

Actions #3

Updated by Jason King 3 months ago

With more troubleshooting. ipadm disable-addr -t <obj> resets the IP of the interface to 0.0.0.0/0 and ip_verify_src_on_ill() succeeds in this instance. This makes sense -- after the link comes up, the DHCP request is from 0.0.0.0 to 255.255.255.255, so when the interface address is 0.0.0.0/0 the source address check succeeds.

However, when the link goes down, and then comes back up, the interface still retains the previous address until it receives different information from the DHCP server. However the request still originates from 0.0.0.0/0, which fails the strong hostmodel check.

dhcpagent does set some DHCP-specific IP options, TBD is to determine how they impact source selection. My suspicion (i.e. this needs to be verified still) is that in the non-DHCP case, when INADDR_ANY is used as the source address, an interface (and the corresponding IP address) are selected for the source IP prior to reaching this check. However in the DHCP case, it is perhaps bypassing that check (which seems reasonable -- the requests need to ensure they egress on a specific link controlled by DHCP vs. letting the kernel pick), which is then running afoul of the later host check.

-- From further investigation, it appears the undocumented IP_UNSPEC_SRC option is what allows dhcpagent to bypass the step where a source IP address is chosen when INADDR_ANY is used for the source address.

Actions #4

Updated by Jason King 3 months ago

  • Assignee set to Jason King
Actions #5

Updated by Jason King 3 months ago

  • Subject changed from strong IP hostmodel breaks DHCP to strong IP hostmodel breaks DHCP
Actions #6

Updated by Jason King 3 months ago

To test this, I first recreated the issue on a stock OmniOS VM (starting with hostmodel=weak):
  1. I configured the network interface to DHCP, rebooted to verify it boots with a DHCP address
  2. I verified a default route was present for the interface
  3. I then disconnected the interface in VMWare
  4. I verified the default route entry had been removed
  5. I then reconnected the interface in VMWare
  6. I verified the default route had been re-added.

I then repeated the test with hostmodel=strong, and verified the default route was not re-added after reconnecting the interface.

Next I tested the fix by booting a BE with the fix and repeated the same tests. In this instance, the default route was re-added with both hostmodel=weak and hostmodel=strong.

Additionally, to attempt to verify the 'normal' hostmodel behavior (when IP_UNSPEC_SRC is not enabled on the socket), I created a small test program (attached), that let me create a regular TCP connection from a specific source IP as well as a specific interface (using the ifindex value). I then added an additional interface to the VM in VMware (so that I would have two potential source interfaces).

Starting with a stock OmniOS VM, I then tried to send a packet out the 'wrong' interface (I specified the source IP of interface A while choosing the if index of interface B). With hostmodel=weak, this succeeds, with hostmodel=strong, this fails with test: connect: Cannot assign requested address. I then tried to send a packet out the 'correct' interface (using the source IP and ifindex of interface A) which succeeded.

I then repeated the test on the BE with the change and saw the same results.

Actions #7

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit d48f713ac62c289b9e5d2c48645e62f4e644fe2a

commit  d48f713ac62c289b9e5d2c48645e62f4e644fe2a
Author: Jason King <jason.brian.king@gmail.com>
Date:   2021-07-30T16:35:12.000Z

    13966 strong IP hostmodel breaks DHCP
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>

Actions #8

Updated by Jason King 3 months ago

As a bit of a followup (in case someone else hits it), it was also discovered this bug also prevents ping -i <interface> ip from working (e.g. trying to ping the gateway on a subnet using the interface on the same subnet returns a network not available error). The fix for this bug also fixes ping -i.

Actions

Also available in: Atom PDF