Bug #13839
closedsockfs should improve its inode generation
100%
Description
I had a system where netstat -u
was showing two different processes listening on port 80. While one of these processes was expected to do so, the other was not (thus surprise). Running pfiles
against each process however did not show the unexpected process listening on port 80. Looking closer, it turned out both listening sockets had the same major/minor number and the same inode value:
# ls -li /proc/254/fd/19 /proc/233/fd/6 40812 s--------- 0 root root 0 Jan 1 1970 /proc/233/fd/6 40812 s--------- 0 root root 0 Jan 1 1970 /proc/254/fd/19 # pfiles 254 ... 19: S_IFSOCK mode:0666 dev:563,0 ino:40812 uid:0 gid:0 rdev:0,0 O_RDWR|O_NONBLOCK FD_CLOEXEC SOCK_STREAM SO_REUSEADDR,SO_SNDBUF(400000),SO_RCVBUF(400000) sockname: AF_INET6 :: port: 80 ... # pfiles 233 ... 6: S_IFSOCK mode:0666 dev:563,0 ino:40812 uid:0 gid:0 rdev:0,0 O_RDWR|O_NONBLOCK FD_CLOEXEC SOCK_STREAM SO_REUSEADDR,SO_SNDBUF(400000),SO_RCVBUF(400000) sockname: AF_INET x.x.x.x port: 2380 ...
Looking closer, using mdb, I verified that each file_t
and vnode_t
were different:
> 0t233::pid2proc | ::pfiles -f ! awk 'NR==1 { print } $2 == 6 { print }' FILE FD FLAG VNODE OFFSET CRED CNT fffffe16f7d93750 6 83 fffffe17091bf580 0 fffffe16e440b178 1 > 0t254::pid2proc | ::pfiles -f ! awk 'NT==1 { print} $2 == 19 { print }' fffffe17076e28e0 19 83 fffffe1707254540 0 fffffe16e440b008 1
Then looking at how inode values are assigned to sockfs vnodes, in prgetfdinfo()
, we see
if (VOP_GETATTR(vp, &vattr, 0, file_cred, NULL) == 0) {
fdinfo->pr_major = getmajor(vattr.va_fsid);
fdinfo->pr_minor = getminor(vattr.va_fsid);
fdinfo->pr_rmajor = getmajor(vattr.va_rdev);
fdinfo->pr_rminor = getminor(vattr.va_rdev);
fdinfo->pr_ino = (ino64_t)vattr.va_nodeid;
We then see the following in socket_vop_getattr()
:
/*
* Calculate the amount of bitshift to a sonode pointer which will
* still keep it unique. See below.
*/
if (sonode_shift == 0)
sonode_shift = highbit(sizeof (struct sonode));
ASSERT(sonode_shift > 0);
so = VTOSO(vp);
...
/*
* If the va_nodeid is > MAX_USHORT, then i386 stats might fail.
* So we shift down the sonode pointer to try and get the most
* uniqueness into 16-bits.
*/
vap->va_nodeid = ((ino_t)so >> sonode_shift) & 0xFFFF;
vap->va_nlink = 0;
vap->va_size = 0;
Looking more in mdb:
> sonode_shift.18423/d sonode_shift.18423: sonode_shift.18423: 10 > fffffe1707254540::print vnode_t v_data v_data = 0xfffffe17027db010 > fffffe17091bf580::print vnode_t v_data v_data = 0xfffffe17027db290
Right shifting those values 10 bits yields the same lower 16-bits, causing the same inode value.
We should utilize a better strategy for generating inode numbers for sockfs vnodes.
Updated by Andy Fiddaman about 2 years ago
At first glance, the strategy seems like it might be reasonable if we need to generate a 16-bit ID, but it is shifting by one bit too many.
> ::sizeof struct sonode sizeof (struct sonode) = 0x290 > ::sizeof struct sonode | =j 1010010000 | | | | | +------ bit 4 mask 0x010 | +--------- bit 7 mask 0x080 +----------- bit 9 mask 0x200 > 0xfffffe17027db010 >> 9 =x 3ed8 > 0xfffffe17027db290 >> 9 =x 3ed9
Updated by Jason King about 2 years ago
My guess is that the original author probably forgot that highbit()
returns values 1-32, and 0 for no bits set, so the result is off by one (to use as a shift value).
However I'm not sure that the 16-bit inode requirement has been true since at least the mid-90s, and likely systems like a SysV vestige (and likely was never a Solaris limitation). I suspect at minimum using 32-bits should be fine.
Updated by Jason King almost 2 years ago
To test, I tried (using mdb) to identify a number of sockets that without the fix, would generate identical inode values:
> ::walk socket_cache | ::print struct sonode so_vnode | ::eval .>>0t10&ffff=J ! sort | uniq -d 26c3 488 5695 a9c8 bb67 e81a e974 fc1e fc35 fd68 fdf2
(the idea being to take the address of the socket's vnode_t
, shift right 10 bits, then sort the addresses and print the duplicate values).
Picking one of the values (fc1e) -- at the time of testing #13867 hadn't been integrated (it was the testing of this issue that led to the discovery #13867 ), so a duplicated value that represented two TCP sockets was chosen.
Using mdb, the expected inode values (after the fix) were obtained:
> ::walk socket_cache s| ::print struct sonode so_vnode | ::eval .>>0t10&ffff=J | ::grep .==fc1e | ::eval <s=J | ::eval .>>0t9&ffffffff=U 193198170 193198171 >
Then their existence was confirmed via a search in /proc
:
# find /proc -type s \( -inum 193198170 -o -inum 193198171 \) /proc/284/fd/17 /proc/284/fd/18
Examining the process with pfiles
confirmed they were two different sockets listening on two different ports:
# pfiles 284 ... 17: S_IFSOCK mode:0666 dev:563,0 ino:193198171 uid:0 gid:0 rdev:0,0 O_RDWR|O_NONBLOCK FD_CLOEXEC SOCK_STREAM SO_REUSEADDR,SO_SNDBUF(400000),SO_RCVBUF(400000) sockname: AF_INET6 :: port: 8443 18: S_IFSOCK mode:0666 dev:563,0 ino:193198170 uid:0 gid:0 rdev:0,0 O_RDWR|O_NONBLOCK FD_CLOEXEC SOCK_STREAM SO_REUSEADDR,SO_SNDBUF(400000),SO_RCVBUF(400000) sockname: AF_INET6 :: port: 443 ...
Finally, running netstat -aun
confirmed that the output was now correct:
# netstat -aun | grep 443 *.8443 *.* root 284 xxxxxxx 0 0 400000 0 LISTEN *.443 *.* root 284 xxxxxxx 0 0 400000 0 LISTEN
Updated by Electric Monk almost 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit e7434800e5d3d9760f83932621a63a9b102dee76
commit e7434800e5d3d9760f83932621a63a9b102dee76 Author: Jason King <jason.brian.king@gmail.com> Date: 2021-06-24T20:29:15.000Z 13839 sockfs should improve its inode generation Reviewed by: Andy Fiddaman <andy@omnios.org> Approved by: Joshua M. Clulow <josh@sysmgr.org>