Bug #1101
closedNFS service fails to start after hostname change
100%
Description
If we does not have share before changing hostname, nfsd does not start.
root@illumos-gate:~# share root@illumos-gate:~# ps ax | grep nfs 100943 pts/1 R 0:00 grep nfs root@illumos-gate:~# root@illumos-gate:~# hostname illumos-gate2 root@illumos-gate2:~# share /testpool/NFS40 root@illumos-gate2:~# share - /testpool/NFS40 rw "" root@illumos-gate2:~# ps ax | grep nfs 100960 ? S 0:00 /usr/lib/nfs/nfsmapid 100965 ? S 0:00 /usr/lib/nfs/lockd 100993 pts/1 S 0:00 grep nfs root@illumos-gate2:~# root@illumos-gate2:~# svcs -x svc:/network/nfs/server:default (NFS server) State: maintenance since Fri Jun 10 08:19:19 2011 Reason: Start method exited with $SMF_EXIT_ERR_FATAL. See: http://sun.com/msg/SMF-8000-KS See: nfsd(1M) See: /var/svc/log/network-nfs-server:default.log Impact: This service is not running. svc:/network/nfs/status:default (NFS status monitor) State: maintenance since Fri Jun 10 08:19:19 2011 Reason: Restarting too quickly. See: http://sun.com/msg/SMF-8000-L5 See: statd(1M) See: /var/svc/log/network-nfs-status:default.log Impact: This service is not running.
If some share already exist - all works fine
Updated by Igor Vasilyev about 11 years ago
root@illumos-gate:~# uname -a SunOS illumos-gate 5.11 30.05.11 i86pc i386 i86pc Solaris
System was builded from illumos-gate 30.05.2011
/var/svc/log/network-nfs-server:default.log
[ Jun 14 07:45:35 Enabled. ] [ Jun 14 07:45:35 Executing start method ("/lib/svc/method/nfs-server start"). ] couldn't register datagram_v MOUNTVERS/lib/svc/method/nfs-server: mountd failed with 1 [ Jun 14 07:45:36 Method "start" exited with status 95. ] [ Jun 14 07:45:36 Stopping for maintenance due to service_request. ] [ Jun 14 07:45:36 Stopping for maintenance due to service_request. ]
Updated by sham pavman over 10 years ago
- Assignee set to sham pavman
- % Done changed from 0 to 20
Updated by sham pavman over 10 years ago
- % Done changed from 20 to 30
Hi,
After having done my bit of investigation into this issue I have a few realizations to share..
The issue arises only when there is temporary hostname change and does not occur otherwise.
The moment there is a reboot this hostname change is reverted anyway and also because this kind of a change is not done too often I feel its best that this issue be 'noted' than 'fixed'
Any opinions?
Shampavman
Updated by Marcel Telka almost 9 years ago
sham pavman wrote:
Any opinions?
It should be fixed.
Updated by Marcel Telka over 7 years ago
- Status changed from New to In Progress
- Assignee changed from sham pavman to Marcel Telka
Updated by Marcel Telka over 7 years ago
- Subject changed from nfs does not start after change hostname to NFS service fails to start after hostname change
Updated by Marcel Telka over 7 years ago
Root cause:
Some RPC servers (for example: rpcbind, and lockd) binds locally using the special HOST_SELF host name. Before the actual binding, this special host name is translated to the host address using the netdir_getbyname() call, which for Local Transport Provider (IOW: ticlts, ticots, and ticotsord) returns addresses composed from the machine hostname and the service name. This is the reason why we do see the machine hostname in the rpcinfo output. For example, at the il-bld machine, I have this:
$ rpcinfo | grep il-bld 100000 4 ticots il-bld.rpc rpcbind superuser 100000 3 ticots il-bld.rpc rpcbind superuser 100000 4 ticotsord il-bld.rpc rpcbind superuser 100000 3 ticotsord il-bld.rpc rpcbind superuser 100000 4 ticlts il-bld.rpc rpcbind superuser 100000 3 ticlts il-bld.rpc rpcbind superuser 100021 1 ticotsord il-bld.lockd nlockmgr 1 100021 2 ticotsord il-bld.lockd nlockmgr 1 100021 3 ticotsord il-bld.lockd nlockmgr 1 100021 4 ticotsord il-bld.lockd nlockmgr 1 $
When some RPC client is trying to connect to such RPC server over the Local Transport Provider, it first translates the HOST_SELF_CONNECT special host name using the netdir_getbyname() to the host address. Again, this will translate as the real hostname (il-bld in our example above), and the service name (rpc in our example above), separated by a dot.
This works well in usual cases, but it could cause an issue when the hostname changes. After the hostname change, the rpcbind still listens on the old address, but when a client is trying to connect to it, its netdir_getbyname() will return the new hostname in the address and the client is unable to connect.
The fix:
There are basically two ways how to fix this:
1) Restart rpcbind once the hostname changes. This actually means we should restart the rpc/bind SMF service. This will also cause to restart all dependend services (like lockd, nfsd, mountd, ... etc) which is undesirable since it will cause the short service interruption. In addition, to implement this reliably, we would need to restart the rpc/bind service automatically after every hostname change (this would require a change in the hostname(1) command, and in the uname -S as well). This is basically too complex, so no way.
2) The other possible way is to make sure the rpcbind does not bind on the address where a part of it is the hostname. This would need to change the netdir_getbyname() implementation for Local Transport Provider to return something more generic, for example "localhost", instead of the current hostname.
We could see this in the straddr.c source file:
418 /* 419 * Unless /etc/netconfig has been altered, the only transport 420 * that will use straddr.so is loopback. In this case, we 421 * always return our nodename if that's what we were passed, 422 * or we fail (note that we'd like to return a constant like 423 * "localhost" so that changes to the machine name won't cause 424 * problems, but things like autofs actually assume that we're 425 * using our nodename). 426 */ 427 428 if ((strcmp(token, HOST_SELF_BIND) == 0) || 429 (strcmp(token, HOST_SELF_CONNECT) == 0) || 430 (strcmp(token, HOST_ANY) == 0) || 431 (myname != NULL && (strcmp(token, myname) == 0))) { 432 if (myname == NULL) 433 return (0); 434 435 (void) strcpy(hostbuf, myname); 436 return (1); 437 }
This is the actual code where we return the hostname, instead of something generic. The comment at lines 419-425 says that we would like to return something like "localhost" here, but some services (like autofs) are not prepared for such a change. Fortunatelly, this comment is obsolete, since the autofs implementation was changed several years back to use doors, instead of RPC.
This solution sounds better, but it will need the proper and careful implementation and testing to make sure we do not break something.
Updated by Marcel Telka over 7 years ago
- Category changed from nfs - NFS server and client to lib - userland libraries
Updated by Marcel Telka over 7 years ago
Simple manifestation of the problem:
# hostname test # rpcinfo program version netid address service owner 100000 4 ticots test.rpc rpcbind superuser 100000 3 ticots test.rpc rpcbind superuser 100000 4 ticotsord test.rpc rpcbind superuser 100000 3 ticotsord test.rpc rpcbind superuser 100000 4 ticlts test.rpc rpcbind superuser 100000 3 ticlts test.rpc rpcbind superuser 100000 4 tcp 0.0.0.0.0.111 rpcbind superuser 100000 3 tcp 0.0.0.0.0.111 rpcbind superuser 100000 2 tcp 0.0.0.0.0.111 rpcbind superuser 100000 4 udp 0.0.0.0.0.111 rpcbind superuser 100000 3 udp 0.0.0.0.0.111 rpcbind superuser 100000 2 udp 0.0.0.0.0.111 rpcbind superuser 100000 4 tcp6 ::.0.111 rpcbind superuser 100000 3 tcp6 ::.0.111 rpcbind superuser 100000 4 udp6 ::.0.111 rpcbind superuser 100000 3 udp6 ::.0.111 rpcbind superuser 100234 1 ticotsord <\\000\\000\\000 gssd superuser 100155 1 ticots ?\\000\\000\\000 smserverd superuser 100155 1 ticotsord B\\000\\000\\000 smserverd superuser 100155 1 tcp 0.0.0.0.241.87 smserverd superuser 100155 1 tcp6 ::.165.150 smserverd superuser 100134 1 ticotsord I\\000\\000\\000 ktkt_warnd superuser 100011 1 ticlts L\\000\\000\\000 rquotad superuser 100011 1 udp 0.0.0.0.178.182 rquotad superuser 100011 1 udp6 ::.232.122 rquotad superuser 100169 1 ticlts T\\000\\000\\000 - superuser 100169 1 ticotsord V\\000\\000\\000 - superuser 100169 1 ticots X\\000\\000\\000 - superuser # hostname newname # rpcinfo rpcinfo: can't contact rpcbind: RPC: Rpcbind failure - RPC: Failed (unspecified error) #
IOW, once the hostname changes (using the hostname command), the rpcbind stops to work properly. This is causing that all RPC servers (for example mountd, lockd, statd) started after the hostname change will fail to register with rpcbind.
Updated by Marcel Telka over 7 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk over 7 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 30 to 100
git commit 6935f61b0d202f1b87f0234824e4a6ab88c492ac
commit 6935f61b0d202f1b87f0234824e4a6ab88c492ac Author: Marcel Telka <marcel.telka@nexenta.com> Date: 2015-03-13T18:35:03.000Z 1101 NFS service fails to start after hostname change Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com>
Updated by Juan Jose Presa Rodal over 5 years ago
Hi, I have a custom legacy app that search rpc in `hostname`.rpc and dont works with localhost.rpc. If I use an old straddrs.so library all goes well. Is there any way to force new straddrs.so to bind with the old behaviour?
Updated by Marcel Telka over 5 years ago
Juan Jose Presa Rodal wrote:
Hi, I have a custom legacy app that search rpc in `hostname`.rpc and dont works with localhost.rpc. If I use an old straddrs.so library all goes well. Is there any way to force new straddrs.so to bind with the old behaviour?
I think the problem is in the application.
Do you have sources for the application?
Since I'm not sure I fully understand your issue...
... do you have some simple testcase (in C) showing the problem?