Project

General

Profile

Bug #1101

NFS service fails to start after hostname change

Added by Igor Vasilyev over 8 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
lib - userland libraries
Start date:
2011-06-10
Due date:
% Done:

100%

Estimated time:
Difficulty:
Tags:
needs-triage

Description

If we does not have share before changing hostname, nfsd does not start.

root@illumos-gate:~# share
root@illumos-gate:~# ps ax | grep nfs
 100943 pts/1    R  0:00 grep nfs
root@illumos-gate:~# 
root@illumos-gate:~# hostname illumos-gate2
root@illumos-gate2:~# share /testpool/NFS40 
root@illumos-gate2:~# share
-               /testpool/NFS40   rw   ""  
root@illumos-gate2:~# ps ax | grep nfs
 100960 ?        S  0:00 /usr/lib/nfs/nfsmapid
 100965 ?        S  0:00 /usr/lib/nfs/lockd
 100993 pts/1    S  0:00 grep nfs
root@illumos-gate2:~# 
root@illumos-gate2:~# svcs -x
svc:/network/nfs/server:default (NFS server)
 State: maintenance since Fri Jun 10 08:19:19 2011
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
   See: http://sun.com/msg/SMF-8000-KS
   See: nfsd(1M)
   See: /var/svc/log/network-nfs-server:default.log
Impact: This service is not running.

svc:/network/nfs/status:default (NFS status monitor)
 State: maintenance since Fri Jun 10 08:19:19 2011
Reason: Restarting too quickly.
   See: http://sun.com/msg/SMF-8000-L5
   See: statd(1M)
   See: /var/svc/log/network-nfs-status:default.log
Impact: This service is not running.

If some share already exist - all works fine

History

#1

Updated by Igor Vasilyev over 8 years ago

root@illumos-gate:~# uname -a
SunOS illumos-gate 5.11 30.05.11 i86pc i386 i86pc Solaris

System was builded from illumos-gate 30.05.2011

/var/svc/log/network-nfs-server:default.log

[ Jun 14 07:45:35 Enabled. ]
[ Jun 14 07:45:35 Executing start method ("/lib/svc/method/nfs-server start"). ]
couldn't register datagram_v MOUNTVERS/lib/svc/method/nfs-server: mountd failed with 1
[ Jun 14 07:45:36 Method "start" exited with status 95. ]
[ Jun 14 07:45:36 Stopping for maintenance due to service_request. ]
[ Jun 14 07:45:36 Stopping for maintenance due to service_request. ]

#2

Updated by sham pavman about 8 years ago

  • Assignee set to sham pavman
  • % Done changed from 0 to 20
#3

Updated by sham pavman almost 8 years ago

  • % Done changed from 20 to 30

Hi,

After having done my bit of investigation into this issue I have a few realizations to share..
The issue arises only when there is temporary hostname change and does not occur otherwise.
The moment there is a reboot this hostname change is reverted anyway and also because this kind of a change is not done too often I feel its best that this issue be 'noted' than 'fixed'

Any opinions?

Shampavman

#4

Updated by Marcel Telka about 6 years ago

sham pavman wrote:

Any opinions?

It should be fixed.

#5

Updated by Marcel Telka over 4 years ago

  • Status changed from New to In Progress
  • Assignee changed from sham pavman to Marcel Telka
#6

Updated by Marcel Telka over 4 years ago

  • Subject changed from nfs does not start after change hostname to NFS service fails to start after hostname change
#7

Updated by Marcel Telka over 4 years ago

Root cause:

Some RPC servers (for example: rpcbind, and lockd) binds locally using the special HOST_SELF host name. Before the actual binding, this special host name is translated to the host address using the netdir_getbyname() call, which for Local Transport Provider (IOW: ticlts, ticots, and ticotsord) returns addresses composed from the machine hostname and the service name. This is the reason why we do see the machine hostname in the rpcinfo output. For example, at the il-bld machine, I have this:

$ rpcinfo | grep il-bld
    100000    4    ticots    il-bld.rpc          rpcbind    superuser
    100000    3    ticots    il-bld.rpc          rpcbind    superuser
    100000    4    ticotsord il-bld.rpc          rpcbind    superuser
    100000    3    ticotsord il-bld.rpc          rpcbind    superuser
    100000    4    ticlts    il-bld.rpc          rpcbind    superuser
    100000    3    ticlts    il-bld.rpc          rpcbind    superuser
    100021    1    ticotsord il-bld.lockd        nlockmgr   1
    100021    2    ticotsord il-bld.lockd        nlockmgr   1
    100021    3    ticotsord il-bld.lockd        nlockmgr   1
    100021    4    ticotsord il-bld.lockd        nlockmgr   1
$

When some RPC client is trying to connect to such RPC server over the Local Transport Provider, it first translates the HOST_SELF_CONNECT special host name using the netdir_getbyname() to the host address. Again, this will translate as the real hostname (il-bld in our example above), and the service name (rpc in our example above), separated by a dot.

This works well in usual cases, but it could cause an issue when the hostname changes. After the hostname change, the rpcbind still listens on the old address, but when a client is trying to connect to it, its netdir_getbyname() will return the new hostname in the address and the client is unable to connect.

The fix:

There are basically two ways how to fix this:

1) Restart rpcbind once the hostname changes. This actually means we should restart the rpc/bind SMF service. This will also cause to restart all dependend services (like lockd, nfsd, mountd, ... etc) which is undesirable since it will cause the short service interruption. In addition, to implement this reliably, we would need to restart the rpc/bind service automatically after every hostname change (this would require a change in the hostname(1) command, and in the uname -S as well). This is basically too complex, so no way.

2) The other possible way is to make sure the rpcbind does not bind on the address where a part of it is the hostname. This would need to change the netdir_getbyname() implementation for Local Transport Provider to return something more generic, for example "localhost", instead of the current hostname.

We could see this in the straddr.c source file:

418    /*
419     *    Unless /etc/netconfig has been altered, the only transport
420     *    that will use straddr.so is loopback.  In this case, we
421     *    always return our nodename if that's what we were passed,
422     *    or we fail (note that we'd like to return a constant like
423     *    "localhost" so that changes to the machine name won't cause
424     *    problems, but things like autofs actually assume that we're
425     *    using our nodename).
426     */
427
428    if ((strcmp(token, HOST_SELF_BIND) == 0) ||
429        (strcmp(token, HOST_SELF_CONNECT) == 0) ||
430        (strcmp(token, HOST_ANY) == 0) ||
431        (myname != NULL && (strcmp(token, myname) == 0))) {
432        if (myname == NULL)
433            return (0);
434
435        (void) strcpy(hostbuf, myname);
436        return (1);
437    }

This is the actual code where we return the hostname, instead of something generic. The comment at lines 419-425 says that we would like to return something like "localhost" here, but some services (like autofs) are not prepared for such a change. Fortunatelly, this comment is obsolete, since the autofs implementation was changed several years back to use doors, instead of RPC.

This solution sounds better, but it will need the proper and careful implementation and testing to make sure we do not break something.

#8

Updated by Marcel Telka over 4 years ago

  • Category changed from nfs - NFS server and client to lib - userland libraries
#9

Updated by Marcel Telka over 4 years ago

Simple manifestation of the problem:

# hostname
test
# rpcinfo
   program version netid     address             service    owner
    100000    4    ticots    test.rpc            rpcbind    superuser
    100000    3    ticots    test.rpc            rpcbind    superuser
    100000    4    ticotsord test.rpc            rpcbind    superuser
    100000    3    ticotsord test.rpc            rpcbind    superuser
    100000    4    ticlts    test.rpc            rpcbind    superuser
    100000    3    ticlts    test.rpc            rpcbind    superuser
    100000    4    tcp       0.0.0.0.0.111       rpcbind    superuser
    100000    3    tcp       0.0.0.0.0.111       rpcbind    superuser
    100000    2    tcp       0.0.0.0.0.111       rpcbind    superuser
    100000    4    udp       0.0.0.0.0.111       rpcbind    superuser
    100000    3    udp       0.0.0.0.0.111       rpcbind    superuser
    100000    2    udp       0.0.0.0.0.111       rpcbind    superuser
    100000    4    tcp6      ::.0.111            rpcbind    superuser
    100000    3    tcp6      ::.0.111            rpcbind    superuser
    100000    4    udp6      ::.0.111            rpcbind    superuser
    100000    3    udp6      ::.0.111            rpcbind    superuser
    100234    1    ticotsord <\\000\\000\\000       gssd       superuser
    100155    1    ticots    ?\\000\\000\\000       smserverd  superuser
    100155    1    ticotsord B\\000\\000\\000       smserverd  superuser
    100155    1    tcp       0.0.0.0.241.87      smserverd  superuser
    100155    1    tcp6      ::.165.150          smserverd  superuser
    100134    1    ticotsord I\\000\\000\\000       ktkt_warnd superuser
    100011    1    ticlts    L\\000\\000\\000       rquotad    superuser
    100011    1    udp       0.0.0.0.178.182     rquotad    superuser
    100011    1    udp6      ::.232.122          rquotad    superuser
    100169    1    ticlts    T\\000\\000\\000       -          superuser
    100169    1    ticotsord V\\000\\000\\000       -          superuser
    100169    1    ticots    X\\000\\000\\000       -          superuser
# hostname newname
# rpcinfo
rpcinfo: can't contact rpcbind: RPC: Rpcbind failure - RPC: Failed (unspecified error)
#

IOW, once the hostname changes (using the hostname command), the rpcbind stops to work properly. This is causing that all RPC servers (for example mountd, lockd, statd) started after the hostname change will fail to register with rpcbind.

#10

Updated by Marcel Telka over 4 years ago

  • Status changed from In Progress to Pending RTI
#11

Updated by Electric Monk over 4 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 30 to 100

git commit 6935f61b0d202f1b87f0234824e4a6ab88c492ac

commit  6935f61b0d202f1b87f0234824e4a6ab88c492ac
Author: Marcel Telka <marcel.telka@nexenta.com>
Date:   2015-03-13T18:35:03.000Z

    1101 NFS service fails to start after hostname change
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

#12

Updated by Juan Jose Presa Rodal almost 3 years ago

Hi, I have a custom legacy app that search rpc in `hostname`.rpc and dont works with localhost.rpc. If I use an old straddrs.so library all goes well. Is there any way to force new straddrs.so to bind with the old behaviour?

#13

Updated by Marcel Telka almost 3 years ago

Juan Jose Presa Rodal wrote:

Hi, I have a custom legacy app that search rpc in `hostname`.rpc and dont works with localhost.rpc. If I use an old straddrs.so library all goes well. Is there any way to force new straddrs.so to bind with the old behaviour?

I think the problem is in the application.
Do you have sources for the application?

Since I'm not sure I fully understand your issue...
... do you have some simple testcase (in C) showing the problem?

Also available in: Atom PDF