NTP fails on boot
The NTP service fails on boot, and also when the Internet connection fails. When this happens, the ntpq command shows no remote servers, like this:
$ ntpq -p No association ID's returned
It says this, even when the servers are configured in /etc/inet/ntp.conf. The service returns to normal after it is restarted manually.
This happens because NTP requires a name service to resolve host names to IP address. On boot, the NTP service starts before the DNS service is available. The same thing happens when the name service fails because the Internet connection has gone down. The solution may be a new SMF dependancy of the NTP service on the name service.
Updated by Gary Mills about 1 year ago
My first task was to reproduce the problem on a test server. Doing this turned out to be difficult and frustrating. I kept getting unexpected results. Finally, I disabled the name services cache (svc:/system/name-service-cache:default). It had been invalidating my reboot tests.
The name service cache (NSCD) caches names and their IP addresses. In the case of NTP server pools, it must have cached all the DNS A-records and their TTLs in a persistant store. It's usually unreliable, because it only caches names that have been recently used. In my reboot tests, this was always true. As soon as I disabled that service, I began to get the results I had been expecting.
It was easy to reproduce the problem, after that. Any NTP server name that required DNS resolution failed. I tried various ntp.conf configurations, including the one that led to this report. For that one, I used NTP pools with the server directive, as in the NTP documents and man pages. At boot, there was about a 13-second delay between the time the NTP service came online and the time the DNS service came online. As a result, NTP always failed to resolve time server names. It did work after it was restarted.
I also tried the new pool directive, expecting it to work. It didn't, probably because it did not retry the DNS queries for the NTP server pools. After a SMF restart, it did work, and did add associations for the time servers that were part of the NTP pools. The use of this new directive is not documented, probably because the NTP documentation lags behind the software. I found about it here:
I used these restrict and pool directives in my /etc/inet/ntp.conf file:
restrict default limited kod ignore restrict -6 default limited kod ignore restrict source limited kod restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap nopeer restrict 127.0.0.1 restrict -6 ::1 pool 0.pool.ntp.org iburst pool 1.pool.ntp.org iburst pool 2.pool.ntp.org iburst pool 3.pool.ntp.org iburst
Now that I have reproduced the problem on a test server, my next task is to test a dependancy on:
Updated by Gary Mills about 1 year ago
In the end, all I had to do was to add a third dependancy to the SMF manifest. It's this one:
The first test I did was to reboot the system. NTP worked immediately after login, because the service svc:/network/ntp:default waited until svc:/milestone/name-services:default came online. That new behavior was a result of the new dependancy.
The second test was to disconnect the ethernet cable for five minutes, and then reconnect it. Nine services restarted automatically on reconnect, including svc:/milestone/name-services:default but not including the original two dependancies of the NTP service. This added dependancy caused NTP to restart, as shown in the SMF log:
[ Nov 24 15:25:27 Stopping because dependency activity requires stop. ] [ Nov 24 15:25:27 Executing stop method (:kill). ] [ Nov 24 15:25:27 Executing start method ("/lib/svc/method/ntp start"). ] [ Nov 24 15:25:27 Method "start" exited with status 0. ]
So, this one addition to the SMF manifest fixed both of the problems listed in this bug report.