Project

General

Profile

Bug #9705

NTP client crashing

Added by Harry Liebel about 1 year ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
OI-Userland
Target version:
Start date:
2018-08-07
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

Hi.

I installed/enabled the NTP client on OpenIndiana. It usually starts when I start it by hand. Sometimes it will be in maintenance mode after a reboot and I can clear it to start it again. Even after running for a short while it will fail and go into maintenance mode again.

I'm using an up-to-date OpenIndiana. I did a TextInstall in a VM and only added tmux and enabled NTP. I updated everything to the latest as of 7Aug18.

I just enabled ntp using 'svcadm enable ntp' without changing any configuration.

harry@vultr-oi:~$ date
 6 August 2018 at 23:57:04 AEST

harry@vultr-oi:~$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+ntp2.ds.network 202.46.177.18    2 u   12   64   37   44.955  -22.107  17.201
*45.76.113.31 (d 17.253.66.125    2 u   18   64   37    0.428  -20.438  15.068
+y.ns.gin.ntt.ne 249.224.99.213   2 u   18   64   37    1.156  -22.878  16.041
+ntp3.ds.network 27.124.125.250   3 u   18   64   37   44.871  -20.451  16.153

...
...

harry@vultr-oi:~$ ntpq -p
ntpq: read: Connection refused

harry@vultr-oi:~$ svcs -xv ntp
svc:/network/ntp:default (Network Time Protocol (NTP) Version 4)
 State: maintenance since  7 August 2018 at 00:24:01 AEST
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: man -M /usr/share/man -s 1M ntpd
   See: man -M /usr/share/man -s 4 ntp.conf
   See: man -M /usr/share/man -s 1M ntpq
   See: /var/svc/log/network-ntp:default.log
Impact: This service is not running.

$ grep ntpd /var/adm/messages*
/var/adm/messages.0:Aug  6 23:52:16 vultr-oi ntpd[512]: [ID 702911 ntp.error] /jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed
/var/adm/messages.0:Aug  6 23:52:16 vultr-oi ntpd[512]: [ID 702911 ntp.error] exiting (due to assertion failure)
/var/adm/messages.0:Aug  7 00:09:13 vultr-oi ntpd[1150]: [ID 702911 ntp.error] /jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed
/var/adm/messages.0:Aug  7 00:09:13 vultr-oi ntpd[1150]: [ID 702911 ntp.error] exiting (due to assertion failure)
/var/adm/messages.0:Aug  7 00:13:48 vultr-oi ntpd[1201]: [ID 702911 ntp.error] /jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed
/var/adm/messages.0:Aug  7 00:13:48 vultr-oi ntpd[1201]: [ID 702911 ntp.error] exiting (due to assertion failure)
/var/adm/messages.0:Aug  7 00:24:01 vultr-oi ntpd[1218]: [ID 702911 ntp.error] /jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed
/var/adm/messages.0:Aug  7 00:24:01 vultr-oi ntpd[1218]: [ID 702911 ntp.error] exiting (due to assertion failure)

Related issues

Related to OpenIndiana Distribution - Feature #9983: Replace ntp with ntpsecNew2018-11-16

Actions

History

#1

Updated by Harry Liebel about 1 year ago

Sorry, put it in the wrong section should be in the OI tracker. Can I move it?

#2

Updated by Marcel Telka about 1 year ago

  • Project changed from illumos gate to OpenIndiana Distribution
#3

Updated by Michal Nowak about 1 year ago

I think I've seen this bug on SmartOS in Linux KVM before, usually after resuming from suspension to memory.

Anyway, we should update to 4.2.8p11, although I haven't seen anything encouraging in p11 changelog.

Filed upstream bug http://bugs.ntp.org/show_bug.cgi?id=3522.

#4

Updated by Harry Liebel about 1 year ago

I ran ntp by itself and got a stacktrace.

root@vultr-oi:~# pstack core
core 'core' of 1659:    /usr/lib/inet/ntpd -n
 fffffd7feede6a0a _lwp_kill () + a
 fffffd7feed78ee0 raise (6) + 20
 fffffd7feed51618 abort () + 98
 0000000000442139 ???????? ()
 00000000004c52da isc_assertion_failed () + a
 00000000004c1cb4 ???????? ()
 00000000004c1730 get_free_recv_buffer ()
 000000000043a0b3 input_handler () + 9f3
 00000000004b9cd2 sigio_handler () + 52
 fffffd7feeddf836 __sighndlr () + 6
 fffffd7feedd24db call_user_handler (16, 0, fffffd7fffdeec40) + 1db
 fffffd7feedd284e sigacthandler (16, 0, fffffd7fffdeec40) + 10e
 --- called from signal handler with signal 22 (SIGPOLL) ---
 fffffd7feede68ba __sigsuspend () + a
 00000000004ba1dd wait_for_signal () + 4d
 0000000000443191 ntpdmain () + f41
 00000000004434b9 ???????? ()
 000000000042fa43 _start_crt () + 83
 000000000042f9a8 _start () + 18

#5

Updated by Michal Nowak about 1 year ago

I prepared a pull request to oi-userland (https://github.com/OpenIndiana/oi-userland/pull/4363) but before that integrates to hipster, you may want to try it from a p5p archive. If you know how to install it (I don't...), you can find it in the pull request. Feedback welcome.

#6

Updated by Michal Nowak about 1 year ago

Nope v4.2.8p11 is failing the very same way.

#7

Updated by Geoff Weiss about 1 year ago

I applied the patches in #4363. ntp no longer goes into maintenance mode. Up until this fix, ntp would dump core after running for around 5-8 minutes. The ntp service has been up for just over three days now and the ntp svc log has no new entries.

#8

Updated by Michal Nowak about 1 year ago

Thanks for the confirmation Geoff! Can you confirm Harry?

#9

Updated by Harry Liebel about 1 year ago

Hi Michal.

Apologies for the late response.

I was trying to get the install from p5p working, but couldn't get the dependency versions right. Is there a trick to it?

So I went down the path to setup a local build environment to try the patch.

I setup a OI-1804 Text install into a virtualbox VM. After enabling NTP on the fresh install it failed as expected.

After updating the VM and setting up the build environment the current (8p10) NTP fails much slower for me. This caused some confusion while I was trying to ensure I could actually replicate the bug before applying your patch.

I have now built and installed the updated version (8p11) of NTP (after a bit of learning and a few false starts).

$ ntpq --help
ntpq - standard NTP query program - Ver. 4.2.8p11
$ ntpdc --help
ntpdc - vendor-specific NTPD control program - Ver. 4.2.8p11

I still see the failure:

Aug 15 02:18:48 oi1804-vm ntpd[1690]: [ID 702911 ntp.error] /export/home/harry/work/oi-userland/components/network/ntp/ntp-4.2.8p11/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed
Aug 15 02:18:48 oi1804-vm ntpd[1690]: [ID 702911 ntp.error] exiting (due to assertion failure)

Harry

#10

Updated by Udo Grabowski about 1 year ago

I get the same problem when using ntpdate with more than 10 arguments. Someone decided
that 10 hosts are enough (does not happen in old ):

ro sunth7 /var/crash/crashcore # ntpdate q timehost imksuns{11,12,13,14,15,16,17,18,19,20,21}
/jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE0) pf
>pptail || pptail pf->pptail) failed.
Abort (core dumped)

ro sunth7 /var/crash/crashcore # pstack core-ntpdate-1535118141.16780
core 'core-ntpdate-1535118141.16780' of 16780: ntpdate -q timehost imksuns11 imksuns12 imksuns13 imksuns14 imksuns15
ffffdd7fefa76a0a _lwp_kill () + a
ffffdd7fefa08ee0 raise (6) + 20
ffffdd7fef9e1618 abort () + 98
000000000041b9bf ???? ()
000000000041a814 ???? ()
000000000041a290 get_free_recv_buffer ()
000000000040f33d input_handler () + 11d
000000000041062d ntpdatemain () + 129d
000000000040ed63 _start_crt () + 83
000000000040ecc8 _start () + 18

#11

Updated by Michal Nowak about 1 year ago

  • Category set to OI-Userland
  • Status changed from New to In Progress
  • Assignee set to Michal Nowak
  • Target version set to Hipster

This bug seems to be fixed in 4.2.8p12:

$ ntpdate -q {0,1,2,3}.{cz,ba,at,de,dk}.pool.ntp.org
server 37.157.198.150, stratum 2, offset -0.001707, delay 0.03090
server 80.211.195.36, stratum 2, offset -0.001676, delay 0.03139
...
server 77.68.158.228, stratum 2, offset -0.001885, delay 0.06885
server 85.204.137.77, stratum 2, offset -0.001784, delay 0.05257
server 92.246.24.228, stratum 1, offset -0.001474, delay 0.06552
24 Aug 21:31:40 ntpdate[13265]: adjust time server 193.171.23.163 offset -0.002182 sec

I'll update the PR accordingly, would be cool if anyone can confirm it's fixed.

#12

Updated by Geoff Weiss about 1 year ago

I updated to p12. The ntp service has been stable for over four hours. i do use a customized ntp.conf file; I don't use the default ntp servers provided for openindiana (this may explain why p11 was stable for me).

Geoff

#13

Updated by Harry Liebel about 1 year ago

Hi.

I've been trying to test it out but I am stuck. On my VPS I did a pkg update to get the new ntp as it went into OI proper recently.

If I check the installed version I get version p12:

harry@vultr-oi:~$ pkg info ntp
             Name: service/network/ntp
          Summary: Network Time Protocol Daemon v4
      Description: Network Time Protocol v4, NTP Daemon and Utilities
         Category: System/Services
            State: Installed
        Publisher: openindiana.org
          Version: 4.2.8.12
           Branch: 2018.0.0.0
   Packaging Date: August 28, 2018 at 06:02:39 AM
Last Install Time: August 30, 2018 at 04:55:04 AM
             Size: 4.66 MB
             FMRI: pkg://openindiana.org/service/network/ntp@4.2.8.12-2018.0.0.0:20180828T060239Z
      Project URL: http://www.ntp.org/
       Source URL: http://archive.ntp.org/ntp4/ntp-4.2/ntp-4.2.8p12.tar.gz

harry@vultr-oi:~$ pkg list -v ntp
FMRI                                                                         IFO
pkg://openindiana.org/service/network/ntp@4.2.8.12-2018.0.0.0:20180828T060239Z i--

But if I run ntp I get version p10:

harry@vultr-oi:~$ ntpdate -v
30 Aug 15:05:10 ntpdate[1471]: ntpdate 4.2.8p10@1.3728-o Sun Feb 11 07:50:49 UTC 2018 (1)

Is there something wrong with the package or with my machine? I've run a pkg verify and it all checks out. I've also done an uninstall and reinstall of ntp but is does the same.

The result is that I still get the failure, but probably because my machine is failing to update. What am I doing wrong?

Thanks,
Harry

#14

Updated by Geoff Weiss about 1 year ago

I can echo that after a pkg update, which removes my version of p12, I also see the same result of ntpdate -v . The ntp service crashes immediately now, instead of staying up for ~5 minutes.

Sorry, I have no time to troubleshoot the contents of this package.

#15

Updated by Michal Nowak about 1 year ago

Harry, Geoff you have the right package, the version the binary reports is wrong.

Can you post the error, please?

#16

Updated by Harry Liebel about 1 year ago

Hi Michal.

The build/package looks suspect. The assert error message also claims to be p10.

harry@vultr-oi:~$ pkg list -v ntp
FMRI                                                                         IFO
pkg://openindiana.org/service/network/ntp@4.2.8.12-2018.0.0.0:20180828T060239Z i--
harry@vultr-oi:~$
harry@vultr-oi:~$
harry@vultr-oi:~$ ntpdate -q {0,1,2,3}.{cz,ba,at,de,dk}.pool.ntp.org
/jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE(((void *)0) == pf->pptail || pptail == pf->pptail) failed.
Abort (core dumped)
harry@vultr-oi:~$
harry@vultr-oi:~$ pstack core 
core 'core' of 1781:    ntpdate -q 0.cz.pool.ntp.org 0.ba.pool.ntp.org 0.at.pool.ntp.org 0.de.
 fffffd7fef026fba _lwp_kill () + a
 fffffd7feefb9330 raise (6) + 20
 fffffd7feef91998 abort () + 98
 000000000041b9bf ???????? ()
 000000000041a814 ???????? ()
 000000000041a290 get_free_recv_buffer ()
 000000000040f33d input_handler () + 11d
 000000000041062d ntpdatemain () + 129d
 000000000040ed63 _start_crt () + 83
 000000000040ecc8 _start () + 18
#17

Updated by Harry Liebel about 1 year ago

Looking at the manifest files for ntp p10 and p12 to see what the hashes look like...

I'm going through:
http://pkg.openindiana.org/hipster/en/advanced_search.shtml?token=ntp&show=p&sav=1&rpp=50&v=&action=Advanced+Search

http://pkg.openindiana.org/hipster/manifest/0/service%2Fnetwork%2Fntp%404.2.8.10%2C5.11-2018.0.0.0%3A20180211T131908Z

http://pkg.openindiana.org/hipster/manifest/0/service%2Fnetwork%2Fntp%404.2.8.12%2C5.11-2018.0.0.0%3A20180828T060239Z

The hashes for these files are the same:
usr/sbin/ntpdate
usr/lib/inet/ntpd
usr/sbin/ntp-keygen

And doing a 'strings' on ntpdate does have the 4.2.8p10 version number in there so it's not pulled from a library..

But for other files are not.

Perhaps a misbehaving clean on the build machine? Or ccache..?

#18

Updated by Harry Liebel about 1 year ago

If i build ntp p10 and then build p12 in the same tree without doing a clean the hashes for ntpdate and the other binaries stay the same in the published manifest. If I remove the prototype directory, touch the Makefile and build again the hashes are different.. Does that mean the prototype directory is not updated properly or perhaps something that triggers after that? I dont know enough about the build system to know what I'm supposed to see..

#19

Updated by Michal Nowak about 1 year ago

Problem with our build system was resolved, may I Harry ask you to re-test with latest ntp from our repos, please?

#20

Updated by Harry Liebel about 1 year ago

Hi Michal.

Repo version looks to be p12 now. I've updated and done the ntpdate test and that looks good. I've enabled ntp on my VPS again and looking good so far. Thanks :)

#21

Updated by Harry Liebel about 1 year ago

Is there something in the build we can improve to prevent the need for human intervention when this is updated again in the future?

#22

Updated by Harry Liebel about 1 year ago

Still working without any problems. I think we can close this one for now :) Thanks!

#23

Updated by Michal Nowak about 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100
  • Tags deleted (needs-triage)
#24

Updated by Michal Nowak 11 months ago

  • Status changed from Closed to In Progress
  • Assignee deleted (Michal Nowak)
  • % Done changed from 100 to 0

It actually hasn't been fixed by the latest update. I and other see it. Cloned the original bug: http://bugs.ntp.org/show_bug.cgi?id=3552.

#25

Updated by Michal Nowak 10 months ago

#26

Updated by Geoff Weiss 9 months ago

I resynced my ntp build with userland, and was able to replicate ntp crashing after a few minutes. I don't have a solution to prevent the assertion fail when the service starts on boot, but here is a change to oi-userland/components/network/ntp to prevent the service from going into maintenance a few minutes after being started:

--- Makefile.old        2018-12-14 13:08:03.214030850 +0000
+++ Makefile    2019-01-03 11:54:26.317608534 +0000
@@ -53,7 +53,7 @@

 COMPONENT_PREP_ACTION = ( cd $(@D) && autoreconf --force --install )

-CFLAGS +=      -std=c99 -D_XOPEN_SOURCE=600 -D__EXTENSIONS__
+CFLAGS +=      -std=c99 -D__EXTENSIONS__

 CONFIGURE_BINDIR.64 = $(USRSBINDIR)
 CONFIGURE_SBINDIR.64 = $(USRSBINDIR)

I prefer the solution of having ntpsec available, though.

#27

Updated by Michal Nowak 9 months ago

Thanks for the suggestion, Geoff.

Can you describe, what happens with the process (and service) after assertion fails (when compiled without _XOPEN_SOURCE=600)?

#28

Updated by r a 9 months ago

I looked at replacing the default ntp with ntp-4.2.8p12 and checking the logs for ntp and used
CC=gcc CFLAGS='-m64 -O3' LDFLAGS=-m64 PKG_CONFIG_PATH=/usr/lib/pkgconfig ./configure --prefix=/usr --mandir=/usr/share/man --bindir=/usr/lib/inet --sysconfdir=/etc/inet --localstatedir=/var/ntp --enable-shared=yes --enable-static=no --with-crypto --enable-solarisprivs --enable-ipv6 --enable-dst-minutes --enable-thread-support --without-sntp --without-ntpsnmpd

4 Jan 13:22:45 ntpd1854: /jenkins/jobs/oi-userland/workspace/components/network/ntp/ntp-4.2.8p10/libntp/recvbuff.c:326: REQUIRE0) pf->pptail || pptail pf->pptail) failed
4 Jan 13:22:45 ntpd1854: exiting (due to assertion failure)

After replacing with ntp-4.2.8p12
4 Jan 14:50:00 ntpd19341: Listen and drop on 0 v6wildcard [::]:123
4 Jan 14:50:00 ntpd19341: Listen and drop on 1 v4wildcard 0.0.0.0:123
4 Jan 14:50:00 ntpd19341: Listen normally on 2 lo0 127.0.0.1:123
4 Jan 14:50:00 ntpd19341: Listen normally on 3 vboxnet0 192.168.56.1:123
4 Jan 14:50:00 ntpd19341: Listen normally on 4 e1000g1 172.16.26.110:123
4 Jan 14:50:00 ntpd19341: Listen normally on 5 lo0 [::1]:123
4 Jan 14:50:00 ntpd19341: Listen normally on 6 e1000g1 [fe80::215:17ff:fe91:7e46%3]:123
4 Jan 14:50:00 ntpd19341: Listen normally on 7 e1000g1 [2a02:8010:d00d:1:8563:3393:d477:5981]:123
4 Jan 14:50:00 ntpd19341: Listen normally on 8 e1000g1 [fd22:d18a:1556:1:6723:1023:d1cc:1355]:123
4 Jan 14:50:00 ntpd19341: Listening on routing socket on fd #29 for interface updates
4 Jan 14:50:00 ntpd19341: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
4 Jan 14:50:00 ntpd19341: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
4 Jan 14:54:26 ntpd19341: recvbuff.c:326: REQUIRE0) pf->pptail || pptail pf->pptail) failed

The ecvbuff.c:326: REQUIRE0) pf->pptail || pptail pf->pptail) failed refers to ntp-4.2.8p12/libntp/recvbuff.c line 326, column 5 which relates the following code

#ifdef NTP_DEBUG_LISTS_H
void
check_gen_fifo_consistency(void *fifo) {
gen_fifo *pf;
gen_node *pthis;
gen_node **pptail;

pf = fifo;
REQUIRE((NULL pf->phead && NULL pf->pptail) ||
(NULL != pf->phead && NULL != pf->pptail));

pptail = &pf->phead;
for (pthis = pf->phead;
pthis != NULL;
pthis = pthis->link)
if (NULL != pthis->link)
pptail = &pthis->link;
REQUIRE(NULL  pf->pptail || pptail  pf->pptail);
}
#endif /* NTP_DEBUG_LISTS_H */

Not sure why this code causes a problem.

#29

Updated by Michal Nowak 9 months ago

Good point r a. We can disable the offending code by disabling debugging code via --disable-debugging.

Hence: https://github.com/OpenIndiana/oi-userland/pull/4695. Whoever can please build it yourself and test before it gets to the distribution.

Geoff: I think you are right, with your suggested change ntpd crashes but the ntp service regenerates. Will see how the one without debugging behaves.

#30

Updated by Geoff Weiss 9 months ago

I'm happy with the changes in PR4695. The ntp service does not crash any more. The service is now stable, even after initial start during boot.

#31

Updated by Michal Nowak 9 months ago

  • Status changed from In Progress to Closed
  • Assignee set to Michal Nowak
  • % Done changed from 0 to 100

Now we have service/network/ntp@4.2.8.12,5.11-2018.0.0.1:20190107T160118Z, that should fix it for everyone. Let us know otherwise.

Also available in: Atom PDF