Bug #14305
closedSync the tod chip following ntp_adjtime(MOD_FREQUENCY)
100%
Description
We encountered a problem in OmniOS when using the chrony
(https://chrony.tuxfamily.org/) time synchronisation daemon.
One report of this problem is tracked at https://github.com/omniosorg/omnios-build/issues/2576
chrony's author commented:
I suspect this is caused by the kernel resetting or adjusting the system clock when it drifts too much from the real time clock. chronyd has no idea something else is touching the system clock, which completely breaks its control loop.
To confirm that, please try it with the kernel parameter dosynctodr set to 0. If it works, we'll need to figure out what chronyd needs to do to suppress the kernel adjustment.
Manually setting dosynctodr
to 0 did indeed resolve the problem (replicated across several systems).
It appears, therefore, that chrony and the kernel are competing for control of the software clock, with the kernel adjusting it back. We see this in the chrony logs as a sudden time offset of up to 2 seconds. Since, stepping aside, chrony controls time via adjusting the frequency offset only, it quickly sets the offset to the maximum in one direction or another, and never recovers.
(The configuration as shipped in OmniOS means that it will never step the time after the first initial start-up period.)
The kernel tod_needsync
variable (introduced in Solaris 2.6 afaik) was added to remove the requirement for setting the dosynctodr
kernel variable to 0 whenever running time synchronisation software. tod_needsync
is set to 1 for every call into adjtime()
and for every call into ntp_adjtime()
that changes the clock offset. This causes the hardware clock to be kept synchronised to the software clock, which is being managed by NTP software. However, it is not set for clock frequency offset adjustments, such as those made by chrony and this seems like an oversight.
I've updated the kernel on some test machines to do this synchronisation, and chrony appears completely stable.
Updated by Andy Fiddaman 8 months ago
I've tested this change by running ntpsec
and chrony
separately on a patched system and confirming that the clock remains in sync over time, including recovering from when an upstream host is unreachable (via blackhole routes) and manual clock steps via `date` and using the `step` action in chrony.
I also monitored the timedelta
and tod_needsync
variables with dtrace during calls to tod_set()
- either periodic or directly as a result of a call into ntp_adjtime()
to confirm that things were behaving as I expected, and that the tod clock is being synchronised with the software clock rather than the other way around.
Updated by Electric Monk 8 months ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit c538cdc56b01e46a42335d3f8d315c44ecf91ac9
commit c538cdc56b01e46a42335d3f8d315c44ecf91ac9 Author: Andy Fiddaman <omnios@citrus-it.co.uk> Date: 2021-12-17T21:54:50.000Z 14305 Sync the tod chip following ntp_adjtime(MOD_FREQUENCY) Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Robert Mustacchi <rm@fingolfin.org>