[ntp:hackers] NTP code and documentation wiggles

David L. Mills mills at udel.edu
Thu Jan 10 19:13:09 UTC 2008


Guys,

1. The documenation cleanup project continues. The howtos and reference 
clock pages have been cleared of broken links, HTML syntax errors and 
clearly evident errors. The Windows howto was a challenge, but I gave up 
on the Solaris howto. Some pages have been reformatted to style used 
elsewhere in the collection. Some of the options pages have been cleaned 
up. The monitoring options page was a mess; it took me a day to clear 
the weeds, not to mention the broken and misleading prose.

2. Brian pointed out a problem with the initial frequency training 
scheme and the Solaris kernel. Close inspection revealed other ways the 
training could be improved. The result is a reliable estimate of initial 
frequency well within 1 PPM with no more residual offset error than if 
the frequency file was present and accurate.

3. The rate grooming and KoD code works too well. While it is not 
possible for a compatible client to overrun a compatible server, it is 
possible to cheat by restarting a client shortly after completing an 
iburst, which did result in a KoD. This turns out to be a little 
irritating during testing. I modified the received KoD response to set 
the headway to the max rather than to disable the association. There may 
still be wiggle room in this area.

4. Stale certificates and leapsecond values can be a problem with 
Autokey. The common case is when a trusted host key or certificate is 
refreshed and eventually all dependent servers and clients need to 
refresh their certificate trails. As there is no revocation capability 
in Autokey, a certificate is valid until its lifetime expires. 
Therefore, both the old and new certificates, as well as all 
certificates signed by the old and new host keys, remain valid until 
they expire.

The only way to make sure media values are current is to restart the 
association. The current code does this once per week for each 
association; this updates the certificate cache and leapseconds values 
if the downstratum servers have refreshed media.

Still thinking about the leapsecond issue. This turns out to be much 
harder than you might imagine. It is the only issue that requires an 
absolute epoch. Everywhere else the result is only a nudge one way or 
another relative to the current epoch. The seconds to the absolure epoch 
shifts one way or another as the clock is nudged and stepped, especially 
near the leap epoch. The current code does a good job of consensus, 
voting, etc., but arming and executing the leap requires that the local 
clock be stable within at least a few seconds before the leap.

There is also a question when kernel assist is not available at time 
time of the leap. The current code simply sets the clock back one second 
on the assumption the actual kernel clock reading routine will not be 
stepped back for any cause unless more than one second, which is what my 
kernel code does. However, it is doubtful that other kernels have 
adopted that approach. Note that it is impractical to used adjtime(), as 
its slew rate would continue for over a half hour.

Dave




More information about the hackers mailing list