[ntp:hackers] What to do when the offset is WAYTOOBIG
jlevine at boulder.nist.gov
Thu Apr 19 20:01:06 PDT 2007
>Your NIST servers, at least those I have checked, show only the ACTS
>(lockclock) reference source and none of the other NIST servers.
>However, late in your message you cite a scenario where multiple
>external sources might indeed disagreee. This is what the selection
>algorithm is for.
Yes. The comparison of the various servers and the decision about
declaring one unhealthy is done by a process that is outside of both
NTP and LOCKCLOCK. As far as NTP is concerned, it is running with the
local clock as the reference with a fudge parameter set to show ACTS
synchronization. The LOCKCLOCK process calls ACTS periodically and
adjusts the clock if the responses seem to be sensible.
(The local clock is also used by a number of other daemons that
provide non-NTP time services, but that is another story.)
LOCKCLOCK will declare the local system unhealthy if the ACTS call
fails or if it is too noisy (comparison of consecutive 1 second
data points) or if the OTM does not change from * to #, indicating
that the ACTS server is not correcting for the telephone delay. The
local system will trigger a pager alarm if any of these things happen.
The supervisory process runs on the server in Boulder and uses
ntp requests to all of the other servers to check the servers the
same way that the users do. In addition, it uses my version of the
finger daemon to ask each server for its internal status parameters. This
process can also signal a server to set itself as unhealthy if its
time exceeds the time of the other servers and even if its internal
checks seem to be okay. This shouldn't happen, of course, because the
internal checks should have caught the problem first. However, ...
Since I have 19 servers and they are checked every hour, there are
200+ checks per day, and there is usually about 1 false alarm
per day due to a network glitch of some kind. I am willing to
tolerate this false alarm rate because I am a compulsive neurotic to
start with and because the alarms go only to me. This would probably
not be a good solution if we had a really professional operation
where the alarms went to a network operations center.
The bottom line is that the worst that can happen is that some
server is declared unhealthy, possibly by mistake, by a daemon
process that got a little trigger-happy. The local server will get a
chance to redeem itself on the next call to ACTS, which resets the
flag to healthy
if everything seems to be okay internally. If there is a persistent
network error of some kind then the supervisory daemon will set the
server unhealthy again on its next cycle and that will trigger yet
another pager alarm.
One of the reasons for this complex chain of processes is that
I tend to add yet another process when I think of another
neat thing to do instead of modifying the existing processes to
include this new neat thing. (In addition, many of these ideas
really originated in the software that controls the primary clock
ensemble, and my left hand has shamelessly stolen code that
my right hand has written for another purpose.)
Apart from the fancy mumbo jumbo, I think the real difference
between our two approaches is that my system is willing
to set a system to unhealthy at any time whereas my understanding of
NTP is that it will never do this once the clock has
been declared healthy following a start up. I think that is what
pushes you to have NTP exit on a failure.
More information about the hackers