[ntp:hackers] What to do when the offset is WAYTOOBIG

Judah Levine jlevine at boulder.nist.gov
Thu Apr 19 20:01:06 PDT 2007


Hello,


>Your NIST servers, at least those I have checked, show only the ACTS 
>(lockclock) reference source and none of the other NIST servers. 
>However, late in your message you cite a scenario where multiple 
>external sources might indeed disagreee. This is what the selection 
>algorithm is for.

    Yes. The comparison of the various servers and the decision about 
declaring one unhealthy is done by a process that is outside of both 
NTP and LOCKCLOCK. As far as NTP is concerned, it is running with the 
local clock as the reference with a fudge parameter set to show ACTS 
synchronization. The LOCKCLOCK process calls ACTS periodically and 
adjusts the clock if the responses seem to be sensible.
(The local clock is also used by a number of other daemons that 
provide non-NTP time services, but that is another story.)
LOCKCLOCK will declare the local system unhealthy if the ACTS call 
fails or if it is too noisy (comparison of consecutive 1 second
data points) or if the OTM does not change from * to #, indicating 
that the ACTS server is not correcting for the telephone delay. The
local system will trigger a pager alarm if any of these things happen.
     The supervisory process runs on the server in Boulder and uses 
ntp requests to all of the other servers to check the servers the 
same way that the users do. In addition, it uses my version of the 
finger daemon to ask each server for its internal status parameters. This
process can also signal a server to set itself as unhealthy if its 
time exceeds the time of the other servers and even if its internal 
checks seem to be okay. This shouldn't happen, of course, because the 
internal checks should have caught the problem first. However, ...
Since I have 19 servers and they are checked every hour, there are 
200+ checks per day, and there is usually about 1 false alarm
per day due to a network glitch of some kind. I am willing to 
tolerate this false alarm rate because I am a compulsive neurotic to
start with and because the alarms go only to me. This would probably 
not be a good solution if we had a really professional operation
where the alarms went to a network operations center.
    The bottom line is that the worst that can happen is that some 
server is declared unhealthy, possibly by mistake, by a daemon 
process that got a little trigger-happy. The local server will get a 
chance to redeem itself on the next call to ACTS, which resets the 
flag to healthy
if everything seems to be okay internally. If there is a persistent 
network error of some kind then the supervisory daemon will set the
server unhealthy again on its next cycle and that will trigger yet 
another pager alarm.
      One of the reasons for this complex chain of processes is that 
I tend to add yet another process when I think of another
neat thing to do instead of modifying the existing processes to 
include this new neat thing. (In addition, many of these ideas
really originated in the software that controls the primary clock 
ensemble, and my left hand has shamelessly stolen code that
my right hand has written for another purpose.)
     Apart from the fancy mumbo jumbo, I think the real difference 
between our two approaches is that my system is willing
to set a system to unhealthy at any time whereas my understanding of 
NTP is that it will never do this once the clock has
been declared healthy following a start up. I think that is what 
pushes you to have NTP exit on a failure.

Best wishes,

Judah Levine





More information about the hackers mailing list