[ntp:hackers] What to do when the offset is WAYTOOBIG
David L. Mills
mills at udel.edu
Thu Apr 19 18:24:52 PDT 2007
Your NIST servers, at least those I have checked, show only the ACTS
(lockclock) reference source and none of the other NIST servers.
However, late in your message you cite a scenario where multiple
external sources might indeed disagreee. This is what the selection
algorithm is for.
There are two hypotheses:
1. The local clock hardware is defective, as determined by the
(adaptive) sigma threshold. The prudent course is to suspend service and
trigger the beeper.
2. One or more of the remote servers are defective, as determined by the
clock selection algorithm. If there is no majority cliquee, the prudent
course is to continue service but increase dispersion, so that dependent
clients can mitigate with other sources.
A NIST primary server might swing the liklihood ratio to favor (1), as
the ACTS service is ordinarily very robust and not affected by other
network traffic. However, a higher stratum server might swing the ratio
to favor (2) on the basis that Internet congestion and reliably is much
less than telephone congestion and reliability. The same thing might be
advisable for primary servers using reference sources other than ACTS.
The usual indication that something is drastically wrong with the
hardware is when the frequency pegs at +-500 PPM or when the wander
spikes. If so, select hypothesis (1). In the case of steps, all
chronometric data are purged and the client starts over from scratch. If
this cycle repeats, select hypothesis (1).
In all other cases where a majority clique exist or all sources have
become unreachable or non-selectable, select hypothesis (2).
We could add an adaptive threshold for the sigma, which in the NTP
design is estimated from the exponentially averaged first-order
frequency differences. This would be effective only if a single
reference clock was selected.
Judah Levine wrote:
>> I am watching five clocks. Three of the say 1200, two say 1300 and my
>> clock says 1400.
>> Since the majority of clocks I watch say 1200, I conclude the real
>> time is 1220, but that is beyond my panic limit of one hour.
> I would have looked at this differently. I would have evaluated
> the error of my clock from the
> times of the remote clocks I was monitoring using the sigma of my
> clock as a metric -- the average
> prediction/correction error over some previous time interval. Since
> the sigma of a typical system
> is on the order of milliseconds, I would have concluded that something
> is really broken here --
> the prediction errors are hours, not milliseconds. I would not have
> concluded that the time was 1220,
> because I trust my local clock to be within some small multiple of its
> historical prediction error. That
> might not be correct, but it is my first-order working hypothesis.
> Based on the evidence at hand, I
> have no way of deciding who is right, except that something is clearly
> broken. So I set my clock
> to unhealthy and do not adjust it. If the problem really is in the
> remote clocks, then this strategy is
> optimum. If the problem is in my clock then I have limited the damage
> by telling my customers not
> to use it. (The act of setting the clock unhealthy triggers a pager
> alarm in the NIST servers, but that
> is outside of the scope of NTP).
> Since my strategy uses the historical prediction error of the
> local clock as a way of evaluating
> the responses of the remote systems, I only need to query a single
> external server. I accept its
> response if its time difference is within some reasonable value of
> what my historical sigma has been. My
> system would query a second server if this test fails, but that might
> not help here, since none of the queries
> would pass this test. The fact that a number of external servers
> agreed would not by itself override my
> sigma test. As I mentioned above, this situation would trigger an alarm.
> The weakness with my algorithm comes when the servers disagree by
> something on the order of
> my prediction sigma. That is sticky because I can't say for sure
> whether it is a glitch or a conforming
> event. Depending on the details, I can follow the wrong pied piper here.
> Judah Levine
> Time and Frequency Division
> NIST Boulder
More information about the hackers