[ntp:hackers] What to do when the offset is WAYTOOBIG

David L. Mills mills at udel.edu
Thu Apr 19 18:24:52 PDT 2007


Judah,

Your NIST servers, at least those I have checked, show only the ACTS 
(lockclock) reference source and none of the other NIST servers. 
However, late in your message you cite a scenario where multiple 
external sources might indeed disagreee. This is what the selection 
algorithm is for.

There are two hypotheses:

1. The local clock hardware is defective, as determined by the 
(adaptive) sigma threshold. The prudent course is to suspend service and 
trigger the beeper.

2. One or more of the remote servers are defective, as determined by the 
clock selection algorithm. If there is no majority cliquee, the prudent 
course is to continue service but increase dispersion, so that dependent 
clients can mitigate with other sources.

A NIST primary server might swing the liklihood ratio to favor (1), as 
the ACTS service is ordinarily very robust and not affected by other 
network traffic. However, a higher stratum server might swing the ratio 
to favor (2) on the basis that Internet congestion and reliably is much 
less than telephone congestion and reliability. The same thing might be 
advisable for primary servers using reference sources other than ACTS.

The usual indication that something is drastically wrong with the 
hardware is when the frequency pegs at +-500 PPM or when the wander 
spikes. If so, select  hypothesis (1). In the case of steps, all 
chronometric data are purged and the client starts over from scratch. If 
this cycle repeats, select hypothesis (1).

In all other cases where a majority clique exist or all sources have 
become unreachable or non-selectable, select hypothesis (2).

We could add an adaptive threshold for the sigma, which in the NTP 
design is estimated from the exponentially averaged first-order 
frequency differences. This would be effective only if a single 
reference clock was selected.

Dave

Judah Levine wrote:

> Hello,
>
>> I am watching five clocks. Three of the say 1200, two say 1300 and my 
>> clock says 1400.
>> Since the majority of clocks I watch say 1200, I conclude the real 
>> time is 1220, but that is beyond my panic limit of one hour.
>
>
>     I would have looked at this differently. I would have evaluated 
> the error of my clock from the
> times of the remote clocks I was monitoring using the sigma of my 
> clock as a metric -- the average
> prediction/correction error over some previous time interval. Since 
> the sigma of a typical system
> is on the order of milliseconds, I would have concluded that something 
> is really broken here --
> the prediction errors are hours, not milliseconds. I would not have 
> concluded that the time was 1220,
> because I trust my local clock to be within some small multiple of its 
> historical prediction error.  That
> might not be correct, but it is my first-order working hypothesis. 
> Based on the evidence at hand, I
> have no way of deciding who is right, except that something is clearly 
> broken. So I set my clock
> to unhealthy and do not adjust it. If the problem really is in the 
> remote clocks, then this strategy is
> optimum. If the problem is in my clock then I have limited the damage 
> by telling my customers not
> to use it. (The act of setting the clock unhealthy triggers a pager 
> alarm in the NIST servers, but that
> is outside of the scope of NTP).
>     Since my strategy uses the historical prediction error of the 
> local clock as a way of evaluating
> the responses of the remote systems, I only need to query a single 
> external server. I accept its
> response if its time difference is within some reasonable value of 
> what my historical sigma has been. My
> system would query a second server if this test fails, but that might 
> not help here, since none of the queries
> would pass this test. The fact that a number of external servers 
> agreed would not by itself override my
> sigma test. As I mentioned above, this situation would trigger an alarm.
>     The weakness with my algorithm comes when the servers disagree by 
> something on the order of
> my prediction sigma. That is sticky because I can't say for sure 
> whether it is a glitch or a conforming
> event. Depending on the details, I can follow the wrong pied piper here.
>
> Judah Levine
> Time and Frequency Division
> NIST Boulder
>
>



More information about the hackers mailing list