[ntp:questions] strange behaviour of ntp peerstats entries.

Unruh unruh-spam at physics.ubc.ca
Fri Jan 25 20:24:32 UTC 2008


Brian Utterback <brian.utterback at sun.com> writes:

>Unruh wrote:
>> "David L. Mills" <mills at udel.edu> writes:

>>> You might not have noticed a couple of crucial issues in the clock 
>>> filter code.
>> 
>> I did notice them all. Thus my caveate. However throwing away 80% of the
>> precious data you have seems excessive.

>But it isn't really. It would be if there were no correlation between
>the delay and the error, but there is a correlation. If the sampling
>were completely random, then you would want to use all of the samples
>to determine the correct offset, by averaging or some such method.
>But since the error in the sample is correlated to the size of the
>delay, using samples with greater delay and thus greater error just
>increases the error of the final result, not lessens it. Since the
>clocks involved also slew between samples, we want to use the newest
>sample with the smallest delay.

I understand the reason for the decision, I am just very uncomfortable with
it.

Since part of the design goals of ntp are not to overload the net ( by
which it seems to mean one packet ever minute is "too much") it seems
pretty cavalier to then throw away something like 80% of the data you get. 
While I certainly understant the design decision, It is not at all clear
that your model is correct. One model is that the the excess delay is
always on the outgoing leg. Another is that it is random-- sometimes out,
sometimes in, sometimes both. This latter case IS statistical, and a very
very brief test with ntp ( in which I printed out p_offset and p_del as
well as what ntp does, so I can directly compare what happens) seems to
indicate that the noise introduced by delay on the one system I happen to
have tested this on is statistical, not biased. Clearly a longer delay
data point
is worth less, but it is surely not worthless as ntp assumes. 
This becomes especially critical if the clock drift jumps around, because
it is obviously impossible to see that if you are not looking. 

Note that the situation can arise that the one can wait many more than 8
samples for another one. Say sample i is a good one. and remains the best
for the next 7 tries. Sample i+7 is slightly worse than sample i and thus
it is not picked as it comes in. But the next i samples are all worse than
it. Thus it remains the filtered one, but is never used because it was not
the best when it came in. This situation could keep going for a long time,
meaning that ntp suddenly has no data to do anything with for many many
poll intervals. Surely using sample i+7 is far better than  not using any
data for that length of time. And this could happen again. Now, since the
delays are presumably random variables, the chances of this happening are
not great ( although under a condition of gradually worsening network the
chances are not that small), but since one is running ntp for millions or
billions of samples, the chances of this happening sometime becomes large. 


Another question, why was clock_phi chosen as 15PPM? Is this a crude
attempt to disable this filtering once the time period becomes comparable
to the Allan time? (ie, with a poll interval of 60000s, the current sample
will always be the one chosen.) Is there some theory behind this choice of
15PPM and for the way in which the "metric" is aged, or was it just a
guess. I would expect that the "correct" figure would depend on the actual
delay spectrum of the connection. 

 





More information about the questions mailing list