[ntp:questions] strange behaviour of ntp peerstats entries.

Unruh unruh-spam at physics.ubc.ca
Mon Jan 28 02:19:44 UTC 2008


mayer at ntp.isc.org (Danny Mayer) writes:

>Unruh wrote:
>> mayer at ntp.isc.org (Danny Mayer) writes:
>> 
>>> Unruh wrote:
>>>> Brian Utterback <brian.utterback at sun.com> writes:
>>>>
>>>>> Unruh wrote:
>>>>>> "David L. Mills" <mills at udel.edu> writes:
>>>>>>> You might not have noticed a couple of crucial issues in the clock 
>>>>>>> filter code.
>>>>>> I did notice them all. Thus my caveate. However throwing away 80% of the
>>>>>> precious data you have seems excessive.
>>>> Note that the situation can arise that the one can wait many more than 8
>>>> samples for another one. Say sample i is a good one. and remains the best
>>>> for the next 7 tries. Sample i+7 is slightly worse than sample i and thus
>>>> it is not picked as it comes in. But the next i samples are all worse than
>>>> it. Thus it remains the filtered one, but is never used because it was not
>>>> the best when it came in. This situation could keep going for a long time,
>>>> meaning that ntp suddenly has no data to do anything with for many many
>>>> poll intervals. Surely using sample i+7 is far better than  not using any
>>>> data for that length of time.
>> 
>>> On the contrary, it's better not to use the data at all if its suspect. 
>>> ntpd is designed to continue to work well even in the event of loosing 
>>> all access to external sources for extended periods.
>> 
>>>> And this could happen again. Now, since the
>>>> delays are presumably random variables, the chances of this happening are
>>>> not great ( although under a condition of gradually worsening network the
>>>> chances are not that small), but since one is running ntp for millions or
>>>> billions of samples, the chances of this happening sometime becomes large. 
>>>>
>> 
>>> There are quite a few ntpd servers which are isolated and once an hour 
>>> use ACTS to fetch good time samples. This is not rare at all.
>> 
>> And then promplty throw them away because they do not satify the minimum
>> condition? No, it is not "best" to throw away data no matter how suspect.
>> Data is a preecious comodity and should be thrown away only if you are damn
>> sure it cannot help you. For example lets say that the change in delay is
>> .1 of the variance of the clock. The max extra noise that delay can cause
>> is about .01 Yet NTP will chuck it. Now if the delay is 100 times the
>> variance, sure chuck it. It probably cannot help you. The delay is a random
>> process, non-gaussian admitedly, and its effect on the time is also a
>> random process-- usually much closer to gaussian. And why was the figure of
>> 8 chosen ( the best of the last 8 tries) why not 10000? or 3? I suspect it
>> came off the top of someone's head-- lets not throuw away too much stuff,
>> since it would make ntp unseable, but lets throw away some to feel
>> virtuous. Sorry for being sarcastic, but I would really like to know what
>> the justification was for throwing so much data away.

>No, 8 was chosen after a lot of experimentation to ensure the best 
>results over a wide range of configurations. Dave has adjusted these 
>numbers over the years and he's the person to ask.


OK. The usual comment is that you throw away about 40% of the data using
the median filter (eg looking at the shm refclock program where that
40%figure is attributed to him and in ntp as well). But here one is trowing
away over 80% ( Ie keeping less than 1/6 of the data).
Running a very quick test on one system on my lan, I find that this changes
the variance of the offsets by about 10%. Ie, it makes only a marginal
difference to the variance. ( and yes, there is a fair amount of
correlation between the offset fluctuation and the delay fluctutation.
(correlation coefficient .5) . Actually the main thing this seems to do is
to make the variance in the delay times small, not the variance in the
offset.

I am also a little bit surprized that it is the delay that is used and not
the total roundtrip time. As I seem to read it, the delay is (t4-t3+t2-t1)
ie, it does not take into account the delay within the far machinei (eg
t4-t1), but
only propagation delay. I would expect that the former might even be more
important than the latter, but that is a pure guess-- ie no measurements on
even one system to back it up. 
Now it may be that on that rocky road to Manila, the propagation delay is
by far the most important, but on a moderm lan, especially with a low
propagation delay of hundreds of usec rather then 100s of msec, I wonder. 

I munged ntp record_peer_stats to also print out the p_off and p_del, (ie
the immediate offset and delay of the current packet) and counted up in the
output how often peer->off and p_off are different from each other,
indicating a thrown away packet of data. I got 83% of the time.





More information about the questions mailing list