[ntp:questions] ntpd wedged again

A C agcarver+ntp at acarver.net
Mon Feb 13 21:43:00 UTC 2012


On 2/13/2012 01:53, David Lord wrote:
> A C wrote:
>> On 2/12/2012 16:38, David Lord wrote:
>>> A C wrote:
>>>> I'm not exactly sure what happened but I may have caught the system in
>>>> the act of trying to run away. Below is the last two lines from the
>>>> peers log for one of the peers and below that is the output of ntpq
>>>> showing what happened to that peer immediately after. Note the offset.
>>>> This somewhat matches what I had been seeing previously where my
>>>> system would be reporting sane numbers and then very suddenly go
>>>> haywire.
>>>>
>>>> 55969 77832.471 67.23.181.241 9324 0.006359155 0.081262776 0.001634487
>>>> 0.008837966
>>>> 55969 77897.557 67.23.181.241 9324 399326.864663492 0.000122070
>>>> 0.001346708 399326.861631493
>>>>
>>>>
>>>> remote refid st t when poll reach delay offset jitter
>>>>
>>>> ==============================================================================
>>>>
>>>>
>>>> 127.127.22.0 .PPS. 0 l 5 16 377 0.000 0.095 0.122
>>>> 127.127.28.0 .GPSD. 4 l 36 128 377 0.000 6.399 10.748
>>>> -67.223.228.71 18.26.4.105 2 u 4 64 377 95.580 -3.863 14.410
>>>> +74.118.152.85 69.36.224.15 2 u 20 64 377 41.318 0.492 3.692
>>>> 67.23.181.241 128.4.1.1 2 u 53 64 377 0.122 3993268 3993268
>>>> +130.207.165.28 130.207.244.240 2 u 7 64 377 77.872 2.837 0.328
>>>> *131.144.4.10 130.207.244.240 2 u 1 64 377 84.182 1.434 10.710
>>>>
>>>>
>>>> After a couple minutes things went back to normal as if nothing had
>>>> happened:
>>>>
>>>> 55969 78417.519 67.23.181.241 9024 0.005904355 0.081140840 0.005271221
>>>> 0.000546629
>>>> 55969 78545.470 67.23.181.241 9324 0.006413290 0.081305361 0.002089938
>>>> 0.011478833
>>>>
>>>>
>>>> remote refid st t when poll reach delay offset jitter
>>>> ==============================================================================
>>>>
>>>>
>>>> 127.127.22.0 .PPS. 0 l 15 16 377 0.000 0.092 0.122
>>>> 127.127.28.0 .GPSD. 4 l 94 128 377 0.000 11.432 8.820
>>>> -67.223.228.71 18.26.4.105 2 u 58 64 377 95.580 -3.863 8.613
>>>> +74.118.152.85 69.36.224.15 2 u 9 64 377 41.034 0.488 3.691
>>>> -67.23.181.241 128.4.1.1 2 u 44 64 377 81.141 5.904 0.547
>>>> +130.207.165.28 130.207.244.240 2 u 62 64 377 78.036 3.062 0.628
>>>> *131.144.4.10 130.207.244.240 2 u 59 64 377 84.182 1.434 10.717
>>>>
>>>>
>>>> It's the fist time that I've actually been able to see this happen
>>>> while staring at the monitor. I'm not certain but I suspect that the
>>>> last few times this happened where the system did run away the peer
>>>> that experienced the strange behavior happened to also be the selected
>>>> system peer. I can't explain the behavior though because everything
>>>> looks fine on my system otherwise and the remote server also seems to
>>>> be fine given that its offset was normal just before the update.
>>>
>>> What remote server settings do you have?
>>>
>>> The distances from your sources also don't fit with
>>> having a short poll interval unless you know for
>>> certain the routes are symmetric and with constant
>>> delay.
>>>
>>> Apart from that other sites have their own problems
>>> so when any of my selected sources goes bad it is
>>> usually replaced with a better source.
>>>
>>> Ntpd usually handles the bad sources fairly well.
>>>
>>
>> I posted the entire configuration in my previous email. Other than
>> iburst I have no settings for the servers. No min or max poll. I'm not
>> selecting the 64 second poll interval, ntpd is.
>
> That's normal for ntpd starting up and if there is no good
> drift file it can take many hours to establish. As the drift
> value is adjusted the poll intervals normally increase and
> finer adjustments are made. With a stable temperature I see
> poll intervals of 1024.


It has been over 24 hours now of ntpd running (hooray, new record) but 
the poll times are still at 64.  The refclock seems to be clamping it 
but I don't know why.  The PPM adjustment has been fairly stable for the 
last 12 hours.  It hovers around -78.40 and drifts up and down by a few 
pp-billion as the room temperature changes.

>>
>> In this case ntpd successfully deselected the errant source and then
>> accepted it again after the strange behavior was noted. But I contend
>> that this may have happened with the system peer and ntpd may nto have
>> been so graceful about the sudden jump.
>
> I use 'tos minsane 3' in my ntp.conf and have occasionally
> seen my stratum jump to 16 when there has been a problem
> with my internet connection.

A stratum change I can see if something has gone wrong but this is not 
just a stratum change but a complete skewing of the clock by many 
seconds.  Initially it appears the skew is just an error internal to 
ntpd where it only *thinks* the clock is off by many seconds.  However, 
my suspicion is that if this error happens to fall upon the clock that 
has been christened the system peer then it causes havoc and really 
*does* skew the system clock to try and compensate for a perceived 
drift.  Unfortunately that sends the algorithm off the rails and the 
entire clock ends up slammed in one direction ticking way off as is 
evidenced by restarting ntpd shortly after the fatal flaw and observing 
it forcing a clock step of many seconds.

>
>>
>> PS: I get a fuzz report once every two hours, is this what is supposed
>> to be in the code or is it supposed to report fuzz when it is detected?
>
> Sorry I've never had a fuzz report. If you are running
> ntp-dev it may have code that gives extra information.
>

Yeah, Dave hart already described the fuzz code to me.  I was more 
curious if it was a cyclic thing or if it was reported on an as detected 
basis.


More information about the questions mailing list