[ntp:questions] ntpd wedged again

A C agcarver+ntp at acarver.net
Sat Feb 11 19:02:24 UTC 2012


On 2/11/2012 06:51, Dave Hart wrote:
> On Sat, Feb 11, 2012 at 09:21, A C<agcarver+ntp at acarver.net>  wrote:
>> So ntpd has been behaving reasonably well with the snprintf fix.  I had good
>> results with only internet servers.  My PPS and SHM refclocks were set to
>> noselect.
>>
>> I removed the noselect on the PPS refclock and left flag3 set to zero (no
>> kernel discipline).
>>
>> Everything seemed fine and then:
>>
>>> Sat Feb 11 01:12:10 PST 2012
>>>      remote           refid      st t when poll reach   delay   offset
>>>   jitter
>>>
>>> ==============================================================================
>>> x127.127.22.0    .PPS.            0 l    -   16  377    0.000  -111.40
>>> 351.464
>>>   127.127.28.0    .GPSD.           4 l   49  128  377    0.000  -14655.
>>> 2814.64
>>>   169.229.70.201  169.229.128.214  3 u  103  512  377   39.347  -9274.2
>>> 6597.61
>>>   72.14.179.211   127.67.113.92 2 u   79  512  377   57.746  -14699.
>>> 10685.0
>>>   24.124.0.251    132.236.56.250   3 u  521  512  377   77.930  -9835.0
>>> 7451.10
>>>   130.207.165.28  130.207.244.240  2 u  153  512  377   79.131  -9155.6
>>> 6554.15
>>>   131.144.4.10    130.207.244.240  2 u  142  512  377   86.537  -9102.3
>>> 6526.3
>
> Did you forget to mention you commented out the NMEA refclock at the
> same time you removed noselect from the atom/PPS and SHM drivers?
>
> I am a bit tired right now, so forgive me for latching onto a nit
> rather than the juicy part, but I want to be as clear as possible.
> You say everything was fine until you made some changes, without
> specifying the previous state, and when I try to infer what that
> earlier state was based on the two changes, I'm left with a setup with
> no refclocks, which is obviously not particularly comparable.  I'm
> also hesitating to point a finger at the gpsd+SHM combo, particularly
> because I suspect it's racy especially on non-x86 systems and have on
> my to-do list rewriting it to use a safer shared memory access
> protocol...
>
> So first, let's be clear about what you're reporting.  Was the change
> from 3 refclock drivers with 2 marked noselect to 2 selectable
> drivers?

No problem.  SHM has been disabled by noselect for a while.  It is still 
currently disabled by noselect (but not commented out so I can still 
observe its relative offset).  During the snprintf testing from this 
week, ATOM has also been disabled by noselect (also so I could continue 
to observe its relative offset) so I was left with only the internet 
servers (five total) as my time sources.

For an entire week I ran with ATOM and SHM in noselect and things looked 
fine.  Offsets for all internet servers settled down to 1-2ms and the 
reported ATOM offset also stayed in that same range without straying 
away (again, this is reported offset but the clock wasn't being used 
because it was still noselect).

I removed the noselect from ATOM only (not SHM) so now I had the 
internet servers (five) plus ATOM.  Everything looked fine for a few 
hours after I restarted ntpd with ATOM enabled again (allowed to be 
selected).  But after a few hours, the clock went crazy and started 
slewing very quickly.  When I restarted ntpd, it had to step the clock 
backwards by 16.6 seconds to bring it into agreement.  The clock gained 
16 seconds in a matter of about 5 minutes (the amount of time I let ntpd 
run in this crazy state).



More information about the questions mailing list