[ntp:questions] ntpd wedged again

unruh unruh at invalid.ca
Sat Feb 11 20:09:16 UTC 2012


On 2012-02-11, A C <agcarver+ntp at acarver.net> wrote:
> On 2/11/2012 06:51, Dave Hart wrote:
>> On Sat, Feb 11, 2012 at 09:21, A C<agcarver+ntp at acarver.net>  wrote:
>>> So ntpd has been behaving reasonably well with the snprintf fix.  I had good
>>> results with only internet servers.  My PPS and SHM refclocks were set to
>>> noselect.
>>>
>>> I removed the noselect on the PPS refclock and left flag3 set to zero (no
>>> kernel discipline).
>>>
>>> Everything seemed fine and then:
>>>
>>>> Sat Feb 11 01:12:10 PST 2012
>>>>      remote           refid      st t when poll reach   delay   offset
>>>>   jitter
>>>>
>>>> ==============================================================================
>>>> x127.127.22.0    .PPS.            0 l    -   16  377    0.000  -111.40
>>>> 351.464
>>>>   127.127.28.0    .GPSD.           4 l   49  128  377    0.000  -14655.
>>>> 2814.64
>>>>   169.229.70.201  169.229.128.214  3 u  103  512  377   39.347  -9274.2
>>>> 6597.61
>>>>   72.14.179.211   127.67.113.92 2 u   79  512  377   57.746  -14699.
>>>> 10685.0
>>>>   24.124.0.251    132.236.56.250   3 u  521  512  377   77.930  -9835.0
>>>> 7451.10
>>>>   130.207.165.28  130.207.244.240  2 u  153  512  377   79.131  -9155.6
>>>> 6554.15
>>>>   131.144.4.10    130.207.244.240  2 u  142  512  377   86.537  -9102.3
>>>> 6526.3
>>
>> Did you forget to mention you commented out the NMEA refclock at the
>> same time you removed noselect from the atom/PPS and SHM drivers?
>>
>> I am a bit tired right now, so forgive me for latching onto a nit
>> rather than the juicy part, but I want to be as clear as possible.
>> You say everything was fine until you made some changes, without
>> specifying the previous state, and when I try to infer what that
>> earlier state was based on the two changes, I'm left with a setup with
>> no refclocks, which is obviously not particularly comparable.  I'm
>> also hesitating to point a finger at the gpsd+SHM combo, particularly
>> because I suspect it's racy especially on non-x86 systems and have on
>> my to-do list rewriting it to use a safer shared memory access
>> protocol...
>>
>> So first, let's be clear about what you're reporting.  Was the change
>> from 3 refclock drivers with 2 marked noselect to 2 selectable
>> drivers?
>
> No problem.  SHM has been disabled by noselect for a while.  It is still 
> currently disabled by noselect (but not commented out so I can still 
> observe its relative offset).  During the snprintf testing from this 
> week, ATOM has also been disabled by noselect (also so I could continue 
> to observe its relative offset) so I was left with only the internet 
> servers (five total) as my time sources.
>
> For an entire week I ran with ATOM and SHM in noselect and things looked 
> fine.  Offsets for all internet servers settled down to 1-2ms and the 
> reported ATOM offset also stayed in that same range without straying 
> away (again, this is reported offset but the clock wasn't being used 
> because it was still noselect).
>
> I removed the noselect from ATOM only (not SHM) so now I had the 
> internet servers (five) plus ATOM.  Everything looked fine for a few 
> hours after I restarted ntpd with ATOM enabled again (allowed to be 
> selected).  But after a few hours, the clock went crazy and started 
> slewing very quickly.  When I restarted ntpd, it had to step the clock 
> backwards by 16.6 seconds to bring it into agreement.  The clock gained 
> 16 seconds in a matter of about 5 minutes (the amount of time I let ntpd 
> run in this crazy state).

16 sec in 5 min is 50,000 PPM. It is hard to see how ntpd could do that,
unless it was stepping like mad (one of the problems with the highly
non-linear stepping that ntp likes to do). It is possible to make the
clock slew at that rate by using adjtimex, the tickvalue adjustment, but
ntpd does not do that. (chrony does use it).
 



More information about the questions mailing list