[ntp:questions] ntpd wedged again
unruh at invalid.ca
Sat Feb 11 20:09:16 UTC 2012
On 2012-02-11, A C <agcarver+ntp at acarver.net> wrote:
> On 2/11/2012 06:51, Dave Hart wrote:
>> On Sat, Feb 11, 2012 at 09:21, A C<agcarver+ntp at acarver.net> wrote:
>>> So ntpd has been behaving reasonably well with the snprintf fix. I had good
>>> results with only internet servers. My PPS and SHM refclocks were set to
>>> I removed the noselect on the PPS refclock and left flag3 set to zero (no
>>> kernel discipline).
>>> Everything seemed fine and then:
>>>> Sat Feb 11 01:12:10 PST 2012
>>>> remote refid st t when poll reach delay offset
>>>> x127.127.22.0 .PPS. 0 l - 16 377 0.000 -111.40
>>>> 127.127.28.0 .GPSD. 4 l 49 128 377 0.000 -14655.
>>>> 18.104.22.168 22.214.171.124 3 u 103 512 377 39.347 -9274.2
>>>> 126.96.36.199 127.67.113.92 2 u 79 512 377 57.746 -14699.
>>>> 188.8.131.52 184.108.40.206 3 u 521 512 377 77.930 -9835.0
>>>> 220.127.116.11 18.104.22.168 2 u 153 512 377 79.131 -9155.6
>>>> 22.214.171.124 126.96.36.199 2 u 142 512 377 86.537 -9102.3
>> Did you forget to mention you commented out the NMEA refclock at the
>> same time you removed noselect from the atom/PPS and SHM drivers?
>> I am a bit tired right now, so forgive me for latching onto a nit
>> rather than the juicy part, but I want to be as clear as possible.
>> You say everything was fine until you made some changes, without
>> specifying the previous state, and when I try to infer what that
>> earlier state was based on the two changes, I'm left with a setup with
>> no refclocks, which is obviously not particularly comparable. I'm
>> also hesitating to point a finger at the gpsd+SHM combo, particularly
>> because I suspect it's racy especially on non-x86 systems and have on
>> my to-do list rewriting it to use a safer shared memory access
>> So first, let's be clear about what you're reporting. Was the change
>> from 3 refclock drivers with 2 marked noselect to 2 selectable
> No problem. SHM has been disabled by noselect for a while. It is still
> currently disabled by noselect (but not commented out so I can still
> observe its relative offset). During the snprintf testing from this
> week, ATOM has also been disabled by noselect (also so I could continue
> to observe its relative offset) so I was left with only the internet
> servers (five total) as my time sources.
> For an entire week I ran with ATOM and SHM in noselect and things looked
> fine. Offsets for all internet servers settled down to 1-2ms and the
> reported ATOM offset also stayed in that same range without straying
> away (again, this is reported offset but the clock wasn't being used
> because it was still noselect).
> I removed the noselect from ATOM only (not SHM) so now I had the
> internet servers (five) plus ATOM. Everything looked fine for a few
> hours after I restarted ntpd with ATOM enabled again (allowed to be
> selected). But after a few hours, the clock went crazy and started
> slewing very quickly. When I restarted ntpd, it had to step the clock
> backwards by 16.6 seconds to bring it into agreement. The clock gained
> 16 seconds in a matter of about 5 minutes (the amount of time I let ntpd
> run in this crazy state).
16 sec in 5 min is 50,000 PPM. It is hard to see how ntpd could do that,
unless it was stepping like mad (one of the problems with the highly
non-linear stepping that ntp likes to do). It is possible to make the
clock slew at that rate by using adjtimex, the tickvalue adjustment, but
ntpd does not do that. (chrony does use it).
More information about the questions