[ntp:questions] Strange time instability (ntpd 4.2.2p3)

linux at horizon.com linux at horizon.com
Thu Oct 26 18:54:28 UTC 2006


I've been playing with recent Linux kernels, and the standard kernel
just upgraded to the NTP nanokernel, so there's been significant overhaul
and the chance for bugs to creep in.

But before I go pointing fingers over there, I'd like to understand some
bizarre data in the stats files.

I've got ntpd 4.2.2p3 with three time sources: refclock #29 (Palisade),
refclock #22 (PPS), and a LAN peer with its own GPS clock.  All give
consistent offsets in the peerstats file.

It starts out at about 200 to 350 us error, follows an exponential decay
curve over 1100-1200 seconds to -200 to -350 us, then before it totally
flattens out, it starts diving for positive error again.

But what's weird is that, even though there's a healthy and valid
PPS peer, the offset in the loopstats file does something quite
different.

It starts out in the same place, but then exponentially decays to 0.
Then it jumps up to match the peerstats overshoot, and they both
abruptly reverse direction.

Here's an attempted rendering in ASCII art.  The o points are from
peerstats, and the x points are from loopstats, and the *s are where
they coincide:

*                     *                     *
                   o                     o   
 x               o     x               o     
                o                     o      
 ox            o       ox            o       
   x                     x
     x        o            x        o        
__o____ x_x_____________o____ x_x____________
             o     x x             o     x x 
   o            x        o            x      
              x                     x
    o       ox            o       ox         
     o                     o                 
      o     x               o     x          
        o                     o              
           *                     *           

(There are, of course, a lot more points on each curve in reality.)

Also unlike my picture, the amplitudes are rather chaotic.
And occasionally, the peerstats curve decays more slowly than
the loopstats one, so there's a second pulse in the same direction.

But in all cases, the loopstats offset starts out matching the peerstats
one, then decays to near zero while the peerstats does something similar,
but what appears to be a non-zero linear term added.

Then it notices that the peer offset is different and jumps to match it.
Repeat ad nauseam.


Now, I can see kernel misbehaviour giving the ntpd control loop fits
trying to stabilize it, but given that *all* of the peers unanimously
agree on the local clock offset, how does the loopstats offset get
decoupled from the peers like that?

Currently, even if the kernel is completely insane, I don't see how
the loopinfo offset field can do that.

The call graph appears to be (ntpd/ntp_proto.c and a bit of ntp_loopfilter.c)

process_packet
-> clock_filter(peer, p_offset)
   - Find best offset
   - Set peer->offset
   - Popcorn noise filter
   -> clock_select
      - Make list of peers
      - Intersection
      - Clustering
      - Find sys_prefer, and sys_pps (if any)
      - If sys_pps
        - sys_peer = sys_pps; sys_offset = sys_pps->offseet
      - Else if sys_prefer
        - sys_peer = sys_prefer; sys_offset = sys_prefer->offseet
      - Else
         - Find sys_peer (with clock-hopping hold-down)
         - Combining to set sys_offset
      -> clock_update
         -> local_clock(sys_offset)
            -> record_loop_stats(offset)
-> record_peer_stats(peer->offset)

That sure looks like, assuming there is a pps_peer, sys_offset must be
sys_pps->offset, and they should never be different.  Even ignoring the
pps logic, there are only three peers configured, and they all agree on
what time it is.  How can record_loop_stats log an offset that doesn't
correspond to *any* of the peers?

Maxpoll for all of them is clamped very low, so 1000 seconds is far
longer than a sample can live in the history buffer.

So what is going on here?
Advice is appreciated.

Linux 2.6.19-rc2, amd64 uniprocessor, ntp 4.2.2p3, Acutime 2000 GPS
receiver.



More information about the questions mailing list