[ntp:questions] NTP clients not syncing up to servers?

Ted Beatie ted at permabit.com
Fri Oct 7 18:10:53 UTC 2005


I realize that subject is a bit vague.  Unfortunately, the more I dive
into the documentation for NTP, the more I'm convinced that it's magic.
This message will be long, but I'm not sure what information is or is
not relevant.

What we're trying to do;

  We deploy systems at customer facilities that have both "gateway" and
  "storage-node" machines; the gateways connect to the rest of customer
  site and the storage-nodes, and the storage-nodes connect only to one
  another and the gateways.  We'd like the gateways to sync to either a
  customer-supplied NTP server or external NTP servers, and the other
  gateways, and the storage-nodes to sync to the gateways (and trust
  them completely).

The setup;

  The gateways have 2 or 3 interfaces, one of which goes to the internal
  LAN, and the other one or two go to private back-end switches.  The
  ntp.conf on the gateways looks like this;

    driftfile /var/lib/ntp/ntp.drift
    statsdir /var/log/ntpstats/
    statistics loopstats peerstats clockstats
    filegen loopstats file loopstats type day enable
    filegen peerstats file peerstats type day enable
    filegen clockstats file clockstats type day enable

    server <one or more servers, external or internal>
    server <one or more other gateways, using the back-end addresses>

  The storage-nodes have 2 interfaces, each of which goes to back-end
  switches.  The ntp.conf on the storage-nodes looks like this;

    driftfile /var/lib/ntp/ntp.drift
    statsdir /var/log/ntpstats/
    statistics loopstats peerstats clockstats
    filegen loopstats file loopstats type day enable
    filegen peerstats file peerstats type day enable
    filegen clockstats file clockstats type day enable

    server <two or more gateways, using the back-end addresses>

  furthermore, /etc/init.d/ntp has been modified on the storage-nodes to
  include the -g flag.

The problem;

  It doesn't seem to work reliably;

    gateway1:~# date;for i in 2 51 52 53 54; do ssh -1 10.123.123.$i date;done
    Fri Oct  7 11:10:36 EDT 2005
    Fri Oct  7 11:10:35 EDT 2005
    Fri Oct  7 11:18:27 EDT 2005
    Fri Oct  7 11:14:05 EDT 2005
    Fri Oct  7 11:08:26 EDT 2005
    Fri Oct  7 11:22:15 EDT 2005

  I'm unsure what's going on, or how to diagnose.  It looks like
  everything is communicating properly;

    On the gateway (time 11:10 above);

 gateway1:~# ntpq -c pe localhost
      remote           refid      st t when poll reach   delay   offset  jitter
 ==============================================================================
 <internal NTP server> <>          2 u   38   64  377    0.278  -1565.0   4.599
 <other gateway>       <>          4 u  928 1024  377    0.114  -1135.6   2.005
 <other gateway>       <>         16 u  805 1024    0    0.000    0.000 4000.00

    On one of the storage-nodes (time 11:14 above);

 storage-node2:~# ntpq -c pe localhost
      remote           refid      st t when poll reach   delay   offset  jitter
 ==============================================================================
 <gateway 1>           <>          3 u   40   64  377    0.136  -207752   3.355
 <gateway 2>           <>          4 u   35   64  377    0.123  -208891   4.606

  Looking at the debugging techniques, and seeing that the tally code is
  a space, and delving deeper, I see;

    gateway1:~# ntpq -c as localhost
    ind assID status  conf reach auth condition  last_event cnt
    ===========================================================
    1 47900  9014   yes   yes  none    reject   reachable  1
    2 47901  9014   yes   yes  none    reject   reachable  1
    3 47902  8000   yes   yes  none    reject

    storage-node2:~# ntpq -c as localhost
    ind assID status  conf reach auth condition  last_event cnt
    ===========================================================
    1 16076  9064   yes   yes  none    reject   reachable  6
    2 16077  9064   yes   yes  none    reject   reachable  6

  So obviously the responses are getting rejected, but I'm not clear
  why.  Looking at what should theoretically be the upstream internal
  NTP server from the gateway;

gateway1:~# ntpq -c 'rv 47900' localhost
status=9014 reach, conf, 1 event, event_reach,
srcadr=sfprinters.us.babcockbrown.com, srcport=123, dstadr=10.16.4.150,
dstport=123, leap=00, stratum=2, precision=-7, rootdelay=0.000,
rootdispersion=9733.109, refid=bigbird.babcockbrown.com, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, flash=00 ok, keyid=0,
offset=-1562.833, delay=0.230, dispersion=11.665, jitter=7.127,
reftime=c6f0cea2.4526e978  Fri, Oct  7 2005  6:38:26.270,
org=c6f11314.3e76c8b4  Fri, Oct  7 2005 11:30:28.244,
rec=c6f11315.d120d130  Fri, Oct  7 2005 11:30:29.816,
xmt=c6f11315.d10cc35c  Fri, Oct  7 2005 11:30:29.816,
filtdelay=     0.31    0.31    0.36    0.30    0.23    0.39    0.27    0.37,
filtoffset= -1572.7 -1562.2 -1567.0 -1572.8 -1562.8 -1567.7 -1573.4 -1563.1,
filtdisp=      7.83    8.82    9.78   10.74   11.68   12.64   13.60   14.58

  And looking at one of the gateways from the storage-node;

storage-node2:~# ntpq -c 'rv 16076' localhost
status=9064 reach, conf, 6 events, event_reach,
srcadr=10.123.123.1, srcport=123, dstadr=10.123.123.52, dstport=123,
leap=00, stratum=3, precision=-16, rootdelay=0.244,
rootdispersion=9734.558, refid=10.16.4.1, reach=377, unreach=0, hmode=3,
pmode=4, hpoll=6, ppoll=6, flash=00 ok, keyid=0, offset=-207768.629,
delay=0.118, dispersion=3.779, jitter=3.407,
reftime=c6e82e68.10c63f14  Fri, Sep 30 2005 17:36:40.065,
org=c6f11375.feae6c8f  Fri, Oct  7 2005 11:32:05.994,
rec=c6f11445.c40d2c38  Fri, Oct  7 2005 11:35:33.765,
xmt=c6f11445.c3fec13b  Fri, Oct  7 2005 11:35:33.765,
filtdelay=     0.19    0.14    0.12    0.14    0.14    0.15    0.14    0.14,
filtoffset= -207770 -207769 -207768 -207767 -207766 -207765 -207763 -207762,
filtdisp=      0.03    1.01    1.97    2.91    3.90    4.88    5.85    6.83

  I've also seen a flash value of 80.  Now it appears to be 00.

I'm at a loss here.  What we want is for the gateways to get their times
from either upstream external NTP sources, or internal sources, or to
just accept their own time, and the storage-nodes should get their times from
the gateways and believe them no matter what the skew.

How can I go about figuring out where to go from here?  Thanks in
advance for any help..


--
Ted Beatie                         Permabit, Inc.             ted at permabit.com
Sr. Systems Engineer       One Kendall Sq, Cambridge, MA       +1-617-995-9317




More information about the questions mailing list