[ntp:questions] solaris 9, ntpdate

Chuck Young see.why at austin.rr.com
Tue Jul 14 09:25:27 UTC 2009


Thanks for the reply.

On Monday 13 July 2009 22:19, Danny Mayer wrote:
> Chuck Young wrote:
<...>
> > It's difficult for me to speak authoritatively to exactly what happened
> > as it was 2 weeks ago and I've only been recently called in to
> > troubleshoot.  In summary, 2 solaris 9 servers I help admin recently had
> > a pretty hairy problem with clocks.  There was a power outage in the data
> > center housing these boxes.  When they came back up, their clocks got
> > badly mis-set.  Around 4 hours later, the local time server (stratum 2
> > below) responsible apparently got back on track (or further off track?);
> > I believe xntpd quit when the remote and local clocks got widely off.
> >
<...>
> > Here is the rough architecture:
> >
> >
> > strat 1:               NIST             |              RH
> >                             /      \            |            /       \
> >                  (data center 1)     |    (data center 2)
> >                          /            \         |         /             \
> > strat 2:    cisco-1  cisco-2   |   cisco-3  cisco-4
> >
> > (all clients poll first local then remote strat 2 servers)
> >                    \V/     \V/      \V/    |     \V/      \V/     \V/
> > strat 3:   client client client |  client client client
<...>
> > Here is the pertinent shell code in the stock solaris 9 init script
> > (edited, will post the whole script if asked):
> >
> > <snip from /etc/init.d/xntpd>
<...>
> >         if [ -n "$ARGS" ]; then
> >                 # Wait until date is close before starting xntpd
> >                 (/usr/sbin/ntpdate $ARGS; sleep 2; /usr/lib/inet/xntpd) &
<...>
> >
> > My understanding from this script, the conf file, the log, and what I
> > know about ntp (which I've some experience with over the years) is that,
> > on start, solaris executed 'ntpdate -s -w 192.168.201.117 192.168.201.119
> > 1.2.3.117 1.2.3.119'.  One or more strata 2 sources were way off.  The
> > date was then incorrectly set, and xntpd started.  4 hours later (maybe -
> > or perhaps the clock was getting reset by some huge amount?), xntpd quit
> > because offsets were so large.
<...>
> This looks like a big mess. Don't use ntpdate especially in your

Yes I understand ntpdate vs ntpd.  Fact is that every init script I've looked 
at on solaris 7, 8, 9, SuSE and Redhat Linux uses ntpdate to initially set 
the clock.  This is commonly done to speed up boot times.

I'm not talking here about anything I've coded; I'm describing stock installs 
across a few flavors of unix.  I was kind of surprised at this, because docs 
on these same boxes clearly describe ntpdate as deprecated or nearso.

> environment. ntpdate just sets the time. You need to be using ntpd which
> will discipline the clock. If you are using a very old version of ntpd

Yes these all use ntpd; the solari use version 3.  It was ntpd that ended up 
shutting down - I believe this happened when the strat 2 servers changed 
their times (probably got rebooted again) and got > 1000s off from the 
clients.  My guess.

These OSes are not that old, eg, CentOS 5.3.  One of the solaris 9 boxes that 
borked is a recent, fully patched solaris 9 install which is btw supported.  
But note that the init script on that box is still:

"
# Copyright (c) 1996-1997 by Sun Microsystems, Inc.
"

> which it what it sounds like, download and build the latest stable
> version of ntpd from the download web site. When you install it, use
> ntpd -gN which will allow you to set the time based on the consensus of
> the NTP servers you have configured. Add to each server line iburst and
> ntpd will be able to synchronize in about 15-20 seconds (though that can
> depend on a number of factors).
>

Neither of 'iburst' nor '-gN' are supported on the ntp v3, solaris 9 sun 
implementation of xntp.  I'll look into building from source, but that has 
various problems associated with it.

> You also are using two external primary servers but this causes problems
> because the stratum 2 servers cannot decide which one is more reliable.

I think my ascii diagram did more harm than good.

Each Data Center has two stratum 2 servers.  That is, four total stratum 2 
servers, at two DCs.  Each of these stratum 2 servers syncs to exactly one 
upstream stratum 1 server, with pairs of colocated stratum 2 servers syncing 
to the SAME stratum 1 server.  Again, each stratum 2 server only talks to a 
single stratum 1 server, so I don't see how there can be any problem with 
"stratum 2 servers" deciding "which one [that is, which stratum 1 server] is 
more reliable".

What *could* be happening is that the clients at stratum 3 were seeing 
different times from different pairs of stratum 2 servers.  But what makes 
that hard to understand is why do logfiles on different clients that borked 
in a near-identical manner claim that these Very Wrong Times came from TWO 
DIFFERENT stratum 2 servers, at DIFFERENT colos, hence syncing to DIFFERENT 
stratum 1 servers?

That says to me that there is something in *common*, on the *client side* 
(they are both crufty sun implementations of the daemon on solaris 9), that 
caused them to read broken upstream stratum 2 info in an identical manner.

> You need at least a minimum of 3 NTP servers and preferably 4 in case
> one of them becomes unavailable.
>

What stratum are you discussing?  "(P)referably 4" stratum 2 servers?  Or 
stratum 1?  We run exactly 4 stratum 2 servers right now.

> I also notice that you are using cisco devices for part of the network.
> However, cisco's principle use is for routing and switching and not for
> an ntp server, so you should be sure that it's reliable. For your

I'm blissfully ignorant of the ciscos but have brought that up already with 
staff.  ntp.org doesn't say anything particularly bad about ciscos however:

http://support.ntp.org/bin/view/Support/CiscoNTP

> particular situation you should pay money and buy a number of refclocks
> for internal use. It's cheap for critical financial DB servers.
> Configure those on some of your older servers so that they are the only
> thing that those servers are doing and then point all your critical
> client systems to those servers. Make sure you have at least 3 and
> configure a few external servers in case for some reason those become
> unavailable.
>

Are you suggesting buying precision chronometers and building stratum 0 
delivery in the enterprise?

thx
cy



More information about the questions mailing list