[ntp:questions] solaris 9, ntpdate

Danny Mayer mayer at ntp.org
Tue Jul 14 03:19:00 UTC 2009


Chuck Young wrote:
> Hello,
> 
> I've spent quite a bit of time diving both solaris docs and ntp.org docs and 
> am unable to come up with answers to what follows.
> 
> It's difficult for me to speak authoritatively to exactly what happened as it 
> was 2 weeks ago and I've only been recently called in to troubleshoot.  In 
> summary, 2 solaris 9 servers I help admin recently had a pretty hairy problem 
> with clocks.  There was a power outage in the data center housing these 
> boxes.  When they came back up, their clocks got badly mis-set.  Around 4 
> hours later, the local time server (stratum 2 below) responsible apparently 
> got back on track (or further off track?); I believe xntpd quit when the 
> remote and local clocks got widely off.
> 
> The boxes that failed are enterprise critical DB servers and time is very 
> important to our business (revenue).  I'm interested in:
> 
> 1) Resources to facilitate further research
> 2) More robust failover
> 3) Immediate notification of clock trouble
> 
> I can code this in sh or perl myself but would naturally like to do as little 
> as possible, leverage existing solutions, etc.  These are production boxes 
> with no real dev space for me to tinker in so KISS is the order of the day.  
> Follows are gory details; sorry for long post.
> 
> Here is the rough architecture:
> 
>                                                  |
> strat 1:               NIST             |              RH
>                             /      \            |            /       \
>                  (data center 1)     |    (data center 2)
>                          /            \         |         /             \
> strat 2:    cisco-1  cisco-2   |   cisco-3  cisco-4
>                                                  |
> (all clients poll first local then remote strat 2 servers)
>                    \V/     \V/      \V/    |     \V/      \V/     \V/
> strat 3:   client client client |  client client client
> 
> (I've pointed out to colleagues that the strat 2 servers should be peered; at 
> the time of this meltdown they weren't).
> 
> Here is some pertinent info from one host (other is very similar):
> 
> [root at aegir ccy]# uname -a
> SunOS aegir 5.9 Generic_122300-04 sun4u sparc SUNW,Sun-Fire-V440
> [root at aegir ccy]# ntptrace 192.168.201.117
> cisco-1.mydomain.com: stratum 2, offset -0.004321, synch distance 0.03790
> time-a.nist.gov: stratum 1, offset 0.003697, synch distance 0.00000, refid 
> 'ACTS'
> [root at aegir ccy]# ntptrace 192.168.201.119
> cisco-2.mydomain.com: stratum 2, offset -0.000124, synch distance 0.03850
> time-a.nist.gov: stratum 1, offset -0.001194, synch distance 0.00000, refid 
> 'ACTS'
> [root at aegir ccy]# cat /etc/inet/ntp.conf
> server 192.168.201.117
> server 192.168.201.119
> server 1.2.3.117
> server 1.2.3.119
> [root at aegir ccy]#
> 
> The saga of the clock breaking can be seen in this sequence 
> of /var/adm/messages:
> 
> Jun 23 07:53:52 aegir genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0
> Mar  1 00:04:25 aegir ntpdate[187]: [ID 774510 daemon.notice] step time server 
> 1.2.3.119 offset -514799367.013772 sec
> Mar  1 00:04:27 aegir xntpd[306]: [ID 702911 daemon.notice] xntpd 3-5.93e Mon 
> Sep 20 15:47:11 PDT 1999 (1)
> Mar  1 00:04:28 aegir xntpd[306]: [ID 301315 daemon.notice] tickadj = 5, tick 
> = 10000, tvu_maxslew = 495, est. hz = 100
> Mar  1 00:04:28 aegir xntpd[306]: [ID 798731 daemon.notice] using kernel 
> phase-lock loop 0041
> Mar  1 00:04:33 aegir pseudo: [ID 129642 kern.info] pseudo-device: vol0
> Mar  1 00:04:33 aegir genunix: [ID 936769 kern.info] vol0 is /pseudo/vol at 0
> Mar  1 00:08:45 aegir xntpd[306]: [ID 261039 daemon.error] time error 
> 514799368.113322 is way too large (set clock manually)
> 
> The other box shows similar, but not identical, issues:
> 
> Mar  1 00:10:35 sif ntpdate[263]: [ID 774510 daemon.notice] step time server 
> 192.168.201.117 offset -514798002.353877 sec
> Mar  1 00:10:38 sif xntpd[497]: [ID 702911 daemon.notice] xntpd 3-5.93e Mon 
> Sep 20 15:47:11 PDT 1999 (1)
> 
> Note here that ntpdate did the damage; also that the polled date "Sep 20 
> 15:47:11 PDT 1999" was not the set date "Mar  1 00:04:27" (year???).  2 
> different time servers at two different colos are shown in the respective 
> "ntpdate" entries, both with similar horribly wrong offsets (there may have 
> been and probably were extant network outages as the boxes came back online).
> 
> Here is the pertinent shell code in the stock solaris 9 init script (edited, 
> will post the whole script if asked):
> 
> <snip from /etc/init.d/xntpd>
> 'start')
>         [ -f /etc/inet/ntp.conf ] || exit 0
> 
>         ARGS=`/usr/bin/cat /etc/inet/ntp.conf | /usr/bin/nawk '
>         BEGIN {
>             first = 1
>         }
> <...>
>         /^server|^peer/ {
>             if (first) {
>                 first = 0
>                 printf("-s -w")
>             }
>             printf(" %s", $2)
>             next
>         }
>         '`
>         if [ -n "$ARGS" ]; then
>                 # Wait until date is close before starting xntpd
>                 (/usr/sbin/ntpdate $ARGS; sleep 2; /usr/lib/inet/xntpd) &
>         else
>                 /usr/lib/inet/xntpd &
>         fi
>         ;;
> <...>
> esac
> exit 0
> </snip from /etc/init.d/xntpd>
> 
> My understanding from this script, the conf file, the log, and what I know 
> about ntp (which I've some experience with over the years) is that, on start, 
> solaris executed 'ntpdate -s -w 192.168.201.117 192.168.201.119 1.2.3.117 
> 1.2.3.119'.  One or more strata 2 sources were way off.  The date was then 
> incorrectly set, and xntpd started.  4 hours later (maybe - or perhaps the 
> clock was getting reset by some huge amount?), xntpd quit because offsets 
> were so large.
> 
> I'd like to understand what happened to mitigate against a recurrence but 
> naturally am mostly interested in solutions.  Any advice appreciated.  I've 
> consulted all pertinent solaris 9 sysadmin docs, the 8 year old solaris xntp 
> "blueprint", the entire ntp.org FAQ, much of its official documentation.  Of 
> course I could be missing something obvious.  Again, sorry for long post and 
> TIA.

This looks like a big mess. Don't use ntpdate especially in your
environment. ntpdate just sets the time. You need to be using ntpd which
will discipline the clock. If you are using a very old version of ntpd
which it what it sounds like, download and build the latest stable
version of ntpd from the download web site. When you install it, use
ntpd -gN which will allow you to set the time based on the consensus of
the NTP servers you have configured. Add to each server line iburst and
ntpd will be able to synchronize in about 15-20 seconds (though that can
depend on a number of factors).

You also are using two external primary servers but this causes problems
because the stratum 2 servers cannot decide which one is more reliable.
You need at least a minimum of 3 NTP servers and preferably 4 in case
one of them becomes unavailable.

I also notice that you are using cisco devices for part of the network.
However, cisco's principle use is for routing and switching and not for
an ntp server, so you should be sure that it's reliable. For your
particular situation you should pay money and buy a number of refclocks
for internal use. It's cheap for critical financial DB servers.
Configure those on some of your older servers so that they are the only
thing that those servers are doing and then point all your critical
client systems to those servers. Make sure you have at least 3 and
configure a few external servers in case for some reason those become
unavailable.

Danny


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.




More information about the questions mailing list