[ntp:questions] solaris 9, ntpdate

Chuck Young see.why at austin.rr.com
Wed Jul 8 08:04:29 UTC 2009


Hello,

I've spent quite a bit of time diving both solaris docs and ntp.org docs and 
am unable to come up with answers to what follows.

It's difficult for me to speak authoritatively to exactly what happened as it 
was 2 weeks ago and I've only been recently called in to troubleshoot.  In 
summary, 2 solaris 9 servers I help admin recently had a pretty hairy problem 
with clocks.  There was a power outage in the data center housing these 
boxes.  When they came back up, their clocks got badly mis-set.  Around 4 
hours later, the local time server (stratum 2 below) responsible apparently 
got back on track (or further off track?); I believe xntpd quit when the 
remote and local clocks got widely off.

The boxes that failed are enterprise critical DB servers and time is very 
important to our business (revenue).  I'm interested in:

1) Resources to facilitate further research
2) More robust failover
3) Immediate notification of clock trouble

I can code this in sh or perl myself but would naturally like to do as little 
as possible, leverage existing solutions, etc.  These are production boxes 
with no real dev space for me to tinker in so KISS is the order of the day.  
Follows are gory details; sorry for long post.

Here is the rough architecture:

                                                 |
strat 1:               NIST             |              RH
                            /      \            |            /       \
                 (data center 1)     |    (data center 2)
                         /            \         |         /             \
strat 2:    cisco-1  cisco-2   |   cisco-3  cisco-4
                                                 |
(all clients poll first local then remote strat 2 servers)
                   \V/     \V/      \V/    |     \V/      \V/     \V/
strat 3:   client client client |  client client client

(I've pointed out to colleagues that the strat 2 servers should be peered; at 
the time of this meltdown they weren't).

Here is some pertinent info from one host (other is very similar):

[root at aegir ccy]# uname -a
SunOS aegir 5.9 Generic_122300-04 sun4u sparc SUNW,Sun-Fire-V440
[root at aegir ccy]# ntptrace 192.168.201.117
cisco-1.mydomain.com: stratum 2, offset -0.004321, synch distance 0.03790
time-a.nist.gov: stratum 1, offset 0.003697, synch distance 0.00000, refid 
'ACTS'
[root at aegir ccy]# ntptrace 192.168.201.119
cisco-2.mydomain.com: stratum 2, offset -0.000124, synch distance 0.03850
time-a.nist.gov: stratum 1, offset -0.001194, synch distance 0.00000, refid 
'ACTS'
[root at aegir ccy]# cat /etc/inet/ntp.conf
server 192.168.201.117
server 192.168.201.119
server 1.2.3.117
server 1.2.3.119
[root at aegir ccy]#

The saga of the clock breaking can be seen in this sequence 
of /var/adm/messages:

Jun 23 07:53:52 aegir genunix: [ID 936769 kern.info] pm0 is /pseudo/pm at 0
Mar  1 00:04:25 aegir ntpdate[187]: [ID 774510 daemon.notice] step time server 
1.2.3.119 offset -514799367.013772 sec
Mar  1 00:04:27 aegir xntpd[306]: [ID 702911 daemon.notice] xntpd 3-5.93e Mon 
Sep 20 15:47:11 PDT 1999 (1)
Mar  1 00:04:28 aegir xntpd[306]: [ID 301315 daemon.notice] tickadj = 5, tick 
= 10000, tvu_maxslew = 495, est. hz = 100
Mar  1 00:04:28 aegir xntpd[306]: [ID 798731 daemon.notice] using kernel 
phase-lock loop 0041
Mar  1 00:04:33 aegir pseudo: [ID 129642 kern.info] pseudo-device: vol0
Mar  1 00:04:33 aegir genunix: [ID 936769 kern.info] vol0 is /pseudo/vol at 0
Mar  1 00:08:45 aegir xntpd[306]: [ID 261039 daemon.error] time error 
514799368.113322 is way too large (set clock manually)

The other box shows similar, but not identical, issues:

Mar  1 00:10:35 sif ntpdate[263]: [ID 774510 daemon.notice] step time server 
192.168.201.117 offset -514798002.353877 sec
Mar  1 00:10:38 sif xntpd[497]: [ID 702911 daemon.notice] xntpd 3-5.93e Mon 
Sep 20 15:47:11 PDT 1999 (1)

Note here that ntpdate did the damage; also that the polled date "Sep 20 
15:47:11 PDT 1999" was not the set date "Mar  1 00:04:27" (year???).  2 
different time servers at two different colos are shown in the respective 
"ntpdate" entries, both with similar horribly wrong offsets (there may have 
been and probably were extant network outages as the boxes came back online).

Here is the pertinent shell code in the stock solaris 9 init script (edited, 
will post the whole script if asked):

<snip from /etc/init.d/xntpd>
'start')
        [ -f /etc/inet/ntp.conf ] || exit 0

        ARGS=`/usr/bin/cat /etc/inet/ntp.conf | /usr/bin/nawk '
        BEGIN {
            first = 1
        }
<...>
        /^server|^peer/ {
            if (first) {
                first = 0
                printf("-s -w")
            }
            printf(" %s", $2)
            next
        }
        '`
        if [ -n "$ARGS" ]; then
                # Wait until date is close before starting xntpd
                (/usr/sbin/ntpdate $ARGS; sleep 2; /usr/lib/inet/xntpd) &
        else
                /usr/lib/inet/xntpd &
        fi
        ;;
<...>
esac
exit 0
</snip from /etc/init.d/xntpd>

My understanding from this script, the conf file, the log, and what I know 
about ntp (which I've some experience with over the years) is that, on start, 
solaris executed 'ntpdate -s -w 192.168.201.117 192.168.201.119 1.2.3.117 
1.2.3.119'.  One or more strata 2 sources were way off.  The date was then 
incorrectly set, and xntpd started.  4 hours later (maybe - or perhaps the 
clock was getting reset by some huge amount?), xntpd quit because offsets 
were so large.

I'd like to understand what happened to mitigate against a recurrence but 
naturally am mostly interested in solutions.  Any advice appreciated.  I've 
consulted all pertinent solaris 9 sysadmin docs, the 8 year old solaris xntp 
"blueprint", the entire ntp.org FAQ, much of its official documentation.  Of 
course I could be missing something obvious.  Again, sorry for long post and 
TIA.

cy



More information about the questions mailing list