[ntp:hackers] What to do when the offset is WAYTOOBIG

Brian Utterback brian.utterback at sun.com
Mon Apr 16 11:51:56 PDT 2007


I am in the final stages porting NTPv4 to Solaris Nevada, and I have
some issues I would like to get resolved.

One if these is the behavior of ntpd when it gets an offset that is
too large. For most users, we want to accept a large offset once at
boot up and never again. That's great, we have a flag for that.
However, what do we do if the offset is calculated as being to large
after that?

The current behavior is to exit. This has been discussed at length
on this list. Dave advocates this behavior because it indicates an
error, and (in his words) "requires human paws" to deal with it.

Unfortunately, Solaris comes with a service "restarter" such that
services that stop are restarted. This re-executes the startup
script, starting ntpd with the same flags as before, allowing that
"one time step" behavior, and now not only do we not require the
human paws, we have have allowed the step.

Not good, especially in light of our premise that something is wrong.

Now, I could hack at ntpd and smf and stop the restarting, but we
have had numerous requests to disable the exit behavior too. In case
that is not clear, many of our customers do not want the clock to
step to an obviously incorrect value, and many of them do not want
ntpd to exit, needing manual intervention to restart.

Furthermore, it occurs to me that the problem is as likely to be
upstream in the NTP network as at the local system.

So, by my reckoning, it would be far preferable to not do the step,
but also not exit. We prevent the clock from undergoing the step, but
allow upstream problems to be fixed without requiring manual
intervention.

Imagine a network with a thousand clients and a handful of servers. If 
the servers picked up bogus time via a systematic error (from a bug with
the handling of daylight savings time, say. This is purely hypothetical
of course, real vendors wouldn't have bug in their DST handling code, 
right?) and the bogus time was served downstream to the clients, the
clients all exit and now it requires login to all thousand clients to
get things going again.

Now, I have some thoughts on how to implement this, but I wanted to get
your thoughts as to whether or not this approach is a good one. What
say you all, yea or nay?


-- 
blu

"Remember 'A Thousand Points of Light'? With a network, we now have
a thousand points of failure."
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom


More information about the hackers mailing list