[ntp:questions] question regarding NTP configuration for clusters, and "cluster time" stability

Unruh unruh-spam at physics.ubc.ca
Tue Oct 6 17:04:35 UTC 2009


"rotordyn at yahoo.com" <rotordyn1 at gmail.com> writes:

>On Oct 1, 5:42=A0am, Ian Dobbie <ian.dob... at bioch.ox.ac.uk> wrote:
>> "rotor... at yahoo.com" <rotord... at gmail.com> writes:

>> How about setting up the system so that you have, say, 3 machines
>> synced to UTC, then have other machines off these. The 3 main
>> machines peer each other. Then have the local clock on these machines
>> have a relatively high stratum level, say 5, so if they loose outside
>> sync then the rest of the cluster will still follow the 3, keeping
>> your cluster in sync even though it might drift from UTC.

>That sounds like what orphan mode does. I didn't quite get the
>mechanism, and David Mills explained it to me as:

>>>  If a core server fails, the other core servers continue as
>>> before. If its sources fail, it pops to the orphan stratum
>>> and nobody believes it. If all sources fail, the core servers
>>> all operate at stratum 5, which means the clients will
>>> elect only one of them; the others will be disregarded.

>This feature solves the problem of a core server losing its UTC
>sources.

>> In my experience unless there is something seriously wrong with your
>> system after a bit of drift correction machines will stay within 1s of
>> UTC for months. This is assuming they stay turned on.

>Yes, these are servers, so they're on for years, barring any
>power disruption.

>One added wrinkle that perhaps I haven't emphasized enough:

>Our current implementation uses the reference ntpd internally,
>but without external UTC sources. So while it is stable, it can
>drift over time. (I assume this is the situation referred to as
>a "time island"?)  We now want to allow external UTC sources,
>but we have to ensure that the collective internal "cluster time"
>remains stabile, even if the external UTC sources do bad things,
>such as jumping a minute or two. Our testing shows this can
>cause our nodes' clocks to diverge from each other.

>For example:  Internal node A polls the external UTC source
>B and gets the time, then B jumps 5 minutes, and then internal
>node C gets the time. At this point, node C could start slewing
>its clock to converge on the new time from B, moving it away

That is why you do not use ONE external source, you use at least 4, so
that such bad behaviour from one does not destroy your system. You have
the same problem. What happens if your internal node A jumps by 5 min.
What do the rest of your internal machines do?


>from A. And raising the polling frequency has limits, and we
>found that on very busy servers, we could not raise it enough
>to eliminate this particular "failure mode."  (I use quotes
>because it is only a failure in the context of our requirements,
>not in the functional behavior of the daemon.)

What do you do if A suffers from taht failure mode now?



>I appreciate everyone's input. Having used ntpd for years, but
>only in common configurations, I hadn't appreciated the
>complexity of it...

>thanks,
>tim




More information about the questions mailing list