[ntp:questions] [Bug 177] Clock stepping messes up frequency. (fwd)

David L. Mills mills at udel.edu
Wed Oct 22 16:49:23 UTC 2003


Guys,

I am cross-posting this and an earlier message from bugzilla, since the
"bug" is unresolved and the issue should be considered by the general
user population.

The issue has to do with recovery following a holdover period when the
local clock driver disciplines the clock in the absence of external
synchronization sources. The report is that, if the local clock has
drifted outside the step threshold, the correction introduces an
unwanted frequency surge. The frequency surge is a secondary effect
resulting from an engineered response to a possibly corrupted frequency
file. However, this is not the issue I am concerned about. Rather, I'm
concerned about how the offset could have drifted greater than the step
threshold in the first place.

In general, the clock discipline maintains the frequency within one
part-per-million (PPM) once the algorithm has settled down in a few
hours after restart. Therefore, the local clock should be able to
free-run for periods up to 36 hours without the offset exceeding the
default step threshold of 125 ms. One report suggests that, when the
sources are once again reachable via dial-up connection, a residual
offset greater than the step threshold is created. I can't confirm this
here, but there may be something I have missed. Note that similar
scenarios happen all the time with the HF radio drivers when the signals
are lost for a day or more and the residual correction when the signals
are again found is only a few milliseconds.

So, I am looking for reports that, following an extended outage and
whether or not the local clock driver is used, the local clock has
drifted greater than the step interval. I can't confirm this here,
either in simulation or practice, unless something serious is broken in
the harware or operating system. One possibility to consider is that
sleep/suspend mode could have torqued the clock during the holdover
interval. Other reports relating to this issue would be much appreciated

Dave

"David L. Mills" wrote:
> 
> Sirs,
> 
> I do not have the context for this message; the cited saved_offset is
> not a variable in the current code. This message does not belong on this
> list; it belongs at hackers or better yet the newsgroup.
> 
> First, note that a step completely erases all past information. This was
> the considered opinion of the Privacy and Security Research Group some
> years ago and conforms to the intended model described on the NTP
> project page. I believe the context as described in an earlier message
> is where the local clock driver was used as holdover and resulted in a
> large offset when a synchronized source is re-found. However, the
> behavior is much more general and will happen whether or not a holdover
> is used. It will happen when the client is first synchronized, then
> loses all sources for a period during which the clock drifts outside the
> step threshold. Upon re-discovery the daemon assumes that the large
> error is due to initial frequency error and corrects it accordingly.
> This is probably not the ideal choice; but, read on.
> 
> In the vast number of possible cases the daemon refines the frequency
> with error well below one PPM, which equates to about 86 ms per day and
> well within the step threshold. So, in the case cited how did the offset
> error get so large? The radio drivers routinely lose the signal for
> periods up to a day or more and the error upon resynchronization is
> typically less than a few milliseconds.
> 
> Having said that, the state machine should be a little more clever about
> the S_FREQ state and avoid that state if previously synchronized. This
> is not strictly conformant to the PSRG principles and definitely would
> incite serious revolt from the Digital folks that designed DTSS. Those
> folks would not accept frequency steering at all, since it involves
> remembering something from one poll to the next.
> 
> The fix is trivial; the proof that it works in all conceivable cases is
> dreadful and considered possible only in simulation.
> 
> Dave
> 
> bugzilla at ntp.org wrote:
> >
> > http://bugzilla.ntp.org/show_bug.cgi?id=177
> >
> > ------- Additional Comments From bruckman at ntp.org  2003-10-17 15:54 -------
> > No. The saved_offset was saved while the peers were still reachable, so it's
> > not suspect, and if you fail to use the saved_offset while any kind of drift
> > exists, you're definitely going to overcorrect for the drift.
> >
> > Observe that the peers are still being polled, and the intersection and
> > selection algorithms are still at work, while you're in frequency mode, so it's
> > not a random packet that caps the frequency mode interval, triggering the step
> > and frequency change, rather, it's a "good" packet, so you're not any more
> > vulnerable to follow a falseticker then, than you are normally -- albeit doing
> > so would have a more profound and immediate effect. The worst that can happen,
> > however, is that you'll sync to the falseticker for a while, then when it's
> > finally detected and eliminated, you'll go back into frequency mode for another
> > 900 seconds, and take another shot.
> >
> > A better way to protect against a single falseticker, than to mess with the
> > frequency mode correction, is with the "tos" directive in "ntp.conf". E.g.:
> > "tos minsane 2". With that, packets from any source which isn't consistent with
> > at least one other would have no affect on the loop at all.
> >
> > ------- You are receiving this mail because: -------
> > You are on the CC list for the bug, or are watching someone who is.



More information about the questions mailing list