[ntp:hackers] More on the clock discipline algorithm

David L. Mills mills at udel.edu
Sun Oct 19 18:03:02 PDT 2003


Frederick,

Prior experience (see my Trans. Networking paper and compare with
current discipline algorithm) suggests we don't do things half way.
Either one way or the other. The present design expects residual jitter
to be relatively small in the vast majority of cases, in fact, much less
than the step threshold, and if large is probably due to a defective
frequency file. The point I'm not getting in the discussion here is why
the large offset in the first place and whether a legitimare offset this
large can occur unless something else is broken.

The discipline is purposely blind to source selection, even if the local
clock is a kludge, and the expectation is that most sources will be
reasonably close to each other in the normal case, surely much less than
the step threshold after the clock filters. So, the reasoning goes that,
if a large offset exists with multiple servers, the offsets between
those servers will be only moderate and the large offset must be due to
the frequency file or something equally terrible. If the local clock was
stepped in TSET and is beyond the step threshold 900 seconds later in
SPIK, something must be seriously wrong, most likely a defective
frequency file.

There is another complicating factor not evident here. When only two
sources are available and the confidence intervals do not overlap, no
majority clique can exist and the clock will not be disciplined.
However, the design of the local clock interface is that only two
possibilities exist, one where the ordinary sources discipline the clock
and the other where no sources are available and the clock free-runs at
the last disciplined frequency. Frederick raises the issue that under
some dialup conditions a large transient may exist upon initial
connection. I assume the iburst mode is in use, in which case the clock
filter should have waxed that transient if not the popcorn spike. Is the
problem that these mechanisms don't work in some cases? I'd like to see
a peerstats/loopstats plot.

The local clock breaks these assumptions, of course, but even in that
case some explanation must be offered how the local clock got so far off
from its sources during holdover. Unless some compelling argument can be
found for this, I sway toward the view that large offsets are most
likely due to a broken frequency file. While the experiments I can do
here may not accurately reflect Frederick's case, those I can do show
the discipline switches to TSET following the holdover and coasts
through the steopout interval as designed. Here's the rub. If the offset
is due to a transient and the frequency is in fact close to correct,
then sometime during the stepout interval it would be expected that a
sample less than the step threshold would show up and reset the stepout
timer. If this were the case the discipline would with high probability
never get to the stepout threshold. The fact the report is otherwise
tells me the offset must be persistent and not just a dialup transient.
This really does need to be explained.

Dave

Frederick Bruckman wrote:
> 
> On Sun, 19 Oct 2003, David L. Mills wrote:
> 
> > The scenario develops following the holdover when the local clock offset
> > happens to be greater than the step threshold. Normally, this doesn't
> > happen even if the holdover is a day or more, since the holdover offset
> > due to frequency drift is normally less than one PPM (86 ms per day), so
> > either the holdover period is very long or something has broken the
> > discipline loop. Either is unlikely in common cases as confirmed by the
> > radio drivers that have to deal with frequent holdovers up to a day.
> 
> The problem is not limited to the local clock driver. It's just more
> evident with a local clock.
> 
> The same thing you describe happens much more often with hosts that
> are synchronized over a dial-up. In most cases, the "spike" mode
> correction seems to be at fault. If we're not going to trust the
> frequency, and we're going to switch into frequency mode for the
> stepout period anyway, why step the clock (after a spike)? If the
> frequency mode isn't judged adequate to correct for any frequency
> error, maybe we could "split the difference", and let the spike
> correction only be half of the calculated amount, with the idea that
> the frequency mode correction will straighten it out?
> 
> > Enter now the reported frequency surge after holdover. The problem here
> > may be an insane frequency file or some very large twitch in the
> > intrinsic frequency offset, an argument between the kernel frequency and
> > daemon frequency or, as may be the case in the report, a gradual drift
> > of time offset beyond the step threshold with only modest frequency
> > error.
> 
> There's an error in the calculation for certain avenues into frequency
> mode. If you always get into S_FREQ the assumed way, through S_SPIKE,
> one inverval later, S_TSET, then one interval later, S_FREQ, there's
> no problem, because rstclock() with zero as the last argument sets
> fp_offset to zero just as last_time is reset. Other paths through
> the state machine, disconnected operation in particular, can lead
> fp_offset to represent the change in offset over a much greater
> interval than the time since last_time. In that case, the calculated
> frequency change at the end of frequency mode overcorrects.
> 
> I propose a fix for that in bug #177:
> 
>     http://bugzilla.ntp.org/attachment.cgi?id=75&action=view
> 
> Basicly, the fix involves saving fp_offset everytime last_time is
> reset, so that no matter how we get into frequency mode, we'll have
> enough information to precisely average the drift that occured (only)
> over the stepout interval.
> 
> > The rulse of the game are that in case of time step all past history is
> > expunged. Security and reliability insist that no prior time values be
> > believed, so the daemon starts from scratch. But, should the daemon
> > assume the step was due to a frequency or time error or both? To recover
> > from the baddest case of corrupt frequency file, it assumes a frequency
> > error, which of course leads to the reported behavior. The daemon could
> > simply step the time and leave the frequency alone, but this invites the
> > corrupt frequency file hazard.
> 
> Yes exactly. The same bug that affect frequency mode exists in spike
> mode, too, and would seemingly exacerbate the problem, but my proposed
> fix, as a practical matter, doesn't help much. Rather, it lets you
> track the short-term network transient better, so it actually makes it
> worse. (The fix to the subsequent frequency mode correction always
> seems to iron it out, though).
> 
> > Prior instances of simulation have proven the clock discipline a nasty
> > beast to tame, given the wide range of evil failures/misconfigures it
> > might have to cope with. I've simulated all the cases I can imagine in
> > the present design, but maybe the state machine should be modified. One
> > or another refinements might be:
> >
> > 1. Discipline the frequency directly only when the frequency file is not
> > present. No change from the present behavior. On the other hand, if it
> > is and the apparent frequency correction is greater than the machine can
> > cope with, give up with a message to the log. In the reported case the
> > daemon would exit, but since the frequency discipline during holdover is
> > very likely defective, this might be the correct action to take. The
> > user would have to take remedial action requiring hand massage or
> > removal of the frequency file.
> >
> > 2. The above action but ignore current frequency and start all over from
> > the beginning. This would "fix" the reported problem and instantly cause
> > another one since the daemon would take another fifteen minutes to
> > regain sanity.
> >
> > 3. Ignore the problem but find out why after holdover the time offset is
> > so large.
> >
> > 4. There may be other ways to "fix" the problem, but they have to obey
> > the ground rules and work correctly in multiple contrived simulations. I
> > invite a volunteer to do this, but be advised it's a lot of work.
> 
> I'll take half an order of (1), half an order of (2), and a full order
> of (3)...
> 
> As I said, even after you fix the frequency mode bug, the spike mode
> issue still exists. I believe there's just not enough information, in
> the brief interval of non-disciplined operation that precedes the
> spike, to make an accurate frequency correction. If the (new,
> improved) frequency mode correction proves adequate to fix any error,
> then we don't need to correct the frequency at spike time. That would
> be the preferred solution, since it would only seem to hurt clocks
> that are way off anyhow for the one fifteen minute interval. If it
> needs a little help for whopping frequency errors, we could, as a
> compromise, "split the difference", and undercorrect the spike by
> one-half.
> 
> Frederick



More information about the hackers mailing list