[ntp:hackers] More on the clock discipline algorithm

Frederick Bruckman fredb at immanent.net
Sun Oct 19 11:31:05 PDT 2003


On Sun, 19 Oct 2003, David L. Mills wrote:

> The scenario develops following the holdover when the local clock offset
> happens to be greater than the step threshold. Normally, this doesn't
> happen even if the holdover is a day or more, since the holdover offset
> due to frequency drift is normally less than one PPM (86 ms per day), so
> either the holdover period is very long or something has broken the
> discipline loop. Either is unlikely in common cases as confirmed by the
> radio drivers that have to deal with frequent holdovers up to a day.

The problem is not limited to the local clock driver. It's just more
evident with a local clock.

The same thing you describe happens much more often with hosts that
are synchronized over a dial-up. In most cases, the "spike" mode
correction seems to be at fault. If we're not going to trust the
frequency, and we're going to switch into frequency mode for the
stepout period anyway, why step the clock (after a spike)? If the
frequency mode isn't judged adequate to correct for any frequency
error, maybe we could "split the difference", and let the spike
correction only be half of the calculated amount, with the idea that
the frequency mode correction will straighten it out?

> Enter now the reported frequency surge after holdover. The problem here
> may be an insane frequency file or some very large twitch in the
> intrinsic frequency offset, an argument between the kernel frequency and
> daemon frequency or, as may be the case in the report, a gradual drift
> of time offset beyond the step threshold with only modest frequency
> error.

There's an error in the calculation for certain avenues into frequency
mode. If you always get into S_FREQ the assumed way, through S_SPIKE,
one inverval later, S_TSET, then one interval later, S_FREQ, there's
no problem, because rstclock() with zero as the last argument sets
fp_offset to zero just as last_time is reset. Other paths through
the state machine, disconnected operation in particular, can lead
fp_offset to represent the change in offset over a much greater
interval than the time since last_time. In that case, the calculated
frequency change at the end of frequency mode overcorrects.

I propose a fix for that in bug #177:

    http://bugzilla.ntp.org/attachment.cgi?id=75&action=view

Basicly, the fix involves saving fp_offset everytime last_time is
reset, so that no matter how we get into frequency mode, we'll have
enough information to precisely average the drift that occured (only)
over the stepout interval.

> The rulse of the game are that in case of time step all past history is
> expunged. Security and reliability insist that no prior time values be
> believed, so the daemon starts from scratch. But, should the daemon
> assume the step was due to a frequency or time error or both? To recover
> from the baddest case of corrupt frequency file, it assumes a frequency
> error, which of course leads to the reported behavior. The daemon could
> simply step the time and leave the frequency alone, but this invites the
> corrupt frequency file hazard.

Yes exactly. The same bug that affect frequency mode exists in spike
mode, too, and would seemingly exacerbate the problem, but my proposed
fix, as a practical matter, doesn't help much. Rather, it lets you
track the short-term network transient better, so it actually makes it
worse. (The fix to the subsequent frequency mode correction always
seems to iron it out, though).

> Prior instances of simulation have proven the clock discipline a nasty
> beast to tame, given the wide range of evil failures/misconfigures it
> might have to cope with. I've simulated all the cases I can imagine in
> the present design, but maybe the state machine should be modified. One
> or another refinements might be:
>
> 1. Discipline the frequency directly only when the frequency file is not
> present. No change from the present behavior. On the other hand, if it
> is and the apparent frequency correction is greater than the machine can
> cope with, give up with a message to the log. In the reported case the
> daemon would exit, but since the frequency discipline during holdover is
> very likely defective, this might be the correct action to take. The
> user would have to take remedial action requiring hand massage or
> removal of the frequency file.
>
> 2. The above action but ignore current frequency and start all over from
> the beginning. This would "fix" the reported problem and instantly cause
> another one since the daemon would take another fifteen minutes to
> regain sanity.
>
> 3. Ignore the problem but find out why after holdover the time offset is
> so large.
>
> 4. There may be other ways to "fix" the problem, but they have to obey
> the ground rules and work correctly in multiple contrived simulations. I
> invite a volunteer to do this, but be advised it's a lot of work.

I'll take half an order of (1), half an order of (2), and a full order
of (3)...

As I said, even after you fix the frequency mode bug, the spike mode
issue still exists. I believe there's just not enough information, in
the brief interval of non-disciplined operation that precedes the
spike, to make an accurate frequency correction. If the (new,
improved) frequency mode correction proves adequate to fix any error,
then we don't need to correct the frequency at spike time. That would
be the preferred solution, since it would only seem to hurt clocks
that are way off anyhow for the one fifteen minute interval. If it
needs a little help for whopping frequency errors, we could, as a
compromise, "split the difference", and undercorrect the spike by
one-half.

Frederick



More information about the hackers mailing list