[ntp:hackers] More on the clock discipline algorithm

peda at sectra.se peda at sectra.se
Mon Oct 20 03:09:04 PDT 2003


Hello,

I realise that I don't understand many of the things that are discussed 
here, but as far as I understand this case has not been covered:

A holdover (which I interpret as a period during which nothing but local 
clock is available) can easily be implicitly longer than one day, if you 
take into account a period when the computer is switched off. My system 
had an offset as large as 2.5 seconds after this weekend during which it 
was switched off. So, if I switch on my computer, and for some reason 
don't have access to time (bad GPS reception, no network, whatever), the 
local clock is selected. Now, I have a correct frequency file (well, It's 
not way off anyway), and still the frequency is going haywire when I do 
get a correct clock.

I have this ntp.conf:

tinker panic 0
logfile D:\NTPData\ntp.log
server 127.127.31.0 mode 128 # Patched driver, should not matter
server 127.127.1.1
driftfile D:\NTPData\ntp.drift
statsdir D:\NTPData\ntp.stats\
statistics loopstats

ntp.log from this 'morning' when I pulled the GPS kable on purpose before 
ntpd start, and inserted it just after 10:13:54:

20 Oct 10:10:41 NTP[732]: refclock_jupiter: Enable output mode failed
20 Oct 10:10:41 NTP[732]: refclock_jupiter: time_pps_getcap failed: No 
error
20 Oct 10:10:41 NTP[732]: frequency initialized 12.902 PPM from 
D:\NTPData\ntp.drift
20 Oct 10:12:51 NTP[732]: refclock_jupiter: Enable output mode failed
20 Oct 10:13:54 NTP[732]: synchronized to LOCAL(1), stratum=5
20 Oct 10:16:04 NTP[732]: jupiter_receive: 12 chan ver 03.00, 01/29/01 
(0003)
20 Oct 10:20:24 NTP[732]: synchronized to GPS_JUPITER(0), stratum=0
20 Oct 10:35:35 NTP[732]: time reset -2.476448 s
20 Oct 10:35:35 NTP[732]: frequency error -2553 PPM exceeds tolerance 500 
PPM
20 Oct 10:39:53 NTP[732]: synchronized to LOCAL(1), stratum=5
20 Oct 10:41:00 NTP[732]: synchronized to GPS_JUPITER(0), stratum=0
20 Oct 10:57:02 NTP[732]: time reset +0.619402 s
20 Oct 11:01:16 NTP[732]: synchronized to LOCAL(1), stratum=5
20 Oct 11:01:19 NTP[732]: synchronized to GPS_JUPITER(0), stratum=0
20 Oct 11:38:49 NTP[732]: time reset -0.204193 s
20 Oct 11:43:08 NTP[732]: synchronized to LOCAL(1), stratum=5
20 Oct 11:43:10 NTP[732]: synchronized to GPS_JUPITER(0), stratum=0

and here's the corresponding loopstats file:

52932 29699.193 0.000000000 12.902000 0.000003815 0.000000 6
52932 29768.207 0.000000000 12.902000 0.000003815 0.000000 6
52932 29834.220 0.000000000 12.902000 0.000003815 0.000000 6
52932 29899.233 0.000000000 12.902000 0.000003815 0.000000 6
52932 29963.246 0.000000000 12.902000 0.000003815 0.000000 6
52932 30938.451 0.000000000 -500.000000 0.000062879 1283.133902 4
52932 31193.888 0.000000000 -500.000000 0.000054488 1111.226555 6
52932 31257.868 0.000000000 -500.000000 0.000047227 962.350426 6
52932 32221.572 0.000000000 182.913133 0.148916330 900.656063 4
52932 32476.284 0.000000000 182.913133 0.128965325 779.991031 6
52932 32479.628 0.003375425 182.913535 0.112216584 675.492047 6
52932 32542.647 -0.008263933 182.882011 0.098265503 584.993273 6
52932 32605.666 -0.017353830 182.816845 0.086775394 506.619037 6
52932 32669.677 -0.027662828 182.712969 0.077752832 438.744959 6
52932 32735.685 -0.037655714 182.569324 0.071089434 379.964287 6
52932 32800.682 -0.046451039 182.392127 0.065503156 329.058737 6
52932 32866.690 -0.055434939 182.180660 0.060699985 284.973245 6
52932 32929.707 -0.062989692 181.940373 0.056403052 246.794099 6
52932 32993.713 -0.070035411 181.677383 0.052461338 213.730000 6
52932 33057.720 -0.076855147 181.384204 0.048885768 185.095667 6
52932 33122.734 -0.083328414 181.066332 0.045568275 160.297629 6
52932 33186.740 -0.089286194 180.725732 0.042470330 138.821923 6
52932 33251.737 -0.094919139 180.363644 0.039604085 120.223448 6
52932 33317.740 -0.100253092 179.981209 0.036951282 104.116736 6
52932 33383.748 -0.105218342 179.579833 0.034518442 90.167962 6
52932 33449.749 -0.109834633 179.160847 0.032267858 78.088026 6
52932 33515.755 -0.114122289 178.725505 0.030165154 67.626565 6
52932 33580.766 -0.118045353 178.275198 0.028182914 58.566756 6
52932 33645.761 -0.121678580 177.811031 0.026312485 50.720829 6
52932 33708.772 -0.124952944 177.334373 0.024527793 43.926173 6
52932 34729.965 0.000000000 12.975849 0.028719474 90.556958 4
52932 34988.804 0.000000000 12.975849 0.024871794 78.424626 6
52932 35054.655 -0.011172658 12.933229 0.021539621 67.917722 6
52932 35180.635 -0.010052780 12.894881 0.018661950 58.818476 6
52932 35243.646 -0.009380653 12.859655 0.016182901 50.938297 6
52932 35308.638 -0.008755783 12.826777 0.014053136 44.113862 6
52932 35373.651 -0.008265858 12.795245 0.012224130 38.203729 6
52932 35436.641 -0.007646755 12.766075 0.010662501 33.085403 6
52932 35499.652 -0.007178503 12.739119 0.009322554 28.652802 6
52932 35563.654 -0.006714397 12.713906 0.008167120 24.814058 6

So, more than 30 minutes to settle down, and that is IMHO not acceptable. 
And that was just because I was lucky this time. If one of the steps had 
produced a wrong frequency that was not all bad, it would have taken a 
longer time to recover. Frederick Bruckman wrote a patch that makes this 
problem go away. I like this patch, but I'm ignorant to a great many 
things in ntp_loopfilter.c so the relevance of my opinion is questionable.

I can't see what I can do differently, please enlighten me. I want to have 
fallback on local clock and I sometimes don't have other time sources 
available. Being a couple of seconds off on boot and not having a GPS 
solution is not that odd is it? At least not if you're thinking laptop.

Sincerely,
Peter Ekberg






"David L. Mills" <mills at udel.edu>
Sent by: hackers-bounces at ntp.org
2003-10-20 03:03

 
        To:     hackers at ntp.org
        cc: 
        Subject:        Re: [ntp:hackers] More on the clock discipline algorithm


Frederick,

Prior experience (see my Trans. Networking paper and compare with
current discipline algorithm) suggests we don't do things half way.
Either one way or the other. The present design expects residual jitter
to be relatively small in the vast majority of cases, in fact, much less
than the step threshold, and if large is probably due to a defective
frequency file. The point I'm not getting in the discussion here is why
the large offset in the first place and whether a legitimare offset this
large can occur unless something else is broken.

The discipline is purposely blind to source selection, even if the local
clock is a kludge, and the expectation is that most sources will be
reasonably close to each other in the normal case, surely much less than
the step threshold after the clock filters. So, the reasoning goes that,
if a large offset exists with multiple servers, the offsets between
those servers will be only moderate and the large offset must be due to
the frequency file or something equally terrible. If the local clock was
stepped in TSET and is beyond the step threshold 900 seconds later in
SPIK, something must be seriously wrong, most likely a defective
frequency file.

There is another complicating factor not evident here. When only two
sources are available and the confidence intervals do not overlap, no
majority clique can exist and the clock will not be disciplined.
However, the design of the local clock interface is that only two
possibilities exist, one where the ordinary sources discipline the clock
and the other where no sources are available and the clock free-runs at
the last disciplined frequency. Frederick raises the issue that under
some dialup conditions a large transient may exist upon initial
connection. I assume the iburst mode is in use, in which case the clock
filter should have waxed that transient if not the popcorn spike. Is the
problem that these mechanisms don't work in some cases? I'd like to see
a peerstats/loopstats plot.

The local clock breaks these assumptions, of course, but even in that
case some explanation must be offered how the local clock got so far off
from its sources during holdover. Unless some compelling argument can be
found for this, I sway toward the view that large offsets are most
likely due to a broken frequency file. While the experiments I can do
here may not accurately reflect Frederick's case, those I can do show
the discipline switches to TSET following the holdover and coasts
through the steopout interval as designed. Here's the rub. If the offset
is due to a transient and the frequency is in fact close to correct,
then sometime during the stepout interval it would be expected that a
sample less than the step threshold would show up and reset the stepout
timer. If this were the case the discipline would with high probability
never get to the stepout threshold. The fact the report is otherwise
tells me the offset must be persistent and not just a dialup transient.
This really does need to be explained.

Dave


_______________________________________________
hackers mailing list
hackers at ntp.org
http://mailman.ntp.org/mailman/listinfo/hackers






More information about the hackers mailing list