[ntp:questions] drift value very large and very unstable

Andy Helten andy.helten at dot21rts.com
Fri Mar 7 15:13:14 UTC 2008


Andy wrote:
> The good news is that "new ntp.conf" appears to work!  This is the first
> configuration that has produced reasonable results, granted it could
> still be a fluke since the drift was rather unpredictable (but _always_
> settled near +/-500ppm).  The bad news is that we _require_ some of the
> commands removed from ntp.conf (at least burst and step).  After letting
> ntp run with the "new ntp.conf" for at least 16 hours, the drift had
> stabilized around 33ppm:
>
> sbc1 root 1->ntpq -crv
> assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
> event_peer/strat_chg,
> version="ntpd 4.2.4p0 at 1.1472 Tue Jan  8 16:23:44 UTC 2008 (1)",
> processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
> stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.272,
> peer=13451, refid=BTFP,
> reftime=c0320bd4.c1843a15  Thu, Mar  7 2002 10:55:00.755, poll=4,
> clock=c0320bd5.6dfc379d  Thu, Mar  7 2002 10:55:01.429, state=4,
> offset=0.029, frequency=33.562, jitter=0.002, noise=0.002,
> stability=0.001
>
>
> This test ran with the previously problematic Redhawk kernel and all of
> the same hardware.  To further isolate the problem, I've added the
> 'burst' command back into ntp.conf, removed the drift file, and
> restarted ntp.
>
> Andy
>   

This may seem like a long email, but it mostly consists of two cut &
paste jobs interspersed with brilliant analysis.  The two cut & paste
chunks are from two different runs of ntp, one that shows ntp working
correctly and one that shows it "failing".  The _only_ difference
between the two runs is that time stepping was left to default behavior
in the working run and time stepping was disabled in the "failing" run. 
So, please don't be discouraged by the length of the email, you may find
it an intriguing read...


As I mentioned in a previous email, I was going to run ntp while adding
back in some of the features I removed.  After doing this, at least
superficially, I've isolated the problem to time step being disabled. 
It doesn't matter whether I specify 'tinker step 0' in ntp.conf or use
the '-x' argument on the command line.  Both result in drift that
approaches 500ppm.  First the working configuration.  Below is ntpq
output and the ntp.conf file after running just over 12 hours with step
_enabled_ in which everything works correctly.  Below the ntpq/ntp.conf
information is the beginning and ending of the stats.loop file for the
same run.

Keep in mind, NTP runs perfectly on this same set of hardware when
running the 2.6.9-42 linux kernel with time stepping *disabled*.  If the
true drift of this system is around 33ppm (two different runs with
stepping disabled have settled near 33ppm), then would the clock offset
ever get larger than 128ms and require a step?  In fact, the answer
seems to be "no, a time step is never even required".  As proof,
stats.loop shows the largest offset to be 0.023634 seconds and that is
the third entry in the file.  The offset only goes down after that,
eventually achieving microsecond accuracy.  Yes, this is with time
stepping *enabled*, but still the point is that no step was even needed
to keep time accurately and to establish a reasonable (but apparently
accurate) drift value.


/*****************************************/
ntpq for working configuration, stepping enabled
/*****************************************/
sbc1 root 3->ntpq
ntpq> pe
     remote           refid      st t when poll reach   delay   offset 
jitter
==============================================================================
*GPS_BANC(0)     .BTFP.           0 l    8   16  377    0.000   -0.006  
0.001
ntpq> as

ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 19400  9614   yes   yes  none  sys.peer   reachable  1
ntpq> rv &1
assID=19400 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
flash=00 ok, keyid=0, ttl=64, offset=-0.006, delay=0.000,
dispersion=0.105, jitter=0.001,
reftime=cb7bc62f.16499b2e  Fri, Mar  7 2008  8:48:31.087,
org=cb7bc62f.16499b2e  Fri, Mar  7 2008  8:48:31.087,
rec=cb7bc62f.164a14e6  Fri, Mar  7 2008  8:48:31.087,
xmt=cb7bc62f.16490801  Fri, Mar  7 2008  8:48:31.087,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   -0.01   -0.01   -0.01   -0.01   -0.01   -0.01   -0.01   -0.01,
filtdisp=      0.00    0.20    0.21    0.23    0.24    0.26    0.27    0.47
ntpq> rv
assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
event_peer/strat_chg,
version="ntpd 4.2.4p0 at 1.1472 Tue Jan  8 16:23:44 UTC 2008 (1)",
processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.037,
peer=19400, refid=BTFP,
reftime=cb7bc634.1649ad18  Fri, Mar  7 2008  8:48:36.087, poll=4,
clock=cb7bc635.182e76f5  Fri, Mar  7 2008  8:48:37.094, state=4,
offset=-0.003, frequency=33.551, jitter=0.002, noise=0.002,
stability=0.001
ntpq> quit
sbc1 root 10->ps x|grep ntpd
 9554 ?        Ss     0:00 ntpd -c /etc/ntp_debug.conf
sbc1 root 4->cat /etc/ntp_debug.conf
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable

restrict default nomodify notrap noquery
restrict 127.0.0.1

driftfile /var/lib/ntp/drift

tinker panic 0

server  127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
tos orphan 6

/*****************************************/
stats.loop for working configuration, stepping enabled
/*****************************************/
54532 8012.087 0.014105000 0.000 0.004986901 0.000000 6
54532 8922.088 0.023610000 10.445 0.007191059 3.692976 6
54532 8937.087 0.023634000 10.465 0.006726625 3.454469 6
54532 8952.088 0.023633000 10.484 0.006292182 3.231368 6
54532 8970.087 0.023631000 10.512 0.005885797 3.022683 6
54532 8986.087 0.023628000 10.535 0.005505659 2.827473 6
54532 9004.087 0.023625000 10.559 0.005150073 2.644872 6
54532 9021.088 0.023622000 10.582 0.004817452 2.474065 6
54532 9036.088 0.023620000 10.605 0.004506314 2.314291 5
54532 9053.087 0.023616000 10.699 0.004215271 2.165075 5
54532 9068.087 0.023292000 10.781 0.003944691 2.025449 5
54532 9085.088 0.022913000 10.875 0.003692351 1.894924 5
54532 9101.088 0.022565000 10.961 0.003456068 1.772800 4
54532 9119.090 0.022186000 11.344 0.003235634 1.663816 4
54532 9136.087 0.021169000 11.688 0.003047943 1.561096 4
54532 9151.087 0.020285000 11.977 0.002868169 1.463843 4
54532 9167.088 0.019392000 12.273 0.002701435 1.373317 4
54532 9182.087 0.018601000 12.539 0.002542389 1.288048 4
54532 9197.088 0.017851000 12.793 0.002392922 1.208199 4
54532 9213.087 0.017095000 13.055 0.002254277 1.133948 4
54532 9228.088 0.016425000 13.289 0.002121941 1.063943 4
54532 9246.090 0.015669000 13.559 0.002002814 0.999779 4
54532 9262.088 0.015036000 13.789 0.001886778 0.938751 4
54532 9278.088 0.014437000 14.008 0.001777582 0.881520 4
54532 9293.087 0.013905000 14.207 0.001673378 0.827590 4
54532 9310.088 0.013337000 14.422 0.001578123 0.777857 4
54532 9326.087 0.012832000 14.617 0.001486967 0.730888 4
54532 9344.088 0.012297000 14.832 0.001403735 0.687889 4
54532 9359.088 0.011876000 15.000 0.001321477 0.646196 4
54532 9377.088 0.011401000 15.195 0.001247494 0.608393 4
54532 9392.087 0.011025000 15.352 0.001174473 0.571774 4
54532 9409.087 0.010623000 15.527 0.001107767 0.538445 4
54532 9427.088 0.010225000 15.703 0.001045725 0.507488 4
54532 9442.087 0.009909000 15.844 0.000984548 0.477309 4

    <snip>

54532 49946.087 0.000006000 33.555 0.000001419 0.001337 4
54532 49963.088 0.000006000 33.555 0.000001369 0.001251 4
54532 49979.087 0.000008000 33.555 0.000001526 0.001170 4
54532 49996.087 0.000008000 33.555 0.000001467 0.001095 4
54532 50012.087 0.000010000 33.555 0.000001538 0.001024 4
54532 50030.088 0.000012000 33.555 0.000001646 0.000958 4
54532 50045.088 0.000013000 33.555 0.000001576 0.000896 4
54532 50063.088 0.000015000 33.555 0.000001635 0.000838 4
54532 50079.087 0.000016000 33.555 0.000001570 0.000784 4
54532 50095.087 0.000017000 33.555 0.000001554 0.000733 4
54532 50110.088 0.000019000 33.555 0.000001618 0.000686 4
54532 50126.087 0.000022000 33.555 0.000001775 0.000642 4
/*****************************************/


Now here are some ntpq and stats.loop values for the exact same
hardware/software configuration as above, except with stepping disabled
via 'tinker step 0'.  There was no reboot between these runs, only the
tinker step was added back into ntp.conf and the drift file was
deleted.  This test was allowed to run for just over two hours and the
drift value was still increasing, but experience with this setup
indicates the drift was not going to come back down if the test were
allowed to run longer.  The stats.loop output shows the beginning of the
file, shows where offset reaches it's maximum, and then shows the end of
the file.  As you can see, the offset max is 0.093575450 seconds, so no
time step is required nor is one taken.  Yet, the drift runs out of control.


/*****************************************/
ntpq for working configuration, stepping enabled
/*****************************************/
sbc1 root 27->ntpq
ntpq> pe
     remote           refid      st t when poll reach   delay   offset 
jitter
==============================================================================
*GPS_BANC(0)     .BTFP.           0 l    3   16  377    0.000   34.313  
0.323
ntpq> as

ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 52112  9614   yes   yes  none  sys.peer   reachable  1
ntpq> rv &1
assID=52112 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=GPS_BANC(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
stratum=0, precision=-21, rootdelay=0.000, rootdispersion=0.000,
refid=BTFP, reach=377, unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=10,
flash=00 ok, keyid=0, ttl=64, offset=34.313, delay=0.000,
dispersion=0.017, jitter=0.323,
reftime=cb7b0304.fe74557e  Thu, Mar  6 2008 18:55:48.993,
org=cb7b0304.fe74557e  Thu, Mar  6 2008 18:55:48.993,
rec=cb7b0304.fe74aa8a  Thu, Mar  6 2008 18:55:48.993,
xmt=cb7b0304.fe74063f  Thu, Mar  6 2008 18:55:48.993,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   34.31   34.28   34.25   34.21   34.18   34.15   33.75   33.72,
filtdisp=      0.00    0.02    0.03    0.05    0.06    0.08    0.26    0.27
ntpq> rv
assID=0 status=0444 leap_none, sync_uhf_clock, 4 events,
event_peer/strat_chg,
version="ntpd 4.2.4p0 at 1.1472 Tue Jan  8 16:23:44 UTC 2008 (1)",
processor="i686", system="Linux/2.6.18.8-RedHawk-4.2-trace", leap=00,
stratum=1, precision=-20, rootdelay=0.000, rootdispersion=34.818,
peer=52112, refid=BTFP,
reftime=cb7b0304.fe74aa8a  Thu, Mar  6 2008 18:55:48.993, poll=4,
clock=cb7b0310.9f487fd7  Thu, Mar  6 2008 18:56:00.622, state=4,
offset=34.313, frequency=368.414, jitter=0.323, noise=0.952,
stability=0.526
ntpq>
ntpq> quit
sbc1 root 28->date
Thu Mar  6 18:56:35 EST 2008

sbc1 root 29->cat /etc/ntp_debug.conf
# Debug stuff
statistics clockstats peerstats loopstats
statsdir /var/lib/ntp/log/
filegen clockstats file stats.clock type pid link enable
filegen peerstats file stats.peer type pid link enable
filegen loopstats file stats.loop type pid link enable

restrict default nomodify notrap noquery
restrict 127.0.0.1

driftfile /var/lib/ntp/drift

tinker step 0

server  127.127.16.0 prefer mode 2 minpoll 4 burst # Symmetricom BC635
tos orphan 6
sbc1 root 30->


/*****************************************/
stats.loop for working configuration, stepping enabled
/*****************************************/
54531 78843.994 0.000851046 0.000 0.000300891 0.000000 6
54531 79747.994 0.030705963 33.578 0.026475415 11.871458 6
54531 79765.994 0.031304524 33.611 0.024766387 11.104738 6
54531 79780.994 0.031796892 33.640 0.023167488 10.387536 6
54531 79797.994 0.032358327 33.672 0.021672109 9.716657 6
54531 79815.994 0.032955986 33.708 0.020273503 9.089109 7
54531 79833.994 0.033550503 33.717 0.018965291 8.502084 7
54531 79849.994 0.034074021 33.725 0.017741370 7.952972 7
54531 79865.994 0.034603163 33.733 0.016596587 7.439324 7
54531 79882.994 0.035167590 33.742 0.015525968 6.958851 7
54531 79898.994 0.035694285 33.751 0.014524407 6.509410 8
54531 79916.994 0.036289215 33.753 0.013587967 6.088996 8
54531 79932.994 0.036820295 33.755 0.012711766 5.695734 8
54531 79950.994 0.037417341 33.758 0.011892642 5.327871 8
54531 79965.994 0.037910326 33.760 0.011125913 4.983767 9
54531 79981.994 0.038438966 33.760 0.010409017 4.661888 9
54531 79998.994 0.039003336 33.761 0.009738788 4.360796 9
54531 80013.994 0.039500613 33.762 0.009111498 4.079152 9
54531 80030.994 0.040060894 33.762 0.008525328 3.815697 8
54531 80046.994 0.040590507 33.765 0.007976912 3.569258 8
54531 80063.994 0.041151342 33.767 0.007464352 3.338735 7
54531 80079.994 0.041677784 33.777 0.006984742 3.123103 7
54531 80096.994 0.042239126 33.788 0.006536642 2.921397 7
54531 80113.994 0.042804284 33.799 0.006117732 2.732720 6
54531 80131.994 0.043397711 33.845 0.005726459 2.556278 6
54531 80149.994 0.043991734 33.892 0.005360728 2.391238 6
54531 80166.994 0.044557325 33.938 0.005018487 2.236855 5
54531 80183.994 0.045117876 34.120 0.004698547 2.093385 5
54531 80201.994 0.045714207 34.317 0.004400142 1.959410 5
54531 80216.994 0.046208335 34.482 0.004119662 1.833791 5
54531 80233.994 0.046769870 34.671 0.003858701 1.716664 4
54531 80250.994 0.047327945 35.394 0.003614874 1.625964 4
54531 80268.994 0.047925094 36.125 0.003387989 1.542768 4
54531 80285.994 0.048483893 36.865 0.003175326 1.466640 4
54531 80300.994 0.048980676 37.565 0.002975434 1.394102 4
54531 80317.994 0.049539934 38.321 0.002790278 1.331168 4
54531 80333.994 0.050070750 39.085 0.002616804 1.274155 4
54531 80350.994 0.050633071 39.858 0.002455857 1.222764 4
54531 80368.994 0.051225187 40.640 0.002306763 1.176702 4

    <snip>

54531 81596.994 0.091688138 120.566 0.000560630 1.336847 4
54531 81614.994 0.092277665 121.974 0.000564323 1.345953 4
54531 81631.994 0.092832915 123.390 0.000563197 1.354974 4
54531 81647.994 0.093355741 124.815 0.000558310 1.363858 4
54531 81665.994 0.093944241 126.248 0.000562173 1.372754 4
54531 81680.994 0.094438060 127.599 0.000554090 1.370047 4
54531 81698.994 0.095025116 129.049 0.000558317 1.380290 4
54531 81713.994 0.095517171 130.416 0.000550471 1.378559 4
54531 81729.994 0.095041514 131.866 0.000541684 1.387719 4
54531 81745.994 0.094563218 133.309 0.000534172 1.394739 4
54531 81760.994 0.094058731 134.654 0.000530552 1.388682 4
54531 81775.994 0.093548859 135.992 0.000528012 1.382476 4
54531 81791.994 0.093575450 137.420 0.000493999 1.388228 4
54531 81808.994 0.093133732 138.841 0.000487771 1.392381 4
54531 81825.994 0.092695262 140.256 0.000481884 1.395154 4
54531 81843.994 0.092282190 141.664 0.000473829 1.396781 4
54531 81860.994 0.091845491 143.065 0.000469349 1.397366 4
54531 81876.994 0.091866474 144.467 0.000439098 1.397917 4
54531 81893.994 0.090927669 145.855 0.000528087 1.396613 4
54531 81909.994 0.090950463 147.242 0.000494046 1.395513 4
54531 81926.994 0.090506990 148.623 0.000488011 1.393711 4

    <snip>

54531 86165.994 0.032871409 368.915 0.001025820 0.522859 4
54531 86182.994 0.033430722 369.425 0.000979730 0.521283 4
54531 86198.994 0.033955849 369.943 0.000935071 0.520889 4
54531 86215.994 0.032518157 370.440 0.001011648 0.517866 4
54531 86232.994 0.033075338 370.944 0.000966597 0.516237 4
54531 86248.994 0.033601583 371.457 0.000923113 0.515799 4
54531 86264.994 0.032128572 371.947 0.001008385 0.512674 4
54531 86282.994 0.032720790 372.447 0.000966217 0.511019 4
54531 86298.994 0.032747589 372.946 0.000903863 0.509617 4
54531 86314.994 0.032778294 373.446 0.000845556 0.508444 4
54531 86332.994 0.031872478 373.933 0.000853321 0.505733 4
54531 86349.994 0.032433287 374.428 0.000822466 0.504391 4
/*****************************************/


So, the summary is that drift goes to 500ppm when stepping is disabled
but runs normally when stepping is enabled and both situations never
require a time step.  This makes no sense to me.  By the way, as
mentioned previously, we require that time does not step backward due to
a problem in some commercial software that cannot currently tolerate
time moving backwards.

Quite frankly, I don't think it's unreasonable that a system require
time to monotonically increase.  Clearly this isn't the first system
that requires such behavior (i.e. time step disable was not added for
me).  I understand it takes 14 days to recover from an offset of 600
seconds, but I also understand that if we have an offset of more than
10ms in this system, then something isn't working correctly.  I'm going
to be bold and say that we simply will _never_ have an offset of 600
seconds in this system.  If we do, they will have a recovery procedure
that involves rebooting the system, which will force a quick sync during
startup.  If they continue to have a problem, it will be fixed, most
likely by swapping hardware until the problem is fixed or flying someone
in to work on the system.

To summarize, we really need to disable time stepping to keep time from
moving backwards.  Maybe the commercial software will be fixed before
this problem is solved, but I don't want to rely on that and, even then,
monotonically increasing time may remain a requirement.

Andy





More information about the questions mailing list