[ntp:hackers] Weird initial frequency calculations.

David L. Mills mills at udel.edu
Tue Aug 26 21:44:52 UTC 2008


Brian,

I assume you left enough time between running ntpdate for adjtime() to 
amortize the residual offset.

There were two problems, now fixedin ntp-dev. First, in fixing a 
reported problem where the leapsecond values were not signed when first 
synchronized, I created another problem. When the OpenSSL library is 
available, but a crypto command is not present, ntpd cheerfully 
segfaulted when the clock was first set.

The other problem was a simple bug that, under suitably bizarre 
circumstances, the frequency was reported correctly to the protostats, 
but not actually set. I repeared your experiment both with and without 
kernel support and it behaved exactly as intended.

By the way, you are very much working the hard way. All you need is to 
verify from the protostats file that the expected events took place at 
the expected times and then look at those times in the loopstats file.

I did not observe a local clock update between when first set and when 
the frequency was set. A number of clock_filter events was recorded, but 
none of these caused a clock update.

Dave

Brian Utterback wrote:

> As requested, I repeated the experiments, this time alternating 
> between the kernel and daemon loops. I ran this repeatedly for 24 
> hours, setting the frequency in the kernel manually to 0 and using 
> ntpdate to set the offset to nearly zero before starting ntpd. Also 
> before starting ntpd, I removed the drift file. I then ran ntpd for 30 
> minutes and gathered the all the relevant stats files. This is all on 
> a system known to have about a -26 ppm drift.
>
> Here is what I found:
>
> Looking at the loopstats file from the kernel loop, the first line of 
> all the files looks like this:
>
> 54699 8056.790 0.009085000 1.707 0.003211998 0.603457 6
>
> This is apparently the first offset calculation. The frequency 
> recorded on this first line is in the range of -0.963 to 2.902. The 
> second line always has the offset equal to zero. This second line 
> occurs between 0 and 1388 seconds after the first line. There is a 
> strong correlation between the length of time between the first and 
> second lines and the accuracy of the frequency set in the second line. 
>  Depending on the time between the first and second entries, there 
> were between 0 and 6 additional entries in the loopstats file. The 
> last entry in the loopstats files had frequencies in the range of -1.8 
> to -37.8. So, the inital frequency calculation is inconsistent at best.
>
> Now, looking at the daemon loop, I see entirely different results:
>
> As before, the entry appears to be the inital offset adjustment, but 
> unlike the kernel loop, the frequency on this line is always zero. As, 
> as before, the second line appears to be caused by the frequency 
> calculation, with the offset always zero and occurring 0 to 1305 
> seconds after the first line. However, the frequency is always zero as 
> well. It appears that in the daemon loop, the inital frequency 
> calculation is consistently failing.  As a result, the final entry in 
> the these loopstats is in the range -0.085 to -0.399.
>
> David L. Mills wrote:
>
>> Brian,
>>
>> I gather from your note that you stopped ntpd once each hour and:
>> 1. Removed the frequency file.
>> 2. Started the daemon.
>> 3. Recorded the frequency between 3 and 26 minutes afger that.
>>
>> Your printout apparently was in the direct_freq() routine in 
>> ntp_loopfilter.
>> You should note from the code that the direct_freq() routine is 
>> called from two places only and that only when actually calculating 
>> the frequency after the inital measurement interval.
>>
>> So, how is it you show a a call when less than 15 minutes remain 
>> before the only call?
>>
>> I'm not getting any traction here. Take it step by step. Record 
>> loopstats with the kernel disabled and then another run with it 
>> enabled. COmpare the frequency and then go from there. Remember, a 
>> loopstats entry is produced exactly and only when the clock offset 
>> and freqeuency are updated.
>>
>> Dave
>>
>> Brian Utterback wrote:
>>
>>> I explained this before. The frequency numbers I posted in the first 
>>> message in this thread did not represent a timeline. It represented 
>>> the values that ntpd set the frequency to in the calc_freq routine 
>>> in ntp_loopfilter.c So, the frequency was not "moving toward a 
>>> correct negative value and suddenly show a single positive value". 
>>> This was a table of frequency settings that ntpd made each one 
>>> between 3 minutes and 26 minutes after ntpd was restarted. I sorted 
>>> the table by time so that the correlation between the time and 
>>> accuracy was more evident. This was apparently a mistake, since it 
>>> has led to quite a lot of confusion. Each line of the table is an 
>>> event that happened one or more hours before or after all of the 
>>> other lines.
>>>
>>> Here is the same table, without sorting and leaving the timestamps 
>>> intact:
>>>
>>> Jul 29 22:18:42 freq=-22073
>>> Jul 29 23:03:23 freq=615
>>> Jul 30 00:19:46 freq=-25569
>>> Jul 30 01:05:36 freq=1578
>>> Jul 30 02:18:48 freq=-24193
>>> Jul 30 03:03:28 freq=-751
>>> Jul 30 04:05:44 freq=-7642
>>> Jul 30 05:03:32 freq=-4135
>>> Jul 30 06:03:31 freq=-8245
>>> Jul 30 07:21:02 freq=-26722
>>> Jul 30 08:19:52 freq=-23362
>>> Jul 30 09:20:52 freq=-22816
>>> Jul 30 10:23:17 freq=-27761
>>> Jul 30 11:18:45 freq=-26337
>>> Jul 30 12:05:45 freq=1642
>>> Jul 30 13:03:37 freq=-3478
>>> Jul 30 14:03:40 freq=-6324
>>> Jul 30 15:26:34 freq=-27200
>>> Jul 30 16:03:43 freq=2062
>>> Jul 30 17:03:42 freq=-3551
>>> Jul 30 18:03:43 freq=-4969
>>> Jul 30 19:03:41 freq=-5407
>>> Jul 30 20:03:48 freq=-3894
>>> Jul 30 21:03:45 freq=-7290
>>> Jul 30 22:03:47 freq=-4073
>>> Jul 30 23:04:49 freq=-6159
>>> Jul 31 00:07:06 freq=-4963
>>> Jul 31 01:04:53 freq=-2309
>>> Jul 31 02:03:49 freq=-6067
>>> Jul 31 03:03:52 freq=-3325
>>> Jul 31 04:19:11 freq=-22233
>>> Jul 31 05:03:53 freq=-1238
>>> Jul 31 06:20:15 freq=-26172
>>> Jul 31 07:21:23 freq=-19315
>>> Jul 31 08:03:59 freq=-838
>>> Jul 31 09:03:56 freq=-1556
>>> Jul 31 10:03:56 freq=-2672
>>> Jul 31 11:04:00 freq=-2594
>>>
>>>
>>> In each case, ntpd was restarted on the hour, so the minutes and 
>>> seconds are the actual time as well as the time since ntpd was 
>>> restarted.
>>>
>>> I hope that clears up the confusion.
>>>
>>> David L. Mills wrote:
>>>
>>>> Brian,
>>>>
>>>> In the thread below I cite three samples from you data where the 
>>>> reported frequency was moving toward a correct negative value and 
>>>> suddenl showed a single positive value. There was no external 
>>>> signal from ntpd to cause that and no apparent other cause why it 
>>>> would do that. You have reported anomalistic frequency behavior 
>>>> before, so far as I know yet unexplained. One experiment you did 
>>>> was completely at odds with the expected behavior of the 
>>>> microkernel that cannot be duplicated with the machines available 
>>>> here. You have mentioned spread spectrum clocking, so far as I know 
>>>> yet unresolved. Your data show anomalistic behavior that cannot be 
>>>> duplicated here.
>>>>
>>>> The anomalistic behavior evident in your data needs to be explained 
>>>> before looking into possibly related other effects. As I have said 
>>>> before, the most likely explanation is overflow somwhere in the 
>>>> frequency calculation. So far as I know, this does not happen in 
>>>> other kernels or in the Solaris kernel with the machines available 
>>>> here.
>>>>
>>>> It would help and save a lot of sweat here if Solaris would upgrade 
>>>> to the nanokernel so at least all machines would be on common 
>>>> ground. By the way, all Solaris machines here are running with 
>>>> 64-bit kernele.
>>>>
>>>> Dave
>>>>
>>>> Brian Utterback wrote:
>>>>
>>>>> Okay, but as I said before, I do not understand what you mean by 
>>>>> anomalistic frequency bumps.
>>>>>
>>>>> David L. Mills wrote:
>>>>>
>>>>>> Brian,
>>>>>>
>>>>>> Stop. I'm not going to consider anything further until an 
>>>>>> explanation of the anomalistic frequency bumps I cited is 
>>>>>> offered. These did not involve the daemon and apparently happened 
>>>>>> all by themselves. Until the cause of these glitches has been 
>>>>>> determined, I'm not going to worry about other behaviors that 
>>>>>> might well be dependent on the cause. Again, the behavior you 
>>>>>> report is not apparent on two Solaris 10 machines here, one an 
>>>>>> Ultra 60 and another a Blade 1500.
>>>>>>
>>>>>> As I said previously, it is necessary to isolate the 
>>>>>> ntp_adjtime() and adjtime() behavior to see if the serious 
>>>>>> frequency wobble is due the kernel discipline.
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>> Brian Utterback wrote:
>>>>>>
>>>>>>> The frequency is not jumping around on its own. As I said, the 
>>>>>>> frequency is set to zero via ntptime before ntpd is started. 
>>>>>>> Since there is no drift file, ntpd also sets the frequency to 0 
>>>>>>> at start up.   Then, 195 seconds into the run, ntp_adjtime is 
>>>>>>> called with an offset to correct. This call results in the 
>>>>>>> frequency being recalculated in the kernel.
>>>>>>>
>>>>>>> What I see is basically 4 sets of calls to ntp_adjtime at 
>>>>>>> varying points of time after startup. The timeline goes like this:
>>>>>>>
>>>>>>> At time 0, ntp_adjtime is called twice, once to set the offset=0 
>>>>>>> and
>>>>>>> once to set the frequency=0.
>>>>>>>
>>>>>>> About 195 seconds later, ntp_adjtime is called to set an offset. 
>>>>>>> This results in a recalculation of the frequency by the kernel.
>>>>>>>
>>>>>>> Sometime later, varying from 2 seconds to 1000 seconds, 
>>>>>>> ntp_adjtime is called to set the frequency to some value. 
>>>>>>> Immediately after that, the offset is set to 0.
>>>>>>>
>>>>>>> Sometime later, varying from 2 seconds more to 1000 seconds, 
>>>>>>> ntp_adjtime is called with a new offset, which again 
>>>>>>> recalculates the frequency.
>>>>>>>
>>>>>>> I have no problem with the sequence of events here, it is what 
>>>>>>> would be expected. The problem is that when the last two 
>>>>>>> intervals are very small, the frequencies set and then 
>>>>>>> recalculated are wildly inaccurate. When those two intervals are 
>>>>>>> long, they are much closer to the correct values.
>>>>>>>
>>>>>>> So, the two questions I have are these:
>>>>>>>
>>>>>>> 1. Does ntpd take into account the frequency set in the kernel 
>>>>>>> after the first real offset is set when it makes its subsequent 
>>>>>>> frequency calculation?
>>>>>>>
>>>>>>> 2. Why do the last two intervals vary so much and what can we do 
>>>>>>> about it?
>>>>>>>    David L. Mills wrote:
>>>>>>>
>>>>>>>> Brian,
>>>>>>>>
>>>>>>>> The fact that the frequency jumps around when the daemon is not 
>>>>>>>> disciplining the kernel is a red flag. If the kernel is 
>>>>>>>> disabled and the daemon loop with adjtime() works okay, there 
>>>>>>>> is big trouble in the microkernel code. If the serious wobble 
>>>>>>>> continues, there is something evil in the hardware.
>>>>>>>>
>>>>>>>> What is so alarming is that the kernel frequency is compouted 
>>>>>>>> as an integral of the offsets, which should be slowly 
>>>>>>>> decreasing to zero. This is purely a mathematical operation and 
>>>>>>>> should not be affected by hardware glitch. Unless, that is, 
>>>>>>>> there is a discontinuity due to integer overflow or something 
>>>>>>>> like that. I've never seen this happen on any Sun or Alpha 
>>>>>>>> since 1992. Remember, we still haven't resolved the glitch you 
>>>>>>>> reported some time back, and that did appear to be instablity 
>>>>>>>> in the feedback loop due to spread spectrum clocking.
>>>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> Brian Utterback wrote:
>>>>>>>>
>>>>>>>>> Sorry, you are right, it is ppb. I have been looking at them 
>>>>>>>>> so much I was just automatically doing the conversion in my head.
>>>>>>>>>
>>>>>>>>> I need a little more instruction. What do you want me to do? I 
>>>>>>>>> got the part about setting disable ntp. Then what?
>>>>>>>>>
>>>>>>>>> David L. Mills wrote:
>>>>>>>>>
>>>>>>>>>> Brian,
>>>>>>>>>>
>>>>>>>>>> Those frequency numbers are 1000 times your previously 
>>>>>>>>>> measured value. If you are using the kernel frequency 
>>>>>>>>>> varuable directly, its units are PPB.
>>>>>>>>>>
>>>>>>>>>> If you removed the frequency file, the bumps I cited could 
>>>>>>>>>> not be due to anything but the kernel discipline, since the 
>>>>>>>>>> daemon does not again discipline the clock until 15 minutes 
>>>>>>>>>> into the run. The kernel isn't doing anything except an 
>>>>>>>>>> exponential decay of the initial uupdate and its integral for 
>>>>>>>>>> the frequency, so the frequency should be very even with not 
>>>>>>>>>> even small bumps.
>>>>>>>>>>
>>>>>>>>>> This makes it even more important to verify the kernel 
>>>>>>>>>> discipline is working correctly. You should fire up the 
>>>>>>>>>> machine with disable ntp command in the configuration file 
>>>>>>>>>> and see if the adjtime() syscall results in the same behavior 
>>>>>>>>>> you show.
>>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>> Brian Utterback wrote:
>>>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>
>>>>>>>>>>> The units of the frequency shown are ppm. I forgot to 
>>>>>>>>>>> mention that the ntp.drift file is removed after stopping 
>>>>>>>>>>> the previous ntpd and before starting the next. In each 
>>>>>>>>>>> case, the ntpd program was allowed to run for one hour 
>>>>>>>>>>> before starting the next cycle, which was started 
>>>>>>>>>>> immediately, so the offset present at the start was whatever 
>>>>>>>>>>> it was when ntpd exited. Looking at the offsets at the 
>>>>>>>>>>> beginning of each run and the end of each previous run, the 
>>>>>>>>>>> offset was generally in the 3-10ms range, but was as high as 
>>>>>>>>>>> 45ms.
>>>>>>>>>>>
>>>>>>>>>>> I am not sure what you are saying about the spikes, unless 
>>>>>>>>>>> you are misunderstanding the data. This table represents 
>>>>>>>>>>> hourly runs done over a period of almost 2 days. The output 
>>>>>>>>>>> was then sorted to provide a visual correlation between the 
>>>>>>>>>>> length of time spent before setting the frequency and the 
>>>>>>>>>>> accuracy thereof. The table does not represent a timeline at 
>>>>>>>>>>> all. If you did not misunderstand and still think that there 
>>>>>>>>>>> is an issue with spikes, then please elaborate, because I 
>>>>>>>>>>> don't understand what you are saying.
>>>>>>>>>>>
>>>>>>>>>>> David L. Mills wrote:
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>>
>>>>>>>>>>>> Brian,
>>>>>>>>>>>>
>>>>>>>>>>>> You don't say what the units are nor what the offset 
>>>>>>>>>>>> presented was. You apparently used a special program. You 
>>>>>>>>>>>> don't need to do that, as the residual offset and frequecy 
>>>>>>>>>>>> are copied to the daemon for loopstats. I assume you set 
>>>>>>>>>>>> the ntp.drift value to zero, in which case the kernel will 
>>>>>>>>>>>> be initialized to zero as well. I'm running Solaris 10 on 
>>>>>>>>>>>> several machines, but not the one of you cite.
>>>>>>>>>>>>
>>>>>>>>>>>> See minutes 3:43, 536 and 5:45. The loop can't do that 
>>>>>>>>>>>> unless hit with a large spike. If the units are in PPB, the 
>>>>>>>>>>>> convergence makes sense, but way too fast. It should take 
>>>>>>>>>>>> many hours to get there.
>>>>>>>>>>>>
>>>>>>>>>>>> I suggest you do the experiment again, but disable the 
>>>>>>>>>>>> kernel. Watch the loopstats; the kernel should do the same 
>>>>>>>>>>>> thing. Just for whims, verify the kernel time constant is 2 
>>>>>>>>>>>> and have maxpoll 6.
>>>>>>>>>>>>
>>>>>>>>>>>> Dave
>>>>>>>>>>>>
>>>>>>>>>>>> Brian Utterback wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>  
>>>>>>>>>>>>
>>>>>>>>>>>>> While further investigating the initial frequency 
>>>>>>>>>>>>> calculations, I got some weird results.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I did was take a system that had a known drift of 
>>>>>>>>>>>>> approximately -28 ppm. I then made a script that ran the 
>>>>>>>>>>>>> p120 ntpd, after first setting the system frequency and 
>>>>>>>>>>>>> offset to 0 using ntptime. I then traced the calls to 
>>>>>>>>>>>>> ntp_adjtime, looking for the call that set the kernel 
>>>>>>>>>>>>> frequency after the initial frequency calculation. I used 
>>>>>>>>>>>>> the same four servers each time, all reporting time to 
>>>>>>>>>>>>> within less than a millisecond or so of each other. I did 
>>>>>>>>>>>>> not configure iburst.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I found was that ntpd set the frequency in 3 to 7 
>>>>>>>>>>>>> minutes, or 18 to 26 minutes. The frequency set in the 3 
>>>>>>>>>>>>> to 7 minute cases were terrible approximations while those 
>>>>>>>>>>>>> in the 18 to 26 mintue range were pretty good, or at least 
>>>>>>>>>>>>> in the ballpark.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, I don't why there is a difference, but something weird 
>>>>>>>>>>>>> is going on in how ntpd is doing this initial frequency 
>>>>>>>>>>>>> calculation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are the data I have collected. The first column is 
>>>>>>>>>>>>> the time since ntpd started up, and the second column is 
>>>>>>>>>>>>> the frequency the kernel was being set to:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 03:23  freq=615
>>>>>>>>>>>>> 03:28  freq=-751
>>>>>>>>>>>>> 03:31  freq=-8245
>>>>>>>>>>>>> 03:32  freq=-4135
>>>>>>>>>>>>> 03:37  freq=-3478
>>>>>>>>>>>>> 03:40  freq=-6324
>>>>>>>>>>>>> 03:41  freq=-5407
>>>>>>>>>>>>> 03:42  freq=-3551
>>>>>>>>>>>>> 03:43  freq=-4969
>>>>>>>>>>>>> 03:43  freq=2062
>>>>>>>>>>>>> 03:45  freq=-7290
>>>>>>>>>>>>> 03:47  freq=-4073
>>>>>>>>>>>>> 03:48  freq=-3894
>>>>>>>>>>>>> 03:49  freq=-6067
>>>>>>>>>>>>> 03:52  freq=-3325
>>>>>>>>>>>>> 03:53  freq=-1238
>>>>>>>>>>>>> 03:56  freq=-1556
>>>>>>>>>>>>> 03:56  freq=-2672
>>>>>>>>>>>>> 03:59  freq=-838
>>>>>>>>>>>>> 04:00  freq=-2594
>>>>>>>>>>>>> 04:49  freq=-6159
>>>>>>>>>>>>> 04:53  freq=-2309
>>>>>>>>>>>>> 05:36  freq=1578
>>>>>>>>>>>>> 05:44  freq=-7642
>>>>>>>>>>>>> 05:45  freq=1642
>>>>>>>>>>>>> 07:06  freq=-4963
>>>>>>>>>>>>> 18:42  freq=-22073
>>>>>>>>>>>>> 18:45  freq=-26337
>>>>>>>>>>>>> 18:48  freq=-24193
>>>>>>>>>>>>> 19:11  freq=-22233
>>>>>>>>>>>>> 19:46  freq=-25569
>>>>>>>>>>>>> 19:52  freq=-23362
>>>>>>>>>>>>> 20:15  freq=-26172
>>>>>>>>>>>>> 20:52  freq=-22816
>>>>>>>>>>>>> 21:02  freq=-26722
>>>>>>>>>>>>> 21:23  freq=-19315
>>>>>>>>>>>>> 23:17  freq=-27761
>>>>>>>>>>>>> 26:34  freq=-27200
>>>>>>>>>>>>>
>>>>>>>>>>>>>         
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> hackers mailing list
>>>>>>>>>>>> hackers at lists.ntp.org
>>>>>>>>>>>> https://lists.ntp.org/mailman/listinfo/hackers
>>>>>>>>>>>>       
>>>>>>>>>>>
>>>>>>>>>>>     
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> hackers mailing list
>>>>>>>>>> hackers at lists.ntp.org
>>>>>>>>>> https://lists.ntp.org/mailman/listinfo/hackers
>>>>>>>>>>   
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> hackers mailing list
>>>>>>>> hackers at lists.ntp.org
>>>>>>>> https://lists.ntp.org/mailman/listinfo/hackers
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> hackers mailing list
>>>>>> hackers at lists.ntp.org
>>>>>> https://lists.ntp.org/mailman/listinfo/hackers
>>>>>
>>>>>
>>>> _______________________________________________
>>>> hackers mailing list
>>>> hackers at lists.ntp.org
>>>> https://lists.ntp.org/mailman/listinfo/hackers
>>>
>>>
>>
>> _______________________________________________
>> hackers mailing list
>> hackers at lists.ntp.org
>> https://lists.ntp.org/mailman/listinfo/hackers
>
>



More information about the hackers mailing list