[ntp:hackers] ntp p110 and setting frequency and offset still off.

Brian Utterback Brian.Utterback at Sun.COM
Sun Feb 3 15:57:46 UTC 2008


There are two approaches forward. One, port the nano-kernel. This is on 
the roadmap, but will
take some time and effort and a lot of testing.
I
The other thing is try to find the problem in the current kernel. Since 
the xntpd code doesn't seem to
have the same problem, it seems to me that some assumption somewhere is 
out of whack.

I need help on both approaches. I have the nano-kernel distribution, but 
I am not clear on
what I have to do and how to test it with the simulator included. The 
requirements are that
the kernel can be 32-bit or 64-bit, and may have either a 10ms or 1ms 
tick. And of course
it is all multi-processor.  Can I just compile everything in the 
directory at 64-bits? Is the stuff
that will ultimately go into the kernel contained in the ktime.c file?

I can trace through the calculations in the current micro-kernel. I just 
need to have a methodology.
I can set up any initial values in the kernel and then do a test with 
ntptime. I just need more specifics
about what numbers it has and what they should be.

As I said before, it is relatively easy to build OpenSolaris kernels 
these days. If you have a system
available, I can provide test kernels, or I can help set up a build 
environment there.

I wish I could do this without help, but I lack the expertise. And I 
really need to fix this if we want
to get NTP up to date in Solaris.

David L. Mills wrote:
> Brian,
>
> I have really bad news. I cranked up ntpd on Solaris with initial 
> transient as described and found completely unacceptable behavior. 
> Initially, it looked like the Solaris kernel did not correctly scale the 
> time constant, so I changed that to the current scaling factor. That was 
> a disaster. Then, in an attempt to fit the time constant to the expected 
> behavior, the kernel became completely unstable. The best advice I can 
> give is to immediately discontinue using the kernel and for the current 
> NTPv4 build to disable it by default.
>
> It seems that when the incidental offsets are small and no large 
> disruptions are present the kernel behaves okay, as evidence 
> pogo.udel.edu, an Ultra 5. However, a large spike sent to 
> deacon.udel.edu, a Blade 1500, is patently unstable.
>
> I looked at the source you linked, but it is so obscure I can't make out 
> what it does. Certainly it has nothing to do with either the microkernel 
> or nanokernel sources, but that is not a bad thing. You did not reveal a 
> link to the seconds overflow code that computes the frequency offset for 
> the next second, but Solaris might not have used that. And, that is not 
> a bad thing either.
>
> There are two aspects of a type-II feedback control system, the phase 
> gain and the frequency gain. These two are intimately entwined and must 
> be in a certian ratio as described in RFC 1305 and das Buch. Clearly, 
> the Solaris kernel does not comply, at least in the current Blade 1500 
> and likely in previous kernels. All I can say is abandon the kernel and 
> use only the daemon loop.
>
> Dave
>
> Brian Utterback wrote:
>
>   
>> When you said that the Solaris kernel is broken, if we can isolate 
>> what is wrong, then
>> we can get it fixed.
>>
>> As you noted, the constants in the timex.h file have changed from 
>> shifts to multiplies. As near as
>> I can tell, the kernel was likewise changed to multiplies. I have no 
>> idea why that was done. I was
>> hoping you could tell me, since you were in contact with the one who 
>> did the implementation in
>> Solaris, Jan Brittenson.
>>
>> Is there anyway I can test the kernel without running ntpd? I suspect 
>> that there is a way to do it
>> with ntptime.  For instance, I could create an offset of 90ms, and 
>> then use ntptime to set
>> the offset and time constant and frequency and then use ntpdate to 
>> observe the offset over time.
>> That's what you mean when you talked about crossing zero and 
>> overshooting, right? The offset?
>>
>> As I said in my last email, I am very concerned about the bogus 
>> offsets being fed to the kernel by
>> ntpd.  No matter whether or not the kernel is broken, it would appear 
>> that ntpd feeding the kernel
>> bad data.
>>
>> Thank you for the help here. I really need to get this nailed down one 
>> way or another. I am
>> ready to integrate the NTPv4 into Nevada, but it has to work better 
>> than it does currently
>>
>> If you want to see the actual stuff, you can see the actual 
>> implementation here:
>>
>> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/clock.c#100 
>>
>>
>> The implementation for the syscall is here:
>> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/syscall/ntptime.c#177 
>>
>>
>> And the timex.h:
>> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/timex.h#72 
>>
>>
>> David L. Mills wrote:
>>
>>     
>>> Brian,
>>>
>>> There are two problelms here. The first is the humungus overshoot way 
>>> over 50  percent. The second is that the time constant increases way 
>>> too fast. You can isolate the issues by setting maxpoll 6. For proper 
>>> behavior, the loop should cross zero in about 2500-3000 s and 
>>> overshoot about 6 percent. If it does not, the loop parameters are 
>>> broken.
>>>
>>> The second issue may be due to the kernel estimate of the clock 
>>> jitter, which is passed back to the daemon for use in the time 
>>> constant control algorithm. Start at maxpoll 6 with the daemon loop 
>>> and loopstats enabled and notice the clock jitter decrease until it 
>>> stablizes. Then run with the kernel loop at maxpoll 6 and compare 
>>> with the daemon data. I suspect the kernel jitter will be much worse. 
>>> In that case the algorithm will find the offset much smaller than the 
>>> jitter, which is the signal to increase the time constant.
>>>
>>> I did confirm the nanokernel performs as expected, but the Solaris 10 
>>> kernel is badly broken. At time constant 6 (kernel time constant 2) 
>>> and a 90-ms initial offset, the loop crossed zero in about 500 s and 
>>> overshot 50 percent in about 1300 s. Clearly something is broken.
>>>
>>> By far the best wasy to fix this is to upgrade to the nanokernel; 
>>> however, it could be the wrong timex.h file may be in use. Check the 
>>> SHIFT_... constants in your file against the timex.h file in the 
>>> microkernel distribution.
>>>
>>> I did check that file in Solaris 10 and found a nasty surprise. All 
>>> those SHIFT_... constants had been changed. Unless the kernel is far 
>>> different than left here, that file is busted. Somebody has changed 
>>> what was intended as a shift to a multiplier. That works if the 
>>> kernel code is changed to suit.
>>>
>>> Dave
>>>
>>>       
>>>> snip
>>>>         
>
> _______________________________________________
> hackers mailing list
> hackers at lists.ntp.org
> https://lists.ntp.org/mailman/listinfo/hackers
>   



More information about the hackers mailing list