[ntp:questions] strange time behaviour

Hal Murray hal-usenet at ip-64-139-1-69.sjc.megapath.net
Tue May 26 08:20:13 UTC 2009


>No, no HPET clock here. Clocksources are: tsc, pit, jiffies.

>The application is light-weight and doesn't do any heavy things on the
>kernel.
>For the 13 minutes and 32 seconds the clock slew down, the behaviour
>was monotonic: losing 14 or sometimes 13 seconds per minute: just as
>if the pit had been reprogrammed for another frequency.

>I see three possibilities, where the problem could be: ntp, the linux-
>kernel or hardware. Since I can only upgrade ntp without interrupting
>the working system, which is nearly as bad as unscheduled
>interruptions, I would prefer the probem being in ntp. Would anybody
>recommend me upgrading ntp?

Does anybody have any ideas for where 13 minutes comes from?


I think you win the contest for the strangest bug so far this year.
Please let us know what the problem was if/when you find it.

I don't see any reason to be suspicious of ntp.  On the other hand
if everything else is hard to change, sure update to the latest
version.  (I'd wait a few days/weeks.  There is a release that
should happen real-soon-now.  Or help test the almost release.)

If I was tasked with fixing this...

First question.  How much is the fix worth?  How much do you think
it will cost to track it down?  Would it be cheaper to reboot the
systems every 40 days?

I'd start by getting a system in the lab that can/should have
the same bug so you can hack on it without screwing up operational
systems.

The Next step would be to get the fine print on the hardware specs.

There has been a lot of work on the timekeeping area of the Linux kernel
recently.  I'd probably update to a recent kernel.

Then I'd scan all the code, partly to make sure I understood everything,
and partly looking for bugs.

48 days is a long time.  I'd probably dig out a couple of old junker
systems and set them up as close as possible to the failing system.
This is the different-hardware test case.  Might as well start them
ticking now.

One thing that might help is to see if you can make the problem
happen in a few minutes rather than a few months.  I'd look
into hacking the kernel to initialize some of the time-keeping
variables so they were about to overflow.  And maybe add a bunch
of debugging counters and some way to read them.

-- 
These are my opinions, not necessarily my employer's.  I hate spam.




More information about the questions mailing list