[ntp:questions] Re: Strange scattergram
sandler at ujf.cas.cz
Thu May 18 18:14:05 UTC 2006
> Karel Sandler wrote:
>> "Brian Utterback" wrote:
>>> Karel Sandler wrote:
>>>> to better know the network quality between my S2 and its peers, the
>>>> wedge scattergrams were used. That strange one has the upper limb
>>>> prolonged down to the left side of the head. Only few samples (status
>>>> 9314) are located there but their existence is beyond all my
>>>> Hope my S2 works well, only one plot (one peer from seven) shows up
>>>> that feature - lx.ujf.cas.cz/ntp-lx/antitail.png .
>>>> K. Sandler
>>> I wouldn't worry too much about the continuation of the line down and
>>> towards the left. That is probably due to the fact that the systems
>>> clock is in the process of slewing and is in the "overshoot" part of
>>> the phase locked loop. Of much greater concern is the complete asymmetry
>>> of the wedge plot. The typical cause for having the wedge plot bunched
>>> up towards one limb or the other is an asymmetric network path between
>>> the client and the server. However, looking at the offset plots
>>> at http://lx.ujf.cas.cz/ntp-lx it looks to me like your ntpd is not
>>> disciplining the frequency at all. Your system clock runs fast and
>>> ntpd adjusts it back in what look to me to be about a two hour cycle.
>>> I think that is why all of the offsets are positive in the wedge plot,
>>> because your clock runs fast.
>> Thanks, Brien, for your response. As I understand, you mean that due to a
>> lack of discipline a datagrams can get false timestamps sometimes. Well,
> That's not exactly what I mean.
> In theory, ntpd is not only supposed to correct the offsets, but also
> adjust the clock frequency so that the clock does not run fast or slow,
> but stays in sync for long periods of time, meaning that the measured
> offset is always very small, and should be effectively zero. So,
> for any given measurement of offset, it should be nearly zero, but
> due to random errors, it might be plus or minus some error value.
> The error value is bounded by the round trip time. Hence the wedge
> plots. If ntpd is working, then the offset at any given point in time
> is close to zero, and random errors will give offsets of up to the
> delay. Thus plotting delay versus offset should produce a cloud in a
> form of a wedge.
> Now, if the network path is not symmetric, the measured offset is
> perturbed. The offset will be systematically greater or less than
> it should be. Since this kind of asymmetry often affects all of your
> servers or none of them, it may be impossible to detect this in
> the short term, since the offsets from all the servers will be perturbed
> However, there is one small exception. The random errors are generally
> introduced at each network hop. So, for an asymmetric path, there will
> tend to be more errors introduced in on one leg then the other. The
> errors themselves take the form of small delays, so this means that
> one leg will vary more and will tend to perturb the offset measurement
> more toward one direction (either positive or negative) than the other.
> This has the effect of making the wedge plot asymmetric, with more
> points long one or the other limb of the graph. The huff puff
> algorithm keeps track of this data over a long period of time and
> if it notices this kind of asymmetry, it calculates a correction
> factor to bring the wedge plot back into symmetry.
> One limiting factor of the huff puff at this time, though, is that
> it only calculates a global correction, which is fine if the problem
> is affecting all your servers. If not, huff puff could make things
> worse. Luckily, it seems that this is the most common case. It could
> be modified to do the calculations on a per server basis, but that
> is not the case today.
> However, this all assumes that ntpd is doing it job. In your case, it
> appears that it is not. There is a pronounced stair step to your
> offsets over time. The offset increases for about two hours, then
> ntpd kicks in and slews the clock to bring it back down to zero, then
> it increases again. If ntpd was adjusting the clock frequency correctly,
> then this would not be happening.
> So, the offsets are starting at zero, and then increasing, and then
> going back quickly to zero. This means that the offsets are all
> skewed toward the positive side of the plot. Because the clock is
> running at the wrong speed, this also distorts the calculated offset
> because the org timestamp and the rec timestamp for a packet are at
> slightly different timescales. Normally, the change is so small
> as to be negligible, but in your case it is pretty systematic.
> So, because much of the offset is not due to errors, but due to
> systematic frequency problems, the relation between delay and offset
> is reduced. Furthermore, since ntpd slews the clock to correct offsets,
> this has the effect of exacerbating the problem of the changing time
> scales. This, I believe, is the source of the little tail, down
> and to the left, lying outside the wedge.
> As you said this is only effecting the single system, this further
> makes me think that this is a problem that ntpd is having disciplining
> the clock frequency, rather than an asymmetric path problem as Dr.
> Mills has suggested.
> One suggestion I have, is stop ntpd, remove the drift file, reboot (or
> zero out the kernel adjustments with ntptime if you know how) and
> let ntpd recalculate the correct drift. Incorrect drifts have been
> known to cause odd effects and it is better to start from scratch,
> rather than start with a drift that is incorrect.
> Hope that helps.
Thanks, Brian, for this thorough explanation. Certainly, I can restart the
server (or ntpd only and zero out the kernel adjustements with ntptime or
adjtimex). The server is up five weeks after the last kernel upgrade (FC3)
and the original drift file has been changed by ntpd many times. I think
that any original drift value was forgotted long time ago.
Agree with your statements about this pronounced stair steps visible on the
offset curve. They are clearly visible along with frequencies on the
loopstats plot (the url is in my previous mail). It seems to me that ntpd
mostly don't take care for a frequency until the offset changes by amount
nearly 1ms. This took about two hours, i.e. the frequency was 0.1-0.2ppm
off. Don't know why. Then, as you wrote, the ntpd kiks and returns that 1ms
offset back in less then half an hour. Although this ~0.5ppm value is large,
I don't see the influence on the org and rec timestamps which are at most a
tens of ms apart (only offset value not a delay can be influenced). Maybe
that the whole set of eight polls is somewhat disturbed. Certainly, the
functioning of the ntpd on this server is not satisfactory.
Another my server with almost the same hardware, OS and ntpd version
(distribution.rpm) has similar offset behaviour. I shall try to compile the
tarball there, hope that helps.
More information about the questions