[ntp:questions] Re: Strange scattergram
brian.utterback at sun.removeme.com
Thu May 18 13:49:21 UTC 2006
Karel Sandler wrote:
> "Brian Utterback" wrote:
>> Karel Sandler wrote:
>>> to better know the network quality between my S2 and its peers, the wedge
>>> scattergrams were used. That strange one has the upper limb prolonged
>>> down to the left side of the head. Only few samples (status 9314) are
>>> located there but their existence is beyond all my understanding.
>>> Hope my S2 works well, only one plot (one peer from seven) shows up that
>>> feature - lx.ujf.cas.cz/ntp-lx/antitail.png .
>>> K. Sandler
>> I wouldn't worry too much about the continuation of the line down and
>> towards the left. That is probably due to the fact that the systems
>> clock is in the process of slewing and is in the "overshoot" part of
>> the phase locked loop. Of much greater concern is the complete asymmetry
>> of the wedge plot. The typical cause for having the wedge plot bunched
>> up towards one limb or the other is an asymmetric network path between
>> the client and the server. However, looking at the offset plots
>> at http://lx.ujf.cas.cz/ntp-lx it looks to me like your ntpd is not
>> disciplining the frequency at all. Your system clock runs fast and
>> ntpd adjusts it back in what look to me to be about a two hour cycle.
>> I think that is why all of the offsets are positive in the wedge plot,
>> because your clock runs fast.
> Thanks, Brien, for your response. As I understand, you mean that due to a
> lack of discipline a datagrams can get false timestamps sometimes. Well, but
That's not exactly what I mean.
In theory, ntpd is not only supposed to correct the offsets, but also
adjust the clock frequency so that the clock does not run fast or slow,
but stays in sync for long periods of time, meaning that the measured
offset is always very small, and should be effectively zero. So,
for any given measurement of offset, it should be nearly zero, but
due to random errors, it might be plus or minus some error value.
The error value is bounded by the round trip time. Hence the wedge
plots. If ntpd is working, then the offset at any given point in time
is close to zero, and random errors will give offsets of up to the
delay. Thus plotting delay versus offset should produce a cloud in a
form of a wedge.
Now, if the network path is not symmetric, the measured offset is
perturbed. The offset will be systematically greater or less than
it should be. Since this kind of asymmetry often affects all of your
servers or none of them, it may be impossible to detect this in
the short term, since the offsets from all the servers will be perturbed
However, there is one small exception. The random errors are generally
introduced at each network hop. So, for an asymmetric path, there will
tend to be more errors introduced in on one leg then the other. The
errors themselves take the form of small delays, so this means that
one leg will vary more and will tend to perturb the offset measurement
more toward one direction (either positive or negative) than the other.
This has the effect of making the wedge plot asymmetric, with more
points long one or the other limb of the graph. The huff puff
algorithm keeps track of this data over a long period of time and
if it notices this kind of asymmetry, it calculates a correction
factor to bring the wedge plot back into symmetry.
One limiting factor of the huff puff at this time, though, is that
it only calculates a global correction, which is fine if the problem
is affecting all your servers. If not, huff puff could make things
worse. Luckily, it seems that this is the most common case. It could
be modified to do the calculations on a per server basis, but that
is not the case today.
However, this all assumes that ntpd is doing it job. In your case, it
appears that it is not. There is a pronounced stair step to your
offsets over time. The offset increases for about two hours, then
ntpd kicks in and slews the clock to bring it back down to zero, then
it increases again. If ntpd was adjusting the clock frequency correctly,
then this would not be happening.
So, the offsets are starting at zero, and then increasing, and then
going back quickly to zero. This means that the offsets are all
skewed toward the positive side of the plot. Because the clock is
running at the wrong speed, this also distorts the calculated offset
because the org timestamp and the rec timestamp for a packet are at
slightly different timescales. Normally, the change is so small
as to be negligible, but in your case it is pretty systematic.
So, because much of the offset is not due to errors, but due to
systematic frequency problems, the relation between delay and offset
is reduced. Furthermore, since ntpd slews the clock to correct offsets,
this has the effect of exacerbating the problem of the changing time
scales. This, I believe, is the source of the little tail, down
and to the left, lying outside the wedge.
As you said this is only effecting the single system, this further
makes me think that this is a problem that ntpd is having disciplining
the clock frequency, rather than an asymmetric path problem as Dr.
Mills has suggested.
One suggestion I have, is stop ntpd, remove the drift file, reboot (or
zero out the kernel adjustments with ntptime if you know how) and
let ntpd recalculate the correct drift. Incorrect drifts have been
known to cause odd effects and it is better to start from scratch,
rather than start with a drift that is incorrect.
Hope that helps.
Rose are #FF0000, Violets are #0000FF. All my base are belong to you.
Brian Utterback - OP/N1 RPE, Sun Microsystems, Inc.
More information about the questions