[ntp:questions] NTP Support (Was 'What does "Max Distance Exceeded"...')
Joseph Gwinn
joegwinn at comcast.net
Mon Mar 16 14:27:33 UTC 2009
We have moved from the meaning of status code 9514 to the more general
issue of how NTP shall be supported, so I've collected the relevant
threads below.
===================================================
> At 11:19 PM -0400 3/15/09, Danny Mayer wrote:
> Joseph Gwinn wrote:
>
> >>> The FAQ has to be the place for such explanations.
> >> I'm not sure if this qualifies as an FAQ as I don't recall that it has
> >> come up before. FAQ stands for Frequently Asked Questions.
> >
> > RAQ then? Rarely Asked Questions
> >
> > Seriously, I can't believe that I'm the only person in history to be
> > perplexed by these status codes, and those little three-word summaries
> > are a bit telegraphic.
> >
> > Joe Gwinn
> >
>
> You aren't the only one. These questions have been asked before by a
> number of people. In fact I had to look at this at one point when I was
> getting these codes. Of course I just looked at the source code and
> never looked for documentation.
>
> I will tell you that this is a combination of bits so it's not just a
> number. Each bit represents a test code that failed so you have quite a
> bit to look at.
I do know how the status code is structured, and wrote a Mathematica
program to automate the decoding. (I use Mathematica to generate the
co-plots of loopstats and peerstats data, collect statistics, et al.)
What I didn't know was that the definitions of the code bits had changed
between v3 and v4. I'll have to dig into the old documentation and see
if this code was affected.
There is little chance that I will have the time to read enough NTP
source code to make sense of it, sufficient to be able to come to
reliable conclusions. I'm a system engineer, and time is one issue of
many in a system.
More generally, it's hopeless to expect the world's sysadmins to read
NTP code (or any other kind of code). They just don't have the time,
and are responsible for far too many different kinds of box for it to be
practical. But a major part of making something reliable in practice is
making it possible for a harried sysadmin to nonetheless get it right.
(I'm not a sysadmin, but work with many sysadmins. They spend lots of
time fighting fires, and are of necessity jacks of all trades, masters
of none.)
Silently mutating code definitions sounds like a blunder to me. NTP is
used on tens to hundreds of millions of computers worldwide. There will
never be a pure v4 world. In fact there will still be v3 around when
v5 is being introduced. So, if new kinds of status is needed, invent
new codes to suit, but do not change the meanings of the codes that are
already widely used. In other words, do not undermine your existing
base.
The Internet folk had the same issue with IPv6, and they concluded that
IPv4 was too deeply embedded to ever eliminate, and that there was never
going to be a "flag day" when a worldwide changeover would happen.
Thus, IPv4 and IPv6 had to coexist and interoperate forever, and so IPv6
was designed to support this.
==========================================================
> To: mayer at ntp.org
> From: Joe Gwinn <joegwinn at comcast.net>
> Subject: Re: [ntp:questions] What exactly does "Maximum Distance Exceded"
> mean?
> Cc: questions at lists.ntp.org
> Bcc: gwinn at raytheon.com
> X-Attachments:
>
> Status code values fixed.
>
> At 10:47 PM -0400 3/15/09, Danny Mayer wrote:
> Joseph Gwinn wrote:
> > Hmm. OK, but I think that we've kind of run off the rails. Let me
> > summarize:
> >
> > 1. Sun Microsystems' current behavior is not the issue, as I'm loading
> > old software from an old CD onto old computer hardware, hardware that
> > cannot support a newer version of Solaris than v9.
> >
> > One of these old Solaris boxes did work with NTPv3 running an even older
> > version of Solaris, with no 9514 codes, deepening the mystery.
> >
>
> The trouble here is that those codes are *very likely* likely to have
> changed between V3 and V4 since there was a large rewrite between the
> two. That's why looking at the source code is necessary to get you the
> help you need.
As discussed in my other reply, mutating codes is a blunder. It's a
good-news bad-news thing. The good news is that NTP has succeeded on an
unimagined scale. The bad news is that because of that scale, one must
be *very* respectful of NTP's existing base, and it *can* be
constraining.
> > The fact that this obsolete system can most likely support NTPv4 is
> > worth investigation, though.
> >
> > 2. I think that what's happening is that I'm doing something dumb, and
> > I bet that there is no real difference in how NTPv3 or NTPv4 would react
> > to this faux pas, whatever it turns out to be. Nor is source code
> > research needed or requested.
> >
> > 3. The original question was how to interpret a specific status code,
> > 9514. I read the explanation in the documentation, but became no wiser
> > for it. Thus my question.
>
> Which is why you need to look at the source code. Documentation isn't
> always clear or definitive but the source code will tell you.
It simply cannot be required to read source code to get the definitions
of status codes, even if the documentation has to give one definition
per NTP version. NTP is used on hundreds of millions of computers. Are
we expecting that every time someone gets an unexpected code they either
have to read the source code, or pay someone to read it for them? I'm
sorry, but that cannot work.
> > If there isn't a NTP FAQ entry on this, there probably should be. Our
> > sysadmins were flummoxed by the cloud of 9514 codes, and they are far
> > too busy to undertake a research project. (The deeper problem is that
> > some managers believe that NTP is plug and play, which isn't quite true.)
> >
>
> Mostly it is, but there are always mysteries like this.
Yes.
Joe
More information about the questions
mailing list