[ntp:questions] NTP Support (Was 'What does "Max Distance Exceeded"...')

Joseph Gwinn joegwinn at comcast.net
Mon Mar 16 14:27:33 UTC 2009


We have moved from the meaning of status code 9514 to the more general 
issue of how NTP shall be supported, so I've collected the relevant 
threads below.

===================================================

> At 11:19 PM -0400 3/15/09, Danny Mayer wrote:
> Joseph Gwinn wrote:
> 
> >>> The FAQ has to be the place for such explanations.
> >> I'm not sure if this qualifies as an FAQ as I don't recall that it has
> >> come up before.  FAQ stands for Frequently Asked Questions.
> >
> > RAQ then?  Rarely Asked Questions
> >
> > Seriously, I can't believe that I'm the only person in history to be
> > perplexed by these status codes, and those little three-word summaries
> > are a bit telegraphic.
> >
> > Joe Gwinn
> >
> 
> You aren't the only one. These questions have been asked before by a
> number of people. In fact I had to look at this at one point when I was
> getting these codes. Of course I just looked at the source code and
> never looked for documentation.
> 
> I will tell you that this is a combination of bits so it's not just a
> number. Each bit represents a test code that failed so you have quite a
> bit to look at.

I do know how the status code is structured, and wrote a Mathematica 
program to automate the decoding. (I use Mathematica to generate the 
co-plots of loopstats and peerstats data, collect statistics, et al.) 

What I didn't know was that the definitions of the code bits had changed 
between v3 and v4.  I'll have to dig into the old documentation and see 
if this code was affected.

There is little chance that I will have the time to read enough NTP 
source code to make sense of it, sufficient to be able to come to 
reliable conclusions.   I'm a system engineer, and time is one issue of 
many in a system.

More generally, it's hopeless to expect the world's sysadmins to read 
NTP code (or any other kind of code).  They just don't have the time, 
and are responsible for far too many different kinds of box for it to be 
practical.  But a major part of making something reliable in practice is 
making it possible for a harried sysadmin to nonetheless get it right.  
(I'm not a sysadmin, but work with many sysadmins.  They spend lots of 
time fighting fires, and are of necessity jacks of all trades, masters 
of none.)


Silently mutating code definitions sounds like a blunder to me.  NTP is 
used on tens to hundreds of millions of computers worldwide.  There will 
never be a pure v4 world.  In fact there  will still be v3 around when 
v5 is being introduced.  So, if new kinds of status is needed, invent 
new codes to suit, but do not change the meanings of the codes that are 
already widely used.  In other words, do not undermine your existing 
base.

The Internet folk had the same issue with IPv6, and they concluded that 
IPv4 was too deeply embedded to ever eliminate, and that there was never 
going to be a "flag day" when a worldwide changeover would happen.  
Thus, IPv4 and IPv6 had to coexist and interoperate forever, and so IPv6 
was designed to support this.

==========================================================

> To: mayer at ntp.org
> From: Joe Gwinn <joegwinn at comcast.net>
> Subject: Re: [ntp:questions] What exactly does "Maximum Distance Exceded"  
> mean?
> Cc: questions at lists.ntp.org
> Bcc: gwinn at raytheon.com
> X-Attachments: 
> 
> Status code values fixed.
> 
> At 10:47 PM -0400 3/15/09, Danny Mayer wrote:
> Joseph Gwinn wrote:
> > Hmm.  OK, but I think that we've kind of run off the rails.  Let me
> > summarize: 
> >
> > 1.  Sun Microsystems' current behavior is not the issue, as I'm loading
> > old software from an old CD onto old computer hardware, hardware that
> > cannot support a newer version of Solaris than v9. 
> >
> > One of these old Solaris boxes did work with NTPv3 running an even older
> > version of Solaris, with no 9514 codes, deepening the mystery.
> >
> 
> The trouble here is that those codes are *very likely* likely to have
> changed between V3 and V4 since there was a large rewrite between the
> two. That's why looking at the source code is necessary to get you the
> help you need.

As discussed in my other reply, mutating codes is a blunder.   It's a 
good-news bad-news thing.  The good news is that NTP has succeeded on an 
unimagined scale.  The bad news is that because of that scale, one must 
be *very* respectful of NTP's existing base, and it *can* be 
constraining.


> > The fact that this obsolete system can most likely support NTPv4 is
> > worth investigation, though.
> >
> > 2.  I think that what's happening is that I'm doing something dumb, and
> > I bet that there is no real difference in how NTPv3 or NTPv4 would react
> > to this faux pas, whatever it turns out to be.  Nor is source code
> > research needed or requested. 
> >
> > 3.  The original question was how to interpret a specific status code,
> > 9514.  I read the explanation in the documentation, but became no wiser
> > for it.  Thus my question. 
> 
> Which is why you need to look at the source code. Documentation isn't
> always clear or definitive but the source code will tell you.

It simply cannot be required to read source code to get the definitions 
of status codes, even if the documentation has to give one definition 
per NTP version.  NTP is used on hundreds of millions of computers.  Are 
we expecting that every time someone gets an unexpected code they either 
have to read the source code, or pay someone to read it for them?  I'm 
sorry, but that cannot work.


> > If there isn't a NTP FAQ entry on this, there probably should be.  Our
> > sysadmins were flummoxed by the cloud of 9514 codes, and they are far
> > too busy to undertake a research project.  (The deeper problem is that
> > some managers believe that NTP is plug and play, which isn't quite true.)
> >
> 
> Mostly it is, but there are always mysteries like this.

Yes.

Joe




More information about the questions mailing list