[ntp:questions] Failure of NIST Time Servers

jlevine jlevine at boulder.nist.gov
Mon Jun 6 18:07:41 UTC 2011


I am writing to reply to the previous comments:

1. I have no comment about whether or not the failure and the response
to it are or are not "professional."

2. The failure was directly caused by a double hardware failure -- a
problem in Boulder and an unrelated problem in Fort Collins at
essentially the same time. It is trivially easy to design an algorithm
that would have detected this exact problem after it has occurred, but
it would be much more difficult to have done so beforehand. The fact
that the system did not have this particular algorithm to deal with
this particular double failure is not a bug in the usual sense of that
word.

3. We have a lot of safeguards built into the systems and we have a
lot of experience running them. We have been running time servers for
about 20 years, and I can't remember the last failure of this
magnitude. I have made some changes to deal with the problem of the
relative unreliability of the backup systems in Fort Collins, and this
particular problem will not happen again. But it would be foolish of
me to promise that the system is perfect and that some other failure,
*with equally serious impact*, will never happen again. We have more
than 100 computers in the network and lots of ancillary stuff, and it
would be foolish and simplistic of me to guarantee that I (or anyone
else) have thought of every possible hardware failure.

4. The "unhealthy" flag in NTP (both leap second bits set) is a copy
of an internal private kernel parameter. This parameter can be set by
a number of internal check processes (which are outside of NTP and
independent of it) and it can also be set from Boulder if the central
controller detects a problem. A complete failure of the ACTS system
would have set the unhealthy flag unconditionally, but the partial
failure that actually occurred may not do so. The same kernel
parameter is used to control the status parameters of the other non-
NTP services that we provide.

5. Since hardware failures are probably inevitable in a network system
of the size and complexity of the NIST service, a fair question is
whether the failure can be limited and its impact contained or ideally
made invisible to the users. The failure affected 11 of the 35
physical time servers that I operate. So the glass starts out about
1/3 empty and 2/3 full. About half of the 11 physical servers were
transmitting the unhealthy status and should not have caused any
problems for users who parse the flags. So, even during the worst
failure in my memory, the glass is about 1/7 empty and 6/7 full.

Judah Levine
Time and Frequency Division
NIST Boulder




More information about the questions mailing list