[ntp:questions] Proposed NTP solution for a network
bmwjason at bmwlt.com
Tue Mar 3 05:27:11 UTC 2009
Below is a description of the environment, and my thoughts on, a
resilient and precise NTP configuration. All comments, suggestions, etc.
are welcome, indeed requested. I am not a software type, rather networks
and hardware, so please consider that with comments and questions.
Three locations: A, B, & C. Locations A and B are datacenters, C is a
business office with back-office processing and long-term storage.
A and B are within 10-15 miles of each other near NYC, and C is about
1200 miles from A and B.
All three sites are interconnected in a mesh IP network with dual OC-3
connections from each site -- the network is highly resilient although
perhaps not as fast as we might like. A and B additionally have a GigE
connection between them for host-host communication, database
updates/backups, command & control, etc.
Locations A and B have a large community of Suse 10.x Enterprise
servers, each with very stringent requirements to have time be very
closely "in sync" with each other at that site, as well as at the other
site. Absolute accuracy (i.e. "true time") is not as important as
"precision" (that is, all the hosts should be within a few 10s of
microseconds, but they could be as much as a small hundreds of
microseconds off of UTC).
Steping for time adjustment during prime hours (0700 - 2000) would be
very very bad for the transaction record (transactions are very time
sensitive). Less sensitive between 2000-0700.
Each client at A and B has multiple GigE connections to the LANs.
The timestamps on transactions should be traceable (i.e. we may need to
provide to regulators information on the source, accuracy, and precision
of the timestamp of any transaction).
Each of A, B, and C have a dedicated NTP appliance (same make and model,
with differing manufacturing dates -- I have since learned that maybe we
should mix up the make/model, but "one thing at a time"), with
integrated GPS receiver and antenna on the roof. Each site also has
access to the Internet.
Note that each NTP appliance can output PPS, but the hosts have no
method to receive the PPS (blade servers in an enclosure, and all
available expansion slots on each individual blade are in use). In
addition, there is no provision on the enclosure to accept a PPS or
other time source for distribution to the individual blades using a
Current configuration has all the A and B clients synchronizing with the
NTP appliance at B. The NTP appliance at A has suffered an antenna
fault, which is being repaired, but even after it is back on-line, the
software group wants all hosts to sync to a single NTP appliance. The
NTP appliance at C is new and not yet integrated to the solution -- part
of the reason for this message.
From reading this newsgroup, the wiki
(http://www.ntp.org/ntpfaq/NTP-a-faq.htm) and of course
http://www.ntp.org/, this is what I think the hardware configuration
1. Reference clocks: GPS receivers in the NTP appliance are Stratum 0.
2. Stratum 1 level: Each Appliance has an output at Stratum 1 via the
Ethernet connection. Each appliance should be a peer to the other
appliances (symetric active/passive) as discussed at
http://www.eecis.udel.edu/~mills/ntp/html/assoc.html#symact. This would
enable the appliance to lose the reference source and still be useful to
the Stratum 2 servers that are clients of these appliances.
3. Stratum 2: One server at each of the three locations, each
referenceing each of the three NTP appliances. Each would also peer with
the other two servers. This will enable the datacenters to keep the
local hosts synchronized even if the other sites are unreachable (the
servers at A can continue to process transactions even without
connectivity to B and C, for example).
4. Clients: All clients at location A would sync to the local server
(prefer) and to the server at location B. All clients at B would sync to
the A server (prefer) and to the local server at B. All clients at
location C would sync to their local server (prefer) and to the server
at location A. Thus each client would have a choice of two Stratum 2
servers, each of which is trusted and peered with one-another. In
addition, this makes the clients at A and B likely, although not
guaranteed, to use the same server for their time.
A. Is the above architecture fitting with best practices? Suggestions
for improvement? It seems to fit with Section 188.8.131.52 at
B. I'm unclear where, or if, "orphan" mode should be used on the
servers. Should it be configured at all? What will be the advantage
either way? Oh, some more research
(http://support.ntp.org/bin/view/Support/OrphanMode) shows that orphan
mode is not available in the version we are running.
ntpdc 4.2.0a at 1.1196-r Thu Jun 29 17:48:04 UTC 2006 (1)
Is the use of orphan mode advantageous enough to update the NTPd on 200+
C. This configuration cannot get past the "survivor" problem where, with
three servers, if one fails then the other two cannot find a majority
(see http://www.ntp.org/ntpfaq/NTP-s-algo-real.htm, section 5.3.2). So
that leads to either trusting an Internet host, adding another receiver,
or using a source at an interconnected sister-company in Europe. So,
should the servers also have a trusted Internet-based time source? The
nature of our business makes the Internet inherently un-trusted for a
number of reasons, and having traceable time sources is one of them.
D. Does it make sense, because the time precision is so important, to
use servers for the Stratum 2 level that are un-encumbered by other
processes? Or should one of the existing 8-core blades be sufficient,
perhaps with using processor affinity for the NTP process?
E. The "precision" requirement leads me to think that I need all clients
at a site to be receiving time from the _same_ server, whether that is
the local server or not. How to ensure this requirement is met?
F. I'm sure I'm forgetting some questions, and perhaps need more
education about this, please help me to understand.
G. What have I missed, or gotten confused?
Oh, and I have Dr. Mills book on order, it should arrive in a few days.
More information about the questions