[ntp:questions] Significant clock skew in cluster environment

metallurgist at airpost.net metallurgist at airpost.net
Mon Jun 23 14:50:18 UTC 2008

Hello all.

I'd appreciate some help.  I'm the defacto admin for a small research
cluster in an academic institution.  All hosts are running GNU/Linux
under a 2.6.22-family kernel.  The ntp version in use is 4.2.4 p4.  My
campus runs two ntp servers.  My cluster's headnode uses the two campus
ntp servers as its sources.  Internal cluster nodes then use the cluster
headnode as their (only) ntp time source.  The internal cluster nodes
have no route to the internet, only the headnode does.

I'm seeing a problem wherein internal cluster nodes develop significant
clock skew over time.  By "significant" I mean up to 700 seconds over
two weeks of uptime.  I am checking this using "ntpq -p" and looking at
the offset field.  The only thing I can think of is that some of the
machines, including the headnode, are configured to use the Linux
"ondemand" CPU frequency governor.  These processors are older AMD
Opteron 246/248 chips capable of dynamic frequency management.  However,
I also have nodes with older AMD Athlon processors that do not employ
dynamic frequency management which also exhibit this phenomenon.

Additionally, on the headnode I am seeing in the ntpd syslog output
messages like:

  ntpd[5642]: frequency error 509 PPM exceeds tolerance 500 PPM

But there are no such log entries on any of the internal nodes.

Is there any issue with dynamic processor frequency control negatively
affecting ntp?

If this is not it, I can give the basic contents of my ntp.conf files. 
None of these machines are running onboard firewalls, and ntpd is being
started through the init system.

On the head node:

Two sets of server directives in the form:

  server a.b.c.d iburst
  restrict a.b.c.d nomodify notrap nopeer noquery

where a.b.c.d is one of the campus ntp servers' IP addresses.
Thereafter there is:

  restrict default ignore
  restrict h.i.j.k mask l.m.n.o nomodify nopeer notrap

where h.i.j.k and l.m.n.o are correctly defined to allow all the
internal cluster hosts to query this machine
followed by:

  restrict h.i.j.p mask

where p is the head node's internal cluster IP address.

On the internal cluster nodes (all use the identical file):

  server h.i.j.p iburst

where h.i.j.p is the headnode's IP address
followed by:

  restrict default ignore 
  restrict h.i.j.p mask nopeer

Thanks for any help.
  metallurgist at airpost.net

http://www.fastmail.fm - The professional email service

More information about the questions mailing list