[ntp:questions] sanity checking a time comparison
rick.jones2 at hp.com
Wed Jun 8 19:09:25 UTC 2011
I've been messing about with an "sFlow to RRD" utility that takes
interface counter samples from sFlow PDUs and shoves them into an
RRD. (links to sFlow and RRD below) I face the age-old "He with more
than one clock does not know the time" dilema - I have the time the
sFlow PDU arrived at my collector (via asking for message arrival
timestamping by the kernel/stack via SO_TIMESTAMP not via a
gettimeofday() call after a recvfrom()). There is also the
"sysUpTime" field of the sFlow PDU header, which is milliseconds since
the sFlow agent started.
Presently, I am using the stack timestamp on the collector system
(which is syncing time via NTP), and handwaving away the network
delays - treating them as a more or less constant skew error.
However, I may not always have that luxury, I may have to consume
sFlow PDUs which have passed through the guts of other applications,
which has then gotten me interested in the stability and accuracy of
the sysUpTime field.
So, I took two switches (not necessarily those of my employer, I try
to have a broad view), configured them to send me sFlow counter
samples, and then over 24 to 72 hours captured via tcpdump the sFlow
PDUs. I did not enable time syncronization on either switch - I
wanted to see just how bad it might be. I took the tcpdump timestamp
(this is on my NTP-synced collector, so presumably that is advancing
"accurately" over the long term) and sysUpTime from the first PDU,
then looked at those from the last PDU in the trace and found that
sysUpTime moved away from time on my system by one part in not quite
200000 for one of them and one part in 404 for the other. "No
worries," I thought, "all that means is I need to tell the switches to
So I did. I told them to sync time with my collector system. Now,
presumably, the clocks on the switches over the long term should not
drift all that far from that of my collector, perhaps oscilate back
and forth. Rather than run tcpdump again, I hacked my sFlow to RRD
utility to keep sFlow agent state, and report on the difference
between how far time had advanced on my collector vs how var time had
advanced on the switches, and after not quite 24 hours I am seeing:
agent switch1 subagent 1 cum_pdu_time delta 70799652322 (usec)\
cum_uptime delta 70800000000 (usec) elapsed diff -347678 (usec) \
seqno delta 1
agent switch2 subagent 1 cum_pdu_time delta 70809506020 (usec) \
cum_uptime delta 70985000000 (usec) elapsed diff -175493980 (usec) \
seqno delta 1
"cum_pdu_time" is time as seen on my collector, "cum_uptime" is from
the sysUpTime field of the sFlow PDUs. My switch1 still seems to
differ by one part in ~200K, and switch2 by one part in ~400.
Both switches claim they are syncing their time with my collector
system. If I accept that at face value, about the only thing I can
think of is that the sFlow agents are not basing their "time" on the
NTP-synced time on the switches, but on something else. Perhaps
assuming their timer is firing every N units of time and adding N to
the sysUpTime, but the timer is really firing every M units of time.
Any thoughts among the chrono-gods as to what I might do to verify
I can run an ntpq against switch1:
raj at tardy:~$ ntpq -p switch1
remote refid st t when poll reach delay offset jitter
*collector secthrobsurty 2 - 36 64 377 3.664 0.433 0.000
but switch2 isn't running a "full" NTP - it may just be doing an
SNTP/ntpdate kind of thing.
It is not a question of half full or empty - the glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
More information about the questions