[ntp:questions] Seeing large jitter and delay in recent ntpd and recent Solaris
fran.horan at jhuapl.edu
Tue Nov 8 20:29:55 UTC 2011
When running recent versions of ntpd and with a recent Solaris patch, I see large delay and jitter in ntpq based output, and in the stats files. I am running in a high performance configuration where I'm trying to minimize offset.
I think it is a Solaris problem in asynchronous i/o processing, where select() is erroneously returning zero return values under certain conditions.
My ‘high performance’ configuration is running ntpd bound to processor 0 on an SMP, and at highest priority. My poll rate is 16 sec. And I’m running ntpq locally so ntpq is communicating with ntpd over the loopback interface. Most of these conditions are required to create the problem. So running time share, or running ntpq remotely, or not binding to processor 0 and then there is no problem. I haven’t tried other poll rates but its not likely that matters.
I did a lot of runtime tracing on ntpd using the built-in tracing printouts as well as printouts I added myself. I see that after ntpd services the ntpq requests over the loopback interface, and then services its next timer alarm interrupt, it sends a time synch packet out to its server as expected. (The server is on the same LAN in the same room, not on the internet.)
A SIGPOLL interrupt happens soon after as expected, as the return packet comes in. What then catches my eye is the select() call in the signal handler returns with return value and file descriptors set to zero. These should be nonzero. This is why I think there is a problem in Solaris. So with these zero return values from select(), the signal handler does not read out the packet from socket, and returns. The packet stays in the socket and gets stale.
Next there are more timer expiration events, and being at a poll rate of 16, not too much later I see another outbound packet get sent to the server in the room. Then I see the SIGPOLL interrupt and the interrupt handler run. The handler reads two packets out of the socket, the stale one and then the new one that came in behind it. My trace shows a printout of 'flash header 002', and '2' means 'bogus' so I think ntpd has detected something is bad about this stale packet that was stuck in the socket for a long time. However, I think this stale packet is what may be driving up the delay and jitter values I see in ntpq and the stats files. In rawstats, I see the t4 timestamp being really late for what I think is this event.
The signal handler makes the select() call with a timeout of 0, so the call returns immediately. As an experiement, I tried putting in non-zero timeouts, first 100 microsec and then 1 microsec. With non-zero timeouts, select() returns look good, non-zero return values. I looked at a handful of events, and left the timeout at 1 microsec. My colleague let it run for a while and he reported back it was better but eventually he started to see problems. Anyway, this is a hint about what may be happening in Solaris.
Some things I tried fiddling with in Solaris that did not matter whether they were enabled or disabled were TCP Fusion and hi_res tick.
I passed this to a couple Oracle reps about a week ago and haven’t heard anything back. I’ll try to post this to an Oracle Solaris support group, and try to submit an official trouble report to Oracle on it as well.
Some important details on the runtime machine and versions:
Hardware: Oracle SPARC T3 server, SMP multiprocessor, 32 Gig RAM
Operating System: Solaris 10, 8/11
Solaris patch where we started to see a problem: 144500-19
ntpd versions: I saw the problem and did my tracing on v4.2.6. Then I went back to version 4.2.5 which was recently delivered with Solaris and see the problem there as well. My colleague went back even further and he reports that 4.2.2 has the problem, but then 4.2.0 and back does not.
I’ll let you know if I hear back from Oracle. If you have ideas please let me know. We are not stuck. We can run older versions, or not bind to processor 0 to avoid the problem and still stay within the spec we are trying to meet.
More information about the questions