[ntpwg] NTPv4 SNMP MIB draft

David L. Mills mills at udel.edu
Tue Jan 24 22:42:29 UTC 2006


Guys,

I've a little trouble here with nitpicks, which are inline below. I 
would however like to take a closer look at what is going on here and 
what is the expectation on how the MIB is used. There are two customers 
of these data, human and some sort of AI program that can distil reports 
and log events.

Think about it this way. The standardized monitoring programs and 
statistics recording features have been found very useful over the 
years. Let's say one of the goals of the MIB is to allow remote 
reconstruction of the data produced by ntpq and recorded by the filegen 
facility. So, a litmus test is to ask what MIB queries do I need to 
produce a local copy of the loopstats, clockstats, cryptostats, 
peerstats, sysstats and rawstats records? Can I reconstruct all or most 
of the ntpq billboards from MIB queries?

Another litmus test is what kind of script could be written, possibly a 
cron job and/or driven by traps, that could collect the filegen data and 
look for spikes and extreme frequency wobbles. Use the rawstats data to 
generate a wedge scattergram once per day. Experience here suggests that 
the most useful summary statistic of them all.

I still have trouble with the data types. A REAL is I assume an IEEE 
floating double; a floating single is not very useful. But, if a display 
string is included, either a human reads it directly or an AI program 
simply parses it, with due regard for precision and headroom. So, why do 
we need the binary representation in the first place? If the binary 
representation is needed for simple programs in a PIC, for example, then 
what is the PIC going to do about the actual value, if no more precise 
than the display string?

Dave

ntpSrvTimeResolution OBJECT-TYPE
    SYNTAX      DisplayString
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "string describing the time resolution of the running NTP 
implementation"
    -- e.g. "100ns"
    -- depends on the NTP implementation and the underlying OS. The 
current resolution should be used, so
    -- if the OS only suppoers 10ms and ntpd is capable of 1ns, the 10ms 
should be advertised
    ::= { ntpSrvInfo  5 }

xxx You need to use these terms technically. Resolution is the number of 
significant bits in a clock reading, which could be one nanosecond, 
while precision is the minimum increment that can be distiguished, which 
could be as much as a clock tick of 10 ms.
   
ntpSrvTimeResolutionVal OBJECT-TYPE
    SYNTAX      Integer32
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "time resolution in integer format"
    -- ntpSrvTimeResolution in Integer format
    -- shows the resolution based on 1 second, e.g. "1ms" translates to 1000
    ::= { ntpSrvInfo  6 }
   
xxx I don't understand why you need this, as the exact value can be 
computed from the string. Why should the managed object do this, which 
might be specific to each agent?
--
-- Section 2: Current NTP status (dynamic information)
--
ntpSrvStatus     OBJECT IDENTIFIER  ::= { ntpSnmp 1 }

ntpSrvStatusCurrentState OBJECT-TYPE
    SYNTAX      DisplayString
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "actual status of NTP as a string"
    --- possible strings:
    --- "not running" : NTP is not running
    --- "not synchronized" : NTP is not synchronized to any time source 
(stratum = 16)
    --- "sync to local" : NTP is synchronized to own local clock 
(degraded reliability)
    --- "sync to refclock" : NTP is synchronized to a local hardware 
refclock (e.g. GPS)
    --- "sync to remote server" : NTP is synchronized to a remote NTP 
server ("upstream" server)
    ::= { ntpSrvStatus 1 }

xxx I don't know what you have in mind here. Is there an agent in the 
monitored machine that knows when ntpd is not running and when it is? 
The ntpd itself doesn't know when it is not running. The "not 
synchronized" can be determined from the system peer association ID, but 
not from the stratum. What you want to know is whether the 
synchronization distance is above or below the distance threshold (1 s 
configurable). Until the threshold is crossed the client is by 
definition synchronized. The machine is synchronized to a reference 
clock if running at stratum one. Which kind of source is available in 
the reference identifie.

ntpSrvStatusCurrentStateVal OBJECT-TYPE
    SYNTAX      INTEGER {
                notRunning(0),
                            notSynchronized(1),
                            syncToLocal(2),
                syncToRefclock(3),
                syncToRemoteServer(4),
                unknown(99)
                        }
   
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "current state of the NTP as integer value"
    -- see ntpSrvStatusCurrentState
    DEFVAL { 99 }
    ::= { ntpSrvStatus 2 }

xxx I don't know how to code this.

ntpSrvStatusStratum OBJECT-TYPE
    SYNTAX      INTEGER (1..16)
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "own stratum value"
    -- should be stratum of syspeer + 1 (or 16 if no syspeer)
    DEFVAL { 99 }
    ::= { ntpSrvStatus 3 }

xxx I don't understand the + 1. The stratum displayed should be the 
stratum assigned by the algorithms. The stratum does not go to 16 (0 on 
the wire) when all sources go awayi. It is not intended as 
synchronization status indicator.

ntpSrvStatusActiveRefclockId OBJECT-TYPE
    SYNTAX      INTEGER ( 0..99999 )
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "the association ID of the current syspeer"
    DEFVAL { 99 }
    ::= { ntpSrvStatus 4 }

xxx If this displays as zero, the machine has no active sources, but 
continues to be valid as a server until the synchronization distance has 
crossed the threshold.

ntpSrvStatusActiveRefclockName OBJECT-TYPE
    SYNTAX      DisplayString
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "The hostname/descriptive name of the current refclock selected 
as syspeer"
    -- e.g. "ntp1.ptb.de" or "GPS" or "DCFi" ...
    -- maybe something like "RefClk(8)" = "hardware clock using driver 
8" would be nice
    ::= { ntpSrvStatus 5 }

xxx The reference identifier is intended to serve this purpose. The 
refernce implementation purposely does not try to resolve a host name, 
as with IPv6 only a hash is available.

ntpSrvStatusActiveOffset OBJECT-TYPE
    SYNTAX      DisplayString
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Time offset to the current selected refclock as string"
   -- including unit, e.g. "0.032 ms" or "1.232 s"
    ::= { ntpSrvStatus 6 }

xxx This is the system clock offset available in the mode-6 protocol.

ntpSrvStatusActiveRefclockOffsetVal OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Time offset between the current selected refclock and time of 
NTP in miliseconds "
    DEFVAL { 0 }
    ::= { ntpSrvStatus 7 }

This value is of course in IEEE floating double format, but could just 
as readily be converted from the display string by the agent.

ntpSrvStatusFrequency OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Frequency Drift of the NTP server"
    DEFVAL { 0 }
    ::= { ntpSrvStatus 8 }

xxx I suggest using the term frequency offset, not drift, as that 
ordinarily speaks to what we call wander.

ntpSrvStatusNumberOfRefclocks OBJECT-TYPE
    SYNTAX      INTEGER (0..99)
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Number of refclocks configured in the NTP "
    DEFVAL { 0 }
    ::= { ntpSrvStatus 9 }

xxx I don't see this as useful. What you probably want is the number of 
configured associations and the number of survivors of the mitigation 
algorithms. It doesn't matter how many refclocks there are, just whether 
one of them has control of the clock discipline and that is evident from 
the stratum.

ntpSrvStatusAuthKeyId OBJECT-TYPE
    SYNTAX      INTEGER ( 0..1024 )
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Authentication Key ID of active refclock is active "
    -- xxxTODOxxx Check docs :"How many keys are allowed?"
    DEFVAL { 0 }
    ::= { ntpSrvStatus 10 }

xxx Refclocks are not normally authenticated; remote servers and peers 
can be, but each can have a different key. Ordinarily, you don't care 
about the maximum number of keys; the reference implementation can 
allocate probably many thousands before running out of memory.

ntpSrvStatusServiceUptime OBJECT-TYPE
    SYNTAX      TimeTicks
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Uptime of NTP service"
    -- time since ntpd was (re-)started
    DEFVAL { 0 }
    ::= { ntpSrvStatus 11 }

xxx This is available in the systats monitoring data and could be 
available to SNMP as well. Take a look at the data recorded; it is 
similar to the data now used by NIST. That's the stuff I want for 
logging purposes.

xxx Missing: the basic performance data I watch are the offset, root 
distance, frequency and jitter.

--
-- Section 3: Status of all currently mobilized associations
-- 

ntpSrvAssociations     OBJECT IDENTIFIER  ::= { ntpSnmp 3 }

ntpSrvAssocTable OBJECT-TYPE
    SYNTAX        SEQUENCE OF ntpAssociation
    MAX-ACCESS  read-only
    DESCRIPTION
      "Table of currently mobilized associations"
    ::= { ntpSrvAssociations 1 }

xxx Easy.
       
ntpSrvAssociation   SEQUENCE {
    ntpSrvAssocId                 Integer32,
    ntpSrvAssocName            DisplayString,
    ntpSrvAssocAddress            DisplayString,
    ntpSrvAssocOffset            DisplayString,
    ntpSrvAssocStratum            INTEGER,
    ntpSrvAssocPollInterval        INTEGER,
    ntpSrvAssocTimeToNextPoll    INTEGER,
    ntpSrvAssocReachability        INTEGER,
    ntpSrvStatusAssocOffsetVal    REAL,
    ntpSrvStatusAssocJitterVal    REAL,
    ntpSrvStatusAssocDelayVal    REAL   
   
}

xxx Associations don't have names; the only reliable handle is the 
association ID. They are distinguished by source IP address and source 
port (only). The time to the next poll is not available, as it can be 
changed on-fly and cannot be predicted. The performance variables are 
offset, delay, dispersion and jitter. The status variables are stratum, 
time since last update, reachability register and poll interval. 
However, the first diagnostic I look at are the flash bits.

ntpSrvAssocId     OBJECT-TYPE
   SYNTAX     Integer32 ( 0..99999 )
   MAX-ACCESS  read-only
   DESCRIPTION
      "Association ID"
  ::= { ntpSrvAssociation 1 }
 
ntpSrvAssocName   OBJECT-TYPE
   SYNTAX     DisplayString
   MAX-ACCESS read-only
   DESCRIPTION
      "Hostname or other descriptive name for association"
   ::= { ntpSrvAssociation 2 }
  
ntpSrvAssocAddress   OBJECT-TYPE
   SYNTAX    DisplayString
   MAX-ACCESS read-only
   DESCRIPTION
      "IP address (IPv4 or IPv6) of association OR refclock driver ID"
      -- contains IP address of uni/multi/broadcast associations or
      -- a refclock driver ID like "127.127.1.0" for other associations
    ::= { ntpSrvAssociation 3 }

ntpSrvAssocOffset OBJECT-TYPE
    SYNTAX      DisplayString
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Time offset to the association as string"
   -- including unit, e.g. "0.032 ms" or "1.232 s"
    ::= { ntpSrvAssociation 4 }

ntpSrvAssocStratum OBJECT-TYPE
    SYNTAX      INTEGER (1..16)
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "stratum level of the association"
    -- should be stratum of the associations syspeer + 1 (or 16 if no 
syspeer)
    DEFVAL { 99 }
    ::= { ntpSrvAssociation 5 }

ntpSrvAssocPollInterval OBJECT-TYPE
    SYNTAX      INTEGER
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "polling interval for the association in seconds"
    -- reflects the number of seconds between two consecutive polls
    -- can be typically one of the following:
    --  64, 128, 256, 512 or 1024
    DEFVAL { 99 }
    ::= { ntpSrvAssociation 6 }


ntpSrvAssocTimeToNextPoll OBJECT-TYPE
    SYNTAX      INTEGER
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "number of seconds until next poll"
    -- reflects the number of seconds between two successive polls
    DEFVAL { 99 }
    ::= { ntpSrvAssociation 7 }

ntpSrvAssocReachability OBJECT-TYPE
    SYNTAX      INTEGER (0..255)
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "results of the last 8 polls in decimal notation"
    -- reflects the results of the last 8 polls as a decimal value where
    -- each of the 8 bits will be set to 1 if the corresponding poll was
    -- successful (i.e. the host was reached and replied) or it will be 
set to
    -- a value of 0 if the host did not reply.
    -- The last result is represented by the first bit
    -- Examples:
    --   Decimal 239 = Binary 11101111 = the last three polls were 
successful, before that there was one failed attempt and another four 
successful tries
    --   Decimal    7 = Binary 00000111 = the last five polls failed
    --   Decimal 252 = Binary 11111100 = the last six polls were successful
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 8 }

xxx Why isn't this a bit string? Or, do you expect the agent to convert 
to eye candy?

ntpSrvStatusAssocOffsetVal OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Time offset to the association in miliseconds "
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 9 }

xxx The display units for ntpq are in milliseconds for time offsets and 
PPM for frequency offsets. There is no need to do this for SNMP reals; 
seconds and seconds/second would be more appropriate.
 
ntpSrvStatusAssocJitterVal OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Jitter in miliseconds "
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 10 }

ntpSrvStatusAssocDelayVal OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Network delay in miliseconds"
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 11 }

ntpSrvStatusAssocFilterEntries OBJECT-TYPE
    SYNTAX        INTEGER
    MAX-ACCESS  read-only
    DESCRIPTION
      "Number of available entries in the Filter Table for this association"
      -- should be at least 6
    ::= { ntpSrvAssociations 12 }

xxx I don't know what this means. The reachability register and time 
since last update are more revealing. The peer dispersion statistic 
reflects indirectly the number of filter samples and the rank in the 
mitigation algorithms.

ntpSrvStatusAssocFilterTable OBJECT-TYPE
    SYNTAX        SEQUENCE OF ntpAssoFilterEntry
    MAX-ACCESS  read-only
    DESCRIPTION
      "Table of the filter values of currently mobilized associations"
    ::= { ntpSrvAssociations 13 }

ntpSrvAssocFilterEntry   SEQUENCE {
    ntpSrvAssocId                 Integer32,
    ntpSrvFilterIndex            INTEGER,
    ntpSrvAssocFilterOffset        REAL,
    ntpSrvAssocFilterDisp        REAL,
    ntpSrvAssocFilterDelay        REAL
}

xxx You need jitter here, too.

ntpSrvAssocFilterIndex OBJECT-TYPE
    SYNTAX      INTEGER
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Index for the Filter Table"
    -- the table row representing the filter values for the lastest poll 
will have
    -- FilterIndex = 0, the oldest row has FilterIndex = FilterEntries
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 14 }

xxx This is misleading. The table returned should be in the order of 
arrival. The filter order is not normally useful, just the one actually 
selected, and the statistics for this one shown in the peer variables.

ntpSrvAssocFilterOffset OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Filter offset"
    -- "Offset" column of the filter table
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 15 }

ntpSrvAssocFilterDisp OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Filter dispersion"
    -- "Dispersion" column of the filter table
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 16 }

ntpSrvAssocFilterDelay OBJECT-TYPE
    SYNTAX      REAL
    MAX-ACCESS  read-only
    STATUS      current
    DESCRIPTION
        "Filter delay"
    -- "Delay" column of the filter table
    DEFVAL { 0 }
    ::= { ntpSrvAssociation 17 }


--
-- Section 4: Server SNMP trap definitions
--
-- xxxTODOxxx : define Payload

ntpSrvTraps     OBJECT IDENTIFIER  ::= { ntpSnmp 4 }
   
ntpSrvTrapNotSync NOTIFICATION-TYPE
    STATUS      current
    DESCRIPTION
        "trap to be sent when NTP is not synchronised "
    ::= { ntpSrvTraps 1 }

xxx The original intent of the NTP traps issue was as a nofication that 
some component changed state or some value became out of tolerance. The 
idea was to provide a handle so that a human or AI program would know 
what MIB queries to send for further information. So at least the trap 
should include the association ID or zero and the name of the variable 
which has changed state or gone out of tolerance. It is likely that the 
human or AI program will have a menu that says if this trap with this 
name is received, then issue one or more queries and format the results.

The most crucial traps are when the object first comes up or voluntarily 
exits. At present this happens only in ntpdate mode, which does not seem 
of interest trap-wise, and when exceeding the panic threshold. Other 
obvious events are when first synchronized, when all sources have become 
unreachable and when all distances have exceeded the distance threshold. 
I don't think you need more than that for state-change events.

For out of tolerance traps, consider the step, stepout and panic 
thresholds and distance threshold for each source. These are all one 
trap with the name of the associated MIB variable and association ID. 
You also need the clock state number in the MIB and a trap when it 
changes value. You might need another one for the frequency if it hits 
the limit.


More information about the ntpwg mailing list