[ntp:hackers] Release fixing all issues

Harlan Stenn stenn at ntp.org
Thu Jan 21 09:51:49 UTC 2016


Miroslav Lichvar writes:
> On Thu, Jan 21, 2016 at 03:22:54AM +0000, Harlan Stenn wrote:
> >  https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2015-5300
> > 
> > and practically INSTANTLY see why that fix is inadequate.
> > 
> > I suspect the key issue is knowing what -g is supposed to do.
> 
> Right.
> 
> > -g should allow the first *correction* to exceed the panic gate.
> 
> That doesn't seem to be how -g is described. Quoting ntpd.html:
> 
>   Normally, ntpd exits with a message to the system log if the offset
>   exceeds the panic threshold, which is 1000 s by default. This option
>   allows the time to be set to any value without restriction; however,
>   this can happen only once.
> 
> See the difference?

Not really.

-g was created to avoid the case where folks had to run ntpdate before
starting ntpd.  This is "wasteful" because we need to give ntpd as much
time as possible to calculate drift.

So -g allows ntpd to be started early in the boot sequence, and in the
case where we're dealing with a warm reboot and iburst, ntpd will hit
the "sync" state in about 11 seconds' time.  We've long documented that
we recommend folks boot using ntpd -g as early as possible, finish the
other process startup that does not require accurate time, and then run
'ntpwait' before starting time-critical services.  From that point on,
-g should *not* be provided when restarting ntpd.

This means that the only thing -g was designed to do was to ignore the
panic gate until the first correction was made.  That first correction
could be a step *or* a slew.

Unless something is *terribly* wrong, that initial phase correction
should be sufficient.

ntpd calculates the initial phase correction and once that has been
applied, large or small, the panic gate should be honored.

> The trouble with enabling the panic threshold right after the first
> clock update, at least in 4.2.6, is that it doesn't work well with the
> LOCAL driver. If the initial offset is larger than 1000 seconds and
> the first update comes from the driver (e.g. it's specified in the
> config before other sources or the network comes up couple seconds
> later after ntpd start), the second update (from a real source) will
> cause ntpd to exit.

There's also a reason we stopped recommending the LOCAL driver in favor
of orphan mode, also quite a while ago.

And yes, if one uses the LOCAL refclock and ntpd then synchronizes to
time that is outside the panic gate, ntpd will abort.  That's a feature.

If somebody has a setup of this level of complexity, they should also
have or be able to ask competent engineering design help to make sure
their needs and expectations are met.

If 4.2.8 fixes more of these problems I'm not surprised.  We fixed over
1100 bugs between 4.2.6 and 4.2.8.  Enough of these were in the
algoritmic and protocol code.

> Allowing a large step in first two updates instead of one doesn't help
> the attacker much, but breaking a valid use case would be a problem
> for a security update. I know you are trying to deprecate the LOCAL
> driver. There are people who still use it, for good or bad reasons.

There are 2 things going on in that paragraph I don't fully understand.

First, what is this "valid use case" that is apparently being broken by
a security update?

Second, the LOCAL driver.  Folks should be making conscious decisions
about what they are doing, and why.  We aim to offer a wide variety of
"mechanism" choices here, so folks have the ability to implement their
local "policy" choices.

If there are use-cases there that we have not addressed, we haven't
heard of them.

What's the problem here?  Is it that people are not properly
understanding the risks of their choices?  If so, what can we do beyond
offering them resources to make informed choices?

Is it that the risks and scenarios are inadequately or improperly
documented?  In that case let us know and help us (and "you") to improve
the documentation.

Put another way, if people are making these choices for bad reasons,
well, what do they expect?  If they are making these choices for good
reasons, are there better reasons to make better choices?

Sometimes reality bites.  If reality is biting them, they have the
option to be accountable for their choices.

> Anyway, this applies to 4.2.6 and older. In 4.2.8 it doesn't really
> matter which patch is applied. Large step would still be allowed only
> on the first update as there is no case in the loopfilter for the FSET
> state when offset is smaller than 0.128 second and allow_panic is set
> to false in the default case.

I disagree with you here, in part.  With our patch, the panic gate is
reset quickly.  That means a reduced window of opportunity to attack the
client.

With your patch, I *think* a bad guy (worst case, MITM) can feed small
corrections to the client for an arbitrary amount of time, and then send
an *arbitrarily large* step correction and it will be accepted.

Also note that if systems always supply -g to the restart case, *against
our long-standing recommendation*, then all a bad guy needs to do is to
get ntpd to abort because of a panic gate issue, and when ntpd is
automatically restarted with -g, the bad guy can set the clock to an
arbitrary value.

This is why we have long been telling people to monitor their ntpd
instances and to have an adequate number of time sources.

Am I missing something?

H


More information about the hackers mailing list