[ntp:questions] refclock use causes core dump of ntpd

wa6zvp wa6zvp at gmail.com
Fri Feb 23 15:49:16 UTC 2007


On Feb 23, 3:43 am, Ronan Flood <use... at umbral.org.uk> wrote:
> "wa6zvp" <wa6... at gmail.com> wrote:
> > ** Yes. I remembered correctly. up-type = t_unknown, up->state =
> > s_Base, and event = e_Poll.
> >  Very bizarre.  See below.
>
> That is odd; as I said before the first thing that should happen
> is for it to get e_Init from true_start() and go into s_InqGOES.

* Oh, it does pass through doevent a couple of times before the crash.
 The above are the values on the call that will end up at abort().

> > * Despite the fact that doevent is entered with valid params for type
> > and state,
> > that is, t_unknown and t_Base, it still gets to the abort statement at
> > the end
> > of the function whenever its called with event = e_Poll.
>
> > Nowhere in this function is this event handled or used.  Not even
> > mentioned.
>
> Sure it is, line 629 in "case t_goes:".

* Doh! I missed that.

> It struck me last night (duh) that this unexpected code flow could be
> an effect of optimisation: if gcc recognises abort() and knows that it
> will not return, it could convert all the earlier calls to it into
> a jump to this one.  I'm assuming ntpd is being compiled with gcc -O2
> or similar, so if you want to persist you could try taking that out.

* I remembered yesterday as well that optimisation can mess up
debugging so
 I removed the -O2 from the cflags.  Made no change to the symptoms at
all.

> > So, I backed out my hack from last night, and now have
> > simply commented out the one and only line that calls
> > doevent with event=e_Poll.  Line 540 in function true_receive.
>
> And your clock works with that?

* Yea, it works fine.:)

> The e_Poll presumably needs to be there for t_goes, so the proper
> solution to this would seem to be to have e_Poll ignored where not
> expected.  It only seems to be t_unknown where that doesn't happen
> currently, so a simple fix like this might do the job:
>
> --- refclock_true.c.orig        Wed Feb 25 05:58:17 2004
> +++ refclock_true.c     Fri Feb 23 11:18:07 2007
> @@ -703,6 +703,8 @@
>                 }
>                 break;
>             case t_unknown:
> +               if (event == e_Poll)
> +                   break;
>                 switch (up->state) {
>                     case s_Base:
>                         if (event != e_Init)
>
> (Saves dealing with it in every case s_* within t_unknown)

* Yea, that looks simple enough.  I'll try that today.

> On the other hand, converting every abort() to break as you did
> to start with could be the right answer after all :-/
>
> --
> Ronan Flood <use... at umbral.org.uk>

* Heh, that was effective too.:)

I am also of the opinion that refclocks should not call abort (or
exit).
Simply not getting data for the clock should be indication enough that
something is wrong.

Thanks for your assist in this, Ronan.

Roger




More information about the questions mailing list