[ntp:hackers] [gpsd-dev] SHM refclock improvements

juergen perlinger juergen.perlinger at t-online.de
Fri Aug 31 18:52:19 UTC 2012


On 08/29/2012 09:37 AM, Terje Mathisen wrote:
> Hal Murray wrote:
>>> How will we be able to know if an ntpd is capable of the new SHM mode??
>>
>> That's a key question.  I don't have any great ideas.
>>
>> Plan one is to release a new ntpd that handles the new mode, wait for
>> it to
>> get widely distributed, then release a new gpsd that takes advantage
>> of it.
>> That's likely to take a long time.
>>
>> The "wait" could be reduced if ntpd and gpsd could get coordinated on
>> handing
>> new releases to the upstream distributions.
>>
>> We could use 2 chunks of shared memory.  gpsd writes to both, ntpd
>> uses a
>> mode bit or tries both.
>
> Either using a new chunk, or by extending the old one, appending the
> new format data.
>>
>> If we do that, the new format should probably include some way of
>> communicating what versions are supported so we don't have to do this
>> dance
>> again.
>
> I did a bunch of tests in April 2011 which convinced me that the
> proper way to do this is actually fairly simple:
>
> Allocate a (small!) array of timestamp records in a shared memory
> segment, positioned so that each record is aligned with one or two
> cache lines (i.e. 64/128-byte alignment). 4 or 8 records is a very
> nice size.
>
> struct _timeStamp {
>     //int64_t filler[4];
>         // Locate each timestamp in a separate cache line!
>         int64_t clockTimeStampSec;
>         int64_t receiveTimeStampSec;
>         int32_t clockTimeStampNSec;
>         int32_t receiveTimeStampNSec;
>         int32_t leap;
>         int32_t precision;
> } timeStamp;
>
> This is used like this, with a header record followed by an array of
> those timestamps:
>
> struct _shmTime {
>         int64_t marker; /* 0x50544E2D44535047 == GPSD-NTP in LE */
>         int32_t mode;   /* 4 + TIMESTAMP_BUFFER_SIZE * 256 */
>         int32_t count;
>
>         int32_t nsamples;
>         int32_t valid;
>     int64_t filler[5];    // Make the header part 64 bytes
>
>         struct timeStamp ts[TIMESTAMP_BUFFER_MAX_SIZE];
> } shmTime;
>
> The key is/was as usual that the (potentially many) readers never
> update/write anything, while the single writer fills in all the data
> first, writing to the _next_ record, then does a (write) barrier,
> before finally incrementing the count:
>
>     int32_t counter = shm->count;
>         // Get the index of the next record to be used:
>     int32_t c = (counter + 1) & timestamp_buffer_mask;
>
>     int64_t qpc;
>     int32_t ns;
>     // Get the current sec and ns value here!
>     ...
>
>     // Fill in the record:
>
>     shm->ts[c].clockTimeStampSec = qpc;
>     shm->ts[c].receiveTimeStampSec = qpc;
>     shm->ts[c].clockTimeStampNSec = ns
>     shm->ts[c].receiveTimeStampNSec = ns;
>     shm->ts[c].precision = -19;
>     shm->ts[c].leap = 0;
>
>     __mybarrier();
>     shm->count++;
>
> The last two instructions can easily be replaced with a locked
> fetch_and_add(shm->count, 1) operation.
>
> The reader's task is very simple:
>
> Wait until the count value has been updated, then grab the
> corresponding record, before re-reading the count value:
>
> If the count is still the same, then everything is fine and the record
> is valid. If it has been updated (by the writer) in the meantime then
> it is probably still OK:
>
> As long as the count as been updated by less than the array size - 1,
> the writer cannot have wrapped around and started to reuse the record
> we just read, so everything is still valid.
>
> Otherwise just go back up and read again from the top.
>
There's another strategy that is IMHO slightly more robust:

Have the data record start with a guard value, which is the sequence
number (== counter) of the update.
Have N slots, where the slot index 'i' for an update count 'c' is given
by i == c % N. Choose N as a power of two for speed.

The writer does:
 1) generate the next sequence (counter) value in a local copy.
 2) write the new sequence number to the associated slot, followed by a
(write) barrier. This can be done with an locked exchange operation.
 3) write the payload of the SHM slot in any order you like.
 4) issue a write barrier and update the count. (As stated, this can be
done with a locked increment. Locked exchange is also possible.)

The reader does:
 1) read the SHM counter into local copy. (No barrier needed, since
there is an explicit data dependency to the next step.)
 2) read the payload.
 3 if possible issue a read barrier.
 4) read the guard value into local copy.
 5) repeat steps 1-4 until guard and counter copy match.

(step 3+4 can be combined into a locked exchange into the local guard
value copy)

The difference is that this scheme does not need to check the counter a
second time -- an overwrite can be detected for the next 2**32 updates,
so it's
not necessary to assume damage just because the counter has changed.
This is ensured by the read access using the reverse access pattern of
the write(update) access in combination with the barriers.

Declaring the SHM volatile also helps to avoid code shuffle by the
compiler. It does not help with the write order visibility, so we can't
get rid of the barriers. They also block the CPU from write/read
reordering, which is AFAIK not an issue with x86/amd64, but might give a
bite or two on PPC and ARM .


> In my testing I ran 1 writer thread and 3 or more readers (on a quad
> core cpu), with one core locked to the writer and the three others
> available to the readers, in order to be able to get maximal update
> rates.
>
> C:\>\c2\shm-test\Release\shm-test.exe
> Testing with 4 buffer slots, 4 threads for 60 seconds
> Writer affinity = 1, old = f
> Reader 1 affinity = e, old = f
> Reader 2 affinity = e, old = f
> Reader 3 affinity = e, old = f
> Done waiting 60 seconds
>
> All threads done! (i = 9)
> Writer:
> 862159951 iterations, 1.4369e+007 iterations/second
> Thread 1:
> 1660360860 iterations, 442134918 updating 83317414 retries 0 errors
> Thread 2:
> 873592489 iterations, 482193432 updating 72253986 retries 0 errors
> Thread 3:
> 1600536458 iterations, 430963596 updating 79916881 retries 0 errors
> Reader totals:
> 4134489807 iterations, 1355291946 updating 235488281 retries 0 errors
> Reader totals:
> 6.8908e+007 iterations/second, 32.78% updating  5.70% retries 0.000%
> errors
>
> I.e. with 4 buffer slots and 14 million updates/second, the reader
> threads averaged 33% reads that overlapped with a writer update of the
> count variable, For nearly 6% of the total read operations the writer
> has indeed managed to update twice, forcing a retry, but this never
> lead to an inconsistent read.
>
> Using 8 buffer slots (i.e. nearly 600 bytes of shared memory) is
> sufficient to never get a retry unless a thread is descheduled for a
> long time. (I just tried to run 7 reader threads which gave 0.01%
> retries.)
>
> The key here is of course that we have a fairly low number of
> updates/second, with 1 being the most likely value, and even with a 5
> Hz GPS the readers will only need to wake up once a second or so in
> order to be able to grab the latest update.
>
> The other important point is the fact that a shared memory segment
> cannot be smaller than a memory page, which means that 4KB is probably
> the smallest relevant size. This makes it totally free to work with a
> buffer array instead of a single record, removing most of the issues
> with simultaneous reads and writes.
>
Yes. You will get at least one memory page, and since you already pay
the price you can indeed take the benefits of a round-robin buffer. 16
slots gives any reader the change to retrieve a value without collision
during the next 16 update cycles. Which makes such a round-robin scheme
with guard nearly collision-free for a single-producer,
multiple-consumer application...

Greetings,
    pearly




More information about the hackers mailing list