[ntp:hackers] nVidia CUDA GPU - wow - NTP never had it this good before.
tglassey at glassey.com
Sun Jan 24 20:37:19 UTC 2010
On 1/23/2010 11:05 PM, Hal Murray wrote:
>> So folks - I have a functional CUDA based NTP server and it's
>> performance is incredible.
> That sounds like a neat hack, but could you please say a bit more...
Yes - but you have to pull out your old Amdahl or IBM manuals and look
at something like the 470V4's composite peer switching service, which is
sorta what I am doing with a segmented industrial backplane. Many people
know I wanted to build embedded peers as a time-stamping solution.
The embedded peer thing we designed was that card. But I also wanted to
look at how to put a highly precise clock service into a module which
already had a multi-million unit footprint on the marketspace and CUDA
as a co-processor popped up. The idea is to create a model where the
policy controls can be shown to be electrically separate from the
service host using the time attestation.
So I call this nVidia version of NTP "n2TP" or "nNTP" for those that
insist on doing it full out...
> How are you measuring performance? Packets per second?
Yes both but also at the context switch level and in process latency.
The intent is to turn the 4.2.4p8/24.2.5 image into an appliance and run
it in an assigned GPU as a permanent reference. Then its all about the
external reference clock which can be plumbed in through the DVI
connector's pretty easily too so... The real win is in being able to
create a large core-based affinity model to split AutoKEY processing off
of the time-keeping GPU's. In a quad GPU card my model is two GPU's for
each timekeeping and autokey overhead and since they all exist
electrically within the same infrastructure the process is doable.
Then you can talk to it across a socket or pipe from a BSD type
interface model as an embedded service. Cool eh? The reality is that
with this model the core SW Clock system will never ever get taxed - its
just never going to happen because any number client threads (the
unused GPU threads) can interrogate the time-keeper threads in the nNTP
core time service model as to what time of day it is. This means
functionally its going to be really hard to tax this NTP service based
on the amount of time it will take. So if you have four GPU's on a
system that is a fully blown timing ensemble with a fail-over hot spare
on the card.
The NTP service daemon is split into threads which interrogate an
internal time-keeping loop who's only purpose is to continuously update
the OS Time Of Day Counter as well. Additionally there are obviously you
could throw in 16 or so programmable counters for fun too. So yea GPU
Cores offer a new and possibly holy-trinity in implementing an I/O based
reference time service model. Then the key aspects are in how to layer
the services inside the GPU itself.
> Cycles per packet?
> (cycles of the main CPU)
Cycles on the main CPU isn't really an issue. Check out the use of the
QUADRO 5800 for this. NTP runs fully inside the GPU itself. The
stability is unbelievable. The NTP service becomes a bus level NTP
service for somewhere between $2000 and $3000 in retail models with this
SW service. This isn't something we can hide from it's already here.
What is happening is people like Portwell (http://www.portwell.com) are
quietly integrating larger segmented PCI express 16 based boards which
have 4 or 8 PCIx-16 bridges on them with four or more channels being
capable of running at PCIx16 speeds. This can be functionally segmented
so that the PCIx bus between the cells looks like a network and the
cells themselves look like a mainframe with its controllers spread
across the local buss therein.
So enough of the vapor-architecture speak lets talk real world... Try
sticking four Quadro 5800 cards into a small box with a single buss
segment or a larger one with four or eight segments each with a set of
four Quadro's, each with their own 4G local memory and shared PCIx
Firmware ROM Card as well (x1 or x2 speed is fine for loading the BUS
Interface BIOS, the NTP Service and the Run Time interface to the PCIx
Bridge's drivers.). Drop a NIC card in our use the hosting systems (see
the PORTWELL type PICMG hosts for this final role) and the picture is
This system will, with what is described above run any of the popular
OS's including Windows and will also allow for massive compute
capabilities. The idea is to create a super computer which is a 42T Flop
cell which is what 4 of the Q5800M cards produce in the TELOS type
products from nVidia directly.
This model I have expands this by allowing for an n-way setup with N =<
64 (64 x 42TF is a lot of cycles).
> Most GPU cards don't have network hardware.
Yes the GPU looks like an unbundled ALU it also has firmware capability
in that certain implementations allow for more of CUDA functionality to
be accessed some only support the graphics adapter application and some
buffer or minimal runtime ram. That being the gigs of multi-ported RAM
that the GPU's share with the DVI translators which provide their
digital video outputs. My intent was to run NTP ass if it was a
permanently running diagnostic in the GPU firmware itself to really make
this solution rock.
And hey - since NTP is only a couple hundred K especially when the fixed
parts of it are split off into ROM based demand-load segments this looks
like it might rock as well. Relax folks I am also looking to see if I
can bury an Oracle Concurrent Manager into the same infrastructure so in
one fell swoop I can compartmentalize and turbocharge Oracle operations
So yes this is going to get done. I would actually like to talk about
setting up a sub-group or focus group to pick this up as an official
project if the group is interested in this.
Why the GPU? As you note its not even a whole computer... but as it
happens it has enough of what we need to run NTP as an embedded service
if we pull this off 'propah'
> How are the packets getting
> to/from the internet and your code running on the GPU?
CUDA Loader loads the application into pre-reserved display memory which
is reserved as part of the permanent display stack area.. The timing
incremental loop and all other parts of the control process stay in the
fast dual ported memory and execute the interrupt acknowledgment routine
By splitting AutoKEY off into a separate thread process which I am not
totally sure about yet, it should also be possible to unbundle its
running in the same GPU the time service runs in. I am also thinking
that the client portion of the process which asks the reference ISR for
the time needs to be split into threads as well so we can massively
replicate that query threads checking in with the reference timekeeper
> ntpd, or at least the reference implementation, is single threaded.
> What are
> you doing with the other 99 cores on the GPU?
Lets talk about that... NTP should be at least three main threads - the
Time Keeping Loop, the Time Query Service Loop and the
AutoKEY/Entitlement Control Services so these can all be unbundled into
the GPU's massive number of available threads. We can also replicate the
service across a multiple GPU card to create a peering model on a single
board meaning an ensemble can be a single card with this technology. So
what else? Peering inside the GPU... ahahahahah... As graphic dual
ported memory gets faster and faster the clock tracking capability of
this firmware/SW based system will get tighter and tighter too so this
also kicks booty here....
And as to the number of cores still left over, its worse than that. The
Quadro 5800 cards have 240 core threads available .... so we split all
of the other fun stuff in NTP out so we can multi-thread or unbundle it.
Hey CDSA will finally collide with NTP - what a killer idea.
Unbundle AutoKEY, and Timestamping and Content Validation, build in for
DRM, real evidence capabilities... This is as revolutionary to us as the
PC and the INTEL 8080 were to massively collapsing the Mainframe into a
PC Emulator of that Mainframe. Now that technology of that Mainframe is
rampant in everything - lets embrace it's capabilities with timing and
secure evidence control therein...
(Damn, sorry but I couldn't resist the soap box).
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.432 / Virus Database: 271.1.1/2641 - Release Date: 01/23/10 19:33:00
More information about the hackers