[ntp:hackers] nVidia CUDA GPU - wow - NTP never had it this good before.

tglassey tglassey at glassey.com
Sun Jan 24 20:37:19 UTC 2010


On 1/23/2010 11:05 PM, Hal Murray wrote:
>    
>> So folks - I have a functional CUDA based NTP server and it's
>> performance is incredible.
>>      
> Thanks.
>
> That sounds like a neat hack, but could you please say a bit more...
>    
Yes - but you have to pull out your old Amdahl or IBM manuals and look 
at something like the 470V4's composite peer switching service, which is 
sorta what I am doing with a segmented industrial backplane. Many people 
know I wanted to build embedded peers as a time-stamping solution.

The embedded peer thing we designed was that card. But I also wanted to 
look at how to put a highly precise clock service into a module which 
already had a multi-million unit footprint on the marketspace and CUDA 
as a co-processor popped up. The idea is to create a model where the 
policy controls can be shown to be electrically separate from the 
service host using the time attestation.

So I call this nVidia version of NTP "n2TP" or "nNTP" for those that 
insist on doing it full out...
> How are you measuring performance?  Packets per second?
Yes both but also at the context switch level and in process latency.

The intent is to turn the 4.2.4p8/24.2.5 image into an appliance and run 
it in an assigned GPU as a permanent reference. Then its all about the 
external reference clock which can be plumbed in through the DVI 
connector's pretty easily too so... The real win is in being able to 
create a large core-based affinity model to split AutoKEY processing off 
of the time-keeping GPU's. In a quad GPU card my model is two GPU's for 
each timekeeping and autokey overhead and since they all exist 
electrically within the same infrastructure the process is doable.

Then you can talk to it across a socket or pipe from a BSD type 
interface model as an embedded service. Cool eh? The reality is that 
with this model the core SW Clock system will never ever get taxed - its 
just never going to happen because any number  client threads (the 
unused GPU threads) can interrogate the time-keeper threads in the nNTP 
core time service model as to what time of day it is. This means 
functionally its going to be really hard to tax this NTP service based 
on the amount of time it will take. So if you have four GPU's on a 
system that is a fully blown timing ensemble with a fail-over hot spare 
on the card.

The NTP service daemon is split into threads which interrogate an 
internal time-keeping loop who's only purpose is to continuously update 
the OS Time Of Day Counter as well. Additionally there are obviously you 
could throw in 16 or so programmable counters for fun too. So yea GPU 
Cores offer a new and possibly holy-trinity in implementing an I/O based 
reference time service model. Then the key aspects are in how to layer 
the services inside the GPU itself.
>   Cycles per packet?
> (cycles of the main CPU)
>    
Cycles on the main CPU isn't really an issue. Check out the use of the 
QUADRO 5800 for this. NTP runs fully inside the GPU itself. The 
stability is unbelievable. The NTP service becomes a bus level NTP 
service for somewhere between $2000 and $3000 in retail models with this 
SW service. This isn't something we can hide from it's already here.


http://www.nvidia.com/object/product_quadro_fx_5800_us.html

http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=4311512&CatId=3599

What is happening is people like Portwell (http://www.portwell.com) are 
quietly integrating larger segmented PCI express 16 based boards which 
have 4 or 8 PCIx-16 bridges on them with four or more channels being 
capable of running at PCIx16 speeds. This can be functionally segmented 
so that the PCIx bus between the cells looks like a network and the 
cells themselves look like a mainframe with its controllers spread 
across the local buss therein.

So enough of the vapor-architecture speak lets talk real world...  Try 
sticking four Quadro 5800 cards into a small box with a single buss 
segment or a larger one with four or eight segments each with a set of 
four Quadro's, each with their own 4G local memory  and shared PCIx 
Firmware ROM Card as well (x1 or x2 speed is fine for loading the BUS 
Interface BIOS, the NTP Service and the Run Time interface to the PCIx 
Bridge's drivers.). Drop a NIC card in our use the hosting systems (see 
the PORTWELL type PICMG hosts for this final role) and the picture is 
complete.


This system will, with what is described above run any of the popular 
OS's including Windows and will also allow for massive compute 
capabilities. The idea is to create a super computer which is a 42T Flop 
cell which is what 4 of the Q5800M cards produce in the TELOS type 
products from nVidia directly.

This model I have expands this by allowing for an n-way setup with N =< 
64 (64 x 42TF is a lot of cycles).
>
> Most GPU cards don't have network hardware.
Yes the GPU looks like an unbundled ALU it also has firmware capability 
in that certain implementations allow for more of CUDA functionality to 
be accessed some only support the graphics adapter application and some 
buffer or minimal runtime ram. That being the gigs of multi-ported RAM 
that the GPU's share with the DVI translators which provide their 
digital video outputs. My intent was to run NTP ass if it was a 
permanently running diagnostic in the GPU firmware itself to really make 
this solution rock.

And hey - since NTP is only a couple hundred K especially when the fixed 
parts of it are split off into ROM based demand-load segments this looks 
like it might rock as well. Relax folks I am also looking to see if I 
can bury an Oracle Concurrent Manager into the same infrastructure so in 
one fell swoop I can compartmentalize and turbocharge Oracle operations 
too...

So yes this is going to get done. I would actually like to talk about 
setting up a sub-group or focus group to pick this up as an official 
project if the group is interested in this.

Why the GPU? As you note its not even a whole computer... but as it 
happens it has enough of what we need to run NTP as an embedded service 
if we pull this off  'propah'

> How are the packets getting
> to/from the internet and your code running on the GPU?
>    
CUDA Loader loads the application into pre-reserved display memory which 
is reserved as part of the permanent display stack area.. The timing 
incremental loop and all other parts of the control process stay in the 
fast dual ported memory and execute the interrupt acknowledgment routine 
therein.

By splitting AutoKEY off into a separate thread process which I am not 
totally sure about yet, it should also be possible to unbundle its 
running in the same GPU the time service runs in. I am also thinking 
that the client portion of the process which asks the reference ISR for 
the time needs to be split into threads as well so we can massively 
replicate that query threads checking in with the reference timekeeper 
thread.
> ntpd, or at least the reference implementation, is single threaded.
Not mine.
>   What are
> you doing with the other 99 cores on the GPU?
Lets talk about that... NTP should be at least three main threads - the 
Time Keeping Loop, the Time Query Service Loop and the 
AutoKEY/Entitlement Control Services so these can all be unbundled into 
the GPU's massive number of available threads. We can also replicate the 
service across a multiple GPU card to create a peering model on a single 
board meaning an ensemble can be a single card with this technology. So 
what else? Peering inside the GPU... ahahahahah... As graphic dual 
ported memory gets faster and faster the clock tracking capability of 
this firmware/SW based system will get tighter and tighter too so this 
also kicks booty here....

And as to the number of cores still left over, its worse than that. The 
Quadro 5800 cards have 240 core threads available .... so we split all 
of the other fun stuff in NTP out so we can multi-thread or unbundle it. 
Hey CDSA will finally collide with NTP - what a killer idea.

Unbundle AutoKEY, and Timestamping and Content Validation, build in for 
DRM, real evidence capabilities... This is as revolutionary to us as the 
PC and the INTEL 8080 were to massively collapsing the Mainframe into a 
PC Emulator of that Mainframe. Now that technology of that Mainframe is 
rampant in everything - lets embrace it's capabilities with timing and 
secure evidence control therein...

(Damn, sorry but I couldn't resist the soap box).

Todd
>
>
>    
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.432 / Virus Database: 271.1.1/2641 - Release Date: 01/23/10 19:33:00
>
>    



More information about the hackers mailing list