Powering the next-gen of deep learning: Pascal and CUDA 8
A lot of the time in Deep Learning, the GPU is abstracted away by whatever framework you happen to be using. Sure you run out of memory and drop down the batch size or some other parameter, or you look at nvidia-smi and wonder why you're not running the GPU at 100% but there simply isn't that much focus on the nuts and bolts of CUDA programming or what the GPU is doing.
This year at GTC 2016 in San Jose, I decided to attend two reasonably detailed presentations on the Pascal hardware and CUDA 8 runtime that sits above it, to find out a bit more about each. I took quite a few photos of slides and without any further ado, here they are, accompanied by brief remarks where I think they are appropriate :)
The presenters were Mark Harris (CTO of GPU programming at Nvidia) and Lars Nyland (Senior Architect at Nvidia).
The Streaming Multiprocessor breakdown (SM)
Slide 1. The four pillars of the talk - not quite as sensational as the five miracles that Jen-Hsun Huang referred to in the keynote this morning, but close enough!
Slide 2. Counter-intuitively, a GP100 SM has less cores than a Maxwell SM (half in fact - they split a Maxwell SM in half to get two Pascal SMs..). But each core has more resources (registers, memory) around it, resulting in higher thread occupancy. There is also double the bandwidth to memory and less contention because there are less cores to fight over - all of this comes together to give increased utilisation of cores. I guess less truly is more in this case..
Slide 3. What would a discourse on GPUs be without a picture of a box containing many other little boxes..
Half precision, aka FP16
Slide 4. No matter which float data type you use, they are all IEEE 754 compliant.
Slide 5. Half precision has been in GPUs for ~12 years (textures stored there for efficiency). Now exposed in the same way as float and double.
Slide 6. And although teraflops (TF) might not be everyone's cup of tea when it comes to benchmarking GPU performance, Pascal clearly wins here. It is the 21.2 TF @ 16FP that gives the DGX-1 its headline 170 TF score.
Slide 7. For deep learning, NVLink will be a key enabler to model paralellism - very infrequently used at present due to synchronisation issues / costs across GPU boundaries. Data parallelism is much more prevalent. NVLink plus increased memory will enable much larger models to be trained now on Pascal. Also, NVLink can be "ganged" - 80 GB/sec in each direction or 160 GB/sec one-way.
HBM - High Bandwidth Memory (stacked)
Slide 8. Three things the presenters wanted us to take away from this slide:
Think of it, and use it, just as plain old global memory.
Fast (much faster than current memory).
ECC is there for protection without any penalty (free as in beer, not the case in previous architectures).
Slide 9. Performance is improved via data locality (migrate pages to the processor that is requesting them). Also guarantees coherency while this is happening. Pascal can also pagefault back to system memory and you can allocate up to the system memory. All in all, big improvements here over Maxwell.
CUDA 8 and beyond
Slide 10. I also took quite a few pictures of the CUDA 8 session slides, but I'll keep to just one here as I think it sums up both where CUDA and GPU programming in general is going. You will see that the code snippet on this slide uses malloc() and free() to allocate and de-allocate GPU memory, not CUDA-specific code. This makes CUDA programming more elegant and the GPU more transparent - in one respect it is simply a unit of hardware in your server optimised for vector / matrix operations - no more, no less.