The Tesla P100 processor succeeds to the Tesla K40 and the Tesla K80 (basically two K40 working together) and packs 3584 computing cores (vs. 2880 for the K40). Each core uses a better architecture, which yields higher performance and better compute density.
To feed data to this monster, a 4096-bit bus to cutting-edge HBM2 memory modules can provide up to 720GB/sec of bandwidth, which is like moving the content of 28 Blu-Ray movies every second. The new HBM2 technology is a key factor in the increased bandwidth.
The computing power tops 21.2 TFLOPS (tera–FLOPS) in half-precision calculations and 5.3 TFLOPS in double precision. That is ~5X faster than the Tesla K40 (single-chip). Interestingly, the K40 was designed for 235W, while the P100 chip is designed for 300W – you can see how the thermals have been kept in check, despite the amazing increase in computing performance.
The massive size of the internal register file(s) will further accelerate application that can keep data in the chip, without having to communicate with the memory, which is inherently slower. Previous NVIDIA single-chip solutions like the K40 have 3840KB of registers. P100 has 14336KB!
With general purpose computing demand for more performance showing no signs of letting down, NVIDIA can afford to engage in building those amazingly huge and powerful chips.