The Maxwell GPU design is manufactured on the same 28nm semiconductor process as the previous generation GPU: the NVIDIA Kepler (as seen in the GeForce GTX 680), so the performance improvements come from architectural changes and optimizations, but also from a slightly larger die-size area (148 vs. 118 square millimeters.
Looking at the comparison between the GK107 Kepler GPU and the GM107 Maxwell GPU, the latter has 66.6% more CUDA cores, the basic processing cores that churn all the computations. It’s L2 Cache size has also jumped from 256KB to 2048KB in a bid to reduce latency as much as possible.
GPU | GK107 (Kepler) | GM107 (Maxwell) |
Cuda Cores | 384 | 640 |
Base Clock (MHz) | 1058 | 1020 |
Boost Clock (MHz) | N/A | 1085 |
GFLOPs | 812.5 | 1305.6 |
Texture Units | 32 | 40 |
Texel fill-rate Gigatexel/sec) | 33.9 | 40.8 |
Memory Clock (MHz) | 5000 | 5400 |
Memory Bandwidth (GB/sec) | 80 | 86.4 |
ROPs | 16 | 16 |
L2 Cache Size | 256KB | 2048KB |
TDP | 64W | 60W |
Transistors (Billions) | 1.3 | 1.87 |
Die Size square millimeters | 118 | 148 |
Manufacturing Process (nm) | 28 | 28 |
Since Kepler, the GeForce GPU designs are used in Mobile to Desktop products, so there’s an inherent drive for power-efficiency that can be felt across NVIDIA’s entire line of chips. What the company has learned with GeForce provides a performance boost with Tegra mobile processors like the NVIDIA Tegra K1, while the lessons learned with Tegra will make future laptop and desktop processors that much more power-efficient."WHAT WAS ONCE A WIMPY PC CAN NOW BECOME A DECENT GAMING MACHINE"
The first products to hit the market are not your typical high-end super-fast graphics processors. On the contrary, the GeForce 750Ti and GeForce GTX 750 are in the mid-range, and they represent the the “meat” of NVIDIA’s market, and that is precisely where the company wants to attack first.
The second reason for that choice is that NVIDIA believes that it can easily outperform AMD on a performance-per-watt basis and address a market that was once out of reach for this level of performance: PCs that don’t have an additional 6-pin power connector – and this is pretty awesome. This means that what was once a wimpy PC can now become a decent gaming machine.
Interestingly enough, Desktop cards based on the Kepler architecture seemed already pretty efficient, and they themselves represented quite an upgrade from the previous NVIDIA Fermi architecture.
NVIDIA says that its Maxwell GPU delivers 35% higher performance per CUDA core, provided that math computations were limiting the speed and not other things like memory bandwidth, or workload). And since there are 66% more CUDA core in the new chip, you can see that the computation potential is much higher.
NVIDIA has not yet published a “deep-dive” white paper on Maxwell, but discussing this with their team, my understanding is that there were big changes in the way the workload was dispatched to CUDA Cores. In Kepler, each group of CUDA cores (called SM or streaming processor, a GPU can have 1 or many SM units) were sharing a register file and scheduler units. In the new design, NVIDIA made things more local, and has basically created some islands where smaller groups of CUDA cores have their own scheduler and register file. This probably leads to better locality, bandwidth, latency and it is better for the chip layout.
NVIDIA says that the new design does a better job of 1/ keeping all cores busy when needed 2/ powering the cores down when they are not needed. I’m just speculating, but it seems like this setup could give them much more granularity over which group of CUDA core gets shut down, which is probably not as easy with the Kepler design.
We all know that GPUs are really fast at graphics, and they can also be great for general-purpose computations, the speed at which they can compress video is often overlooked. If you compress video regularly, it is one of the rare things that still takes minutes or hours in one’s daily life, so any improvement is a big deal.
NVIDIA’s Maxwell architecture can compress video at 6X to 8X real-time speed, which means that a one-hour video sequence could be compressed in 7.5 minutes (actual time depends on video quality etc…) which is quite impressive. Additionally, NVIDIA says that it can perform the video decode with a better power-efficiency because of architectural improvements done to that unit.
Video playback is a light workload, so the hard part here is to shut down as much of the chip as possible, while still being able to play the video and take care of the basic computer window environment. Maxwell has a low-power state called GC5 that helps it consume less power while playing video.
The ability of running a powerful GPU that doesn’t require a 6-pin connector brings an amazing performance for this segment of the market. Think of it: is the performance that you could get four years ago for $500+ with the hottest (literally) and baddest NVIDIA GPU. Now, you can plug that in your small form factor (SFF) PC and it won’t even heat up the room.
Finally, it’s impossible not to think about what will happen when NVIDIA will integrate this into its next-gen Tegra mobile processor codenamed Parker that I will tentatively call Tegra M1…