Anyone notice something? P100 is more GCN-like. 4 Texture Mapping Units per SM, 64 SIMD cores per SM. NVIDIA have delivered a VERY similar design to GCN organization wise.
The lower CUDA cores per SM means that P100 won't run into this issue:
Whereas increased parallelism past 16 concurrent Warps maxed out the available local caches and began to spill into L2 cache.
This makes P100 a more parallel architecture from Maxwell but it is essentially Maxwell tweaked to look more GCN-like. I'm now interested to see what Vega brings to the table.
ROp wise, P100 looks to have 128 ROps. If we look at GM107:
We see a ROp to memory controller ratio of 8:1. Each 8 ROps had access to 1MB of L2 cache and its own 64-bit memory controller.
GM200/204 changed this to a ratio of 16:1. Each 16 ROps had access to 512KB of L2 cache and its own 64-bit memory controller. This solution had issues memory bandwidth wise which is why NVIDIA used ample amounts of color compression. Even with no other work but ROp work, GM200/204 could not hit its theoretical ROp throughput. This makes it highly unlikely that NVIDIA would increase the ROp ratio further with Pascal. Notice a loss of 10GPixels/s for each GM200/204 cards when compared to theoretical performance.
There are 8x 512-bit memory controllers on P100 which means you can pair 8 groups of ROps. If we assume 16:1 ratio that's 128 ROps.
There's 4MB of L2 cache on P100, if we assume 512KB per 16 ROps that's 8x16 for 128 ROps.
My take at least.