Polaris will support ROV and CR. That's what the "Primitive Discard Acceleration" is all about.
ROV is a performance reducing feature btw.
What is clear to me is that Maxwell 3.0 (Pascal) needs HBM2 to shine. This is why AMD GCN3 (Fiji) is able to keep up with a GTX 980 Ti (GM200) at 4K despite having a lower ROp count and clockspeed.
NVIDIA went from an 8:1 ratio of ROP to Memory Controller in Kepler (GK110) to a 16:1 ratio with GM20x. In other words Kepler (GK104/GK110) had a ratio of 8 Rops per 64-bit memory controller.
So a 256-bit memory interface would give us 4 x 8 or 32 ROps and a 384-bit memory interface would give us 6 x 8 or 48 ROps.
What NVIDIA did was boost that to 16 ROps per 64-bit memory interface with GM20x. So a 256-bit memory interface now powers 64 ROps and a 384-bit memory interface now powers 96 ROps.
NVIDIA added color compression (Delta) which affects colored pixels and texels but not random pixels and texels in order to make up for the lack of memory bandwidth. It helped out a bit but still couldn't keep up with GCN2s 64 ROps under random scenarios or GCN3s ROps under both random and colored scenarios.
What we're looking at, then, is NVIDIAs initial Pascal offerings being some what nice but not delivering the performance people seem to think it will. GP100, paired with HBM2, will be able to deliver deliver the needed bandwidth for NVIDIAs bandwidth starved 96 ROps (Z testing, pixel blending, anti-aliasing etc devours immense amounts of bandwidth). Therefore I don't think we're going to see more than 96 ROps in GP100. What we're instead likely to see are properly fed ROps.
If the "GTX 1080" comes with 10 Gbps GDDR5x memory on a 256-bit memory interface then we'd be looking at the same 64 ROps that the GTX 980 sports and the same 16:1 ratio (8 ROps/64-bit memory controller) but with 320GB/s memory bandwidth as opposed to 224GB/s on the GTX 980. So the GTX 1080 (320GB/s) should deliver a similar performance/clk as a GTX 980 Ti (336GB/s) at 4K despite sporting 64 ROps to the GTX 980 Ti's 96.
NVIDIA will likely clock the reference clocks on the GTX 1080 higher in order to obtain faster performance than a reference clocked GTX 980 Ti. So the performance increase of a GTX 1080 over a GTX 980 Ti will likely be due to higher reference clocks as it pertains to 4K performance.
I also think that the GTX 1080 will sport the same or around the same CUDA cores as a GTX 980 Ti (2,816). I could be entirely off but that's what I think.
As for FP64, NVLink, FP16 support, those are nice for a data centre but mean absolutely nothing for Gamers... Sad I know.
So what we're looking at from NVIDIA, initially, is GTX 980 Ti performance (or slightly higher performance) at a lower price point with GP104. The real fun will start with GP100 by end of 2016/beginning of 2017.
On the RTG/AMD front..
RTG replaced the geometry engines with new geometry proceasors. One notable new feature is primitive discard acceleration, which is something GCN1/2/3 lacked. This allows future GCN4 (Polaris/Vega) to prevent certain primitives from being rasterized. Unseen tessellated meshes are "culled" (removed from the rasterizer's work load). Primitive Discard Acceleration also means that GCN4 will support Conservative Rasterization.
Basically, RTG have removed one of their weakness's in GCN.
As for the hardware scheduling, GCN still uses an Ultra Threaded Dispatcher which is fed by the Graphics Command Processor and ACEs.
AMD replaced the Graphics Command Processor and increased the size of the command buffer (section of the frame buffer/system memory dedicated to keeping many to-be executed commands). The two changes, when coupled together, allow for a boost in performance under single threaded scenarios.
How? My opinion is that if the CPU is busy handling a complex simulation or other CPU heavy work under DX11, you generally get a stall on ther GPU side whereas the GPU idles, waiting on the CPU to finish the work it is doing so that it can continue to feed the GPU.
By increasing the size of the command buffer, more commands can be placed in-waiting so that while the CPU is busy with other work, the Graphics Command Processor still has a lot of buffered commands to execute. This averts a stall.
So 720p/900p/1080p/1440p performance should be great under DX11 and Polaris/Vega.
Another nifty new feature is instruction prefetching. Instruction prefetch is a technique used in central processor units to speed up the execution of a program by reducing wait states (GPU Idle time).
Prefetching occurs when a processor requests an instruction or data block from main memory before it is actually needed. Once the block comes back from memory, it is placed in a cache (and GCN4 has increased its Cache sizes as well). When the instruction/data block is actually needed, it can be accessed much more quickly from the cache than if it had to make a request from memory. Thus, prefetching hides memory access latency.
In the case of a GPU, the prefetch can take advantage of the spatial coherence usually found in the texture mapping process. In this case, the prefetched data are not instructions, but texture elements (texels) that are candidates to be mapped on a polygon.
This could mean that GCN4 (Polaris/Vega) will be boosting texturing performance without needing to rely on more texturing units. This makes sense when you consider that Polaris will be a relatively small die containing far fewer CUs (Compute Units) than Fiji and that Texture Mapping Units are found in the CUs. By reducing the texel fetch wait times, you can get a more efficient use out of the Texture Mapping Units on an individual basis. Kind of like higher TMU IPC.
On top of all this we have the new L2 cache, improved CUs for better shader efficiency, new memory controllers etc
So what we're looking at from AMD is FuryX performance (or slightly more) at a reduced price point for DX12 and higher than FuryX performance for DX11. Just like with NVIDIA, the real fun starts with Vega by end of 2016/beginning of 2017.
In conclusion,
We have Maxwell 3.0 facing off against a refined GCN architecture. Micron just announced that mass production of GDDR5x is set for this summer so both AMD and NVIDIA are likely to use GDDR5x. It will be quite interesting to see the end result.
So does lacking Asynchronous compute + graphics matter? Absolutely.