This man gets it.
AMD's Command Processor is inefficient in DX11 because it was designed to work alongside ACEs which DX11 cannot access, ie. Graphics Queues via CP, Compute/Copy queues via ACEs.
It also lacks a deep queue buffer, so if the CPU is hammered, it stalls. NV's Gigathread has a deep buffer and it receives the queues from the software (driver) scheduler that's multi-threaded on the CPU, so it rarely gets stalled. In DX12, it loses it's advantage, and even suffers slightly via software scheduling overhead.
When GCN is driven under DX12, the CP can be fed with multi-core CPUs and so it's operating at a higher efficiency. Add ACEs to distribute compute queues and it offloads some of the work from the CP, as it was designed to work together.
This is why we see DX12 performance is reflective of the Tflops of GCN.
Polaris gets a lot of improvements to enhance it's performance in DX11. A new CP, Pre-Fetch, Cache system are all things that address the critical flaws by removing bottlenecks in DX11. I don't know whether this would improve it's DX12, but given how well current GCN runs in DX12, I have to wonder if there's room left to improve.
What it may mean is that if it's 5.5 TFlops, in DX12, it's going to be on-par with the 390X, but in DX11, it will be ahead. The biggest unknown though is how effective the Discard Accelerator will function. That may really skew the performance edge to Polaris.