You have answered why, but you have not answered what is defining performance per clock for GPUs.
And it is core throughput, and graphics capabilities, of each architecture.
How do you explain that in fact, Nvidia GPUs have longer pipeline that can clock higher but can do less work each cycle than AMD GPUs can?
Nvidia has 32 KB Warp, that is executed by 128 cores/256 KB Register File size, in Maxwell and Pascal architectures.
AMD has 64 KB wavefront that is executed by 64 cores/256 KB Register file size in every single GCN generation since GCN1. All what has changed in GCN since that time, is graphics capabilities. Nvidia changed core throughput, by reducing the number of cores per 256 KB Register File size, which made them less starved for resources. This is why we have seen zero improvement in core for core performance in compute applicatins, but we have seen improvement core for core in both gaming and compute with shift from Kepler to Maxwell.
Where bottleneck is for GCN in graphics throughput - Registering triangles. Each cycle you can register 4 triangles with 4 geometry engines. GP102 can register 6 triangles each clock with 6 GPC's(Each GPC has 1 Geometry Engine). Vega is lifting this bottleneck with Programmable Geometry Pipeline, which can register up to 11 triangles each clock with 4 geometry engines.
AMD lifted a lot of bottlenecks in Vega. But it will take time, before the software will mature.