You can't calculate triangle/primitive throughput like this anymore. Modern games don't spend their whole frame time rasterizing geometry. Lighting, post processing, etc take significant chunk of GPU time (up to 50%), and during that time the geometry pipelines are idling. Only a small part of the frame is geometry bound. Shadow map rendering is the most geometry bound step. G-buffer rendering tends to also be partially geometry bound (no matter how fat pixels), since there tends to be lots of triangles submitted that result in zero pixel shader invocations (backfacing or earlyZ/hiZ rejected). For example drawing a high poly character behind a nearby corner would cause 100k vertex shader invocations, but zero pixel shader invocations. This draw call would cause a bubble in GPU utilization, since the geometry pipeline can't process these triangles fast enough to go through these triangles before the existing pixel shader work (from previous draw calls) finish executing (on the CUs). I have found out that on GCN2, vertex shader work can only utilize roughly two CUs in common case (geom pipes simply can't feed more vertex waves). Remaining (=most) CUs will idle if there's a big chunk of sequential triangles which generate no pixel shader invocations. This is one of the reasons why you want to cull the triangles early, to avoid underutilizing the GPU.
Slide 12 of this presentation is a good example:
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf
Here you can see that the GPU occupancy is very low in a part of the G-buffer rendering step where most of the geometry is occluded. The green part of the occupancy graph is the vertex shader work. As you can see, VS occupancy never goes above a certain (small) portion of the whole GPU. GCN simply needs lots of pixel shader work to saturate the GPU when it is rendering geometry that results in small amount of visible pixels. This is also the main reason why async compute helps GCN so much. CUs can simply execute background compute shader work when there's not enough pixel waves spawned.
Async compute also has another advantage. It allows the developer to keep geometry units active for larger portions of the frame, because you can freely overlap compute with graphics. You don't need to dedicate a big chunk of GPU time to non-rasterization work (post processing and lighting). You can overlap this work with geometry heavy work to reduce both the time when geometry units are idling and the time when CUs are idling. This is one advantage that AMD has over the competition. Games could spend the whole frame submitting both geometry work and compute work. This way geometry units can be utilized during the whole frame (instead of <50% of the frame). The downside is unfortunately that AMD has been behind Nvidia in geometry performance, so you need to use techniques like this to reach parity, instead of gaining big advantages. Polaris improved things a bit, and Vega should improve things further, but so far the results haven't been as good as I had hoped. I would guess that the primitive shaders and DSBR still need some additional driver work to show their full potential. I just hope (for AMDs sake) that they don't need to write custom profiles for each game to utilize these... AMD simply doesn't have as much resources as Nvidia to optimize individual titles separately.