Sorry, for context what I thought you were proposing was adding AVX-512 to the existing console CPU cores. Those Jaguar cores are very similar to Silvermont in size and performance, which is where my comparison to Knights Landing came from.
However, that said- 8 Broadwell level cores plus AVX-512 would be an enormous jump in die area, power consumption and cost for the consoles. Given that Zen is already at roughly Broadwell level performance, you're basically proposing Zen plus AVX-512. 8 Zen cores are already 95W- even with a die shrink and reduced clock speeds, taking that and adding AVX-512 would take up a massive chunk of the console's thermal budget. Again, that thermal budget could be better spent on extra GPU shaders.
8 fully fleshed Broadwell cores with AVX-512 would be a substantial jump in die area and power consumption but I wasn't aiming for that either. A
cut down Haswell core as a base would be my ideal starting point. Somewhere in between Sandy Bridge and Haswell then work our way up from there to extend those cores with AVX-512 and bug free TSX which higher clocks for a 16 core part ...
All of that seems like a reasonable budget of around ~5 billion transistors with 5nm GAAFETs and we really don't want to hold back on hitting an aggressive target of having under 16.6ms for game logic since this very well might be our very last console generation as we are reaching a plateau for transistor technology and I don't want AMD to hold back in which they could realistically score a home run with new consoles in 202X having AVX-512 ...
Remember that with modern console GPUs, multiple compute and graphics tasks can be scheduled simultaneously- you don't need a task wide enough to saturate the entire GPU.
It's just absolutely not wise to use the GPU for everything when all a game programmer wants to do is hit performance targets like frametimes and GPUs just don't quite provide the low latencies that a wider SIMD extension could ...
In fact you could actually hurt performance if you naively try to port around moderate data set workloads to GPU from CPUs when there's a possibility that you could increase the frametimes which is just bad news since sticking with the CPU would've kept lower frametimes instead!
The latency of launching and waiting for a GPU task is still far higher, of course. But I feel that the subset of problems which are wide enough for 8-core 32-element CPU SIMD to be valuable, while not being wide enough for a GPU compute task, is really not big enough to justify the added complexity.
Which brings me back to the above! I do not believe that AVX-512 serves to be a practical subset of GPUs when there are many areas where in which a GPU can't touch when it comes to giving a speedup in many applications that a CPU would normally excel in ...
Let's say that the field of applications are shaped like a square. Corners 1 & 2 can be accelerated with a GPU but it can't touch corners 3 & 4. Corner 3 is occupied with a CPU currently but no processor so far can touch corner 4 until now with another extension or so like AVX-512 which show a definitive advantage with these new family of CPUs ...
It is my view that AVX-512 is meant to reach spots that neither a CPU or GPU ever could and bring new performance heights in a continuum of applications that have different bottlenecks (whether it'd be high throughput, bandwidth, latency or etc) as specified by the 4 corners in my example ...
As for the programming model... eh. I've played around with vectorizing compilers in the past, and they're great up to a certain point. You can massage an awful lot of things into autovectorized loops. But then another developer comes along with a bug to fix or a feature to add, slaps an innocuous if/else branch into what looks like a regular old "for" loop, and all of a sudden the vectorization becomes dramatically less efficient... if it doesn't fail to vectorize entirely. Getting good performance out of that vectorized code, and maintaining that good performance, means that you basically need to think like a GPU developer anyway. And if I need my whole team to think like a GPU developer, I'd rather use a GPU-focused API and make it more explicit.
(That being said, I'm not a console developer, I'm a desktop CPU + CUDA developer. So take what I say with a pinch of salt )
That's not true at all. The most common thing you would have to worry about for both is getting a good data layout, that's practically half the work right there for you to setup. As long as you can ensure that there is enough data sets you will almost certainly get a speedup with AVX-512 just from the compiler side alone whereas you have to forgo many trials of radically GPU centric algorithms just to get it efficiently running (gotta be careful about that register pressure and those memory access patterns). You can mix and match scalar + vector instructions fairly freely on CPUs but doing that on a GPU would rear up on it's ugly head one bad surprise after another and there's far less ways to synchronize across threads/waves ...
I do like the improvements in AVX-512 from a developer point of view. Masked operations? Scatter support? This is exactly what you need to make autovectorization more feasible. But those features are going to add more and more complexity to the vector units. The number of (named) vector registers has doubled, and each vector is twice as wide- you're up to 2KB of named register space per core. And then you have the additional mask registers on top too. We haven't yet seen how this is reflected in the physical register file, but I can only assume that this will also have grown significantly in SKX. They certainly had to significantly boost L2 cache size in order to deal with the increased cache pressure. And the actual execution units will have to increase in complexity in order to deal with the new masked operations- there will have to be wiring for this additional operand, along with more complication in the scheduling and retiring.
And that's what game programmers have been asking for a long time on their wishlist!
I could understand having to increase the L2 cache to reduce some bank conflicts and increasing the load and store ports to 512-bits to increase bandwidth to maintain the sustained throughput but AVX-512 isn't ballooning out of control in complexity like you seem imply. Even the quadrupled register file may look hefty at first but it's really meager in comparison the amount that GPUs have to dedicate and GPUs make it easier than you think it is to double the amount of ALUs plus in the future once we get to denser transistor sizes the so called 'complexity' will become vanishingly small like we see today with x87, MMX, and SSE ...
I just don't think it's a sensible trade off to make in a console. It's a game of tradeoffs, working within constraints to provide the optimal solution. AVX-512 would be a nice feature, but would you really choose it over 20% more GPU shaders?
I don't see how in the future AVX-512 is NOT worthwhile when we consider that CPUs will take smaller and smaller portions of a die in the future if console hardware designers are only targeting to get a modest 85% of the IPC Haswell has (basically sandy bridge) and bake in AVX-512 with these cores. All of this sounds very doable with a ~5 billion transistor budget (all the while getting 16 cores too in the process!) when we consider that a Broadwell core is twice as fat (20mm^2 using Xeon E5 as ref) as a Knights Landing core which has two AVX-512 units (10mm^2) despite the former having 4x less the SIMD width ... (both built on the same 14nm process)
If the new consoles we're built around the 5nm GAAFETs the CPUs would most likely only be worth 30mm^2 at MOST so really console manufacturers are just paying peanuts for AVX-512 in the end and if some game programmers don't want to mess around with AVX-512, they can just turn off the functionality for higher clocks so it's really a win-win scenario all around as time goes on plus console manufacturers don't have to face another codebase fragmentation if they plan to upgrade again. (PS4 Pro is feeling this as GPU doesn't have the same ISA but I wonder if it's backwards compatible) Increasing IPC seems rather transparent at the compiler when you share the same ISA in comparison so if console manufacturers want they can go in the direction of upgrading the CPU microachitecture instead ...
I don't see the big deal of paying a little (low risk) in to see big gains (high potential reward) and that's exactly how I view AVX-512 in the future. I'm sure every once in a while an engine programmer would like to see their engines being able to hit the 16.6ms threshold just to be able to reach the magical 60FPS in case they ever want it.
Besides, if we take a look at the PS4 APU the jaguar cores have consumed roughly around 30% of the die size so console manufacturers don't share your vision of having an extremely skewed CPU to GPU compute ratio being ideal as it is and lot's of game programmers don't find that fun in that either. (aside from maybe the graphics programmers)