Speculation: i9-9900K is Intel's last hurrah in gaming

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Regarding Ryzen's IPC deficit, I posted the following summary of the 3D Particle Movement benchmark in another thread, but I think it is worth posting here as well, considering Ryzen 3000's prospects in gaming and the proposition in the poll.

3DPM is a benchmark written by AnandTech writer Ian Cutress (based on algorithms developed for his PhD work), and it is consistently used in their CPU reviews. As a rationale for the benchmark, he claims it provides "a good idea on how instruction streams are interpreted by different microarchitectures".

My summary below includes both the old and new version, and focuses solely on Ryzen 2700X vs i9-9900K (both 8C/16T). It is remarkable how far behind Ryzen is on the old version, and equally remarkable how well it catches up on the new optimised version.



In this particular benchmark, Zen 2 does not need to do much to overtake Skylake/CFL on optimised non-AVX code, but needs more to overtake on AVX code, and a lot more to overtake on old and poorly written code.

http://www.anandtech.com/show/13400/intel-9th-gen-core-i9-9900k-i7-9700k-i5-9600k-review


For those interested in more discussion about Zen 2 IPC, this video presentation does a microarchitecture analysis based on x86 assembler guru Agner Fog's work:

 

SgtSpoon

Member
Dec 25, 2007
69
2
71
I dont think they will match Intel's performance. But thats just me placing my bet.

But, they dont necessarily have to -beat- Intel on ipc or clocks. Its only a small margin of people that want to have the absolute fastest cpu, for all others price/performance ratio is of more importance ...
 
Reactions: Tlh97 and Vattila

Omegaboost

Member
Oct 24, 2016
35
6
71
Zen2 won't be able to beat Intel in gaming due to latency incurred by the infinity fabric. 10nm Icelake chips will also feature a 50% bigger L1 cache & double the L2 cache, so IPC will go up another ~10-15% on top of Skylake.
 
Reactions: Vattila

DrMrLordX

Lifer
Apr 27, 2000
21,812
11,165
136
Zen2 won't be able to beat Intel in gaming due to latency incurred by the infinity fabric.

How do you figure?

If the rumors are true, even 8c/16t Matisse will have a massive L3 cache, which will be unaffected by IF speeds/latency. Also, the 8c chiplet design will mean, no more 2xCCX design for an 8c chip. So no IF penalty from thread-hopping between CCXs. It will be like having one 8c CCX, with something like 32 MB of cache. On top of all that, it will be running at higher clocks, and the uarch itself will have improvements that will raise overall IPC (not taking into account the L3 cache). And on top of that, IF speeds will probably be higher, at least from the IMC hopefully supporting higher memory speeds on the desktop (AMD would be insane not to go that route). Hell even if they don't change the available memory speeds/timings much and don't do a whole lot to increase IF speeds, you will see that memory latency on existing Ryzen/Ryzen+ is not that bad - certainly not as bad as inter-CCX latency penalties. It's pretty easy to get sub-65ns latencies with a 2700x and tuned DDR4-3466 or 3600. That is not holding back the 2700x in gaming.

Sure, you will still have IF penalties on the possible 16c/32t chips due to them having two chiplets. If the scheduler can't figure out how to keep a game running on one chiplet, then good grief (or just activate gaming mode like the Threadripper folks).
 

Omegaboost

Member
Oct 24, 2016
35
6
71
How do you figure?

If the rumors are true, even 8c/16t Matisse will have a massive L3 cache, which will be unaffected by IF speeds/latency. Also, the 8c chiplet design will mean, no more 2xCCX design for an 8c chip. So no IF penalty from thread-hopping between CCXs. It will be like having one 8c CCX, with something like 32 MB of cache. On top of all that, it will be running at higher clocks, and the uarch itself will have improvements that will raise overall IPC (not taking into account the L3 cache). And on top of that, IF speeds will probably be higher, at least from the IMC hopefully supporting higher memory speeds on the desktop (AMD would be insane not to go that route). Hell even if they don't change the available memory speeds/timings much and don't do a whole lot to increase IF speeds, you will see that memory latency on existing Ryzen/Ryzen+ is not that bad - certainly not as bad as inter-CCX latency penalties. It's pretty easy to get sub-65ns latencies with a 2700x and tuned DDR4-3466 or 3600. That is not holding back the 2700x in gaming.

Sure, you will still have IF penalties on the possible 16c/32t chips due to them having two chiplets. If the scheduler can't figure out how to keep a game running on one chiplet, then good grief (or just activate gaming mode like the Threadripper folks).

If matisse CCX gets upgraded to 8 cores then sure it could match Intel's ringbus in gaming but that won't happen anytime soon. The current Matisse rumors are fake.

Each core in a CCX communicates directly with each other, so scaling the core count per CCX isn't as simple as just adding more cores, they all have to be connected to each other somehow and that takes up a lot of die space/power.

This is why Ryzen started /w 4 cores per CCX, each core only needs 3 connections. Ramp it up to 6 cores = 5 connections, 8 cores = 7 connections. In an 8 core CCX config, there could be as much die space dedicated to inter-core communication as the actual cores.

Intel's ringbus is far more efficient but AMD can't copy it since it's patented.
 
Reactions: Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
With Intel's ring bus the latency naturally increases with the distance to the L3$. Per Agner Fog whereas within a CCX all cores have the same latency of 40 cycles to access L3$, with Skylake it varies between 34-85 cycles depending on distance. Increasing the number of cores within a CCX will increase this latency as well. There are other areas where actual improvements can be introduced instead giving up the 4 core CCX design.
 
Reactions: Tlh97 and Vattila

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Intel's ringbus is far more efficient but AMD can't copy it since it's patented.

AMD and Intel have a comprehensive cross-patent agreement, so I don't think it would be a patent issue. Intel has ring-bus experience and optimisations though, which may be tough to match. Better to optimise the 4-core CCX and IF interconnect, I guess. So, I still think the 8-core chiplet in the rumoured 8+1 chiplet design for "Rome" will consist of two 4-core CCXs. But we will soon see.
 
Reactions: Tlh97 and Gideon

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Zen2 won't be able to beat Intel in gaming due to latency incurred by the infinity fabric. 10nm Icelake chips will also feature a 50% bigger L1 cache & double the L2 cache, so IPC will go up another ~10-15% on top of Skylake.

When do you expect 10nm Ice Lake on the desktop? And does your pessimistic prediction, based on IF latency, include modern well-written code? It looks to me that modern Zen-aware code can avoid much of the inter-CCX penalty, and that on well-written code it would not take much for Zen 2 to overtake Skylake and challenge Ice lake (the opponent against which it is designed to compete). E.g. for how much effect coding has, see my earlier post about the 3D Particle Movement benchmark.
 

Despoiler

Golden Member
Nov 10, 2007
1,966
770
136
If matisse CCX gets upgraded to 8 cores then sure it could match Intel's ringbus in gaming but that won't happen anytime soon. The current Matisse rumors are fake.

Each core in a CCX communicates directly with each other, so scaling the core count per CCX isn't as simple as just adding more cores, they all have to be connected to each other somehow and that takes up a lot of die space/power.

This is why Ryzen started /w 4 cores per CCX, each core only needs 3 connections. Ramp it up to 6 cores = 5 connections, 8 cores = 7 connections. In an 8 core CCX config, there could be as much die space dedicated to inter-core communication as the actual cores.

Intel's ringbus is far more efficient but AMD can't copy it since it's patented.

AMD already has solutions for bus topology. They aren't going to have a disadvantage given their chiplets strategy. AdoredTV does a good job of explaining where AMD is going. Keep in mind that Zen is the first chip AMD has released of the series. It's the most rudimentary design to get it out the door.

 
Last edited:
Reactions: Tlh97 and Vattila

Omegaboost

Member
Oct 24, 2016
35
6
71
When do you expect 10nm Ice Lake on the desktop? And does your pessimistic prediction, based on IF latency, include modern well-written code? It looks to me that modern Zen-aware code can avoid much of the inter-CCX penalty, and that on well-written code it would not take much for Zen 2 to overtake Skylake and challenge Ice lake (the opponent against which it is designed to compete). E.g. for how much effect coding has, see my earlier post about the 3D Particle Movement benchmark.

Icelake for desktop will be out next year, probably Q3/Q4 like the original Coffeelake.

Modern games are all optimized for consoles first then ported to PC. Current PS4/Xbox1 consoles are both using 8 core Jaguar SoCs with no IF, therefore there will be no special IF optimization code until next gen consoles get upgraded to Zen2 cores.

Intel dominates the PC gaming market (with 83% market share: https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam) so there isn't much incentive for game devs to optimize for Ryzen IF (the windows scheduler is already programmed to do so).

Even if there is IF optimized game code which reduces communication between CCXs, the IF is still being used to reach dram. Unless AMD adds more L3 or an L4 cache, there is no way around that extra latency.
 
Reactions: Vattila

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
Modern games are all optimized for consoles first then ported to PC.
One of the reasons they went with x86 cores was to not have to port the games anymore(at least the cpu part) and that is what is happening,we are running jaguar constrained IPC code on hundreds of $ worth of CPU and wonder why we get far less FPS then what we used to get years ago.
 

exquisitechar

Senior member
Apr 18, 2017
666
904
136
All of the recent Rome diagrams are not leaks, just idle speculation from a bored "engineer". Here is the source: https://twitter.com/chiakokhua/status/1057166516548857856

He's probably invested in AMD stock. Stop believing in leaks unless they're official documents. Fact remains that an 8 core CCX is too complicated & inefficient.
He is invested in AMD stock and has never pretended otherwise; similarly, he's never pretended that his diagrams are leaks.

I am predicting a 4 core CCX, though. Right now, there aren't many rumors about it either way. Regardless, the cross CCX penalty in gaming is overstated, IMO. There are bigger issues that AMD should have fixed.
 
Reactions: Tlh97 and Vattila
May 11, 2008
20,060
1,292
126
Icelake for desktop will be out next year, probably Q3/Q4 like the original Coffeelake.

Modern games are all optimized for consoles first then ported to PC. Current PS4/Xbox1 consoles are both using 8 core Jaguar SoCs with no IF, therefore there will be no special IF optimization code until next gen consoles get upgraded to Zen2 cores.

Intel dominates the PC gaming market (with 83% market share: https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam) so there isn't much incentive for game devs to optimize for Ryzen IF (the windows scheduler is already programmed to do so).

Even if there is IF optimized game code which reduces communication between CCXs, the IF is still being used to reach dram. Unless AMD adds more L3 or an L4 cache, there is no way around that extra latency.

The Jaguar SoCs in the consoles are also of a 2x4core module design.
https://en.wikipedia.org/wiki/Jaguar_(microarchitecture)
https://en.wikichip.org/wiki/microsoft/scorpio_engine
 
Reactions: Tlh97 and Vattila

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
One of the reasons they went with x86 cores was to not have to port the games anymore(at least the cpu part) and that is what is happening,we are running jaguar constrained IPC code on hundreds of $ worth of CPU and wonder why we get far less FPS then what we used to get years ago.
This is news to me. Can you expand? What is Jaguar constrained code?

Are you saying that the code developed for a slower machine will not run much better on a superior CPU?
 

DrMrLordX

Lifer
Apr 27, 2000
21,812
11,165
136
All of the recent Rome diagrams are not leaks, just idle speculation from a bored "engineer".

So your own speculation about what will hobble Matisse is more credible because . . . ?

Fact remains that an 8 core CCX is too complicated & inefficient.

Like an i9-9900k? And don't use the ring bus to obfuscate, you darn well know what the penalties are for expanding core count with a ring bus.

This is news to me. Can you expand? What is Jaguar constrained code?

Are you saying that the code developed for a slower machine will not run much better on a superior CPU?

I can tell you that Dragon Quest XI - a PlayStation 4 title - runs at over 60 fps in 1440p with all the settings turned up on my 1800x @ 4.0 GHz + Vega FE. Good luck doing that on a PS4.
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
This is news to me. Can you expand? What is Jaguar constrained code?

Are you saying that the code developed for a slower machine will not run much better on a superior CPU?
I'm saying that if a game for example is made to use one instruction per cycle because that's the max of what the CPU it's designed for can produce then it will run with 1 IPC no matter how many IPC you have in your core.

This is a game made for desktop only,you can see it uses 2.4 IPC on one core running one thread.
This is Deus ex in Dx12 running on a single core,made for consoles it uses ~0.7 IPC even though it has a bunch of threads,gaming threads can utilize 2.4 IPC as seen by the previous example so just imagine how games would run if they would use all the available IPC,or at least more then now.
PCM is a tool to measure a lot of things like cache misses other events and in this case IPC.
PCM (Processor Counter Monitor)
Intel site for info
https://software.intel.com/en-us/articles/intel-performance-counter-monitor
Github for downloading.
https://github.com/opcm/pcm
 

Despoiler

Golden Member
Nov 10, 2007
1,966
770
136
Icelake for desktop will be out next year, probably Q3/Q4 like the original Coffeelake.

Modern games are all optimized for consoles first then ported to PC. Current PS4/Xbox1 consoles are both using 8 core Jaguar SoCs with no IF, therefore there will be no special IF optimization code until next gen consoles get upgraded to Zen2 cores.

Intel dominates the PC gaming market (with 83% market share: https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam) so there isn't much incentive for game devs to optimize for Ryzen IF (the windows scheduler is already programmed to do so).

Even if there is IF optimized game code which reduces communication between CCXs, the IF is still being used to reach dram. Unless AMD adds more L3 or an L4 cache, there is no way around that extra latency.

Infinity Fabric is an interconnect. It's not subject to any code external to AMD. It's not software. Zen specific optimizations, or CPU optimizations in general, usually have to do with how to efficiently make use of the cache sizes/hierarchy. For instance Ashes of the Singularity had to patch because their existing code often caused the chip to flush all of it's cache. You can imagine that significantly decreased performance.
 
May 11, 2008
20,060
1,292
126
I'm saying that if a game for example is made to use one instruction per cycle because that's the max of what the CPU it's designed for can produce then it will run with 1 IPC no matter how many IPC you have in your core.

This is a game made for desktop only,you can see it uses 2.4 IPC on one core running one thread.
This is Deus ex in Dx12 running on a single core,made for consoles it uses ~0.7 IPC even though it has a bunch of threads,gaming threads can utilize 2.4 IPC as seen by the previous example so just imagine how games would run if they would use all the available IPC,or at least more then now.
PCM is a tool to measure a lot of things like cache misses other events and in this case IPC.
PCM (Processor Counter Monitor)
Intel site for info
https://software.intel.com/en-us/articles/intel-performance-counter-monitor
Github for downloading.
https://github.com/opcm/pcm

I strongly doubt that that the generated assembly code for a jaguar core is used for a desktop pc version of the game.
What they do use is high level sources in languages like C or C++.
And yes, the way the game developers code small routines can be optimized for maximum throughput on for example jaguar.
But that is more optimizing that code that needs to run fast, runs from cache as much as possible with as much cache hits as possible.
Also, rearranging instructions by use of compiler settings to get maximum throughput by making use of all available ports and execution units that can run in parallel independently is an option.
But that is reflected in the raw assembly code generated by the compiler and linker, not high level language sources.

For a desktop version, the game needs to be recompiled anyway. And then it is a matter of setting the right compiler and linker options.
But for the desktop also, generic x86 code needs to be generated for maximum compatibility with all x86 cpus even when using compiler flags for optimization for a specific cpu design from a given manufacturer.
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
I strongly doubt that that the generated assembly code for a jaguar core is used for a desktop pc version of the game.
What they do use is high level sources in languages like C or C++.
And yes, the way the game developers code small routines can be optimized for maximum throughput on for example jaguar.
But that is more optimizing that code that needs to run fast, runs from cache as much as possible with as much cache hits as possible.
Also, rearranging instructions by use of compiler settings to get maximum throughput by making use of all available ports and execution units that can run in parallel independently is an option.
But that is reflected in the raw assembly code generated by the compiler and linker, not high level language sources.

For a desktop version, the game needs to be recompiled anyway. And then it is a matter of setting the right compiler and linker options.
But for the desktop also, generic x86 code needs to be generated for maximum compatibility with all x86 cpus even when using compiler flags for optimization for a specific cpu design from a given manufacturer.
No matter why or what,the bottom line still is that games only use a very small fraction of the available IPC of a desktop CPU.
If there is any recompiling then it doesn't change a thing about it,when a game runs 4 or more threads to only use a fraction of the IPC that one core is capable of then there is something very wrong.
How do you explain the crappy game using so much more IPC? It's made with 2D Fighter maker 2002 a game engine from 2002 so we can be pretty sure that it runs the most generic most crappy code in existente yet it's capable of using all the IPC of a core because it ws made for PC and PC only.
There is no other explanation I can think of then that modern game engines produce code that is meant to run on very weak (low ipc) cores be it jaguar or arc or whatever.
 
Reactions: guachi
May 11, 2008
20,060
1,292
126
No matter why or what,the bottom line still is that games only use a very small fraction of the available IPC of a desktop CPU.
If there is any recompiling then it doesn't change a thing about it,when a game runs 4 or more threads to only use a fraction of the IPC that one core is capable of then there is something very wrong.
How do you explain the crappy game using so much more IPC? It's made with2D Fighter m aker 2002 a game engine from 2002 so we can be pretty sure that it runs the most generic most crappy code in existente yet it's capable of using all the IPC of a core because it ws made for PC and PC only.
There is no other explanation I can think of then that modern game engines produce code that is meant to run on very weak (low ipc) cores be it jaguar or arc or whatever.

Well, why generally speaking is some software these days bloated on the pc ?
Because there is less optimization or there is a use of a large number of dll libraries.

And the comparison is flawed. If you run a dos program from 2000 it will run incredibly fast. It is the same as for your program fighter maker.
Also, the method of measuring is just as important as the data you want to measure.
I wonder if a thread stalls, that the measuring program you use is capable of detecting that.
I mean, if i measure for a certain amount of time the IPC and average that result over that time, because of thread stalling the IPC result may be lower than it actually is.
There is a lot of switching between threads and the os is keeping track of everything.
To examine this properly, the method of measuring must be known first.
 
Reactions: ryan20fun
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |