Discussion Intel current and future Lakes & Rapids thread

Henry swagger · Mar 11, 2023

A/// said:
Sorry have to agree with Mr. @ashFTW here. University research labs will get discounts from the likes of Intel, AMD, NVidia etc. As will large OEMs such as HP, Dell, Lenovo. No OEM is paying tray price either. Apple will practically give away their computers, if you can call them that now, to schools.

Anyway looking forward to Intel's new workstation parts putting a fire under AMD's arse. Not long after Intel announced these parts Threadripper discussion began to gain traction. The last threadripper was a big middle finger to AMD pro customers. It's good to see AMD realizing they can't rest on their laurels.

Not as long as Pat "Gunslinger" Gelsinger is in charge at Intel.

Amd is not paying you so stop with the cheerleading lol

Markfw · Mar 11, 2023

Henry swagger said:
Amd is not paying you so stop with the cheerleading lol

If anybody is a cheerleader here its YOU, for Intel.

Thunder 57 · Mar 11, 2023

Henry swagger said:
Amd is not paying you so stop with the cheerleading lol

So AMD giving the middle finger to it's customers is cheerleading? Ha!

igor_kavinski · Mar 11, 2023

Edrick said:
I had to overclock the mesh by 25% on my Cascade Lake-X CPU just to get close to the latencies of their desktop CPUs.

Do you still have the Core X CPU? Which one is it? Could you post a screenshot of MaxxMem² on it?

DrMrLordX · Mar 11, 2023

Henry swagger said:
Amd is not paying you so stop with the cheerleading lol

How . . . what? Did you even read what he wrote?

Thunder 57 said:
So AMD giving the middle finger to it's customers is cheerleading? Ha!

I kno rite

Seriously though, it looks like the 5995WX isn't even really threatened by many/any of these Sapphire Rapids-W CPUs. Nobody's really lighting a fire under AMD's butt just yet.

witeken · Mar 11, 2023

Hitman928 said:
This is true, but @Hulk is still correct. Intel is jumping from ~7nm to 20A which is a GAAFET node. TSMC has 10, 7, 5, and 3 (plus tweaked versions for each) before moving to GAAFET at 2/20A. Samsung also had a 5 node before moving to GAAFET at 3 and is scheduled to move to 2/20A in 2025. Intel is trying to catch up (and even leap over) the others and is jumping forward a bit comparatively.

Also, from what I've seen discussed/rumored online, Intel's 20A and 18A are basically the same process. It's just that 20A doesn't have all of the features and cell libraries available yet.

*Intel's 10nm was equivalent to foundries' 7nm
*Similarly, Intel's 7nm (renamed to Intel 4/3) is equivalent to something between foundries' 5nm and 3nm
*Intel's 20/18A is a rename from 5nm and is equivelent to foundries' 2nm (remember that Intel does bigger jumps per node, so two Intel nodes may be more equivelent to three rather than two foundry nodes)
*Intel 18A is *not* the same as 20A, just as Intel 3 is not the same as 4. There have been various improvement in order to yield the 10% improvement in performance per watt; in the old nomenclature it would have been called 5nm+(+)

nicalandia · Mar 11, 2023

DrMrLordX said:
Seriously though, it looks like the 5995WX isn't even really threatened by many/any of these Sapphire Rapids-W CPUs. Nobody's really lighting a fire under AMD's butt just yet.

There are stock EPY Genoa CPUs today with 64 Cores that are faster in MT than the 5995WX, So one could build a single CPU monster WS today.

DrMrLordX · Mar 11, 2023

nicalandia said:
There are stock EPY Genoa CPUs today with 64 Cores that are faster in MT than the 5995WX, So one could build a single CPU monster WS today.

True but that's not really HEDT which is where the Xeon-W CPUs are (sort of) aimed.

IEC · Mar 11, 2023

I've got a dead X399 board as decoration somewhere... haven't felt the need to deal with HEDT platform headaches in a while given how powerful consumer CPUs have become in the past 4 years.

I'd be happy for either Intel or AMD to change my mind.

Edrick · Mar 11, 2023

igor_kavinski said:
Do you still have the Core X CPU? Which one is it? Could you post a screenshot of MaxxMem² on it?

No, I just sold it a few weeks back when I got my 13700KF.

Doug S · Mar 11, 2023

witeken said:
*Intel's 10nm was equivalent to foundries' 7nm
*Similarly, Intel's 7nm (renamed to Intel 4/3) is equivalent to something between foundries' 5nm and 3nm
*Intel's 20/18A is a rename from 5nm and is equivelent to foundries' 2nm (remember that Intel does bigger jumps per node, so two Intel nodes may be more equivelent to three rather than two foundry nodes)
*Intel 18A is *not* the same as 20A, just as Intel 3 is not the same as 4. There have been various improvement in order to yield the 10% improvement in performance per watt; in the old nomenclature it would have been called 5nm+(+)

Where is the evidence Intel are doing bigger jumps than the foundries? They tried that with the original 10nm which was supposed to scale 2.4x, and got badly burned and had to back it off. i.e. similar to how TSMC had to back off N3 with N3E, but it didn't take them years to concede there was a problem like it did for Intel. Intel isn't making that mistake again, they are not going to try to be very aggressive in the future, especially with how quickly they want us to believe they can roll out new processes.

Until we see some figures for SRAM cell size and transistors/mm^2 - for actual Intel CPUs, not claimed figures no one approaches in real life and aren't even measured in the same way across foundries - there's really no way to know how the processes compare.

And until Intel actually start shipping stuff when they claim they will it is all academic. Intel buries us in roadmaps showing how they are going to be releasing these processes on a yearly cadence (actually even faster than that, given how they keep pushing the EUV stuff out) but they have been far from reliable in releasing new processes and have yet to ship anything made using EUV. Given Samsung's continued problems getting remotely decent yields with their EUV based processes, no one should assume Intel is suddenly going to turn around the screwups they've had on their fab side for the better part of a decade.

Geddagod · Mar 11, 2023

Doug S said:
Where is the evidence Intel are doing bigger jumps than the foundries? They tried that with the original 10nm which was supposed to scale 2.4x, and got badly burned and had to back it off. i.e. similar to how TSMC had to back off N3 with N3E, but it didn't take them years to concede there was a problem like it did for Intel. Intel isn't making that mistake again, they are not going to try to be very aggressive in the future, especially with how quickly they want us to believe they can roll out new processes.

Until we see some figures for SRAM cell size and transistors/mm^2 - for actual Intel CPUs, not claimed figures no one approaches in real life and aren't even measured in the same way across foundries - there's really no way to know how the processes compare.

And until Intel actually start shipping stuff when they claim they will it is all academic. Intel buries us in roadmaps showing how they are going to be releasing these processes on a yearly cadence (actually even faster than that, given how they keep pushing the EUV stuff out) but they have been far from reliable in releasing new processes and have yet to ship anything made using EUV. Given Samsung's continued problems getting remotely decent yields with their EUV based processes, no one should assume Intel is suddenly going to turn around the screwups they've had on their fab side for the better part of a decade.

There are actual numbers of the node itself, which are measured the same way across foundries, that you could use to compare Intel 4 and TSMC nodes.
Ironically enough however, despite having a ~15?% lower density than TSMC 5nm, The 512KB data array in RWC is smaller than in Zen 4. Same story in Zen 3 vs ADL for SRAM in L2. Design stuff I guess.
Integer execution on the other hand looks to be way smaller for Zen 4 compared to RWC. No idea what happened there, plus it also had some of the lowest shrink% relative to overall core shrink % from GLC vs RWC. On one hand I would love to say they improved INT execution a bit, which is why it shrunk less, but the rumors of really low IPC gain makes me think otherwise. Perhaps the improvements just did not scale well.
On the other hand, the FPU size doesn't look to be too different comparing Zen 4 vs RWC. Marginal differences, but nothing too drastic.
Overall RWC, core per core not including L2, is ~30% larger than Zen 4 from my estimates. Could be wrong, but I believe the real difference is somewhere around that. Which is not bad when you consider we are comparing Intel 4 with TSMC 5nm... HD cells. Which has a ~20% leg up on density vs Intel 4. Intel's cores aren't nearly as bloated as they seem when you shrink the node down to a comparable node.

A/// · Mar 11, 2023

Poor Henry.

IEC said:
I've got a dead X399 board as decoration somewhere... haven't felt the need to deal with HEDT platform headaches in a while given how powerful consumer CPUs have become in the past 4 years.

I'd be happy for either Intel or AMD to change my mind.

Both are getting updates to be able to run up to 192 GB of DDR5 sooner or later. That's the 48x4 kits from Corsair but the fastest they go up to are relegated to the 4 dimm population speed and are on samsung dies. Cl40, too. The 48 and 96 GB kits are x2 stick kits. I don't know of any 64 GB sticks for a 128 GB kit at the moment. And anything in that size range will be stuck at 5600 or 6000.

In other words you can get close to ws without paying ws prices but you give up other essential features but if you're not going to use ws parts to their full extent then you're saving a bit of money.

Geddagod · Mar 11, 2023

Exist50 said:
My first guess would be "bandwidth per core" refers to the core to fabric interface, more so than the caches. Anyone keep track of what that has been for the last few gens?

Good point. I totally forgot about that, I think you are right.
No idea about mesh bandwidth though, I personally don't really keep track of it. Not really my area of interest...

Geddagod · Mar 11, 2023

Anyone know how Zen 4 vs RPL vs GLC scales PPC wise, across different frequencies?

DrMrLordX · Mar 12, 2023

Doug S said:
Where is the evidence Intel are doing bigger jumps than the foundries?

All they have are roadmaps, per your observations. Even minor improvements in node tech takes them at least a year based on their current cadence of node improvements. 10nm in late 2017 (which was a bust), 10nm+ in 2019, 10SF in 2020, and 10ESF/Intel 7 in 2021 (10ESF+/"super" 7 in 2022). Why would anyone expect them to start doing better now?

IntelUser2000 · Mar 12, 2023

Geddagod said:
There are actual numbers of the node itself, which are measured the same way across foundries, that you could use to compare Intel 4 and TSMC nodes.
Ironically enough however, despite having a ~15?% lower density than TSMC 5nm, The 512KB data array in RWC is smaller than in Zen 4. Same story in Zen 3 vs ADL for SRAM in L2. Design stuff I guess.

Integer performance: Integer performance is very difficult to increase and improvement in Integer performance indicates "quality" of the core. It is a complex combination of latency, frequency, balance of units, branch prediction(and pipeline length), across all areas. You cannot have too big of an L1 cache as it increases latency, but can't be too small. You cannot just increase branch targets either, latency will come into play and so does performance of the algorithm. Decoders won't scale without beefing up the rest.

How do you increase top speed of a car from 200km/h to 300km/h? Just double engine displacement and horsepower? No. Aerodynamics needs to be greatly improved, since wind resistance is the limit at high speeds. You need the transmission to keep up so it does not fail, and also switch fast and seamlessly. And you need to do all that while making it lighter. You need a capable driver, because otherwise an accident might happen. Not a simple solution at all.

The 787 achieved 20% fuel consumption reduction by moving to composite materials rather than just aircraft aluminum for the chassis. Then they had to move the battery to lighter lithium technology. The engine has been replaced for a bigger, and slower running turbojet. And they had to do aerodynamic work using CFDs as well.

That's why Integer performance is the most important aspect of a general purpose CPU. It is said that in order to chase 1% improvement in performance, CPU architects did what can be called "heroic" work. It's amazing what they are doing now. The work pays off though. Improving Integer performance, or uarch benefits everything. Deep learning, floating point, word processing, gaming, emulation, snappiness.

That's why it's absolutely retarded when mega corporations mistreat employees, especially veteran ones. These kind of decisions require very seasoned, and experienced architects with 30+ years of experience. That's an entire lifetime doing nothing but being an CPU architect and being top of the field at that!

Density: The L2 array is not LLC(Last Level Cache) anymore and is partway a core cache. It has more stringent voltage and power requirements, thus the cells used aren't exactly bog standard ones for L3.

They could have improved the density aspect compared to the competition in Redwood Cove, thus making it more favorable against chips such as Zen 4. This is me speculating based on what you said though.

Also, there is a fundamental limit. A company that does a better job scaling down will reach limits faster than those that won't. You could argue the limits "democratize" compute and make latest technologies available to the small companies and those with less resources.

That's why DRAM has been on the 10nm class node(10-19nm) for a 5 years now. 10x, 10y, 10z, 10a, 10b, and they even talk about 10y! They will reach a decade, since Micron just announced 10b(10 Beta) availability as being the most advanced node. 10x = 17-18nm depending on the manufacturer.

Very much a possibility the designations are as follows:
10x = 17-18nm
10y = 16-17nm
10z = 15-16nm
10a = 14-15nm
10b = 13-14nm
10y = 12-13nm

1nm improvement per "generation".

Why? Because DRAM at 10nm is far, far denser than logic DRAM(3x the density of eDRAM) and that is denser than logic cells. TSMC is showing almost no gains for SRAM on the N3 node, meaning even SRAM is hitting hard limits. Logic isn't hitting yet because it's less dense.

Geddagod · Mar 12, 2023

IntelUser2000 said:
Integer performance: Integer performance is very difficult to increase and improvement in Integer performance indicates "quality" of the core. It is a complex combination of latency, frequency, balance of units, branch prediction(and pipeline length), across all areas. You cannot have too big of an L1 cache as it increases latency, but can't be too small. You cannot just increase branch targets either, latency will come into play and so does performance of the algorithm. Decoders won't scale without beefing up the rest.

How do you increase top speed of a car from 200km/h to 300km/h? Just double engine displacement and horsepower? No. Aerodynamics needs to be greatly improved, since wind resistance is the limit at high speeds. You need the transmission to keep up so it does not fail, and also switch fast and seamlessly. And you need to do all that while making it lighter. You need a capable driver, because otherwise an accident might happen. Not a simple solution at all.

The 787 achieved 20% fuel consumption reduction by moving to composite materials rather than just aircraft aluminum for the chassis. Then they had to move the battery to lighter lithium technology. The engine has been replaced for a bigger, and slower running turbojet. And they had to do aerodynamic work using CFDs as well.

That's why Integer performance is the most important aspect of a general purpose CPU. It is said that in order to chase 1% improvement in performance, CPU architects did what can be called "heroic" work. It's amazing what they are doing now. The work pays off though. Improving Integer performance, or uarch benefits everything. Deep learning, floating point, word processing, gaming, emulation, snappiness.

That's why it's absolutely retarded when mega corporations mistreat employees, especially veteran ones. These kind of decisions require very seasoned, and experienced architects with 30+ years of experience. That's an entire lifetime doing nothing but being an CPU architect and being top of the field at that!

Density: The L2 array is not LLC(Last Level Cache) anymore and is partway a core cache. It has more stringent voltage and power requirements, thus the cells used aren't exactly bog standard ones for L3.

They could have improved the density aspect compared to the competition in Redwood Cove, thus making it more favorable against chips such as Zen 4. This is me speculating based on what you said though.

Also, there is a fundamental limit. A company that does a better job scaling down will reach limits faster than those that won't. You could argue the limits "democratize" compute and make latest technologies available to the small companies and those with less resources.

That's why DRAM has been on the 10nm class node(10-19nm) for a 5 years now. 10x, 10y, 10z, 10a, 10b, and they even talk about 10y! They will reach a decade, since Micron just announced 10b(10 Beta) availability as being the most advanced node. 10x = 17-18nm depending on the manufacturer.

Very much a possibility the designations are as follows:
10x = 17-18nm
10y = 16-17nm
10z = 15-16nm
10a = 14-15nm
10b = 13-14nm
10y = 12-13nm

1nm improvement per "generation".

Why? Because DRAM at 10nm is far, far denser than logic DRAM(3x the density of eDRAM) and that is denser than logic cells. TSMC is showing almost no gains for SRAM on the N3 node, meaning even SRAM is hitting hard limits. Logic isn't hitting yet because it's less dense.

Just want to add, I made a slight mistake on INT.
Zen 4 still has less area in its core dedicated to INT than RWC but...
In the original die shot annotation of RWC on semi-analysis from Locuza, the INT block was huge. Way bigger than Zen 4, by several magnitudes. I was very confused about this, as it was a wild outlier in the relative size difference in % between RWC and Zen 4 compared to the rest of the structures.
I sat on this for a couple days, and an hour or two ago at 4AM in the morning I found out why (LOL)
Because differentiating the structures are hard, Locuza had bits of the Load/Store system in the INT execution block (namely the load/store buffer). Found it from a tweet he sent.
Now, I'm no die shot analyst like Locuza or Clam are, but based on the position of the load/store buffer on Palm Cove, which has the same general CPU floorplan of GLC and also RWC, and also based on what looked to be a separate block from the INT block, I came up with a much more reasonable estimation of size difference that also fell more in line with what the general core % vs Zen 4 was.
However even with that correction, I still find the INT block to be ~60% larger than Zen 4's. Still weird, maybe I misattributed the size of the load/store buffer? I might not have included all the blocks of the Load/Store buffer that were present in the annotated Integer Execution block. I did ask locuza for confirmation, but I doubt he would reply back since he is out of the game.
It's even more puzzling when you realize Zen 4 actually has higher INT IPC than GLC.
I am almost certain that is the mistake and the INT block on the diagram has more parts from the Load/Store block than the core diagram suggests.
Back to the L2 though, I believe both Intel and AMD use HCC cells for this cache. IIRC Zen designs have more "padding" between the SRAM which results in lower compared density vs Intel's designs, and I'm sure there are some other factors in play as well.
Also sidenote, why tf is the OP$ so much smaller on Intel's RWC than Zen 4? I understand that Zen 4's capacity is larger, but the size difference is astonishing. I don't think Locuza misattributed the block or anything like that either.

nicalandia · Mar 12, 2023

CH&CH Artilce about Sapphire Rapids L3 Cache performance..

Sapphire Rapids: Golden Cove Hits Servers

Last year, Intel’s Golden Cove brought the company back to competitive against AMD. Unfortunately for Intel, that only applied to the client space. AMD’s Rome and Milan CPUs have steadi…

chipsandcheese.com

I must ask. What Capacity? The CPU is starving of L3 at only 112.5 MiB per CPU.

Geddagod · Mar 12, 2023

nicalandia said:
CH&CH Artilce about Sapphire Rapids L3 Cache performance..

Sapphire Rapids: Golden Cove Hits Servers

Last year, Intel’s Golden Cove brought the company back to competitive against AMD. Unfortunately for Intel, that only applied to the client space. AMD’s Rome and Milan CPUs have steadi…

chipsandcheese.com

View attachment 78051

View attachment 78052

I must ask. What Capacity? The CPU is starving of L3 at only 112.5 MiB per CPU.

SPR continues to be broken? Hm.
I think Intel is doubling down on capacity > latency though. EMR is rumored to increase L3 capacity even more. This also follows down to the rest of their cache levels, Intel's L2 is large and slow, even their L1I$ in RWC are rumored to increase again, while still being what, 1 cycle slower than Zen?
I'm going to go read the article right now though. Thank you for posting it up on the forum.

Hitman928 · Mar 12, 2023

@nicalandia Thanks for the link. I think these paragraphs express my thoughts really well not only on SPR's caches, but Intel's approach here in general.

Chips and Cheese said:
On Intel DevCloud, the chip appears to be set up to expose all four chiplets as a monolithic entity, with a single large L3 instance. Interconnect optimization gets harder when you have to connect more nodes, and SPR is a showcase of this. Intel’s mesh has to connect 56 cores with 56 L3 slices. Because L3 accesses are evenly hashed across all slices, there’s a lot of traffic going across that mesh. SPR’s memory controllers, accelerators, and other IO are accessed via ring stops too, so the mesh is larger than the core count alone would suggest. Did I mention it crosses die boundaries too? Intel is no stranger to large meshes, but the complexity increase in SPR seems remarkable. . . Intel engineers now have an order of magnitude more bandwidth going across EMIB stops. The mesh is even larger, and has to support a pile of accelerators too. L3 capacity per slice has gone up too, from 1.25 MB on Ice Lake SP to 1.875 MB on SPR.

From that perspective, Intel has done an impressive job. SPR has similar L3 latency to Ampere Altra and Graviton 3, while providing several times as much caching capacity. Intel has done this despite having to power through a pile of engineering challenges. But from another perspective, why solve such a hard problem when you don’t have to?

In contrast, AMD has opted to avoid the giant interconnect problem entirely. EPYC and Ryzen split cores into clusters, and each cluster gets its own L3. Cross-cluster cache accesses are avoided except when necessary to ensure cache coherency. That means the L3 interconnect only has to link eight cache slices with eight cores. The result is a very high performance L3, enabled by solving a much simpler interconnect problem than Intel.

Edit:

Also, for the slow clock ramp behavior, this quote needs to be included:

This boosting behavior is likely specific to Intel DevCloud and not indicative of how fast SPR can boost

It would be good to have a bare hardware system just to double check, but most likely this isn't a hardware issue for SPR.

Geddagod · Mar 12, 2023

That means the L3 interconnect only has to link eight cache slices with eight cores. The result is a very high performance L3, enabled by solving a much simpler interconnect problem than Intel.

Even on Intel client parts, when using a simple Ringbus, Intel has marginally higher L3 latency regardless no?

nicalandia · Mar 12, 2023

Geddagod said:
Even on Intel client parts, when using a simple Ringbus, Intel has marginally higher L3 latency regardless no?

Not even close, Golden Cove client is so much faster

Sapphire Rapids L3 Latency is 142 Cycles

Geddagod · Mar 12, 2023

nicalandia said:
Not even close, Golden Cove client is so much faster
View attachment 78054

Sapphire Rapids L3 Latency is 142 Cycles

I'm talking about relative to Zen 3 latency. Hence me quoting chips and cheese talking about simply connecting 8 cores with 8 cache slices- as in Zen 3.
Doesn't GLC have higher L3 latency than Zen 3 regardless of ringbus vs mesh?

nicalandia · Mar 12, 2023

Geddagod said:
I'm talking about relative to Zen 3 latency. Hence me quoting chips and cheese talking about simply connecting 8 cores with 8 cache slices- as in Zen 3.
Doesn't GLC have higher L3 latency than Zen 3 regardless of ringbus vs mesh?

Correct.

Discussion Intel current and future Lakes & Rapids thread

Senior member

Moderator Emeritus, Elite Member

Platinum Member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Elite Member

Golden Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Golden Member

Lifer

Elite Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member