64 core EPYC Rome (Zen2)Architecture Overview?

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tuna-Fish

Golden Member
Mar 4, 2011
1,419
1,749
136
TL;DR explanation of why this is a big deal?

It's a dramatically different approach. There is a standard way how the system architecture is laid out that AMD has used since 2003 and Intel since 2008, and this is a major change from it.

This new way of doing things has both advantages and disadvantages. Some of those we can guess, others will only show themselves in use. It's very interesting to speculate on them.

- For server customers, this means no more 4 numa nodes per socket. This will really help some important loads. Interestingly, since Intel just released Cascade Lake-AP, this means a complete role reversal. Except that AMD still has more cores per socket.

- Since they use infinity fabric connecting the CPU to the IO system, it means they can iterate on them independently. Especially with the rise of lower cost-to-iterate processes that are better for IO (like GF 22nm and 12nm), this might mean a lot more products targeting different workloads, all using the same chiplets for compute, but with a different IO die. The next Ryzen chips almost certainly use the same 7nm chiplets, just with an IO die aimed for desktop (possibly with GPU).

- Since they can iterate independently, they can now bring products to market for different IO standards without having to ship a new CPU die. I think this means AMD will be first out with DDR5 support, since they can just design a new IO die for that, and use the same chiplets for both AM4 and AM5 socket processors. Same goes with next-gen PCI-E.
 

Saylick

Diamond Member
Sep 10, 2012
3,382
7,133
136
The real questions that remain now are the number of chiplets in EPYC 2, the IPC gain, cores/CCX, and if AMD is going to implement a EMIB interface, if it's still wiring in the substrate, or even an interposer to connect the IO die with the chiplets. Exciting times!
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
The next Ryzen chips almost certainly use the same 7nm chiplets, just with an IO die aimed for desktop (possibly with GPU).

I doubt that.

While chiplets look like a good tradeoff for TR/Epyc that already have multi-chip latency issues, and are aimed at markets where those latencies don't really matter, Ryzen is a desktop part, where latency does matter and there are currently no multi-chip latency penalties.

A chiplet design for Ryzen is really not a great choice. AMD should, and hopefully feels they have enough market, and enough money to do a separate monolithic die for desktop.

Or maybe they desktop Part will only be the APU this time, but with 8 Cores.
 

Saylick

Diamond Member
Sep 10, 2012
3,382
7,133
136
I doubt that.

While chiplets look like a good tradeoff for TR/Epyc that already have multi-chip latency issues, and are aimed at markets where those latencies don't really matter, Ryzen is a desktop part, where latency does matter and there are currently no multi-chip latency penalties.

A chiplet design for Ryzen is really not a great choice. AMD should, and hopefully feels they have enough market, and enough money to do a separate monolithic die for desktop.

Or maybe they desktop Part will only be the APU this time, but with 8 Cores.
Agreed. The extra die space to integrate all of the necessary IO for the Ryzen SKUs is worth it instead of making another smaller IO die just to have a chiplet approach with Ryzen.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,684
6,227
136
A lot of the speculation went well



Not sure if this a 64 core part.

But at least we are on the right track
  • 14nm IO Chiplet
  • 7nm Chiplets
  • MC on the IO die
  • 8 cores per Zen 2 die (not sure if per CCX)
  • 2x fp compared Zen1
  • Improved front end (also according to Agner, Zen1 backend is bottlenecked by the front end)
  • 33% better memory bandwidth (due to higher memory speed?)
That IO die looks fairly huge.

I have to say a lot of those improvements they showed today left a patent trail if you have the patience to dig through.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,419
1,749
136
That's definitely a SP3 socket. Sadly, I need to run now, but is there any chance some kind soul here could work out the dimensions of those chips on there?
 

HurleyBird

Platinum Member
Apr 22, 2003
2,725
1,342
136
Might be from that fat L4 cache on there . . .

Or maybe a GPU, or maybe there are links for two more chiplets (to be placed at left and right middle) for a hypothetical 80-core version. Or some combination of the above
 

deathBOB

Senior member
Dec 2, 2007
566
228
116
It's a dramatically different approach. There is a standard way how the system architecture is laid out that AMD has used since 2003 and Intel since 2008, and this is a major change from it.

This new way of doing things has both advantages and disadvantages. Some of those we can guess, others will only show themselves in use. It's very interesting to speculate on them.

- For server customers, this means no more 4 numa nodes per socket. This will really help some important loads. Interestingly, since Intel just released Cascade Lake-AP, this means a complete role reversal. Except that AMD still has more cores per socket.

- Since they use infinity fabric connecting the CPU to the IO system, it means they can iterate on them independently. Especially with the rise of lower cost-to-iterate processes that are better for IO (like GF 22nm and 12nm), this might mean a lot more products targeting different workloads, all using the same chiplets for compute, but with a different IO die. The next Ryzen chips almost certainly use the same 7nm chiplets, just with an IO die aimed for desktop (possibly with GPU).

- Since they can iterate independently, they can now bring products to market for different IO standards without having to ship a new CPU die. I think this means AMD will be first out with DDR5 support, since they can just design a new IO die for that, and use the same chiplets for both AM4 and AM5 socket processors. Same goes with next-gen PCI-E.

How is it different than an external north bridge or memory controller like we had a long time ago?
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
A lot of the speculation went well



Not sure if this a 64 core part.

But at least we are on the right track
  • 14nm IO Chiplet
  • 7nm Chiplets
  • MC on the IO die
  • 8 cores per Zen 2 die (not sure if per CCX)
  • 2x fp compared Zen1
  • Improved front end (also according to Agner, Zen1 backend is bottlenecked by the front end)
  • 33% better memory bandwidth (due to higher memory speed?)
That IO die looks fairly huge.

I have to say a lot of those improvements they showed today left a patent trail if you have the patience to dig through.

I think that is obviously 64 core part. It has 8 chiplets, which undoubtedly have 8 cores each, giving the rumored 8+1 config.

But yeah that IO die is monstrous. I also assume it must have a big L4 cache.
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
I doubt that.

While chiplets look like a good tradeoff for TR/Epyc that already have multi-chip latency issues, and are aimed at markets where those latencies don't really matter, Ryzen is a desktop part, where latency does matter and there are currently no multi-chip latency penalties.

A chiplet design for Ryzen is really not a great choice.

Could you please explain the mechanics of why you think latency will get markedly worse?


L1 ~ same depending on core design
L2 ~ same depending on core design
L3 = vastly improved due to all 8 cores sharing L3 cache and not having to go through crossbars clocked at DDR4 rate (Zen1).
L4 = doesn't exist on Zen1, might exist on Zen2. Thus can only be an improvement.
DRAM ~ approximately the same. The interface between the IO chip and the CCX chiplet now does not need to run at MEMCLK, I'd expect they are instead trying to tie it to core clock. So you are looking at a service time from CCX to IO chiplet that would more resemble the L2 times of Zen1+ than the L3. Even assuming it isn't, compare the Zen1+ L3 latency to DRAM (note y-axis is logarithmic).

L3 ~ 30 cycles
DRAM ~ 300 cycles

So going through the I/O chip at worst increases DRAM latency from ~300 to ~330 cycles.... and I'd wager that using the I/O chiplet to decouple on socket infinity fabric from MEMCLK has allowed AMD to really pull that L3 latency down - so that 30 might be more like 20. They also talk about improved memory speeds.

 
Reactions: lightmanek

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Could you please explain the mechanics of why you think latency will get markedly worse?

Not every solution makes sense everywhere. Chiplets are good for TR/EPYC, not so good for desktop.

Integrated MC as opposed to Memory controller on another chip. There is a latency penalty for going off chip.

For a small desktop chip, chiplets should have worse latency and be less cost effective.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,860
3,407
136
WOW so i was wrong....rofl.



did anyone else notice it looks like the chiplets are in pairs of two on a interposer???



edit: also we need conformation about the load store, "12:53PM EST - Improved pipeline, DOuble loading point and load store"

i hope this means 4 load and 2 store ports, but im gussing it just means 2/1 256bit load/store ports
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Not every solution makes sense everywhere. Chiplets are good for TR/EPYC, not so good for desktop.

Integrated MC as opposed to Memory controller on another chip. There is a latency penalty for going off chip.

For a small desktop chip, chiplets should have worse latency and be less cost effective.

That is not an explanation of the mechanics of why the latency will be worse beyond a high level assumption that going through an IO Controller will be markedly worse.

I have pointed out (with measured numbers) that your high level assumption does not hold up.

I await your explanation for why it will be so much worse it will matter.



As for why AMD won't design a desktop die, simples, cost. Any marginal (and they are very marginal) gains AMD would make in cutting DDR latency by a few percent for Ryzen7 would be lost in the mask cost and in requiring dedicated wafers for desktop rather than flexibility in wafer starts.

A redesign for Ryzen7 also means a redesign for Ryzen3. Which means 3 masks, not 1. Keeping the same 8 core unit allows the same 7nm mask to be reused and the I/O Controller switched out.


The market does not revolve around desktop, regardless of how much some folks on here get tunnel vision and think it does.
 
Last edited:

HurleyBird

Platinum Member
Apr 22, 2003
2,725
1,342
136
I'm don't know if that's actually an interposer you're seeing. It could just be the filler that helps hold the chips in place. Oh the other hand, the chips are very close to one another.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
WOW so i was wrong....rofl.



did anyone else notice it looks like the chiplets are in pairs of two on a interposer???



edit: also we need conformation about the load store, "12:53PM EST - Improved pipeline, DOuble loading point and load store"

i hope this means 4 load and 2 store ports, but im gussing it just means 2/1 256bit load/store ports
Maybe it's just how the thing is packaged? Doubled Load/store could also mean 64B/cycle up from 32B/cycle.
EDIT: Are we both saying the same thing when you mention 4 load and 2 store ports and 64B/cycle?
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,860
3,407
136
Maybe it's just how the thing is packaged? Doubled Load/store could also mean 64B/cycle up from 32B/cycle.
EDIT: Are we both saying the same thing when you mention 4 load and 2 store ports and 64B/cycle?


see what i want is 4x128 load and 2x128 store a cycle
what i think it will be is 2x256 and 1x256 store a cycle

What Ian wrote in the live steam could be interpreted both ways (to me anyway).
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
see what i want is 4x128 load and 2x128 store a cycle
what i think it will be is 2x256 and 1x256 store a cycle

What Ian wrote in the live steam could be interpreted both ways (to me anyway).
Ah, I was thinking of cache b/w. I suppose it might have been increased as well, given the increased FP performance.
 
Mar 11, 2004
23,170
5,635
146
What does this mean Doubles the BW/Channel for the same number of channels?
Increased frequency or just efficient usage?

Not sure exactly, but PCIe 4 doubles up PCIe 3 (and 5 will double up 4). Not sure what latency changes (where for instance, memory bandwidth has increased, but I believe latency has as well).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |