64 core EPYC Rome (Zen2)Architecture Overview?

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Excessi0n

Member
Jul 25, 2014
140
36
101
If you are asking if there are currently > 280 unused pins in AM4 package required for the two additional memory channels, the answer is no.

So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,419
1,749
136
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

Yes. This is why going from single channel to dual channel doubles bandwidth, but putting two sticks of RAM into the same channel doesn't. The two sticks on the same channel are attached to the same bus, and the memory controller can only talk to one of them at a time.
 
Reactions: The Stilt

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
If AMD goes with chiplets, they will certainly have a different dies for server and desktop, so everything discussed here (rome rumours) could have zero impact on the AM4 platform.
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.
 
Reactions: fallaha56

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

Yes. This is why going from single channel to dual channel doubles bandwidth, but putting two sticks of RAM into the same channel doesn't. The two sticks on the same channel are attached to the same bus, and the memory controller can only talk to one of them at a time.

And why populating both slots of each channel pulls down attainable memory speeds - as the memory controller can only service each so quick - the limiter is not the DRAM, its the MC.
 
Reactions: coercitiv

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

As Tuna-Fish already said, the two slots of the same channel (e.g. A0 & A1 / B0 & B1) are sharing the signals.
The configuration is called as "2 DPC" (2 DIMMs per Channel) configuration. Designs which lack the second slot (i.e. 1 DPC) for each of the channels are possible, and are usually used on low-end and ITX form factor motherboards.

Quad channel capability on AM4 would require 286 additional pins to be available (143 per channel), increasing the total pin count reserved for the memory interface from the current 286 to 572 (i.e. to 43% of the total pin count of the package).
 
Reactions: Drazick and CatMerc

Tuna-Fish

Golden Member
Mar 4, 2011
1,419
1,749
136
Quad channel capability on AM4 would require 286 additional pins to be available (143 per channel), increasing the total pin count reserved for the memory interface from the current 286 to 572 (i.e. to 43% of the total pin count of the package).

And effectively more than that, because the high-speed transmission pins need ground pins next to them to maintain signal quality.
 
Reactions: krumme

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
This looks and sounds like some absolute balls-to-the-wall insanity.
Nov. 6 can't come soon enough.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
And effectively more than that, because the high-speed transmission pins need ground pins next to them to maintain signal quality.
I am no engineer and It's 20 years plus back I was working with reducing jitter in audio signals for fun. Filtering function in pll. Decoupling, reference clock what not. Dip times. It was beneficial for the jitter results to simply view the ground as part of the signal path. And why not.
I used kind of 3d solution sticking out from dac with 805 size cg0 and np0 soldered between pins of small capacitors. Np0 soldered on top of the top pins of the chips between power and ground. Lol.
All because trace length was long so the signal/ground stiffness was compromised. Dip days...Now it's all done perfectly in an integrated chip for 1 dollar at most even for high end stuff.
 

Glo.

Diamond Member
Apr 25, 2015
5,759
4,666
136
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.
That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.
 

Gideon

Golden Member
Nov 27, 2007
1,703
3,912
136
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.
The saving a tape out part is only true, if they won't make a 8 core raven ridge succesor (that would easily work, even on mobile). If they do, thry might opt out from 12-16 core AM4 multichip parts. They could of course also do both
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.

How much would AMD save on manufacturing versus how much extra spent on design, validation and lost time to market?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,682
6,197
136
That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.

#20180102338 Circuit Board with Bridge Chiplets

Actually the interconnect is also a chiplet (55/50) according to the patent and is fabricated using a higher density process than the rest of the board
The circuit structures of the chiplets 50 and 55 may be constructed using one more design rules for higher density circuit structures while the circuit structures of the remainder of the circuit board...

The interconnect chiplets provide connectivity from one regular chiplet (20/25/30) to another and to the pin arrays (45)
The chiplets 50 and 55 may be used for a variety of purposes. For example, the chiplet 50 may be used to provide large numbers of electrical pathways between the semiconductor chip 20 and 25 as well as electrical pathways to and from the semiconductor chips 20 and 25, through the circuit board 15 and out to the I/O's 45 if desired. The chiplet 55 may be used to provide large numbers of electrical pathways between the semiconductor chip 25 and the semiconductor chip 30 as well as electrical pathways to and from the semiconductor chips 25 and 30 through the circuit board 15 and out to the I/O's 45 if desired.

This might not be the definitive layout for Zen 2 but as it looks, seem complicated.
The patent is fairly comprehensive in the sense it covers not only how the chip are connected to one another but also how these pockets will be created on the substrate and how to insert these interconnect onto the substrate and putting the chiplets on top.

#20180096938 Circuit Board with Multiple Density regions

Here another layout with the 50/52/53/55 interconnect chiplets being used differently.
The patent is all about how the chips 20/25/30 can be of different density and using the interconnects 50/52/53/55 to connect to one another.

#20180239708 Acceleration of cache-to-cache data transfers for producer-consumer communication


[0020] Referring to FIG. 1, processing system 100 (e.g., a server) includes multiple processing nodes (e.g., node 0 and node 1). Each processing node includes multiple caching agents (e.g., processors 102 and 104 coupled to main memory 110). For example, caching agent 102, is a processor including core 0, core 1, core 2, . . . core 7 and caching agent 104, is a processor including core 0, core 1, core 2, . . . core 7) and a memory system. Each of the nodes accesses its own memory within corresponding coherence domain 122 faster than memory in non-coherence domain 124 (e.g., main memory 110) or memory in another node. As referred to herein, a coherence domain refers to a subset of memory (e.g., cache memory of node 0) for which a cache coherence mechanism maintains a coherent view of copies of shared data. A non-coherence domain refers to memory not included in the coherence domain (e.g., main memory 110 or cache memory in another node). Each of the caching agents in a node includes a last-level cache shared by the cores of the caching agent. Each core includes a private penultimate-level cache. For example, caching agent 102 includes last-level cache 128, which is a level-three cache shared by core 0, core 1, core 2, . . . core 7, and includes a level-two cache within each of core 0, core 1, core 2, . . . core 7. Last-level cache 128 and each level-two cache of caching agent 102 includes storage elements, e.g., storage implemented in fast static Random Access Memory (RAM) or other suitable storage elements. Cache control logic is distributed across last-level cache 128 and each level-two cache of caching agent 102. The caching agents use inter-processor communication via directory controller 121 to maintain coherency of a memory image in main memory 110 when caches of more than one caching agent contain the same cache line (i.e., a copy of contents of the same location of main memory 110) of coherence domain 122. In at least one embodiment of processing system 100, probe filter 112 includes storage for a cache directory used to implement a directory-based cache coherency policy. Probe filter 112 is implemented in fast static RAM associated with directory controller 121 or by other suitable storage technique.


#20180239702 Locality-aware and sharing-aware cache coherence for collections of processors

Cache coherence across multiple processors on an interconnect network (in one case an interposer 412)
[0019] Processing system 100 includes a distributed, shared memory system. For example, all memory locations of memory system 108 are accessible by each of processors 102, 104, and 106. Memory system 108 includes multiple memory portions, which are distributed across processors 102, 104, and 106. Memory portion 110 is local to processor 102 (e.g., tightly integrated with processor 102) and remote to processors 104 and 106 (e.g., within processing system 100 and accessible to processors 104 and 106, but not local to processors 104 and 106).


Some more applications which are interesting to read regarding AMDs ideas/attempts at memory access optimization in a CPU and/or with GPUs
#20180039587 Network of memory modules with logarithmic access
#20180018105 Memory controller with virtual controller mode
#20180019006 Memory controller with flexible address decoding
#20180143905 Network-aware cache coherence protocol enhancement
#20180239722 Allocation of memory buffers in computing system with multiple memory channels

Search in USPTO website here
http://appft.uspto.gov/netahtml/PTO/srchnum.html

TL;DR;
In simple terms,
#20180102338 describes how to create cavities on the substrate to put connecting chiplets with conducting pads on both surfaces on these cavities and the real chips are then placed on top of them. Due to the high density of the connecting chiplets complex routing between the chiplets can be achieved as well as connections to the pin arrays are also provided.
#20180096938 describes chiplets of different process nodes integrated in one package
#20180239702 describes cache coherence across multiple processors.
#20180239708 talks of a node with multiple processor/dies each with multiple cores and all of them are using a single Memory controller and LLC sync across the dies
#20180239722 talks of a single Memory controller in one Computing devices but not sure if only for APU use case.
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,759
4,666
136
The saving a tape out part is only true, if they won't make a 8 core raven ridge succesor (that would easily work, even on mobile). If they do, thry might opt out from 12-16 core AM4 multichip parts. They could of course also do both
APU in this design can also be chiplet based.
 

dacostafilipe

Senior member
Oct 10, 2013
772
244
116
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

Yes, 14/16nm is cheaper, but it's still costs a lot.

Add the more complex packaging (price, yields, size, ...) to it and IMO it makes less and less sense to go chiplets for desktop usage.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Yes, 14/16nm is cheaper, but it's still costs a lot.

Add the more complex packaging (price, yields, size, ...) to it and IMO it makes less and less sense to go chiplets for desktop usage.
We need someone to put some numbers to the table here. Can someone do that?

Looking at the roadmap 7nm plus is 2020 early so only approx one year in market to pay for the upfront cost.

We have to remember variable cost is far less risk than initial investment. Its a tradeoff. More Risk= less money. It's a huge deal making the decisions.
 

noneis

Junior Member
Mar 4, 2017
21
29
91
We need someone to put some numbers to the table here. Can someone do that?

Design cost is increasing exponentially:



That's why AMD will probably go with something like this:

7nm CPU Core chiplet = ~$300M
7nm GPU Navi chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
12/14/16nm Desktop/Mobile IO chiplet = ~$100M
2-3 12/14/16nm GPU IO chiplets = ~$200-300M

Total cost $1.0-1.1B, they can then do 7nm+ refresh for ~$600M

Traditional approach:
7nm CPU Core Server chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
7nm CPU SoC = ~$300M
7nm APU = ~$300M
GPU1 = ~$300M
GPU2 = ~$300M

Total cost: ~$1.6B, 7nm+ refresh would cost additional ~$1.5B

Investing ~$3.1B instead of $1.6-1.7B just for designs in 2 years could be too much for company like AMD, because their quarterly Research and Development budget is below $400M.
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
With exponentially increasing design cost TTM - time to market - probably goes the same way.

Consequences
1. You wish for longer depreciation to cover design cost. E.g. 14nm apu gets a platform refresh and will cover wsa obligations as well as the lower end market.
2. Your new products then needs to sell for more. Adaptability to more "niches" and faster to market helps that.

Imo one of the strongest features of io chiplets is therefore also the ability to faster respond to changing market needs. Nobody owns a crystal ball here and knows your opponents moves 3 year in advance. Your 7nm plus might be broken. Amazon might get completely revised softwarestack for key areas.

With io chiplets on lesser nodes you get management controlability. It's like getting a rudder on a tanker that was without before. I think it will fundamentally change how this business is working.
 

Gideon

Golden Member
Nov 27, 2007
1,703
3,912
136
APU in this design can also be chiplet based.
Yes they could. I'm still sceptical they will do it with the first generation. Doing APUs with chiplets immediately complicates things considerably:

1. Building an interposer and at least 2-3 different chiplets is expensive. You can't really sell that anywhere near the price of a 2200G or a 2500U, which is the price level most OEMs are willing to pay for AMD models still.

2. APUs are mostly for low-power (mobile). So in addition to all the new architectural problems, they face, they also need to tackle all the battery-life problems. There is definitely lots of room of improvement even on the relatively simple Raven Ridge (idle power, LPDDR support). Making a multi-module chip zip power at low-intensity tasks (e.g. web-browsing) - while crossing chiplet boundaries id definitely way harder. My guess would be that they would do it later, perhaps with zen 3 once they've already mastered the current chiplet design.

I have no doubt AMD will eventually do it, after all the benefits of such modularity could be massive. E.g. They could release new CPU and GPU chiplets out-of-sync. E.g. release a processor with a 12nm Vega graphics GPU (as Navi isn't ready), then update it to Navi later, without changing any other chiplets. Do configurable models with/without HBM, etc ...

I'm not yet convinced they'll do it the generation. I wouldn't mind to be wrong though
 
Last edited:

noneis

Junior Member
Mar 4, 2017
21
29
91
Consequences
1. You wish for longer depreciation to cover design cost. E.g. 14nm apu gets a platform refresh and will cover wsa obligations as well as the lower end market.

I expect both low-end CPU and low-end GPU markets to be 12nm products because it would require huge volume to pay back 7nm ~$300M design cost. Something like 15+ Million units of GPUs in RX550 market segment, which is ~10% if global yearly GPU unit sales. Maybe NVIDIA could try 7nm low-end, because they have spare money to gamble.

2. Your new products then needs to sell for more. Adaptability to more "niches" and faster to market helps that.

Chiplet design has higher base manufacturing cost, so i would expect it only in products >$150. With increasing manufacturing a design costs and slower advancement in density of cutting-edge nodes, market is heading for higher prices with longer upgrade cycles anyway. I expect $499-$599 high-end desktop CPUs to be something normal in future. (instead of $329-$359).

7nm is just a start, 5nm has >$500 design cost and 3nm could go even above $1.0B. (https://semiengineering.com/big-trouble-at-3nm/). While companies can still do multiple design for various market segments at 7nm, at 3nm that will no longer be possible. Some kind of scalable multi-die solution will be necessary if company wants to sell in different market segments. Companies can't do four $1.0B 3nm designs every 18 months like in past.
 

dacostafilipe

Senior member
Oct 10, 2013
772
244
116
That's why AMD will probably go with something like this ...

AMD will not produce GPU chiplets for discrete GPUs (for now) so you would need to add GPU1+GPU2 to your estimation that would then rise to ~2.3-2.4B.

My "vision":

7nm CPU Core chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
7nm CPU SoC = ~$400M
GPU1 = ~$300M
GPU2 = ~$300M

Total: ~$1.4B + ~$1.3B for refresh = ~$2.7B

One of the negative points about Ryzen from the OEMs was the lack of iGPU. That's why I see this as an opportunity to grow larger in that area. Also, Ryzen will need the lower latency that goes well with gaming.

Edit: FYI, I think that your estimation is too high as a lot of IP is reused on multiple parts. For example, GPU2, if using GPU1 as basis would certainly cost a lot less.
 
Last edited:
Reactions: Vattila and Gideon

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
I expect both low-end CPU and low-end GPU markets to be 12nm products
For the time being this is certainly the case. After all we are still waiting for the upcoming Ryzen APU which will be at 12nm.
 

naukkis

Senior member
Jun 5, 2002
768
634
136
1. Building an interposer and at least 2-3 different chiplets is expensive. You can't really sell that anywhere near the price of a 2200G or a 2500U, which is the price level most OEMs are willing to pay for AMD models still.

Yet there's only one reason to use interposer and chiplet-design, to reduce cost vs integrated circuit. Last such a cpu was Intel's Clarkdale, cheap cpu get chiplet-design where more pricier versions(nehalem) got unified silicon design.

For AMD chiplet design is cheaper alternative for sure, but as Clarkdale shows it usually comes with inferior performance vs integrated silicon version.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |