64 core EPYC Rome （Zen2）Architecture Overview？

Excessi0n · Nov 1, 2018

The Stilt said:
If you are asking if there are currently > 280 unused pins in AM4 package required for the two additional memory channels, the answer is no.

So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

Omegaboost · Nov 1, 2018

Hmm 8 core CCX for zen2 ryzen chips?

Tuna-Fish · Nov 1, 2018

Excessi0n said:
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

Yes. This is why going from single channel to dual channel doubles bandwidth, but putting two sticks of RAM into the same channel doesn't. The two sticks on the same channel are attached to the same bus, and the memory controller can only talk to one of them at a time.

CatMerc · Nov 1, 2018

NeoLuxembourg said:
If AMD goes with chiplets, they will certainly have a different dies for server and desktop, so everything discussed here (rome rumours) could have zero impact on the AM4 platform.

Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.

Atari2600 · Nov 1, 2018

Excessi0n said:
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

Tuna-Fish said:
Yes. This is why going from single channel to dual channel doubles bandwidth, but putting two sticks of RAM into the same channel doesn't. The two sticks on the same channel are attached to the same bus, and the memory controller can only talk to one of them at a time.

And why populating both slots of each channel pulls down attainable memory speeds - as the memory controller can only service each so quick - the limiter is not the DRAM, its the MC.

The Stilt · Nov 1, 2018

Excessi0n said:
So when you have all four slots filled, does that mean that the two sets of dual-channel RAM are communicating with the CPU through the same pins?

As Tuna-Fish already said, the two slots of the same channel (e.g. A0 & A1 / B0 & B1) are sharing the signals.
The configuration is called as "2 DPC" (2 DIMMs per Channel) configuration. Designs which lack the second slot (i.e. 1 DPC) for each of the channels are possible, and are usually used on low-end and ITX form factor motherboards.

Quad channel capability on AM4 would require 286 additional pins to be available (143 per channel), increasing the total pin count reserved for the memory interface from the current 286 to 572 (i.e. to 43% of the total pin count of the package).

Tuna-Fish · Nov 1, 2018

The Stilt said:
Quad channel capability on AM4 would require 286 additional pins to be available (143 per channel), increasing the total pin count reserved for the memory interface from the current 286 to 572 (i.e. to 43% of the total pin count of the package).

And effectively more than that, because the high-speed transmission pins need ground pins next to them to maintain signal quality.

Yotsugi · Nov 1, 2018

This looks and sounds like some absolute balls-to-the-wall insanity.
Nov. 6 can't come soon enough.

krumme · Nov 1, 2018

Tuna-Fish said:
And effectively more than that, because the high-speed transmission pins need ground pins next to them to maintain signal quality.

I am no engineer and It's 20 years plus back I was working with reducing jitter in audio signals for fun. Filtering function in pll. Decoupling, reference clock what not. Dip times. It was beneficial for the jitter results to simply view the ground as part of the signal path. And why not.
I used kind of 3d solution sticking out from dac with 805 size cg0 and np0 soldered between pins of small capacitors. Np0 soldered on top of the top pins of the chips between power and ground. Lol.
All because trace length was long so the signal/ground stiffness was compromised. Dip days...Now it's all done perfectly in an integrated chip for 1 dollar at most even for high end stuff.

Glo. · Nov 1, 2018

CatMerc said:
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.

That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.

Gideon · Nov 1, 2018

CatMerc said:
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

So server and client would share cores, but the controller/IO die would be specifically per segment.

The saving a tape out part is only true, if they won't make a 8 core raven ridge succesor (that would easily work, even on mobile). If they do, thry might opt out from 12-16 core AM4 multichip parts. They could of course also do both

Atari2600 · Nov 1, 2018

Glo. said:
That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.

How much would AMD save on manufacturing versus how much extra spent on design, validation and lost time to market?

DisEnchantment · Nov 1, 2018

Glo. said:
That will make them develop specific IO dies only, on cheap as f*** 14 nm process. Which will in return turn into more profit, and saving moneyz on manufacturing.

#20180102338 Circuit Board with Bridge Chiplets

Actually the interconnect is also a chiplet (55/50) according to the patent and is fabricated using a higher density process than the rest of the board

The circuit structures of the chiplets 50 and 55 may be constructed using one more design rules for higher density circuit structures while the circuit structures of the remainder of the circuit board...

The interconnect chiplets provide connectivity from one regular chiplet (20/25/30) to another and to the pin arrays (45)

The chiplets 50 and 55 may be used for a variety of purposes. For example, the chiplet 50 may be used to provide large numbers of electrical pathways between the semiconductor chip 20 and 25 as well as electrical pathways to and from the semiconductor chips 20 and 25, through the circuit board 15 and out to the I/O's 45 if desired. The chiplet 55 may be used to provide large numbers of electrical pathways between the semiconductor chip 25 and the semiconductor chip 30 as well as electrical pathways to and from the semiconductor chips 25 and 30 through the circuit board 15 and out to the I/O's 45 if desired.

This might not be the definitive layout for Zen 2 but as it looks, seem complicated.
The patent is fairly comprehensive in the sense it covers not only how the chip are connected to one another but also how these pockets will be created on the substrate and how to insert these interconnect onto the substrate and putting the chiplets on top.

#20180096938 Circuit Board with Multiple Density regions

Here another layout with the 50/52/53/55 interconnect chiplets being used differently.
The patent is all about how the chips 20/25/30 can be of different density and using the interconnects 50/52/53/55 to connect to one another.

#20180239708 Acceleration of cache-to-cache data transfers for producer-consumer communication

[0020] Referring to FIG. 1, processing system 100 (e.g., a server) includes multiple processing nodes (e.g., node 0 and node 1). Each processing node includes multiple caching agents (e.g., processors 102 and 104 coupled to main memory 110). For example, caching agent 102, is a processor including core 0, core 1, core 2, . . . core 7 and caching agent 104, is a processor including core 0, core 1, core 2, . . . core 7) and a memory system. Each of the nodes accesses its own memory within corresponding coherence domain 122 faster than memory in non-coherence domain 124 (e.g., main memory 110) or memory in another node. As referred to herein, a coherence domain refers to a subset of memory (e.g., cache memory of node 0) for which a cache coherence mechanism maintains a coherent view of copies of shared data. A non-coherence domain refers to memory not included in the coherence domain (e.g., main memory 110 or cache memory in another node). Each of the caching agents in a node includes a last-level cache shared by the cores of the caching agent. Each core includes a private penultimate-level cache. For example, caching agent 102 includes last-level cache 128, which is a level-three cache shared by core 0, core 1, core 2, . . . core 7, and includes a level-two cache within each of core 0, core 1, core 2, . . . core 7. Last-level cache 128 and each level-two cache of caching agent 102 includes storage elements, e.g., storage implemented in fast static Random Access Memory (RAM) or other suitable storage elements. Cache control logic is distributed across last-level cache 128 and each level-two cache of caching agent 102. The caching agents use inter-processor communication via directory controller 121 to maintain coherency of a memory image in main memory 110 when caches of more than one caching agent contain the same cache line (i.e., a copy of contents of the same location of main memory 110) of coherence domain 122. In at least one embodiment of processing system 100, probe filter 112 includes storage for a cache directory used to implement a directory-based cache coherency policy. Probe filter 112 is implemented in fast static RAM associated with directory controller 121 or by other suitable storage technique.

#20180239702 Locality-aware and sharing-aware cache coherence for collections of processors

Cache coherence across multiple processors on an interconnect network (in one case an interposer 412)

[0019] Processing system 100 includes a distributed, shared memory system. For example, all memory locations of memory system 108 are accessible by each of processors 102, 104, and 106. Memory system 108 includes multiple memory portions, which are distributed across processors 102, 104, and 106. Memory portion 110 is local to processor 102 (e.g., tightly integrated with processor 102) and remote to processors 104 and 106 (e.g., within processing system 100 and accessible to processors 104 and 106, but not local to processors 104 and 106).

Some more applications which are interesting to read regarding AMDs ideas/attempts at memory access optimization in a CPU and/or with GPUs
#20180039587 Network of memory modules with logarithmic access
#20180018105 Memory controller with virtual controller mode
#20180019006 Memory controller with flexible address decoding
#20180143905 Network-aware cache coherence protocol enhancement
#20180239722 Allocation of memory buffers in computing system with multiple memory channels

Search in USPTO website here
http://appft.uspto.gov/netahtml/PTO/srchnum.html

TL;DR;
In simple terms,
#20180102338 describes how to create cavities on the substrate to put connecting chiplets with conducting pads on both surfaces on these cavities and the real chips are then placed on top of them. Due to the high density of the connecting chiplets complex routing between the chiplets can be achieved as well as connections to the pin arrays are also provided.
#20180096938 describes chiplets of different process nodes integrated in one package
#20180239702 describes cache coherence across multiple processors.
#20180239708 talks of a node with multiple processor/dies each with multiple cores and all of them are using a single Memory controller and LLC sync across the dies
#20180239722 talks of a single Memory controller in one Computing devices but not sure if only for APU use case.

Glo. · Nov 1, 2018

Gideon said:
The saving a tape out part is only true, if they won't make a 8 core raven ridge succesor (that would easily work, even on mobile). If they do, thry might opt out from 12-16 core AM4 multichip parts. They could of course also do both

APU in this design can also be chiplet based.

piesquared · Nov 2, 2018

https://www.reddit.com/r/Amd/comments/9tg95d/i_went_to_amds_markham_office_for_the_career_fair/

What chips would those be?

What is the SerDes RTL & ICT PMO interconnect? Maybe this is a wafer of those?

dacostafilipe · Nov 2, 2018

CatMerc said:
Not certainly. If the controller chip is built to be modular they could make a smaller one specifically for AM4 quite easily. That saves them from another 7nm tape out as the controller is 16nm/14nm.

Yes, 14/16nm is cheaper, but it's still costs a lot.

Add the more complex packaging (price, yields, size, ...) to it and IMO it makes less and less sense to go chiplets for desktop usage.

coercitiv · Nov 2, 2018

NeoLuxembourg said:
Add the more complex packaging (price, yields, size, ...) to it and IMO it makes less and less sense to go chiplets for desktop usage.

I would imagine there's also a power and performance price to be paid. That price is easily offset on a many-CCX platform like EPYC or TR, but not on mainstream.

krumme · Nov 2, 2018

NeoLuxembourg said:
Yes, 14/16nm is cheaper, but it's still costs a lot.

Add the more complex packaging (price, yields, size, ...) to it and IMO it makes less and less sense to go chiplets for desktop usage.

We need someone to put some numbers to the table here. Can someone do that?

Looking at the roadmap 7nm plus is 2020 early so only approx one year in market to pay for the upfront cost.

We have to remember variable cost is far less risk than initial investment. Its a tradeoff. More Risk= less money. It's a huge deal making the decisions.

noneis · Nov 2, 2018

krumme said:
We need someone to put some numbers to the table here. Can someone do that?

Design cost is increasing exponentially:

That's why AMD will probably go with something like this:

7nm CPU Core chiplet = ~$300M
7nm GPU Navi chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
12/14/16nm Desktop/Mobile IO chiplet = ~$100M
2-3 12/14/16nm GPU IO chiplets = ~$200-300M

Total cost $1.0-1.1B, they can then do 7nm+ refresh for ~$600M

Traditional approach:
7nm CPU Core Server chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
7nm CPU SoC = ~$300M
7nm APU = ~$300M
GPU1 = ~$300M
GPU2 = ~$300M

Total cost: ~$1.6B, 7nm+ refresh would cost additional ~$1.5B

Investing ~$3.1B instead of $1.6-1.7B just for designs in 2 years could be too much for company like AMD, because their quarterly Research and Development budget is below $400M.

krumme · Nov 2, 2018

With exponentially increasing design cost TTM - time to market - probably goes the same way.

Consequences
1. You wish for longer depreciation to cover design cost. E.g. 14nm apu gets a platform refresh and will cover wsa obligations as well as the lower end market.
2. Your new products then needs to sell for more. Adaptability to more "niches" and faster to market helps that.

Imo one of the strongest features of io chiplets is therefore also the ability to faster respond to changing market needs. Nobody owns a crystal ball here and knows your opponents moves 3 year in advance. Your 7nm plus might be broken. Amazon might get completely revised softwarestack for key areas.

With io chiplets on lesser nodes you get management controlability. It's like getting a rudder on a tanker that was without before. I think it will fundamentally change how this business is working.

Gideon · Nov 2, 2018

Glo. said:
APU in this design can also be chiplet based.

Yes they could. I'm still sceptical they will do it with the first generation. Doing APUs with chiplets immediately complicates things considerably:

1. Building an interposer and at least 2-3 different chiplets is expensive. You can't really sell that anywhere near the price of a 2200G or a 2500U, which is the price level most OEMs are willing to pay for AMD models still.

2. APUs are mostly for low-power (mobile). So in addition to all the new architectural problems, they face, they also need to tackle all the battery-life problems. There is definitely lots of room of improvement even on the relatively simple Raven Ridge (idle power, LPDDR support). Making a multi-module chip zip power at low-intensity tasks (e.g. web-browsing) - while crossing chiplet boundaries id definitely way harder. My guess would be that they would do it later, perhaps with zen 3 once they've already mastered the current chiplet design.

I have no doubt AMD will eventually do it, after all the benefits of such modularity could be massive. E.g. They could release new CPU and GPU chiplets out-of-sync. E.g. release a processor with a 12nm Vega graphics GPU (as Navi isn't ready), then update it to Navi later, without changing any other chiplets. Do configurable models with/without HBM, etc ...

I'm not yet convinced they'll do it the generation. I wouldn't mind to be wrong though

noneis · Nov 2, 2018

krumme said:
Consequences
1. You wish for longer depreciation to cover design cost. E.g. 14nm apu gets a platform refresh and will cover wsa obligations as well as the lower end market.

I expect both low-end CPU and low-end GPU markets to be 12nm products because it would require huge volume to pay back 7nm ~$300M design cost. Something like 15+ Million units of GPUs in RX550 market segment, which is ~10% if global yearly GPU unit sales. Maybe NVIDIA could try 7nm low-end, because they have spare money to gamble.

krumme said:
2. Your new products then needs to sell for more. Adaptability to more "niches" and faster to market helps that.

Chiplet design has higher base manufacturing cost, so i would expect it only in products >$150. With increasing manufacturing a design costs and slower advancement in density of cutting-edge nodes, market is heading for higher prices with longer upgrade cycles anyway. I expect $499-$599 high-end desktop CPUs to be something normal in future. (instead of $329-$359).

7nm is just a start, 5nm has >$500 design cost and 3nm could go even above $1.0B. (https://semiengineering.com/big-trouble-at-3nm/). While companies can still do multiple design for various market segments at 7nm, at 3nm that will no longer be possible. Some kind of scalable multi-die solution will be necessary if company wants to sell in different market segments. Companies can't do four $1.0B 3nm designs every 18 months like in past.

dacostafilipe · Nov 2, 2018

noneis said:
That's why AMD will probably go with something like this ...

AMD will not produce GPU chiplets for discrete GPUs (for now) so you would need to add GPU1+GPU2 to your estimation that would then rise to ~2.3-2.4B.

My "vision":

7nm CPU Core chiplet = ~$300M
12/14/16nm Server/TR IO chiplet = ~$100M
7nm CPU SoC = ~$400M
GPU1 = ~$300M
GPU2 = ~$300M

Total: ~$1.4B + ~$1.3B for refresh = ~$2.7B

One of the negative points about Ryzen from the OEMs was the lack of iGPU. That's why I see this as an opportunity to grow larger in that area. Also, Ryzen will need the lower latency that goes well with gaming.

Edit: FYI, I think that your estimation is too high as a lot of IP is reused on multiple parts. For example, GPU2, if using GPU1 as basis would certainly cost a lot less.

moinmoin · Nov 2, 2018

noneis said:
I expect both low-end CPU and low-end GPU markets to be 12nm products

For the time being this is certainly the case. After all we are still waiting for the upcoming Ryzen APU which will be at 12nm.

naukkis · Nov 2, 2018

Gideon said:
1. Building an interposer and at least 2-3 different chiplets is expensive. You can't really sell that anywhere near the price of a 2200G or a 2500U, which is the price level most OEMs are willing to pay for AMD models still.

Yet there's only one reason to use interposer and chiplet-design, to reduce cost vs integrated circuit. Last such a cpu was Intel's Clarkdale, cheap cpu get chiplet-design where more pricier versions(nehalem) got unified silicon design.

For AMD chiplet design is cheaper alternative for sure, but as Clarkdale shows it usually comes with inferior performance vs integrated silicon version.

64 core EPYC Rome （Zen2）Architecture Overview？

Member

Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Junior Member

Diamond Member

Golden Member

Junior Member

Senior member

Diamond Member

Senior member