Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Doug S · Jun 15, 2021

jamescox said:
It seems like you are possibly massively over-estimating the amount of floating point used in a JavaScript execution thread vs. the amount of all integer used in actually displaying the GUI. The amount of work to display the GUI is often huge compared to what is actually running in the GUI. Also, I don’t know if anyone is talking about having a core with absolutely no FP resources. If you have a separate small core with a scalar FP unit or even just a small narrow 128-bit unit, that could be used to handle any floating point instructions. Technically they could emulate any vector instructions with a scalar unit; it would just be slow. You could actually emulate floating point units with integer units, but that would be excruciatingly slow.

You only need ONE use of floating point to force a thread to trap off a small core that doesn't support it. How much Javascript code do you think is out there that does not use ANY math at all? If it sets a value, compares a value, anything then it will trap.

Once it has trapped to the big core you aren't going to let it move back to the little core because once it has trapped off you know it will happen again. Moving back and forth between big and little cores due to lack of instruction support (as opposed to being due to performance need) is going to waste a ton of power as you keep moving to a core with a cold D cache and TLB, needs icache refill, is missing BTB history, etc.

It seems like the goalposts are moving if now you say you're talking about a core with less FP resources. Well OF COURSE the little core will have fewer FP resources, as well as fewer int resources, fewer load/store resources, and less of everything really. That's absolutely not what was being argued at all. The post I originally replied to was talking about cutting instructions / ISA support out of the small core, not having it be narrower or in order.

soresu · Jun 16, 2021

MadRat said:
So is the small core going to like a K6 with super cache support? Maybe more like the 5x86, before MMX.

Eh?

AMD's last 'small' core Jaguar was better than K6 and supported up to the AVX instruction set, albeit requiring 2 cycles to run an AVX instruction.

I would not expect any new small core to be less than that, and more likely considerably more.

Joe NYC · Jul 17, 2021

Mopetar said:
I'm not sure of that. Apple is going to be buying up a lot of N5 as they continue to transition their product line from Intel to their own ARM SoCs. They aren't the highest volume sales, but we're talking about their entire product line on top of all of their iDevices. We also don't know what Nvidia's plans are, but given AMD is more competitive with their RDNA2 GPUs that may push Nvidia to try to get back on TSMC as well just to ensure no

Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

I would venture to guess that right now, Apple does not have any iPhone SOCs on N7 being processed by TSMC.

Q2 21 was when the other mobile player started their transitions. By Q1 2022, when the Zen 4 wafers will likely start, there will by just bottom feeders of mobile space on that node and trailing edge products.

Probably the only high profile left on N7 may be NVidia with their A100 line.

Also, when AMD starts the transition to Zen 4 on N5, at the same time, even more N7/N6 capacity will start to open up, since AMD is likely the biggest customer on N7. So like I said, there will be a glut on that node.

Mopetar said:
Sony and Microsoft console sales haven't even managed to get down to MSRP yet and are still being heavily scalped online suggesting continued strong demand. They'll buy up an additional N7/N6 wafers that they can get their hands on, especially since it's likely that demand will be just as strong this next holiday season.

Sony and MSFT will be beneficiaries of this node migration as well. IMO, supply will match demand by Christmas.

Mopetar said:
I don't think there's as much free capacity at TSMC in the near or even mid-term as you might be expecting.

The big crowd will be on N5, including Zen 4 die.

It's a good thing that AMD may have a very competitive Zen 3XD, that is entirely on N7/N6, while there is likely a shortage of N5 capacity.

Mopetar said:
The rumors about getting MCM GPUs have been around for a while, but are apparently hard to do at least when it comes to gaming performance which is why it may not materialize for a bit longer. Obviously if it can be done it provides the same kind of economic advantages that Zen did, possibly even more. However, there is another possibility in that they create a graphics chiplet designed for professional/datacenter workloads which apparently don't suffer from being split up across chiplets nearly as much.

A single chiplet designed for those GPUs could still be used with an APU as long as any additional necessary logic gets added to the IO die. I don't know if the need to design another communication link though if they could just recycle the existing link that connects CPU chiplets to the IO die. The only real concern is bandwidth, but really they only need to match the available memory bandwidth. They also already have a high speed link designed for connecting their GPUs that might be more suitable and could perhaps be repurposed, but I don't know.

I think those standalone GPU units are too massive currently, and their partitioning will likely be in conservative half steps. Nowhere near achieving a practical 50-100mm2 chiplet that could be used for APU as a standalone chiplet.

Mopetar said:
I agree that doesn't make sense. Either they make a monolithic APU or they make a special chiplet for graphics. It seems like we still may be another generation removed before seeing chiplet based GPUs though. There are some rumors about RDNA 3 having multi-die GPUs, but I think these may just be more akin to traditional multi-die GPUs where there are two standalone monolithic dies on a single card/package that are connected together.

Agree. Perhaps not even the upcoming Zen 3 based Rembrand, maybe the one after that - Zen 4 based Phoenix (?_

jpiniero · Jul 17, 2021

Joe NYC said:
Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

Just looking at Apple's website they are still selling the 11, SE, and Xr.

Doug S · Jul 18, 2021

Joe NYC said:
Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

I would venture to guess that right now, Apple does not have any iPhone SOCs on N7 being processed by TSMC.

When Apple introduces a new line of iPhone, they continue selling last year's model at a discount. For the last few years they've been keeping the low end of the two year old one on the price list as well. Plus there's the "SE" which is only updated every few years and stays on the same SoC for a while - the "SE2" is using A13 which is N7P and will be around 2 or 3 years. When you see carrier deals for iPhones they are often the last year's version, probably because they are cheaper for the carrier and Apple perhaps is willing to offer additional discounts for those they won't on the latest and greatest, so they probably sell a lot more of those than you'd think.

I've seen estimates that only about half of Apple's iPhone sales are of the latest model, and the other half are the N+1/N+2 and SE. Whether that's true we have no way of verifying, but the fact that analysts are guessing that probably means there's plenty of evidence that those older phones sell in much larger numbers than you seem to think.

Then there's the iPad (non-Pro) iPad Mini, Apple TV (three generations now - they still ship the Apple TV HD which uses the 20nm A8 and the old 4K model using the 10nm A10X) Don't forget about the Apple Watch S-series SoC which they sell multiple years versions of, and HomePod which I think is also A8 or at least was at first. And I'm not really sure what process the "W" chips in the AirPod are made with, but it is not going to be N5 simply based on when the various models were released.

None of those sell as many as the iPhone, and some (i.e. Watch and AirPod) have smaller less complex chips, but when you add all that stuff together that's a crapload of wafers on multiple processes older than N5. And that doesn't even count any ancillary chips that may be included in various products that don't get the "fanfare" of the A* M* S* and W* chips.

uzzi38 · Jul 19, 2021

Mopetar said:
The number of customers who want a 16-core Zen CPU with onboard graphics isn't terribly large. I'm not sure the added cost across an entire product line is worth what niche market segments they might be able to pick up or the small bit of extra convenience that the onboard graphics provides if a GPU goes bad and there isn't a spare to use.

OEMs want iGPs.

Laptops want iGPs.

Those are both way bigger target markets than DIY PCs, and both require Raphael. AMD needs a -S BGA/-H55 competitor for laptops and they want to compete against Intel across the stack on the desktop as well and not be relegated just to gaming OEM boxes. Regular APUs just won't cut it for proper competition past a certain point in time.

Vattila · Jul 30, 2021

Joe NYC said:
Suppose chiplet 1 needs data that is in L3 of chiplet 2. The bandwidth limit and latency hit would make those accesses only marginally faster than single channel DRAM access.

Just a little clarification (I think we've discussed this before in another forum): The L3 is not shared between CCXs. That would create horrible contention for 64-core EPYC with 8 CCXs, and even more so for 2-socket 128-core systems. The states of the caches are only kept consistent within the rules of the x86 memory model, using a cache-coherency algorithm which is designed to do as little as possible — just enough to make it possible for all cores to agree on the state of memory.

Apart from any synchronisation needed by cache-coherency, an L3 miss goes straight to memory, as I understand it. Correct me if I am wrong.

Interestingly, the thing that kills performance and makes inter-CCX latency a bottleneck is high use of shared memory and locks. This puts the cache-coherency algorithm in overdrive with a lot of synchronisation between cores and CCXs. When cores work on separate memory only, or treat any shared data as read-only, the need for synchronisation goes away.

PS. The non-sharing of L3 is also why increasing the L3 available to each CCX has such big effect on performance, even in chips with multiple CCXs with high total L3, since that total isn't accessible to each CCX. We saw this with the increase from a 4-core to an 8-core CCX with a larger shared L3 cache. V-Cache multiplies this effect.

A/// · Aug 21, 2021

naukkis said:
it's pretty much Netburst part two now

I've been saying this for years. Few listen to me, and those who waved my comment off are coming around now. I do not for a second believe Intel will get their power use under control anytime soon, regardless of whatever drivel my life is a disease says.

Mopetar · Aug 28, 2021

Joe NYC said:
That is actually wrong. The biggest bottleneck right now is substrate. Intel said they used up a lot of their reserves in Q2, and they don't worry about losing any market share in Q3 because the substrate is so depleted that in Q3, their competitor (AMD) is not going to face identical constraint, unable to increase production.

So this is actually perfect time to add V-Cache, to avoid the real bottleneck, which is substrate as of now. Adding V-Cache makes the product more valuable without using any additional substrate.

You misunderstand. It doesn't matter if AMD has loads of spare wafers to make tons of this because they're still limited by the slowest part of the process for manufacturing chiplets with v-cache. If they can only process 10 wafers worth of chiplets per day to bond the v-cache then producing more than 10 wafers per day of the cache just means it's piling up waiting for the the slowest production stage.

Do you think TSMC just magically has infinite machines to perform the 3D stacking and bonding process?

NostaSeronx · Nov 3, 2021

yuri69 said:
I mean, trading the ST for MT has traditionally been a niche - look at ..., Bulldozer, etc.

There is no indication or design point that had Bulldozer sacrificing ST for MT. The performance loss is instead mostly attributed towards Server/HPC optimizations. Big L2(higher latency), Big Front-end(no room for op or trace cache), Big FPU(with longer latency to support server/HPC's execution unit demands), etc.

In fact before CMT was selected for Bulldozer the prior CMT implementation that was suppose to launch with K8 was significantly more smaller and faster. Hence, it had a single retire and a single LSU for two cores which each had two 64-bit ALUs(FU0/1), two integer 64-bit MMX units(MM0/1), two FPU 64-bit SSE units(FP0/1), one load agu(LDA), one store agu(STA), and the shrinked version having a store data unit(STD). It had significantly less OoO resources than Bulldozer had, since CMT is meant to be implemented at low-power. Where full CMP and big-core SMT increases power complexity.

Going more in-depth in the confusion.

Bulldozer's architecture was originally referenced as a "Compute Core", with scalable partitioned execution resources being a key feature. What if this implied a similar capability as IBM's CLA/CLB Architecture with One SMT4+4 core or Two SMT4 cores. Where Bulldozer can be either one combined core or two clustered cores. With the combined core being the original intent for production models of Bulldozer, with linear scaling power draw from partitioned slices.

Various modes mentioned by 2007:
-> instruction dispatch module can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like.
-> instruction dispatch module can be configured to dispatch integer instruction operations associated with the thread to both integer execution units 212 and 214 based on a predefined or opportunistic dispatch scheme.
-> The integer execution units 212 and 214 can be used to implement a run ahead scheme whereby the instruction dispatch module dispatches memory-access operations (e.g., load operations and store operations) to one integer execution unit while dispatching non-memory-access operations to the other integer execution unit.
-> Another example of a collaborative use of the integer execution units 202 and 204 is for an eager execution scheme whereby both results of a branch in an instruction sequence can be individually pursued by each integer instruction unit.
-> As yet another example, the integer execution units 212 and 214 can be used collaboratively to implement a reliable execution scheme for a single thread. In this instance, the same integer instruction operation is dispatched to both integer execution units 212 and 214 for execution and the results are compared by, for example, the thread retirement modules 226 of each integer execution unit.

32nm Bulldozer only had the respective mode; one thread per partition. Whereas the 45nm Bulldozer design had the collaborative modes. Which is why they said highest performing single-threaded (+ multi-threaded) compute core in history in 2007.

NostaSeronx · Nov 4, 2021

LightningZ71 said:
I thought there was supposed to be a Monet in there somewhere on the value spectrum. GF12, quad, RDNA2, Zen3...

The value proposition node at GlobalFoundries is 12FDX, not 12LP+.
22FDX = 1x cost
12LP+ = ~1.6x cost [2019 node]
12FDX = ~1.2x cost [2022 node]
per mm squared

So, I doubt "4c Zen3" plus "a couple RDNA2 WGPs" on 12LP+ is aimed at value.

Zucker2k · Nov 9, 2021

Abwx said:
Some infos :

Microsoft has issued documentation for the Milan-X HBv3 VMs with the following performance projections and VM size details and technical overview:

Up to 80% higher performance for CFD workloads

Up to 60% higher performance for EDA RTL simulation workloads

Up to 50% higher performance for explicit finite element analysis workloads

Up to 120 AMD EPYC 7V73X CPU cores (EPYC with 3D V-cache, “Milan-X”)

Up to 96 MB L3 cache per core (3x larger than standard Milan CPUs, and 6x larger than “Rome” CPUs)

350 GB/s DRAM bandwidth (STREAM TRIAD), up to 1.8x amplification (~630 GB/s effective bandwidth)

448 GB RAM

200 Gbps HDR InfiniBand (SRIOV), Mellanox ConnectX-6 NIC with Adaptive Routing

2 x 900 GB NVMe SSD (3.5 GB/s (reads) and 1.5 GB/s (writes) per SSD, large block IO)

AMD's EPYC Milan-X is Official: 3D V-Cache Brings Up To 768MB of L3 Cache, 64 Cores (Updated)

L3 Cache taken to the extreme

www.tomshardware.com

Oh yeah, Intel is in trouble alright. AMD is going for the jugular here, and it'll be interesting to see how Intel responds.

This is a giant stride in computing. Kudos to AMD for being bullish with the way they keep pushing chip development on x86. Simply stupendous!

Edit: @Markfw what don't you like about my post?

IntelUser2000 · Jan 3, 2022

Abwx said:
A theorical 32C wouldnt compare as favorably against the 5950X because the 5800X is pushed in a steepier part of the V/F curve, but it should still have roughly 50% advantage over the 16C part.

Exactly. Abwx nailed it. And with modern Turbo, when only 16 cores are used, then it'll clock just as high as the 5950X, meaning it'll be faster in everything.

Joe NYC said:
From Mike Clark interview, it seems that AMD is going to have some base amount of L3 on die, so that the chip can be sold without stacking. At least for next 1-2 generation.

Latency doesn't have to be higher on the V-cache. It could be made so the base cache gets the same latency and V-cache layer is higher latency.

Also, when it comes to costs, the tiny die might cost little but complexities in stacking is what raises costs.

Similar to how packaging costs are dominant in sub-100mm2 die CPUs but matter less in larger die CPUs. Things like packaging and stacking are fixed costs that become advantageous in expensive and larger configurations.

Exist50 · Jan 4, 2022

Hitman928 said:
You are just showing flip chip orientation which has been used for many generations for these types of chips, starting before finfets. The rest of your post seem to be numbers plucked out of the air so not sure what to make of them.

I’m still confused though because you say both are the orientation going forward so I don’t know which way you think AMD/Intel are packaging their chips.

It's Nosta just BSing about stuff he doesn't understand, like usual.

moinmoin · Jan 8, 2022

turtile said:
It seems that AMD works on packaging -> core -> packaging -> core. Packaging seems to take more time.

As @DisEnchantment mentioned parts of packaging are in-house. But where this gets interesting is process nodes and other technology not achievable in-house. There AMD can't do anything but wait for its partners to finish their work, case in point being Zen 3D V-Cache being prepared since the Zen 3 launch, but only being available later with Milan-X and again later on the consumer market.

I wonder how AMD plans to handle such external dependencies over the long run. Ideally they wouldn't have a single linear roadmap but be able to launch node, core and packaging improvements as they become ready for HVM.

DisEnchantment said:
Evan Burness going on record saying Azure capturing EPYC shipments before any other customer can get their hands on them.
Basically that is a strategy for them to preempt supplies before the customers can even get their hands on them.
Then customers go to Azure for competitive analysis and they put out the best offering because nobody else got the chips before Azure.
He was also bragging that they have not lost a single competitive analysis requested by any customer.

With such early access and day and date availability Azure essentially turned Epyc into their own Graviton competitive advantage at least temporally which is a big win for Microsoft in the Cloud market.

But I'm not sure this is really a positive for AMD which over the long run should be more interested in a customer base boarder than this. Microsoft's commitment must be significant.

NostaSeronx · Feb 9, 2022

moinmoin said:
Sound reasoning, but it doesn't answer at all what's up with the extended WSA between GloFo and AMD. An additional $1.4 billion until 2024 when it was widely thought that the WSA would run out and not be renewed in 2021. Did the WSA turn into a poisonous pill for GloFo instead AMD, securing 14/12nm Fin production for AMD until 2024 while GloFo wanted to convert those fabs earlier?

The WSA via pre-orders only has secured 14nm/12nm Fin production up to 2023. Which is the year where GlobalFoundries will be increasing 45nm&12nm FD production at Malta.

AMD relationship with GloFo is in the know. So, any changes to GloFo's plans would be forwarded ahead of time to AMD. In this scenario, with all info gathered, AMD would transfer EPYC/Ryzen/V&R-series away to TSMC. While reviving Opteron(cost-effective Datacenter), Sempron(cost-effective Personal Computing), G-series(cost-effective Embedded and Other via SCBU). This also includes GlobalFoundries getting the low-end GPU market again.

2022 = Fin production(1st-half+2nd-half), AMD's FDSOI locks in development(2nd-half)
2023 = Fin production goes down to zero(2nd-half), AMD's FDSOI ramps to production(2nd-half)
2024/2025 = Only FDSOI production via AMD at GF going forward.

Began 28nm production at ~$6000 USD(2014) => Enter 22nm production at ~$2000 USD(2024)
With 14nm/12nm Fin being stuck at ~$4000 USD(2022+) for AMD, with the price hike to others pushing it to Samsung's/TSMC's near~$5500 USD levels.

Higher capacity, lower costs for AMD, and lower masks, process steps for GloFo => shorter lead times, and higher profit margins for both with FDX.

TSMC handles premium and higher ASP => better deals and more funding for enhanced FinFETs
GloFo handles cost-sensitive and lower ASP => more stable manufacturing capacity for FDSOI

Lower ASP is given by the lower price point from introduction of 28nm&/14nm at $6000-2014&$8000-2017 USD. Given same area in a mature insert means lower price point for similar performance to 14nm, lower power to 14nm, reduced cost of development(AMD&SCBU) and reduced TTM, etc.

TSMC gained the LP-patent set from GlobalFoundries anyway:
The companies have agreed to a broad life-of-patents cross-license to each other’s worldwide existing semiconductor patents as well as those patents that will be filed during the next ten years as both companies continue to invest significantly in semiconductor research and development.

12LP/7LP/5LP/3LP plans that AMD was aware of is better implemented by TSMC. Money$$$, Multiple Node technicians, multiple fab modules, multi-continent 16nm/12nm/5nm offshore fabs known, etc.
The above specifically: Technology research covering 14nm, 10nm, 7nm, 5nm, 3nm CMOS Finfet technologies for mobile SoC/ASIC. <== Left GF in 2017
GlobalFoundries has yet to implement SSRW High Mobility FinFETs as per 2016-IEEE for 7nm/5nm FinFETs and 2019-IEEE for 14nm/12nm FinFETs. Which AMD might be using if they switch to TSMC, rather than stay at GloFo. In this case, AMD is likely to get a semi-custom customer that wants to add networking accelerators to a Zen++ SoC on TSMC, or whatever. Which Zen++ would be the cheapest option of Zen architectures at TSMC.

A move from GloFo to TSMC is more secure given Zen's market and largely unsuccessful GlobalFoundries SCBU. If they were at TSMC, they wouldn't have failed.
GlobalFoundries FinFET = Ghost Town, one module that was exclusive to FinFETs and is now in transition to FDSOI.
TSMC FinFET = Growing Metropolis, several modules, includes Japan FinFET and Dresden FinFET fabs (and Nanjing FinFET fab), with extension of Arizona FinFET fab.

There is also a better trend of support of custom Zen core implementations at TSMC.

Zen & Dhyana at GlobalFoundries, pretty much in the decay of demand for Zen(preference of TSMC killing Zen-GloFo) and Dhyana is dead.
Zen2, Zen2-Sony, Zen3, Zen4, Zen4c... why not add more on 16nm/12nm.

On the IOD argument, TSMC is more experienced on the I/O FET. So, naturally TSMC's IOD will be superior to GlobalFoundries' IOD.

Basically, everyone is abandoning 14LPP/12LP/12LP+, for TSMC 16FF/12FF or Samsung 14LPP/14LPU/11LPU. However, some customers are simply doing 84CPP-7.5T on 12nm-Fin to go to 84CPP-7.5T on 12nm-FDX; FDXcelerator fast-track. Giving more credence that GlobalFoundries will swiftly down ramp(killing FinFETs) once a replacement node pops up. By the way, 12FDX is definitely doing some form of risk production this year.

Fab8 Device Director 14LPP/12LP/12LP+/12FDX/45RF
Successful technologies deliverables : 45RF | 12FDSOI | 12LP FINFET | Silicon Photonics
Malta 12FDX integration for 1 year 2 months (at edit: Jan 2021-Present)
Malta 12FDX FEOL/MOL process optimization for 2 years 8 months (at edit: August 2019-Present)
Malta/Essex Junction ESD/Latchup development lead for 12LP, 12FDX, 22FDX, and 28SLPe (at edit: June 2021-Present for 12FDX)

Cut extremely short, NPI/NTO(anything new) goes to TSMC if Zen-related; already apparent with RDNAx/CDNAx being TSMC-exclusive as well.

AMD/GlobalFoundries Legacy/Obsolete Big Core/Big CU, etc>> these get moved to TSMC w/ a more aggressive SCBU at TSMC in preparation.
2023+ =>
AMD/GlobalFoundries New/Modern Small Core/Small CU, etc. Separation of node and fab between Little cores(low-cost) and Big cores(premium-cost). Which reflects GlobalFoundries strategy on prioritizing low-cost pervasive semiconductors.

eek2121 · Feb 16, 2022

JoeRambo said:
It actually does beat 8C Zen3 in said compilation tests in Phoronix testing:

View attachment 57517

Lowly 12400 with 4.4Ghz turbo clocks and 18MB of L3 has no trouble keeping up with premier AMD 8C cpu with 32MB of L3.

People keep conveniently forgetting that ADL big core is 5 ALU (each LEA capable), 512 ROB, mostly 6-wide monster. It is let down currently by mobile phone worthy uncore and memory controller, but expect it to scale real well in the future with faster mem.

The only "complex" benchmark i need is Web Speedometer 2.0 (thread is in on this very forum). Since i own 10900K, 5950X and 12900K, the numbers with 3800-3900 HIGHLY tuned DDR4 are the following ~210, ~255 and 330. That's how faster and more smooth ADL is. I believe 12900K does like 300 on stock as well. ~25% faster in my eyes.

It ran at 5.6 ghz, not 4.4 ghz? Processor: Intel Core i5-12400 @ 5.60GHz

EDIT: Just saw mods post. Not commenting further in this thread.

Tarkin77 · Mar 16, 2022

Saylick said:
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.

You realise that this information is behind the paywall? That's not fair to Charlie.

Kepler_L2 · Mar 16, 2022

Cardyak said:
No, Bergamo has 2x perf compared to Milan with SMT, without SMT it has 60% more performance.
200%/160% = ~25%

A 25% uplift from SMT alone. This correlates to the industry standard most people expect from enabling SMT.

Yeah my math was just bad

Thunder 57 · Mar 19, 2022

MadRat said:
I wouldn't so much want slower chips scabbed on. That's really a bit pointless.

I'd want a super duper single core for gaming, with an 8 core on the second for other stuff. If ypu had a whole chiplet for one super duper core, how bad ass could you make it?

That sounds an awful lot like a PS3, which was regarded as rather difficult to write code for.

itsmydamnation · Mar 25, 2022

Yes your are wrong , how do you propose to interconnection your cores and handle cache coherency. If you stack cache what are you stacking it on top off?

It seems to me people weight way to high the cost of the silicon itself , thus come up with crazy ideas. I think even people like Ian get it wrong when considering price vs cost for zen3 vs zen3d. There is alot more to operating costs then just the die itself.

Markfw · Mar 29, 2022

nicalandia said:
I find this very informative.

How It's Built: AMD 3D V-Cache Technology

No captions available, so its NOT very informative for deaf people like me. I hate videos. Why not good old web pages ????

moinmoin · Apr 22, 2022

nicalandia said:
View attachment 60423

For those unaware, OMI stands for open memory interface:

About the Open Memory Interface (OMI)

OMI is a highly tuned bus that was developed for near memory and is easily migratable to emerging memory solutions (e.g., Storage Class Memory). This serial coherent bus, a subset of OpenCAPI, was architected specifically for the interface between a processor and Near Memory having absolute...

openmemoryinterface.org

Specification	LRDIMM DDR4	DDR5	HBM2E(8-High)	OMI
Protocol	Parallel	Parallel	Parallel	Serial
Signalling	Single-Ended	Single-Ended	Single-Ended	Differential
I/O Type	Duplex	Duplex	Simplex	Simplex
Paths/Channel (Read/Write)	64	32	512R/512W	8R/8W
Data Transfer Rate	3,200MT/s	6,400MT/s	3,200MT/s	32,000MT/s
Channel Bandwidth (R+W)	25.6Gbytes/s	25.6Gbytes/s	400Gbytes/s	64Gbytes/s
Latency	41.5ns		60.4ns	45.5ns
Channels / Processor Die	8 (EPYC Rome IO)		5 (NVidia Ampere)	16 (POWER10)
Processor Die Size	416mm²		826mm²	602mm²
Driver Area / Channel	7.8mm²	3.9mm²	11.4mm²	2.2mm²
Bandwidth / mm²	3.3GBytes/s/mm²	6.6GBytes/s/mm²	35GBytes/s/mm²	29.6GBytes/s/mm²
Max Capacity / Channel	64GB	256GB	16GB	256GB
Connection	Multi Drop	Multi Drop	Point-to-Point	Point-to-Point
Data Resilience	Parity	Parity	Parity	CRC

Det0x · May 6, 2022

Abwx said:
The statement is from AT, they used 7 ZIP as metric for INT based computations, beside we are talking of ST perf as a way to extract IPC without saturating the mem bandwith and being too much limited by R/W latencies.

Tuned memory is certainly a great factor for improvement but anything out of spec cant be considered as being guaranted from the manufacturer to be errors free.

The IBM POWER8 Review: Challenging the Intel Xeon

www.anandtech.com

JoeRambo said:
I did a test on Zen3 as well, and with tuned memory @4.4ghz it is scoring 7950 MIPS ( vs 6800 as in article @4.9ghz i believe ), so it is another confirmation than 7Zr compression algorithm is scaling too well with memory to be proper measurement of ALU process. Who knows where Zen3 or ADL peak ?

9079MIPS with a 5800x3d @ 4560mhz and memory 1900:3800

8937MIPS with a 5800x3d @ 4560mhz and memory 1800:3600

8834MIPS with a 5800x3d @ 4560mhz and memory 1600:3200

*edit*
Cleaned up post and screenshots

maddie · May 13, 2022

Mopetar said:
AMD Turin.

A thread count so high you'd swear you were buying sheets.

Shroud of Turin?

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Platinum Member

Platinum Member

Platinum Member

Lifer

Platinum Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Elite Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Member

Senior member

Platinum Member

Platinum Member

Moderator Emeritus, Elite Member

Diamond Member

Golden Member

Attachments

Diamond Member