Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).



What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!
 
Last edited:
Reactions: richardllewis_01

Doug S

Platinum Member
Feb 8, 2020
2,470
4,028
136
It seems like you are possibly massively over-estimating the amount of floating point used in a JavaScript execution thread vs. the amount of all integer used in actually displaying the GUI. The amount of work to display the GUI is often huge compared to what is actually running in the GUI. Also, I don’t know if anyone is talking about having a core with absolutely no FP resources. If you have a separate small core with a scalar FP unit or even just a small narrow 128-bit unit, that could be used to handle any floating point instructions. Technically they could emulate any vector instructions with a scalar unit; it would just be slow. You could actually emulate floating point units with integer units, but that would be excruciatingly slow.

You only need ONE use of floating point to force a thread to trap off a small core that doesn't support it. How much Javascript code do you think is out there that does not use ANY math at all? If it sets a value, compares a value, anything then it will trap.

Once it has trapped to the big core you aren't going to let it move back to the little core because once it has trapped off you know it will happen again. Moving back and forth between big and little cores due to lack of instruction support (as opposed to being due to performance need) is going to waste a ton of power as you keep moving to a core with a cold D cache and TLB, needs icache refill, is missing BTB history, etc.

It seems like the goalposts are moving if now you say you're talking about a core with less FP resources. Well OF COURSE the little core will have fewer FP resources, as well as fewer int resources, fewer load/store resources, and less of everything really. That's absolutely not what was being argued at all. The post I originally replied to was talking about cutting instructions / ISA support out of the small core, not having it be narrower or in order.
 
Last edited:
Reactions: scineram

soresu

Platinum Member
Dec 19, 2014
2,934
2,159
136
So is the small core going to like a K6 with super cache support? Maybe more like the 5x86, before MMX.
Eh?

AMD's last 'small' core Jaguar was better than K6 and supported up to the AVX instruction set, albeit requiring 2 cycles to run an AVX instruction.

I would not expect any new small core to be less than that, and more likely considerably more.
 
Reactions: Tlh97 and scineram

Joe NYC

Platinum Member
Jun 26, 2021
2,323
2,929
106
I'm not sure of that. Apple is going to be buying up a lot of N5 as they continue to transition their product line from Intel to their own ARM SoCs. They aren't the highest volume sales, but we're talking about their entire product line on top of all of their iDevices. We also don't know what Nvidia's plans are, but given AMD is more competitive with their RDNA2 GPUs that may push Nvidia to try to get back on TSMC as well just to ensure no

Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

I would venture to guess that right now, Apple does not have any iPhone SOCs on N7 being processed by TSMC.

Q2 21 was when the other mobile player started their transitions. By Q1 2022, when the Zen 4 wafers will likely start, there will by just bottom feeders of mobile space on that node and trailing edge products.

Probably the only high profile left on N7 may be NVidia with their A100 line.

Also, when AMD starts the transition to Zen 4 on N5, at the same time, even more N7/N6 capacity will start to open up, since AMD is likely the biggest customer on N7. So like I said, there will be a glut on that node.

Sony and Microsoft console sales haven't even managed to get down to MSRP yet and are still being heavily scalped online suggesting continued strong demand. They'll buy up an additional N7/N6 wafers that they can get their hands on, especially since it's likely that demand will be just as strong this next holiday season.

Sony and MSFT will be beneficiaries of this node migration as well. IMO, supply will match demand by Christmas.

I don't think there's as much free capacity at TSMC in the near or even mid-term as you might be expecting.

The big crowd will be on N5, including Zen 4 die.

It's a good thing that AMD may have a very competitive Zen 3XD, that is entirely on N7/N6, while there is likely a shortage of N5 capacity.

The rumors about getting MCM GPUs have been around for a while, but are apparently hard to do at least when it comes to gaming performance which is why it may not materialize for a bit longer. Obviously if it can be done it provides the same kind of economic advantages that Zen did, possibly even more. However, there is another possibility in that they create a graphics chiplet designed for professional/datacenter workloads which apparently don't suffer from being split up across chiplets nearly as much.

A single chiplet designed for those GPUs could still be used with an APU as long as any additional necessary logic gets added to the IO die. I don't know if the need to design another communication link though if they could just recycle the existing link that connects CPU chiplets to the IO die. The only real concern is bandwidth, but really they only need to match the available memory bandwidth. They also already have a high speed link designed for connecting their GPUs that might be more suitable and could perhaps be repurposed, but I don't know.

I think those standalone GPU units are too massive currently, and their partitioning will likely be in conservative half steps. Nowhere near achieving a practical 50-100mm2 chiplet that could be used for APU as a standalone chiplet.

I agree that doesn't make sense. Either they make a monolithic APU or they make a special chiplet for graphics. It seems like we still may be another generation removed before seeing chiplet based GPUs though. There are some rumors about RDNA 3 having multi-die GPUs, but I think these may just be more akin to traditional multi-die GPUs where there are two standalone monolithic dies on a single card/package that are connected together.

Agree. Perhaps not even the upcoming Zen 3 based Rembrand, maybe the one after that - Zen 4 based Phoenix (?_
 
Reactions: Tlh97

Doug S

Platinum Member
Feb 8, 2020
2,470
4,028
136
Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

I would venture to guess that right now, Apple does not have any iPhone SOCs on N7 being processed by TSMC.


When Apple introduces a new line of iPhone, they continue selling last year's model at a discount. For the last few years they've been keeping the low end of the two year old one on the price list as well. Plus there's the "SE" which is only updated every few years and stays on the same SoC for a while - the "SE2" is using A13 which is N7P and will be around 2 or 3 years. When you see carrier deals for iPhones they are often the last year's version, probably because they are cheaper for the carrier and Apple perhaps is willing to offer additional discounts for those they won't on the latest and greatest, so they probably sell a lot more of those than you'd think.

I've seen estimates that only about half of Apple's iPhone sales are of the latest model, and the other half are the N+1/N+2 and SE. Whether that's true we have no way of verifying, but the fact that analysts are guessing that probably means there's plenty of evidence that those older phones sell in much larger numbers than you seem to think.

Then there's the iPad (non-Pro) iPad Mini, Apple TV (three generations now - they still ship the Apple TV HD which uses the 20nm A8 and the old 4K model using the 10nm A10X) Don't forget about the Apple Watch S-series SoC which they sell multiple years versions of, and HomePod which I think is also A8 or at least was at first. And I'm not really sure what process the "W" chips in the AirPod are made with, but it is not going to be N5 simply based on when the various models were released.

None of those sell as many as the iPhone, and some (i.e. Watch and AirPod) have smaller less complex chips, but when you add all that stuff together that's a crapload of wafers on multiple processes older than N5. And that doesn't even count any ancillary chips that may be included in various products that don't get the "fanfare" of the A* M* S* and W* chips.
 
Last edited:

uzzi38

Platinum Member
Oct 16, 2019
2,698
6,393
146
The number of customers who want a 16-core Zen CPU with onboard graphics isn't terribly large. I'm not sure the added cost across an entire product line is worth what niche market segments they might be able to pick up or the small bit of extra convenience that the onboard graphics provides if a GPU goes bad and there isn't a spare to use.

OEMs want iGPs.

Laptops want iGPs.

Those are both way bigger target markets than DIY PCs, and both require Raphael. AMD needs a -S BGA/-H55 competitor for laptops and they want to compete against Intel across the stack on the desktop as well and not be relegated just to gaming OEM boxes. Regular APUs just won't cut it for proper competition past a certain point in time.
 

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Suppose chiplet 1 needs data that is in L3 of chiplet 2. The bandwidth limit and latency hit would make those accesses only marginally faster than single channel DRAM access.

Just a little clarification (I think we've discussed this before in another forum): The L3 is not shared between CCXs. That would create horrible contention for 64-core EPYC with 8 CCXs, and even more so for 2-socket 128-core systems. The states of the caches are only kept consistent within the rules of the x86 memory model, using a cache-coherency algorithm which is designed to do as little as possible — just enough to make it possible for all cores to agree on the state of memory.

Apart from any synchronisation needed by cache-coherency, an L3 miss goes straight to memory, as I understand it. Correct me if I am wrong.

Interestingly, the thing that kills performance and makes inter-CCX latency a bottleneck is high use of shared memory and locks. This puts the cache-coherency algorithm in overdrive with a lot of synchronisation between cores and CCXs. When cores work on separate memory only, or treat any shared data as read-only, the need for synchronisation goes away.

PS. The non-sharing of L3 is also why increasing the L3 available to each CCX has such big effect on performance, even in chips with multiple CCXs with high total L3, since that total isn't accessible to each CCX. We saw this with the increase from a 4-core to an 8-core CCX with a larger shared L3 cache. V-Cache multiplies this effect.
 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
8,000
6,433
136
That is actually wrong. The biggest bottleneck right now is substrate. Intel said they used up a lot of their reserves in Q2, and they don't worry about losing any market share in Q3 because the substrate is so depleted that in Q3, their competitor (AMD) is not going to face identical constraint, unable to increase production.

So this is actually perfect time to add V-Cache, to avoid the real bottleneck, which is substrate as of now. Adding V-Cache makes the product more valuable without using any additional substrate.

You misunderstand. It doesn't matter if AMD has loads of spare wafers to make tons of this because they're still limited by the slowest part of the process for manufacturing chiplets with v-cache. If they can only process 10 wafers worth of chiplets per day to bond the v-cache then producing more than 10 wafers per day of the cache just means it's piling up waiting for the the slowest production stage.

Do you think TSMC just magically has infinite machines to perform the 3D stacking and bonding process?
 
Reactions: scineram

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
I mean, trading the ST for MT has traditionally been a niche - look at ..., Bulldozer, etc.
There is no indication or design point that had Bulldozer sacrificing ST for MT. The performance loss is instead mostly attributed towards Server/HPC optimizations. Big L2(higher latency), Big Front-end(no room for op or trace cache), Big FPU(with longer latency to support server/HPC's execution unit demands), etc.

In fact before CMT was selected for Bulldozer the prior CMT implementation that was suppose to launch with K8 was significantly more smaller and faster. Hence, it had a single retire and a single LSU for two cores which each had two 64-bit ALUs(FU0/1), two integer 64-bit MMX units(MM0/1), two FPU 64-bit SSE units(FP0/1), one load agu(LDA), one store agu(STA), and the shrinked version having a store data unit(STD). It had significantly less OoO resources than Bulldozer had, since CMT is meant to be implemented at low-power. Where full CMP and big-core SMT increases power complexity.

Going more in-depth in the confusion.

Bulldozer's architecture was originally referenced as a "Compute Core", with scalable partitioned execution resources being a key feature. What if this implied a similar capability as IBM's CLA/CLB Architecture with One SMT4+4 core or Two SMT4 cores. Where Bulldozer can be either one combined core or two clustered cores. With the combined core being the original intent for production models of Bulldozer, with linear scaling power draw from partitioned slices.


Various modes mentioned by 2007:
-> instruction dispatch module can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like.
-> instruction dispatch module can be configured to dispatch integer instruction operations associated with the thread to both integer execution units 212 and 214 based on a predefined or opportunistic dispatch scheme.
-> The integer execution units 212 and 214 can be used to implement a run ahead scheme whereby the instruction dispatch module dispatches memory-access operations (e.g., load operations and store operations) to one integer execution unit while dispatching non-memory-access operations to the other integer execution unit.
-> Another example of a collaborative use of the integer execution units 202 and 204 is for an eager execution scheme whereby both results of a branch in an instruction sequence can be individually pursued by each integer instruction unit.
-> As yet another example, the integer execution units 212 and 214 can be used collaboratively to implement a reliable execution scheme for a single thread. In this instance, the same integer instruction operation is dispatched to both integer execution units 212 and 214 for execution and the results are compared by, for example, the thread retirement modules 226 of each integer execution unit.

32nm Bulldozer only had the respective mode; one thread per partition. Whereas the 45nm Bulldozer design had the collaborative modes. Which is why they said highest performing single-threaded (+ multi-threaded) compute core in history in 2007.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
I thought there was supposed to be a Monet in there somewhere on the value spectrum. GF12, quad, RDNA2, Zen3...
The value proposition node at GlobalFoundries is 12FDX, not 12LP+.
22FDX = 1x cost
12LP+ = ~1.6x cost [2019 node]
12FDX = ~1.2x cost [2022 node]
per mm squared

So, I doubt "4c Zen3" plus "a couple RDNA2 WGPs" on 12LP+ is aimed at value.
 

Zucker2k

Golden Member
Feb 15, 2006
1,810
1,159
136
Some infos :

Microsoft has issued documentation for the Milan-X HBv3 VMs with the following performance projections and VM size details and technical overview:

  • Up to 80% higher performance for CFD workloads
  • Up to 60% higher performance for EDA RTL simulation workloads
  • Up to 50% higher performance for explicit finite element analysis workloads
  • Up to 120 AMD EPYC 7V73X CPU cores (EPYC with 3D V-cache, “Milan-X”)
  • Up to 96 MB L3 cache per core (3x larger than standard Milan CPUs, and 6x larger than “Rome” CPUs)
  • 350 GB/s DRAM bandwidth (STREAM TRIAD), up to 1.8x amplification (~630 GB/s effective bandwidth)
  • 448 GB RAM
  • 200 Gbps HDR InfiniBand (SRIOV), Mellanox ConnectX-6 NIC with Adaptive Routing
  • 2 x 900 GB NVMe SSD (3.5 GB/s (reads) and 1.5 GB/s (writes) per SSD, large block IO)

Oh yeah, Intel is in trouble alright. AMD is going for the jugular here, and it'll be interesting to see how Intel responds.

This is a giant stride in computing. Kudos to AMD for being bullish with the way they keep pushing chip development on x86. Simply stupendous!

Edit: @Markfw what don't you like about my post?
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
A theorical 32C wouldnt compare as favorably against the 5950X because the 5800X is pushed in a steepier part of the V/F curve, but it should still have roughly 50% advantage over the 16C part.

Exactly. Abwx nailed it. And with modern Turbo, when only 16 cores are used, then it'll clock just as high as the 5950X, meaning it'll be faster in everything.

From Mike Clark interview, it seems that AMD is going to have some base amount of L3 on die, so that the chip can be sold without stacking. At least for next 1-2 generation.

Latency doesn't have to be higher on the V-cache. It could be made so the base cache gets the same latency and V-cache layer is higher latency.

Also, when it comes to costs, the tiny die might cost little but complexities in stacking is what raises costs.

Similar to how packaging costs are dominant in sub-100mm2 die CPUs but matter less in larger die CPUs. Things like packaging and stacking are fixed costs that become advantageous in expensive and larger configurations.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
You are just showing flip chip orientation which has been used for many generations for these types of chips, starting before finfets. The rest of your post seem to be numbers plucked out of the air so not sure what to make of them.

I’m still confused though because you say both are the orientation going forward so I don’t know which way you think AMD/Intel are packaging their chips.
It's Nosta just BSing about stuff he doesn't understand, like usual.
 
Reactions: scineram

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
It seems that AMD works on packaging -> core -> packaging -> core. Packaging seems to take more time.
As @DisEnchantment mentioned parts of packaging are in-house. But where this gets interesting is process nodes and other technology not achievable in-house. There AMD can't do anything but wait for its partners to finish their work, case in point being Zen 3D V-Cache being prepared since the Zen 3 launch, but only being available later with Milan-X and again later on the consumer market.

I wonder how AMD plans to handle such external dependencies over the long run. Ideally they wouldn't have a single linear roadmap but be able to launch node, core and packaging improvements as they become ready for HVM.

Evan Burness going on record saying Azure capturing EPYC shipments before any other customer can get their hands on them.
Basically that is a strategy for them to preempt supplies before the customers can even get their hands on them.
Then customers go to Azure for competitive analysis and they put out the best offering because nobody else got the chips before Azure.
He was also bragging that they have not lost a single competitive analysis requested by any customer.
With such early access and day and date availability Azure essentially turned Epyc into their own Graviton competitive advantage at least temporally which is a big win for Microsoft in the Cloud market.

But I'm not sure this is really a positive for AMD which over the long run should be more interested in a customer base boarder than this. Microsoft's commitment must be significant.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Sound reasoning, but it doesn't answer at all what's up with the extended WSA between GloFo and AMD. An additional $1.4 billion until 2024 when it was widely thought that the WSA would run out and not be renewed in 2021. Did the WSA turn into a poisonous pill for GloFo instead AMD, securing 14/12nm Fin production for AMD until 2024 while GloFo wanted to convert those fabs earlier?
The WSA via pre-orders only has secured 14nm/12nm Fin production up to 2023. Which is the year where GlobalFoundries will be increasing 45nm&12nm FD production at Malta.

AMD relationship with GloFo is in the know. So, any changes to GloFo's plans would be forwarded ahead of time to AMD. In this scenario, with all info gathered, AMD would transfer EPYC/Ryzen/V&R-series away to TSMC. While reviving Opteron(cost-effective Datacenter), Sempron(cost-effective Personal Computing), G-series(cost-effective Embedded and Other via SCBU). This also includes GlobalFoundries getting the low-end GPU market again.

2022 = Fin production(1st-half+2nd-half), AMD's FDSOI locks in development(2nd-half)
2023 = Fin production goes down to zero(2nd-half), AMD's FDSOI ramps to production(2nd-half)
2024/2025 = Only FDSOI production via AMD at GF going forward.

Began 28nm production at ~$6000 USD(2014) => Enter 22nm production at ~$2000 USD(2024)
With 14nm/12nm Fin being stuck at ~$4000 USD(2022+) for AMD, with the price hike to others pushing it to Samsung's/TSMC's near~$5500 USD levels.

Higher capacity, lower costs for AMD, and lower masks, process steps for GloFo => shorter lead times, and higher profit margins for both with FDX.

TSMC handles premium and higher ASP => better deals and more funding for enhanced FinFETs
GloFo handles cost-sensitive and lower ASP => more stable manufacturing capacity for FDSOI

Lower ASP is given by the lower price point from introduction of 28nm&/14nm at $6000-2014&$8000-2017 USD. Given same area in a mature insert means lower price point for similar performance to 14nm, lower power to 14nm, reduced cost of development(AMD&SCBU) and reduced TTM, etc.

TSMC gained the LP-patent set from GlobalFoundries anyway:
The companies have agreed to a broad life-of-patents cross-license to each other’s worldwide existing semiconductor patents as well as those patents that will be filed during the next ten years as both companies continue to invest significantly in semiconductor research and development.

12LP/7LP/5LP/3LP plans that AMD was aware of is better implemented by TSMC. Money$$$, Multiple Node technicians, multiple fab modules, multi-continent 16nm/12nm/5nm offshore fabs known, etc.
The above specifically: Technology research covering 14nm, 10nm, 7nm, 5nm, 3nm CMOS Finfet technologies for mobile SoC/ASIC. <== Left GF in 2017
GlobalFoundries has yet to implement SSRW High Mobility FinFETs as per 2016-IEEE for 7nm/5nm FinFETs and 2019-IEEE for 14nm/12nm FinFETs. Which AMD might be using if they switch to TSMC, rather than stay at GloFo. In this case, AMD is likely to get a semi-custom customer that wants to add networking accelerators to a Zen++ SoC on TSMC, or whatever. Which Zen++ would be the cheapest option of Zen architectures at TSMC.

A move from GloFo to TSMC is more secure given Zen's market and largely unsuccessful GlobalFoundries SCBU. If they were at TSMC, they wouldn't have failed.
GlobalFoundries FinFET = Ghost Town, one module that was exclusive to FinFETs and is now in transition to FDSOI.
TSMC FinFET = Growing Metropolis, several modules, includes Japan FinFET and Dresden FinFET fabs (and Nanjing FinFET fab), with extension of Arizona FinFET fab.

There is also a better trend of support of custom Zen core implementations at TSMC.

Zen & Dhyana at GlobalFoundries, pretty much in the decay of demand for Zen(preference of TSMC killing Zen-GloFo) and Dhyana is dead.
Zen2, Zen2-Sony, Zen3, Zen4, Zen4c... why not add more on 16nm/12nm.

On the IOD argument, TSMC is more experienced on the I/O FET. So, naturally TSMC's IOD will be superior to GlobalFoundries' IOD.

Basically, everyone is abandoning 14LPP/12LP/12LP+, for TSMC 16FF/12FF or Samsung 14LPP/14LPU/11LPU. However, some customers are simply doing 84CPP-7.5T on 12nm-Fin to go to 84CPP-7.5T on 12nm-FDX; FDXcelerator fast-track. Giving more credence that GlobalFoundries will swiftly down ramp(killing FinFETs) once a replacement node pops up. By the way, 12FDX is definitely doing some form of risk production this year.

Fab8 Device Director 14LPP/12LP/12LP+/12FDX/45RF
Successful technologies deliverables : 45RF | 12FDSOI | 12LP FINFET | Silicon Photonics
Malta 12FDX integration for 1 year 2 months (at edit: Jan 2021-Present)
Malta 12FDX FEOL/MOL process optimization for 2 years 8 months (at edit: August 2019-Present)
Malta/Essex Junction ESD/Latchup development lead for 12LP, 12FDX, 22FDX, and 28SLPe (at edit: June 2021-Present for 12FDX)

Cut extremely short, NPI/NTO(anything new) goes to TSMC if Zen-related; already apparent with RDNAx/CDNAx being TSMC-exclusive as well.

AMD/GlobalFoundries Legacy/Obsolete Big Core/Big CU, etc>> these get moved to TSMC w/ a more aggressive SCBU at TSMC in preparation.
2023+ =>
AMD/GlobalFoundries New/Modern Small Core/Small CU, etc. Separation of node and fab between Little cores(low-cost) and Big cores(premium-cost). Which reflects GlobalFoundries strategy on prioritizing low-cost pervasive semiconductors.
 
Last edited:
Reactions: amd6502

eek2121

Diamond Member
Aug 2, 2005
3,042
4,259
136
It actually does beat 8C Zen3 in said compilation tests in Phoronix testing:

View attachment 57517


Lowly 12400 with 4.4Ghz turbo clocks and 18MB of L3 has no trouble keeping up with premier AMD 8C cpu with 32MB of L3.

People keep conveniently forgetting that ADL big core is 5 ALU (each LEA capable), 512 ROB, mostly 6-wide monster. It is let down currently by mobile phone worthy uncore and memory controller, but expect it to scale real well in the future with faster mem.

The only "complex" benchmark i need is Web Speedometer 2.0 (thread is in on this very forum). Since i own 10900K, 5950X and 12900K, the numbers with 3800-3900 HIGHLY tuned DDR4 are the following ~210, ~255 and 330. That's how faster and more smooth ADL is. I believe 12900K does like 300 on stock as well. ~25% faster in my eyes.

It ran at 5.6 ghz, not 4.4 ghz? Processor: Intel Core i5-12400 @ 5.60GHz

EDIT: Just saw mods post. Not commenting further in this thread.
 

Tarkin77

Member
Mar 10, 2018
75
163
106
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.

You realise that this information is behind the paywall? That's not fair to Charlie.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,794
4,075
136
I wouldn't so much want slower chips scabbed on. That's really a bit pointless.

I'd want a super duper single core for gaming, with an 8 core on the second for other stuff. If ypu had a whole chiplet for one super duper core, how bad ass could you make it?

That sounds an awful lot like a PS3, which was regarded as rather difficult to write code for.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,860
3,405
136
Yes your are wrong , how do you propose to interconnection your cores and handle cache coherency. If you stack cache what are you stacking it on top off?

It seems to me people weight way to high the cost of the silicon itself , thus come up with crazy ideas. I think even people like Ian get it wrong when considering price vs cost for zen3 vs zen3d. There is alot more to operating costs then just the die itself.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
For those unaware, OMI stands for open memory interface:

Specification​
LRDIMM DDR4​
DDR5​
HBM2E(8-High)​
OMI​
Protocol​
Parallel​
Parallel​
Parallel​
Serial​
Signalling​
Single-Ended​
Single-Ended​
Single-Ended​
Differential​
I/O Type​
Duplex​
Duplex​
Simplex​
Simplex​
Paths/Channel (Read/Write)​
64​
32​
512R/512W​
8R/8W​
Data Transfer Rate​
3,200MT/s​
6,400MT/s​
3,200MT/s​
32,000MT/s​
Channel Bandwidth (R+W)​
25.6Gbytes/s​
25.6Gbytes/s​
400Gbytes/s​
64Gbytes/s​
Latency​
41.5ns​
60.4ns​
45.5ns​
Channels / Processor Die​
8 (EPYC Rome IO)​
5 (NVidia Ampere)​
16 (POWER10)​
Processor Die Size​
416mm²​
826mm²​
602mm²​
Driver Area / Channel​
7.8mm²​
3.9mm²​
11.4mm²​
2.2mm²​
Bandwidth / mm²​
3.3GBytes/s/mm²​
6.6GBytes/s/mm²​
35GBytes/s/mm²​
29.6GBytes/s/mm²​
Max Capacity / Channel​
64GB​
256GB​
16GB​
256GB​
Connection​
Multi Drop​
Multi Drop​
Point-to-Point​
Point-to-Point​
Data Resilience​
Parity​
Parity​
Parity​
CRC​

 

Det0x

Golden Member
Sep 11, 2014
1,053
3,075
136
The statement is from AT, they used 7 ZIP as metric for INT based computations, beside we are talking of ST perf as a way to extract IPC without saturating the mem bandwith and being too much limited by R/W latencies.

Tuned memory is certainly a great factor for improvement but anything out of spec cant be considered as being guaranted from the manufacturer to be errors free.

I did a test on Zen3 as well, and with tuned memory @4.4ghz it is scoring 7950 MIPS ( vs 6800 as in article @4.9ghz i believe ), so it is another confirmation than 7Zr compression algorithm is scaling too well with memory to be proper measurement of ALU process. Who knows where Zen3 or ADL peak ?
9079MIPS with a 5800x3d @ 4560mhz and memory 1900:3800


8937MIPS with a 5800x3d @ 4560mhz and memory 1800:3600


8834MIPS with a 5800x3d @ 4560mhz and memory 1600:3200


*edit*
Cleaned up post and screenshots
 

Attachments

  • 1651845849097.png
    352.7 KB · Views: 22
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |