Speculation: Ryzen 4000 series/Zen 3

Page 137 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
There's a ton of bandwidth on your graphics card, and your clocks are likely less than half what your CPU's is.
APU has limited die area with monolithic design. 8-cores and 8 compute units is still more than enough. So, really it is also about uplifting lowest clock rates for same TDP.

The graphics cards on desktop have separate RAM, so they aren't relevant as their development of frequency is independent from the CPU side.

7nm -> 5nm -> 3nm -> onwards actually prefers EPI and GHz push. Density goes up, allowing for shorter wires plus higher mobility/drive current means higher clocks. However, more ALUs have an increased impact than previous nodes.

6 ALUs = 2x slower
8 ALUs = 4x slower
rather than on previous greater than 7nm nodes.
6 ALU = 1.5x slower
8 ALU = 2x slower

Which is the resistance effect FinFETs/Stacked Sheet-Ribbion suffer from post-7nm. There is also limited performance boosts in transistors themselves going forward. So, it is ultimately driven by wires. So, less wire interconnections in a monolithic structure the better.

Renoir -> Forward
Should op to increase clock rates => 1.8 GHz to at most 3 GHz(sus), 4.2 GHz to at most 5.2 GHz(light) by 2025. (8-core)
The above applies to EPYC as well => 2 GHz to at most 3.1 GHz(sus), 3.35 GHz to at most 4.4 GHz(light) by 2025. (64-core)

Essentially, AMD should be opt for the similar or greater improvement from Llano to Bristol;
A8-3500M => 1.5 GHz/2.4 GHz/355.2 GPU GFlops at 35W in mid-2011
FX-9800P => 2.7 GHz/3.6 GHz/921.6 GPU GFlops at 15W in mid-2016

A8-3500M being equivalent to Renoir and FX-9800P being the end goal. If they do the above but with 17h/19h it will be a worthwhile (half)decade.
 
Last edited:
Reactions: amd6502

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
Isn't that the issue? That the software can't tell the difference between logical and physical cores and thus charges the same for them? The licensing is *per core*, regardless of how it shows up?
That was exactly my point, that making the difference between per thread and per core would be lost on plenty software, leading to per logical core in both cases. There's admittedly a movement toward per physical core, but that's both recent and not as usual as it should be.

And yeah, that leads straight to @LightningZ71 use case.
 

jamescox

Senior member
Nov 11, 2009
640
1,104
136
Of course Graviton2 has lower IPC than Zen2. It's based on weak A76 from 2018. But it has core area of 1.4mm2 what allows to put twice as much cores than Zen2 (3.6mm2). Wait for 128-core Graviton3 based on A78 (30% more IPC while less -5% transistors). And prey for that they will not use Cortex X1 cores (60% higher IPC (40% more than Zen2) at 2.1mm2 area). How about that. Still looking x86 strong?

This claim did @LightningZ71 . He is silent because he cannot prove his crazy claim. When you work with the stuff I'd like to know numbers from your company. Just give us number of machines and how many of them is running SMT OFF.

We could have 8xALU, 4xAGU, 4xFPU, SMT4 CPU core in 2003, what a shame. If Zen3 isn't Keller's EV8 resurrection then AMD is in deep deep trouble. I hope Zen3 is at least 6xALU, 3xAGU, 4xFPU, SMT4.

AMD needs to bring more tech features and go forward. However we know they can go also backward like with Bulldozer. So who knows :/

This stuff with the ALU count is ridiculous. If the number of ALU units was a bottleneck, then they would have been increased quite a while ago. Integer ALU units are tiny; the scheduling hardware to keep them busy may not be, but the engineers working on these chips are going to have register transfer level simulators to explore such design choices and determine what the bottlenecks are. Some enthusiast saying “Well, there‘s your problem, this one has 6 and yours only has 3” isn’t going to change the reality that it isn’t the bottleneck.

Also, comparing ALU counts across ISAs is not valid. It may not even be valid across different architectures since some of them may split or combine instructions in different ways. ARM is more RISC-like, so we would expect it to need to execute more simpler instructions to accomplish the same compute. RISC and CISC are mostly obsolete at this point though. None of the current architectures are very RISC-like anymore; ARM has huge numbers of very specialized instructions. The original ideas behind RISC was to use a much reduced instruction set to enable higher clocks and things like better pipelining and out of order execution. This was probably why Alpha processors were at 500 MHz while the Pentium pro was 200. About the only thing that has survived is to use more fixed width instructions and other things to simplify decoding vs. more CISC-like architectures that still have very complex decode.

The current bottlenecks probably make some more complex instructions better since things are so memory bound. Complex instructions can act as a kind of instruction compression such that they take less cache space and such. AMD64 still suffers from higher instruction decode overhead, which is what hey need somthing like the trace cache to save decoded instructions. That kind of takes the place of regular instruction cache to some extent. Also, AMD64 has probably changed a lot also since instructions that did not perform well would tend to get deprecated. They may still be available via microcode for compatibility, but modern compilers generally wouldn’t use them. IMO, the distinction between RISC and CISC just doesn’t really exist anymore.

As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance. In fact, given the die area devoted to cache, you are almost buying more of a memory chip than a processing chip. I have profiled applications that were essentially compute bound and they often still only achieved an IPC near 1. The execution core can execute ridiculous numbers of instructions at 3 to 4 GHz with out of order, superscalar, speculative execution, so it almost always comes down to getting the data to the core. Apple still has a relatively low core count (was it 2 high performance cores and 4 low performance cores?) and a very large shared L2 cache, not L3. Some applications do very well with large, low latency L2 caches. It probably works exceptionally well for the small memory footprint applications that normally run on iPhones and iPads. This is also probably why some of the older core2quad processors still perform very well. I remember some of the old core2quad processors being listed as good enough for compute intensive VR games early on. People were surprised, but some models had 4 to 6 MB L2 caches; those were very expensive extreme edition parts at the time. Having a good cache design (including things like prefetch) is the most important part of modern cpu design. It is also a big source of the improvments in Zen vs. previous architecture.

The larger core count and small L2 + large L3 is probably necessary for good performance across a wide spectrum of applications, from mobile to server. It isn’t going to be the best at both though. It will be interesting to see what Apple does with making laptop and desktop chips. They don’t need to support server applications though, since they don’t make servers. They may end up with something more like Zen with core clusters for something like the Mac Pro though.

As for SMT, it probably isn’t going away unless they decide to include a bunch of tiny, stripped down, low power cores instead. A lot of server applications have no use for all of the FP units taking up huge amounts of die space. Such applications will often run just as well on a tiny low core since they are generally not very cacheable either. They just spend most of their time waiting on memory. Such applications are throughput oriented and we have had architectures specifically designed to run such things with a lot of low power cores or a lot of hardware threads. i don’t see why AMD would remove SMT since it can be shut off if you don‘t want it. Also, there is reason to support higher thread counts eventually For such throughput application.

For the FP improvements, I am thinking that they will add at least one more 256-bit FMA unit. I don’t know quite how the current FP units are architected. I have seen some diagrams that show 2 FMA and 2 FADD with one of the FADD units sharing its input ports with the 2 FMA units. Can it actually do 2 FMA and one FADD per clock? If anyone has a link to more detailed info, it would be appreciated. The FMA units only need 2 operands when doing a multiply but they need 3 operands for an FMA op. Doubling up the FMA units wouldn’t really fit with the 50% number. Going up to 3 units would. I am not sure how they would arrange the ports, but they probably would not need to increase them that significantly. I am kind of hoping that they support AVX512 instructions across 2 clocks as they did with 256-bit instructions on 128-bit units. Some of the AVX512 instructions may be needed to compete with intel independently of the width of the vector. I don’t think there was really that much need to increase the vector width. There shouldn’t be much difference between 2 x 256 vs. 1x512. Three units is more flexibe.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance. In fact, given the die area devoted to cache, you are almost buying more of a memory chip than a processing chip. I have profiled applications that were essentially compute bound and they often still only achieved an IPC near 1. The execution core can execute ridiculous numbers of instructions at 3 to 4 GHz with out of order, superscalar, speculative execution, so it almost always comes down to getting the data to the core. Apple still has a relatively low core count (was it 2 high performance cores and 4 low performance cores?) and a very large shared L2 cache, not L3. Some applications do very well with large, low latency L2 caches. It probably works exceptionally well for the small memory footprint applications that normally run on iPhones and iPads. This is also probably why some of the older core2quad processors still perform very well. I remember some of the old core2quad processors being listed as good enough for compute intensive VR games early on. People were surprised, but some models had 4 to 6 MB L2 caches; those were very expensive extreme edition parts at the time. Having a good cache design (including things like prefetch) is the most important part of modern cpu design. It is also a big source of the improvments in Zen vs. previous architecture.

The larger core count and small L2 + large L3 is probably necessary for good performance across a wide spectrum of applications, from mobile to server. It isn’t going to be the best at both though. It will be interesting to see what Apple does with making laptop and desktop chips. They don’t need to support server applications though, since they don’t make servers. They may end up with something more like Zen with core clusters for something like the Mac Pro though.

I made similar arguments in this post in another thread that focuses on x86 and ARM, but not nearly as detailed and as eloquently.

Anyway, Richie always tries to steer threads towards ARM vs x86; even in this thread which is supposed to focus on Zen 3.

I say we don't let him sidetrack this thread with any more ARM nonsense. There's several other threads for that if he wants to debate ARM vs x86.
 

Valantar

Golden Member
Aug 26, 2014
1,792
508
136
This is also probably why some of the older core2quad processors still perform very well. (...) People were surprised, but some models had 4 to 6 MB L2 caches
An otherwise great, well-written, clear and cohesive post, but your memory is off here - my Q9450 had 12MB of L2! While that wasn't an early C2Q, it wasn't an extreme edition either. It kept up admirably until I replaced it in 2017, after nine years of service.
 
Reactions: Mopetar

Richie Rich

Senior member
Jul 28, 2019
470
229
76
This stuff with the ALU count is ridiculous. If the number of ALU units was a bottleneck, then they would have been increased quite a while ago.
You have to be kidding me or you lived on some lonely island last decade. Beause Intel's engineers are ressurecting Skylake for 5 years in a row and basically they bottled the development. AMD did went backwards with horrible Bulldozer, lost entire server market and almost bankrupt. Feel free to explain me why Bulldozer was such a garbage with 2xALUs and why Intel did well with 4xALUs in Haswell (twice IPC than BD). Of course it has nothing to do with number of ALU, right?

Do you think Zen3 will have 1xALU and will thread apart in IPC every uarch including 6xALU Apple?

My humble opinion is that computation performance is done in computation units like ALU, AGU and FPU. If you buy 8-bit micro-controler it consists of 1xALU and no cache and yet does the computation. I'm afraid if you buy chip with cache memory alone you won't able to do any computation. Feel free to prove me wrong


As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance.
Of course, another Intel garbage messenger. Intel inserted into heads of whole generation people that CPU IPC hit the hard wall and it can be increased a little maybe with some cache tuning. It's so sad to see some people still believe this Intel's BS. Intel did that because was lazy and they was earning big money while no effort for CPU development.

Whole 82% IPC advantage of Apple core is just coincidence with 6xALUs. Poor performance of 2xALU Bulldozer was also coincidence with its low number of ALUs. I wonder why AMD didn't went back to 3xALU K10 design and they went directly on 4xALU design similar to Haswell. Coincidence again I guess.


For the FP improvements, I am thinking that they will add at least one more 256-bit FMA unit.
Funny that you suggest IPC isn't dependent on number of ALUs but in the same time you ask for more FPUs. Did you realized that FPUs does the same thing as ALUs just with different format?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Feel free to explain me why Bulldozer was such a garbage with 2xALUs and why Intel did well with 4xALUs in Haswell (twice IPC than BD). Of course it has nothing to do with number of ALU, right?

Poor performance of 2xALU Bulldozer was also coincidence with its low number of ALUs. I wonder why AMD didn't went back to 3xALU K10 design and they went directly on 4xALU design similar to Haswell. Coincidence again I guess.
Bulldozer's core(K10) of 2 ALUs and 2 AGUs were more efficient than Greyhound's core(K8) of 3 ALU or 3 AGU. The entire Bulldozer architecture in itself has 4 ALUs, 4 AGUs, 4 FPUs. So, it isn't really surprising for a smaller core on a more advanced node to be able to have four ALUs and four FPUs. Since, Bulldozer already had four ALUs and four FPUs on 32nm => It isn't a far push to get Zen on 14nm to also have four ALUs and four FPUs in less area.

The only case of AMD increasing ALU counts is from the total accessible in a CMT module. As the CMT architecture can be re-assembled into a SMT architecture with ease.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Bulldozer's core(K10) of 2 ALUs and 2 AGUs were more efficient than Greyhound's core(K8) of 3 ALU or 3 AGU. The entire Bulldozer architecture in itself has 4 ALUs, 4 AGUs, 4 FPUs. So, it isn't really surprising for a smaller core on a more advanced node to be able to have four ALUs and four FPUs. Since, Bulldozer already had four ALUs and four FPUs on 32nm => It isn't a far push to get Zen on 14nm to also have four ALUs and four FPUs in less area.

The only case of AMD increasing ALU counts is from the total accessible in a CMT module. As the CMT architecture can be re-assembled into a SMT architecture with ease.

2 ALU per thread seems to be kind of an efficiency sweet spot. Zen2's 4 ALU + 3AGU is really quite optimal for MT efficiency, and great for single thread IPC.

Despite limiting themselves to 2 ALU, BD/PD didn't quite get the desired efficiency. Power hungry caches and 32nm silicon that could not keep up with 14nm finfet. Excavator fixed that dramatically (despite the XV/SR oversized front end), and if ported to 7 or 5nm (or even 12FDX) with a PD front end would probably be quite impressive for energy efficiency.

The dozer 2 ALU + 2 AGU proportions seem far from ideal, but at least the later generations were more often able to substitute one of the AGU for simple ALU operations.

To me it seems Zen3 could still benefit quite well from a 5th ALU. Such 5+3 configuration would still be in the MT efficiency sweet zone, and would get nice MT IPC gains and also a slight ST IPC gain.
 
Reactions: Tlh97

soresu

Platinum Member
Dec 19, 2014
2,921
2,141
136
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.
 
Reactions: Tlh97 and tamz_msc

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.
I have a feeling that "Zen 3 in late 2020" means AMD will give a very high level overview of Zen 3 at some press event and maybe formally announce that Milan is sampling to hyperscalers.

I would love to be proved wrong though.
 

yuri69

Senior member
Jul 16, 2013
427
711
136
The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...
 

soresu

Platinum Member
Dec 19, 2014
2,921
2,141
136
The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...
Indeed, the lack of all this is very strange.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.
L3 cache unification is minor change otherwise AMD wouldn't disclose that.

More interesting things are the leaked ones:
  • new 19h family - suggesing big microarchitectural changes
  • samples with disabled SMT
  • repeating rumor about SMT4
  • much bigger microcode leaked from Linux
All this suggest that Zen3 is big step forward. It needs to be because Zen2 is weak and it has problems with Comet Lake which is 5 year old Skylake. It needs to be something wider like 6xALU or 8xALU with SMT4 (Keller's EV8).




LOL, SMT-4 rumour is indeed alive... From Latest Momomo_us leak

Regarding those old leaked roadmap with Genoa in definition phase:
  • old GENOA = Zen3 + new IO Die (DDR5, PCIe5)
  • now GENOA = Zen4 + new IO Die (DDR5, PCIe5)

=> Zen4 = shrinked Zen3 to 5nm with some minor changes in uarch.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,853
3,399
136
The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimizations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support
Exactly. So the whole new 19h Family naming comes due to big uarch changes rather then AVX512 extension so they can keep that totally silent. If the rumor about 50% FPU increase is true, than Zen3 has doubled FPU to 4x (8x pipes). Cortex X1 has doubled FPUs from 2x -> 4x and projected IPC uplift is 30%. So Zen3 must have completely new FPUs or some other technique to gather higher IPC throughput out of it (like SMT4).

EDIT: SMT4 lowers IPC per thread, so better to use term throughput
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,672
6,150
136
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Zen3 added quite a bunch of new instructions. The manual has been available for 3 months if you are interested to read.

Quote from an earlier post of mine
1. SEV-SNP instructions added. One more step in complete VM isolation from host.
Kernel patches ongoing. SEV is complete, but SNP still ongoing. MS implemented Autarky recently for Azure on Intel hosts but I think AMD's SEV-SNP is a much more comprehensive solution.
IMO, once the live migration process from encrypted VMs is streamlined it should be easy to deploy widely. Another one is pinning of pages. An improvement would be the HW allow paging in and out encryped pages without too much perf loss.

2. MPK/PKE support added to Programming Manual and kernel patches submitted
Another feature for Memory page protection.

3. PCID support patches submitted
Smaller hits from all those TLB flushes due to security issues.

4. 256 bit CLMUL and AES instructions
One of the things that servers do ALL THE TIME is bulk encryption and bulk compression, which, not coincidentally, Zen2 is strong. Content compression is when the server send your browser compressed data and your browser decodes it on the fly. It is one of the reasons massive web content is not choking the internet. Encryption is as you know, HTTPS traffic which needs no introduction Any Infrastructure guy worth his salt is not going to use SPEC to judge system performance.
256 bit CLMUL and AES operations are going to give decent boosts for servers in bulk encryption and content compression.

Going through the list of changes in the manual
- I don't think there will be GCC/LLVM patches any time soon. All of Zen2 code will run optimally on Zen3.
- Most of the new instructions are not going to user code i.e. all are system related instructions. SEV-SNP, PCID et al instructions and relevant support already upstreamed for Kernel 5.8+
- Only user facing instructions are the new CLMUL/AES instructions which again would not even land up in user code again. It will be used by OpenSSL/zlib and the like which most user code would simply link to them.
- I am confident AMD would not update the instruction costs for GCC/LLVM at this time before launch. That would be silly considering that it will give away the possible performance uplift, and the cost calculation would not even produce drastically different code.

The major change visible to the kernel is the load/store subsystem for which changes were upstreamed since Jan of this year.
It is not just adding an 8-Core CCX, there is a big change in how the coherency probes are working compared to Zen2. Additionaly, new RAS capabilities and perf subsystem.
Kernel 5.8+ is good for Zen3.

Most of the manuals are available, missing is the Architecture ref for Family 19h. And you can bet this year Family 19h Arch manual will be even more diluted than the Family 15h and Family 17h manuals.

Additionally I dumped the latest Aorus X570 Pro BIOS and extracted the filesystem and I could find hints of Zen3 support already but I am just too lazy to go into speculation mode over it.
 
Last edited:

yuri69

Senior member
Jul 16, 2013
427
711
136
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Instruction set extensions? Sure, Zen3 brings only a few minor ones. Alghough, AMD tends to release a copy-pasted machine description which gets tuned l8r if ever. Sure there might be absolutely no relevant MD changes between Zen2 and Zen3 but it sounds kinda strange.
Tight control? Mkay, but there are tons of ppl in the supply chain. So... yea
Igor's Lab is so far the only single point of truth about Family 19h. Not even those random Chinese chat board posts appear. Duh

I will keep trusting my handy Occam's razor...
 

soresu

Platinum Member
Dec 19, 2014
2,921
2,141
136
Cortex X1 has doubled FPUs from 2x -> 4x and projected IPC uplift is 30%.
FP and SIMD are not the same thing, despite the fact that PR still likes to confuse people by conflating them.

Int and FP are number types, fixed and floating point to be exact.

You can have both Int and FP SIMD on any given ISA.

X1 doubles the NEON units from A78, this should benefit all NEON execution, but by itself has no effect on scalar FP - which is obviously gets that 30% improvement from other changes to the uArch.
So Zen3 must have completely new FPUs or some other technique to gather higher IPC out of it (like SMT4).
You keep referencing things happening now or things like EV8 that happened decades ago as if they would affect design decisions that happened at least 3-4 years ago.

AMD are not soothsayers to predict the uArch announcements of other companies years down the road during the concept design of new cores.

Nor (at least I hope) do they sit around thinking about uArch's designed for completely different ISA's during a time as different from now as that time was from the late 70s to early 80's - markets have changed since then, process nodes have drastically changed.

Something to bare in mind about the 50% FP rumour about Zen3 - it could easily be talking about some ML/AI optimised format like BF16 or FP8.

ARM's intentions to increase ML performance drastically on ARM cores were in fact announced years ago and could well have set balls rolling at both Intel and AMD during the Zen3 concept phase.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,672
6,150
136
Something to bare in mind about the 50% FP rumour about Zen3 - it could easily be talking about some ML/AI optimised format like BF16 or FP8.
I hope AMD is not going to add anything of that sort like Intel's AMX. For a lot of applications that don't use matrix math it is a terrible waste of die space.
I am a believer of HSA, in the future I hope AMD can stack a special accelerator die which can offload these operations like the original x87 coprocessor.
Not all SKUs need to support matrix or specialized vector ops
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
ARM's intentions to increase ML performance drastically on ARM cores were in fact announced years ago and could well have set balls rolling at both Intel and AMD during the Zen3 concept phase.
1) BFloat16 is part of AVX512:

2) BFloat16 brings 100% uplift to ML, not only 50%.

So if there is no indication of AVX512 then those 50% must come from more FPU units. More FPUs with it's own pipe/scheduler also helps massively scalar IPC (Zen2 FPU helped about 20% only despite double FPU width). Because I'm big fan of DEC Alpha EV8 I tend to see as a logical next step SMT4 and EV8's 8xALU (or at least Apple's 6xALU). But that's just me. Maybe x86 future is to stay with 4xALU for next 100 years, who knows, I don't care if x86 dies or not. I guess Apple has proven 6xALU+3xFPU is the way to go and Cortex X1 proved even 4xFPU is possible with 2xLSU, 1xLoad, 2xStore ports (this is first CPU able to do 3xLoad operations /cycle). These cores are very wide OoO machines and still getting wider and wider each new generation. I guess this is the future trend also for x86 (Zen3 and Golden Cove).
 

soresu

Platinum Member
Dec 19, 2014
2,921
2,141
136
1) BFloat16 is part of AVX512:
I didn't say SIMD did I?

I said ML optimised format support - and I'm disinclined to believe such an old rumour to be accurate about 50% anyways considering the ES parts are often bugged or non optimal.

Not that I'm saying it will have AVX512 - though it may well have different extensions, nothing about BF16 requires 512 bit instructions, and as I said FP8 also.

AMD are certainly in a much better spot to get their own independent IS extensions supported than they were with Bulldozer and XOP, just from the amazing DIY AM4 sales of Ryzen, let alone the increasing marketshare in laptop/mobile/SFF from Renoir.
Maybe x86 future is to stay with 4xALU for next 100 years
I doubt that any current ISA will be directly supported in HW ASIC's in 100 years.

By that time we will have gone through dramatic changes to computer architecture at every level, and likely the ISA's in common use today will be supported through emulation on some sort of superfast multi layered reprogrammable spintronic/plasmonic hybrid tech processor using photonics or plasmonics for all data communication.
 

dr1337

Senior member
May 25, 2020
379
635
136
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.

Well from driver leaks we know that its already been revised once from A0 to B0. We also know the A0 chips had a base clock of 4ghz and a boost of 4.6. So even if they've run into a problem/bugs with the design, they're still getting already decent yields and are hard at work at revising. And we know that with the 8c unified CCX and next gen infinity architecture, there will be big improvements in latency, ergo performance and clock scaling regardless of any other changes made to the core.


Personally I wouldn't worry about a navi situation happening with ryzen. Truth be told navi out performs GCN massively per CU, its just that AMD priced it higher in line with nvidias offerings, for many reasons but a big one was wafer supply. This wont be a problem with zen 3 because they're leveraging chiplets, and they want to stay competitive with intel.
 

yuri69

Senior member
Jul 16, 2013
427
711
136
Word is the two new bios's asus has dropped in june and july 2203 and 2407 have support for Zen 3.
Asus 2203 added Renoir support which was no present in 1407. 2407 brought new AGESA stuff, but no microcode changes.

Don't you think the news would pick any Zen 3 BIOS-related story immediately?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |