Speculation: Ryzen 4000 series/Zen 3

NostaSeronx · Jul 2, 2020

dnavas said:
There's a ton of bandwidth on your graphics card, and your clocks are likely less than half what your CPU's is.

APU has limited die area with monolithic design. 8-cores and 8 compute units is still more than enough. So, really it is also about uplifting lowest clock rates for same TDP.

The graphics cards on desktop have separate RAM, so they aren't relevant as their development of frequency is independent from the CPU side.

7nm -> 5nm -> 3nm -> onwards actually prefers EPI and GHz push. Density goes up, allowing for shorter wires plus higher mobility/drive current means higher clocks. However, more ALUs have an increased impact than previous nodes.

6 ALUs = 2x slower
8 ALUs = 4x slower
rather than on previous greater than 7nm nodes.
6 ALU = 1.5x slower
8 ALU = 2x slower

Which is the resistance effect FinFETs/Stacked Sheet-Ribbion suffer from post-7nm. There is also limited performance boosts in transistors themselves going forward. So, it is ultimately driven by wires. So, less wire interconnections in a monolithic structure the better.

Renoir -> Forward
Should op to increase clock rates => 1.8 GHz to at most 3 GHz(sus), 4.2 GHz to at most 5.2 GHz(light) by 2025. (8-core)
The above applies to EPYC as well => 2 GHz to at most 3.1 GHz(sus), 3.35 GHz to at most 4.4 GHz(light) by 2025. (64-core)

Essentially, AMD should be opt for the similar or greater improvement from Llano to Bristol;
A8-3500M => 1.5 GHz/2.4 GHz/355.2 GPU GFlops at 35W in mid-2011
FX-9800P => 2.7 GHz/3.6 GHz/921.6 GPU GFlops at 15W in mid-2016

A8-3500M being equivalent to Renoir and FX-9800P being the end goal. If they do the above but with 17h/19h it will be a worthwhile (half)decade.

moinmoin · Jul 2, 2020

blckgrffn said:
Isn't that the issue? That the software can't tell the difference between logical and physical cores and thus charges the same for them? The licensing is *per core*, regardless of how it shows up?

That was exactly my point, that making the difference between per thread and per core would be lost on plenty software, leading to per logical core in both cases. There's admittedly a movement toward per physical core, but that's both recent and not as usual as it should be.

And yeah, that leads straight to @LightningZ71 use case.

jamescox · Jul 2, 2020

Richie Rich said:
Of course Graviton2 has lower IPC than Zen2. It's based on weak A76 from 2018. But it has core area of 1.4mm2 what allows to put twice as much cores than Zen2 (3.6mm2). Wait for 128-core Graviton3 based on A78 (30% more IPC while less -5% transistors). And prey for that they will not use Cortex X1 cores (60% higher IPC (40% more than Zen2) at 2.1mm2 area). How about that. Still looking x86 strong?

This claim did @LightningZ71 . He is silent because he cannot prove his crazy claim. When you work with the stuff I'd like to know numbers from your company. Just give us number of machines and how many of them is running SMT OFF.

We could have 8xALU, 4xAGU, 4xFPU, SMT4 CPU core in 2003, what a shame. If Zen3 isn't Keller's EV8 resurrection then AMD is in deep deep trouble. I hope Zen3 is at least 6xALU, 3xAGU, 4xFPU, SMT4.

AMD needs to bring more tech features and go forward. However we know they can go also backward like with Bulldozer. So who knows :/

This stuff with the ALU count is ridiculous. If the number of ALU units was a bottleneck, then they would have been increased quite a while ago. Integer ALU units are tiny; the scheduling hardware to keep them busy may not be, but the engineers working on these chips are going to have register transfer level simulators to explore such design choices and determine what the bottlenecks are. Some enthusiast saying “Well, there‘s your problem, this one has 6 and yours only has 3” isn’t going to change the reality that it isn’t the bottleneck.

Also, comparing ALU counts across ISAs is not valid. It may not even be valid across different architectures since some of them may split or combine instructions in different ways. ARM is more RISC-like, so we would expect it to need to execute more simpler instructions to accomplish the same compute. RISC and CISC are mostly obsolete at this point though. None of the current architectures are very RISC-like anymore; ARM has huge numbers of very specialized instructions. The original ideas behind RISC was to use a much reduced instruction set to enable higher clocks and things like better pipelining and out of order execution. This was probably why Alpha processors were at 500 MHz while the Pentium pro was 200. About the only thing that has survived is to use more fixed width instructions and other things to simplify decoding vs. more CISC-like architectures that still have very complex decode.

The current bottlenecks probably make some more complex instructions better since things are so memory bound. Complex instructions can act as a kind of instruction compression such that they take less cache space and such. AMD64 still suffers from higher instruction decode overhead, which is what hey need somthing like the trace cache to save decoded instructions. That kind of takes the place of regular instruction cache to some extent. Also, AMD64 has probably changed a lot also since instructions that did not perform well would tend to get deprecated. They may still be available via microcode for compatibility, but modern compilers generally wouldn’t use them. IMO, the distinction between RISC and CISC just doesn’t really exist anymore.

As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance. In fact, given the die area devoted to cache, you are almost buying more of a memory chip than a processing chip. I have profiled applications that were essentially compute bound and they often still only achieved an IPC near 1. The execution core can execute ridiculous numbers of instructions at 3 to 4 GHz with out of order, superscalar, speculative execution, so it almost always comes down to getting the data to the core. Apple still has a relatively low core count (was it 2 high performance cores and 4 low performance cores?) and a very large shared L2 cache, not L3. Some applications do very well with large, low latency L2 caches. It probably works exceptionally well for the small memory footprint applications that normally run on iPhones and iPads. This is also probably why some of the older core2quad processors still perform very well. I remember some of the old core2quad processors being listed as good enough for compute intensive VR games early on. People were surprised, but some models had 4 to 6 MB L2 caches; those were very expensive extreme edition parts at the time. Having a good cache design (including things like prefetch) is the most important part of modern cpu design. It is also a big source of the improvments in Zen vs. previous architecture.

The larger core count and small L2 + large L3 is probably necessary for good performance across a wide spectrum of applications, from mobile to server. It isn’t going to be the best at both though. It will be interesting to see what Apple does with making laptop and desktop chips. They don’t need to support server applications though, since they don’t make servers. They may end up with something more like Zen with core clusters for something like the Mac Pro though.

As for SMT, it probably isn’t going away unless they decide to include a bunch of tiny, stripped down, low power cores instead. A lot of server applications have no use for all of the FP units taking up huge amounts of die space. Such applications will often run just as well on a tiny low core since they are generally not very cacheable either. They just spend most of their time waiting on memory. Such applications are throughput oriented and we have had architectures specifically designed to run such things with a lot of low power cores or a lot of hardware threads. i don’t see why AMD would remove SMT since it can be shut off if you don‘t want it. Also, there is reason to support higher thread counts eventually For such throughput application.

For the FP improvements, I am thinking that they will add at least one more 256-bit FMA unit. I don’t know quite how the current FP units are architected. I have seen some diagrams that show 2 FMA and 2 FADD with one of the FADD units sharing its input ports with the 2 FMA units. Can it actually do 2 FMA and one FADD per clock? If anyone has a link to more detailed info, it would be appreciated. The FMA units only need 2 operands when doing a multiply but they need 3 operands for an FMA op. Doubling up the FMA units wouldn’t really fit with the 50% number. Going up to 3 units would. I am not sure how they would arrange the ports, but they probably would not need to increase them that significantly. I am kind of hoping that they support AVX512 instructions across 2 clocks as they did with 256-bit instructions on 128-bit units. Some of the AVX512 instructions may be needed to compete with intel independently of the width of the vector. I don’t think there was really that much need to increase the vector width. There shouldn’t be much difference between 2 x 256 vs. 1x512. Three units is more flexibe.

Carfax83 · Jul 2, 2020

jamescox said:
As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance. In fact, given the die area devoted to cache, you are almost buying more of a memory chip than a processing chip. I have profiled applications that were essentially compute bound and they often still only achieved an IPC near 1. The execution core can execute ridiculous numbers of instructions at 3 to 4 GHz with out of order, superscalar, speculative execution, so it almost always comes down to getting the data to the core. Apple still has a relatively low core count (was it 2 high performance cores and 4 low performance cores?) and a very large shared L2 cache, not L3. Some applications do very well with large, low latency L2 caches. It probably works exceptionally well for the small memory footprint applications that normally run on iPhones and iPads. This is also probably why some of the older core2quad processors still perform very well. I remember some of the old core2quad processors being listed as good enough for compute intensive VR games early on. People were surprised, but some models had 4 to 6 MB L2 caches; those were very expensive extreme edition parts at the time. Having a good cache design (including things like prefetch) is the most important part of modern cpu design. It is also a big source of the improvments in Zen vs. previous architecture.

The larger core count and small L2 + large L3 is probably necessary for good performance across a wide spectrum of applications, from mobile to server. It isn’t going to be the best at both though. It will be interesting to see what Apple does with making laptop and desktop chips. They don’t need to support server applications though, since they don’t make servers. They may end up with something more like Zen with core clusters for something like the Mac Pro though.

I made similar arguments in this post in another thread that focuses on x86 and ARM, but not nearly as detailed and as eloquently.

Anyway, Richie always tries to steer threads towards ARM vs x86; even in this thread which is supposed to focus on Zen 3.

I say we don't let him sidetrack this thread with any more ARM nonsense. There's several other threads for that if he wants to debate ARM vs x86.

Valantar · Jul 3, 2020

jamescox said:
This is also probably why some of the older core2quad processors still perform very well. (...) People were surprised, but some models had 4 to 6 MB L2 caches

An otherwise great, well-written, clear and cohesive post, but your memory is off here - my Q9450 had 12MB of L2! While that wasn't an early C2Q, it wasn't an extreme edition either. It kept up admirably until I replaced it in 2017, after nine years of service.

Richie Rich · Jul 3, 2020

jamescox said:
This stuff with the ALU count is ridiculous. If the number of ALU units was a bottleneck, then they would have been increased quite a while ago.

You have to be kidding me or you lived on some lonely island last decade. Beause Intel's engineers are ressurecting Skylake for 5 years in a row and basically they bottled the development. AMD did went backwards with horrible Bulldozer, lost entire server market and almost bankrupt. Feel free to explain me why Bulldozer was such a garbage with 2xALUs and why Intel did well with 4xALUs in Haswell (twice IPC than BD). Of course it has nothing to do with number of ALU, right?

Do you think Zen3 will have 1xALU and will thread apart in IPC every uarch including 6xALU Apple?

My humble opinion is that computation performance is done in computation units like ALU, AGU and FPU. If you buy 8-bit micro-controler it consists of 1xALU and no cache and yet does the computation. I'm afraid if you buy chip with cache memory alone you won't able to do any computation. Feel free to prove me wrong

jamescox said:
As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance.

Of course, another Intel garbage messenger. Intel inserted into heads of whole generation people that CPU IPC hit the hard wall and it can be increased a little maybe with some cache tuning. It's so sad to see some people still believe this Intel's BS. Intel did that because was lazy and they was earning big money while no effort for CPU development.

Whole 82% IPC advantage of Apple core is just coincidence with 6xALUs. Poor performance of 2xALU Bulldozer was also coincidence with its low number of ALUs. I wonder why AMD didn't went back to 3xALU K10 design and they went directly on 4xALU design similar to Haswell. Coincidence again I guess.

jamescox said:
For the FP improvements, I am thinking that they will add at least one more 256-bit FMA unit.

Funny that you suggest IPC isn't dependent on number of ALUs but in the same time you ask for more FPUs. Did you realized that FPUs does the same thing as ALUs just with different format?

NostaSeronx · Jul 3, 2020

Richie Rich said:
Feel free to explain me why Bulldozer was such a garbage with 2xALUs and why Intel did well with 4xALUs in Haswell (twice IPC than BD). Of course it has nothing to do with number of ALU, right?

Poor performance of 2xALU Bulldozer was also coincidence with its low number of ALUs. I wonder why AMD didn't went back to 3xALU K10 design and they went directly on 4xALU design similar to Haswell. Coincidence again I guess.

Bulldozer's core(K10) of 2 ALUs and 2 AGUs were more efficient than Greyhound's core(K8) of 3 ALU or 3 AGU. The entire Bulldozer architecture in itself has 4 ALUs, 4 AGUs, 4 FPUs. So, it isn't really surprising for a smaller core on a more advanced node to be able to have four ALUs and four FPUs. Since, Bulldozer already had four ALUs and four FPUs on 32nm => It isn't a far push to get Zen on 14nm to also have four ALUs and four FPUs in less area.

The only case of AMD increasing ALU counts is from the total accessible in a CMT module. As the CMT architecture can be re-assembled into a SMT architecture with ease.

amd6502 · Jul 4, 2020

NostaSeronx said:
Bulldozer's core(K10) of 2 ALUs and 2 AGUs were more efficient than Greyhound's core(K8) of 3 ALU or 3 AGU. The entire Bulldozer architecture in itself has 4 ALUs, 4 AGUs, 4 FPUs. So, it isn't really surprising for a smaller core on a more advanced node to be able to have four ALUs and four FPUs. Since, Bulldozer already had four ALUs and four FPUs on 32nm => It isn't a far push to get Zen on 14nm to also have four ALUs and four FPUs in less area.

The only case of AMD increasing ALU counts is from the total accessible in a CMT module. As the CMT architecture can be re-assembled into a SMT architecture with ease.

2 ALU per thread seems to be kind of an efficiency sweet spot. Zen2's 4 ALU + 3AGU is really quite optimal for MT efficiency, and great for single thread IPC.

Despite limiting themselves to 2 ALU, BD/PD didn't quite get the desired efficiency. Power hungry caches and 32nm silicon that could not keep up with 14nm finfet. Excavator fixed that dramatically (despite the XV/SR oversized front end), and if ported to 7 or 5nm (or even 12FDX) with a PD front end would probably be quite impressive for energy efficiency.

The dozer 2 ALU + 2 AGU proportions seem far from ideal, but at least the later generations were more often able to substitute one of the AGU for simple ALU operations.

To me it seems Zen3 could still benefit quite well from a 5th ALU. Such 5+3 configuration would still be in the MT efficiency sweet zone, and would get nice MT IPC gains and also a slight ST IPC gain.

soresu · Jul 5, 2020

I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.

tamz_msc · Jul 5, 2020

soresu said:
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.

I have a feeling that "Zen 3 in late 2020" means AMD will give a very high level overview of Zen 3 at some press event and maybe formally announce that Milan is sampling to hyperscalers.

I would love to be proved wrong though.

itsmydamnation · Jul 5, 2020

or working hard to avoid Osborne effect, things like perlmutter have to be delivered in 2020.

yuri69 · Jul 5, 2020

The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...

soresu · Jul 5, 2020

yuri69 said:
The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...

Indeed, the lack of all this is very strange.

Richie Rich · Jul 5, 2020

soresu said:
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.

L3 cache unification is minor change otherwise AMD wouldn't disclose that.

More interesting things are the leaked ones:

new 19h family - suggesing big microarchitectural changes
samples with disabled SMT
repeating rumor about SMT4
much bigger microcode leaked from Linux

All this suggest that Zen3 is big step forward. It needs to be because Zen2 is weak and it has problems with Comet Lake which is 5 year old Skylake. It needs to be something wider like 6xALU or 8xALU with SMT4 (Keller's EV8).

Olikan said:
LOL, SMT-4 rumour is indeed alive... From Latest Momomo_us leak

Regarding those old leaked roadmap with Genoa in definition phase:

old GENOA = Zen3 + new IO Die (DDR5, PCIe5)
now GENOA = Zen4 + new IO Die (DDR5, PCIe5)

=> Zen4 = shrinked Zen3 to 5nm with some minor changes in uarch.

itsmydamnation · Jul 5, 2020

yuri69 said:
The Zen3 architecture from AMD simpy has to appear in multiple places before it actually launches. It hasn't yet.

To name a few:
* open source compiler tooling GCC/LLVM support - this usually happens like 6+ months prior launch; no "znver3" has appeared yet
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet
* ES OPN/spec leaks - for Rome it was about 8 months prior launch; no OPN and but a single leak of spec
* ES benchmark leaks - for Rome 8 months; no leak yet

You can't prevent pre-release BIOSes to leak since they are meant to be tested. Open Source compilers kinda need to have the support early. Late ES are needed for 3rd party validation so they are naturally circulating, etc.

So it's not a case of "AMD is so tightlipped they will totally release tomorrow"...

3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Ryzen 4000 “Vermeer” in B0 stepping more or less ready for the market, Cezanne with Vega and Van Gogh with Navi one step back | Exclusive | igor´sLAB

Intel’s current weakness in desktop CPUs might suggest that the long-awaited Ryzen 4000 generation might be a bit of a plaything for casual time management. Launch this year, yeah. But when, actually?

www.igorslab.de

Richie Rich · Jul 5, 2020

itsmydamnation said:
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimizations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Exactly. So the whole new 19h Family naming comes due to big uarch changes rather then AVX512 extension so they can keep that totally silent. If the rumor about 50% FPU increase is true, than Zen3 has doubled FPU to 4x (8x pipes). Cortex X1 has doubled FPUs from 2x -> 4x and projected IPC uplift is 30%. So Zen3 must have completely new FPUs or some other technique to gather higher ~~IPC~~ throughput out of it (like SMT4).

EDIT: SMT4 lowers IPC per thread, so better to use term throughput

DisEnchantment · Jul 5, 2020

itsmydamnation said:
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Ryzen 4000 “Vermeer” in B0 stepping more or less ready for the market, Cezanne with Vega and Van Gogh with Navi one step back | Exclusive | igor´sLAB

Intel’s current weakness in desktop CPUs might suggest that the long-awaited Ryzen 4000 generation might be a bit of a plaything for casual time management. Launch this year, yeah. But when, actually?

www.igorslab.de

Zen3 added quite a bunch of new instructions. The manual has been available for 3 months if you are interested to read.

Quote from an earlier post of mine

1. SEV-SNP instructions added. One more step in complete VM isolation from host.
Kernel patches ongoing. SEV is complete, but SNP still ongoing. MS implemented Autarky recently for Azure on Intel hosts but I think AMD's SEV-SNP is a much more comprehensive solution.
IMO, once the live migration process from encrypted VMs is streamlined it should be easy to deploy widely. Another one is pinning of pages. An improvement would be the HW allow paging in and out encryped pages without too much perf loss.

2. MPK/PKE support added to Programming Manual and kernel patches submitted
Another feature for Memory page protection.

3. PCID support patches submitted
Smaller hits from all those TLB flushes due to security issues.

4. 256 bit CLMUL and AES instructions
One of the things that servers do ALL THE TIME is bulk encryption and bulk compression, which, not coincidentally, Zen2 is strong. Content compression is when the server send your browser compressed data and your browser decodes it on the fly. It is one of the reasons massive web content is not choking the internet. Encryption is as you know, HTTPS traffic which needs no introduction Any Infrastructure guy worth his salt is not going to use SPEC to judge system performance.
256 bit CLMUL and AES operations are going to give decent boosts for servers in bulk encryption and content compression.

Going through the list of changes in the manual
- I don't think there will be GCC/LLVM patches any time soon. All of Zen2 code will run optimally on Zen3.
- Most of the new instructions are not going to user code i.e. all are system related instructions. SEV-SNP, PCID et al instructions and relevant support already upstreamed for Kernel 5.8+
- Only user facing instructions are the new CLMUL/AES instructions which again would not even land up in user code again. It will be used by OpenSSL/zlib and the like which most user code would simply link to them.
- I am confident AMD would not update the instruction costs for GCC/LLVM at this time before launch. That would be silly considering that it will give away the possible performance uplift, and the cost calculation would not even produce drastically different code.

The major change visible to the kernel is the load/store subsystem for which changes were upstreamed since Jan of this year.
It is not just adding an 8-Core CCX, there is a big change in how the coherency probes are working compared to Zen2. Additionaly, new RAS capabilities and perf subsystem.
Kernel 5.8+ is good for Zen3.

Most of the manuals are available, missing is the Architecture ref for Family 19h. And you can bet this year Family 19h Arch manual will be even more diluted than the Family 15h and Family 17h manuals.

Additionally I dumped the latest Aorus X570 Pro BIOS and extracted the filesystem and I could find hints of Zen3 support already but I am just too lazy to go into speculation mode over it.

yuri69 · Jul 5, 2020

itsmydamnation said:
3 of your points are like the same thing.
If Zen3 brings no new instructions then there doesn't need to be any new compiler optimisations.
There could be alpha/beta bios's under much tighter control, we have seen a leak from igors lab about bios's with Zen3 support

Ryzen 4000 “Vermeer” in B0 stepping more or less ready for the market, Cezanne with Vega and Van Gogh with Navi one step back | Exclusive | igor´sLAB

Intel’s current weakness in desktop CPUs might suggest that the long-awaited Ryzen 4000 generation might be a bit of a plaything for casual time management. Launch this year, yeah. But when, actually?

www.igorslab.de

Instruction set extensions? Sure, Zen3 brings only a few minor ones. Alghough, AMD tends to release a copy-pasted machine description which gets tuned l8r if ever. Sure there might be absolutely no relevant MD changes between Zen2 and Zen3 but it sounds kinda strange.
Tight control? Mkay, but there are tons of ppl in the supply chain. So... yea
Igor's Lab is so far the only single point of truth about Family 19h. Not even those random Chinese chat board posts appear. Duh

I will keep trusting my handy Occam's razor...

soresu · Jul 5, 2020

Richie Rich said:
Cortex X1 has doubled FPUs from 2x -> 4x and projected IPC uplift is 30%.

FP and SIMD are not the same thing, despite the fact that PR still likes to confuse people by conflating them.

Int and FP are number types, fixed and floating point to be exact.

You can have both Int and FP SIMD on any given ISA.

X1 doubles the NEON units from A78, this should benefit all NEON execution, but by itself has no effect on scalar FP - which is obviously gets that 30% improvement from other changes to the uArch.

Richie Rich said:
So Zen3 must have completely new FPUs or some other technique to gather higher IPC out of it (like SMT4).

You keep referencing things happening now or things like EV8 that happened decades ago as if they would affect design decisions that happened at least 3-4 years ago.

AMD are not soothsayers to predict the uArch announcements of other companies years down the road during the concept design of new cores.

Nor (at least I hope) do they sit around thinking about uArch's designed for completely different ISA's during a time as different from now as that time was from the late 70s to early 80's - markets have changed since then, process nodes have drastically changed.

Something to bare in mind about the 50% FP rumour about Zen3 - it could easily be talking about some ML/AI optimised format like BF16 or FP8.

ARM's intentions to increase ML performance drastically on ARM cores were in fact announced years ago and could well have set balls rolling at both Intel and AMD during the Zen3 concept phase.

DisEnchantment · Jul 5, 2020

soresu said:
Something to bare in mind about the 50% FP rumour about Zen3 - it could easily be talking about some ML/AI optimised format like BF16 or FP8.

I hope AMD is not going to add anything of that sort like Intel's AMX. For a lot of applications that don't use matrix math it is a terrible waste of die space.
I am a believer of HSA, in the future I hope AMD can stack a special accelerator die which can offload these operations like the original x87 coprocessor.
Not all SKUs need to support matrix or specialized vector ops

Richie Rich · Jul 5, 2020

soresu said:
ARM's intentions to increase ML performance drastically on ARM cores were in fact announced years ago and could well have set balls rolling at both Intel and AMD during the Zen3 concept phase.

1) BFloat16 is part of AVX512:

AVX-512 BFloat16 Instructions (BF16) - x86 - WikiChip

AVX-512 BFloat16 Instructions (AVX512_BF16) is an x86 extension, part of AVX-512, designed to accelerate neural network-based algorithms by performing dot-product on bfloat16.

en.wikichip.org

2) BFloat16 brings 100% uplift to ML, not only 50%.

So if there is no indication of AVX512 then those 50% must come from more FPU units. More FPUs with it's own pipe/scheduler also helps massively scalar IPC (Zen2 FPU helped about 20% only despite double FPU width). Because I'm big fan of DEC Alpha EV8 I tend to see as a logical next step SMT4 and EV8's 8xALU (or at least Apple's 6xALU). But that's just me. Maybe x86 future is to stay with 4xALU for next 100 years, who knows, I don't care if x86 dies or not. I guess Apple has proven 6xALU+3xFPU is the way to go and Cortex X1 proved even 4xFPU is possible with 2xLSU, 1xLoad, 2xStore ports (this is first CPU able to do 3xLoad operations /cycle). These cores are very wide OoO machines and still getting wider and wider each new generation. I guess this is the future trend also for x86 (Zen3 and Golden Cove).

soresu · Jul 5, 2020

Richie Rich said:
1) BFloat16 is part of AVX512:

I didn't say SIMD did I?

I said ML optimised format support - and I'm disinclined to believe such an old rumour to be accurate about 50% anyways considering the ES parts are often bugged or non optimal.

Not that I'm saying it will have AVX512 - though it may well have different extensions, nothing about BF16 requires 512 bit instructions, and as I said FP8 also.

AMD are certainly in a much better spot to get their own independent IS extensions supported than they were with Bulldozer and XOP, just from the amazing DIY AM4 sales of Ryzen, let alone the increasing marketshare in laptop/mobile/SFF from Renoir.

Richie Rich said:
Maybe x86 future is to stay with 4xALU for next 100 years

I doubt that any current ISA will be directly supported in HW ASIC's in 100 years.

By that time we will have gone through dramatic changes to computer architecture at every level, and likely the ISA's in common use today will be supported through emulation on some sort of superfast multi layered reprogrammable spintronic/plasmonic hybrid tech processor using photonics or plasmonics for all data communication.

dr1337 · Jul 5, 2020

soresu said:
I'm actually starting to get a little concerned about Zen3 - we knew a hell of a lot more about Zen1 and Zen2 more than 6 months before release, yet excepting the L3 cache unification of the CCD we know so little about it.

I get that AMD can be tight lipped, but they are not normally as tight lipped as this when the news is overwhelmingly good.

Navi/RDNA1 was similarly well under wraps until very close to release and it underwhelmed me somewhat, given the process node difference to competing nVidia products despite competitive perf/watt besides that.

Well from driver leaks we know that its already been revised once from A0 to B0. We also know the A0 chips had a base clock of 4ghz and a boost of 4.6. So even if they've run into a problem/bugs with the design, they're still getting already decent yields and are hard at work at revising. And we know that with the 8c unified CCX and next gen infinity architecture, there will be big improvements in latency, ergo performance and clock scaling regardless of any other changes made to the core.

Personally I wouldn't worry about a navi situation happening with ryzen. Truth be told navi out performs GCN massively per CU, its just that AMD priced it higher in line with nvidias offerings, for many reasons but a big one was wafer supply. This wont be a problem with zen 3 because they're leveraging chiplets, and they want to stay competitive with intel.

Makaveli · Jul 5, 2020

yuri69 said:
* alpha/beta BIOSes for current mobos - for Ryzen 3000k this happened like 6 months prior launch; we have not a single mention about family 19h support yet

Word is the two new bios's asus has dropped in june and july 2203 and 2407 have support for Zen 3.

yuri69 · Jul 5, 2020

Makaveli said:
Word is the two new bios's asus has dropped in june and july 2203 and 2407 have support for Zen 3.

Asus 2203 added Renoir support which was no present in 1407. 2407 brought new AGESA stuff, but no microcode changes.

Don't you think the news would pick any Zen 3 BIOS-related story immediately?

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member