New Zen microarchitecture details

AtenRa · Mar 17, 2016

deasd said:
An assumption.
A die that different than desktop = more than 8 cores(16 cores?)
MCM = 16 cores * 2 = 32 cores (with 64 threads)

Ahh, you meen the Server die has 16C 32T (double of the Desktop die) and its also an MCM at 32C 64T ??

deasd · Mar 17, 2016

AtenRa said:
Ahh, you meen the Server die has 16C 32T (double of the Desktop die) and its also an MCM at 32C 64T ??

yep, that's what I read from that statement, MCM is just a trick that put two same dies together, should not call it as a 'separate die' which different from desktop.
So non-MCM server die is different from desktop. I assumed they have different/more core count even with one die......

prtskg · Mar 17, 2016

The server version of the die has twice the cores, L3 cache and additional I/O controllers per die. I haven´t been able to disassemble one yet, however judging from the package size it is a MCM part. 14nm LPP process.

The relative power consumption is roughly the same as on Intel 14nm parts with similar configuration, but the clocks are quite low :/

The 16 core version is already in testing
Am I interpreting it wrong?

AtenRa · Mar 17, 2016

OK, thats fine.

Im curious to see the desktop die size with 8C 16T. According to AMD, 4x Cores makes a Unit. Each Unit has 8MB of L3 cache. That will make an 8C 16T with 2x Units and 2x 8MB (16MB) L3 Cache.

That will make it double of the 4 Module (8 Cores) Bulldozer.
Well, that die could be close to 120-160mm2.

If that is the case, it could be a Kabylake Socket 1151 competitor easily in Q4 2016/ Q1 2017.

ShintaiDK · Mar 17, 2016

AtenRa said:
Im curious to see the desktop die size with 8C 16T. According to AMD, 4x Cores makes a Unit. Each Unit has 8MB of L3 cache.

Wasn't that part of the fake slides?

AtenRa · Mar 17, 2016

ShintaiDK said:
Wasn't that part of the fake slides?

Ehm, I dont remember.

ShintaiDK · Mar 17, 2016

AtenRa said:
Ehm, I dont remember.

This was fake.

inf64 · Mar 17, 2016

Those slides might be fake(are we 100% sure?) but AMD is indeed using similar configuration for Zen. 4 cores share common L3 which now acts like L3 in intel core processors.

They have to get the clocks up for HEDT, nobody will buy ~2.8Ghz base 8C16T SKU, no matter how popular the price is. What is interesting is that AMD will finally have a high IPC core when DX12 gaming is coming which basically favors more cores xd. I guess the IPC uplift VS Bulldozer generation will make the difference in some of the games and productivity apps that cannot take advantage of multi-core chips.

Hopefully Zen platform and Zen+ followup are good enough for new buyers looking for upgrade or current intel chip users looking for better MT performance(i3/i5 users). Pure ST performance will most likely stay on intel's side since the bar is now set up so high but it is good to see AMD getting very close to it.

Tuna-Fish · Mar 17, 2016

That slide might be fake, but the knowledge of 4 cores per a slice of L3 doesn't come from marketing slides but patches posted to the Linux kernel by Aravind Gopalakrishnan, an AMD employee.

https://lkml.org/lkml/2015/11/3/634

In principle, they could post fake numbers to the open mailing list and just patch the correct ones in just before they release, but as far as I know no-one has ever done this and the gain of masking architecture and cache amounts would probably not be worth it compared to the risk of software not working on the CPUs at release unless it was compiled very recently.

Tuna-Fish · Mar 17, 2016

wait nvm, couldn't verify contents of this post

NTMBK · Mar 17, 2016

Tuna-Fish said:
Based on the patches to Linux, the Zen L3 does not act like the L3 in Intel CPUs. In Intel CPUs, all cores use all L3 slices evenly. In a way, memory addresses belong to specific L3 slices, and those L3 slices are the final arbiter on where the valid contents of that address currently resides.

The downside is that data needs to travel further than if you were just using your own L3, the upside is that L3 slices don't need to manage coherency between each other.

In Zen, cores primarily allocate in the L3 slice that belongs to their core complex. This means that there are multiple different places in the L3 where any data could reside, meaning that there is need for some coherency system there.

Sounds similar to how L2 works in the PS4 and XBox One. Each tile of 4 cores has its own L2, and they have to maintain coherence between the two.

Tuna-Fish · Mar 17, 2016

NTMBK said:
Sounds similar to how L2 works in the PS4 and XBox One. Each tile of 4 cores has its own L2, and they have to maintain coherence between the two.

Fyi, I went traipsing around LKLM after writing that previous post and could not find the confirmation for it. Now I think I might have mixed up LKML sources and (possibly incorrect) slides. I am currently unsure of how L3 in Zen works.

itsmydamnation · Mar 17, 2016

NTMBK said:
Sounds similar to how L2 works in the PS4 and XBox One. Each tile of 4 cores has its own L2, and they have to maintain coherence between the two.

not really, on the PS4 between the L2 caches has to go over some god awful bus thats connected outside of the cache, thats why it has such massive latency. we don't know what the interconnect between the L3's is, is could be a mesh from the local L2's to the local L3 and then a mesh between the L3's or a ring. We don't know yet.

edit:

from dresdenboys blog:

The most interesting part describes the way, how the last level cache (LLC) ID is being calculated for Zen based MPUs:

+ core_complex_id = (apicid & ((1 << c->x86_coreid_bits) - 1)) >> 3;
+ per_cpu(cpu_llc_id, cpu) = (socket_id << 3) | core_complex_id;

"Core complex" should be similar to "compute unit" and has been used in some AMD patents already. The expression marked in red means a shift right by 3, which equals a division by 8. So with two logical cores per physical core due to SMT, a core complex should contain four Zen cores and a shared LLC.
The next line shows the socket ID being shifted left by 3, leaving 3 bits for the core complex ID, which suggests a maximum number of eight core complexes per socket, or 32 physical cores. This number should first be seen as a placeholder, but we've already seen rumours mentioning that many cores.

Dresdenboy · Mar 17, 2016

Tuna-Fish said:
That slide might be fake, but the knowledge of 4 cores per a slice of L3 doesn't come from marketing slides but patches posted to the Linux kernel by Aravind Gopalakrishnan, an AMD employee.

https://lkml.org/lkml/2015/11/3/634

In principle, they could post fake numbers to the open mailing list and just patch the correct ones in just before they release, but as far as I know no-one has ever done this and the gain of masking architecture and cache amounts would probably not be worth it compared to the risk of software not working on the CPUs at release unless it was compiled very recently.

You can find my explaination here:
http://dresdenboy.blogspot.de/2016/02/amd-zeppelin-cpu-codename-confirmed-by.html

ShintaiDK said:
This was fake.

Yep, that's one of the fake slides, which is assumed to be faked based on the facts, that AMD didn't show them and other slides by the same anonymous poster contained image processing artefacts (found by juanrga) and some nonsense.

I assume, that these slides (repeatedly popping up thanks to careless CMS usage or whatever) caused some of the "info" collected by Theo Valich (6-wide).

But there are also these slides which contain information, which is in part confirmed already:

Glo. · Mar 17, 2016

Glo. said:
Lets think about it for a second.
8 core Zen CPU running at 95W TDP at 3.5 GHz core clock.
8 Core Broadwell-E Running at 3 GHz and 140W.
Impossible?

Let it even run at 3 GHz, and you still win with Broadwell in terms of efficiency. However, here come the posts from Thevenin and Fottemberg. Fott said that Efficiency will be on par with Broadwell. So it may very well be 2.5 GHz 8 core CPU with 95W TDP.

Now everything is on place.

I am quoting myself to give a little perspective. Broadwell-EP 8 core 85W CPU has base core clock of 2.1 GHz.

120W 16 core CPU also has 2.1 GHz core clock.

William Gaatjes · Mar 19, 2016

Those GMI links are interesting and answer a question i had in my mind.

With the later versions of the apu (since kaveri) and heterogenous computing, the gpu and the cpu can read and write in each others memory because of the unified memory. If the cpu needs to do some work on a data set, it would not have to physically copy that data set to "cpu" memory , perform the modification on the data and write it back to "gpu" memory. With shared memory, this is kind of silly. And the later apu's solved that issue by using a hardware method (by use of the MMU : memory managing unit) that is comparable to just copying a pointer to that data set and not the actual data set it self. This reliefs the memory from a lot of unnecessary data copying (meaning lot of read and write). The memory is more efficiently used.
This is called zero-copy through pointer passing.

With the arrival of fury, i always wondered how AMD was going to solve that issue since the fury is connected over the pci-e bus and has its own memory. So data copying is needed again. With the latency of the pci-e bus added. The only way to solve this is to optimize the drivers that as little as possible data is moved between gpu and cpu. But one can do only so much.

It seems that AMD is slowly unraveling their plans.
Now i finally understand why they mention a cpu only zen.
Because they are going the mcm road (Not a pcb but an interposer die with multiple dies on it, cpu, gpu and hbm). With the confidence they have gotten from the fury (die with hbm, same height with very strict tolerances for cooling purposes), AMD has now confidence enough to create a interposer based mcm (multi carrier module). The whole point is that at billions of transistors, the energy density is huge. If you want performance, you need to go wide and crank up the clock speed. This creates heat that cannot be removed easily.
Also, a multi 10 to 100 billion transistor (single) die apu would be statistically more prone to defects then separate (lower transistor number) dies. Even adding redundant circuitry and as a last resort the method of binning, can do only so much.
And, either the cpu architecture or the gpu architecture can be kept and the other modified. More or less hbm. mcm allows for more flexibility.

But the issue is to get the same latency advantage as the apu's have since kaveri. The zero copy method. And AMD has also solved that riddle by coming up with the GMI link. One GMI link allows for 25GB/sec. The more GMI links, the higher the bandwidth. But what is the real benefit, is the much lower latency when compared to pci-e. They will start if the slides can be believed with a 100GB/sec link. Cranking it up to a point as much as the cpu and gpu can handle the data. So, data copying is still needed but at a much lower latency. And when the need for ddr4/ddr5 external memory is again eliminated by the use of a hbm2 only, mcm based "SOC", the zero copy through pointer passing method is again available. This works at the grace of having a unified memory architecture once again. And HBM2 allows for a computer configuration with enough memory on it that unified memory makes sense. A complete mcm with 16GB would be more then enough for most people, meaning a pc with 16GB of memory.

The future is looking well for AMD if they execute correctly.

ZEN will be a powerhouse because AMD does not have to worry about the GPU being on the same die anymore. It has benefits but also drawbacks. Especially since there are more and more transistors.

Expect the next generation of consoles to be again a AMD design win.
And this might as well be the end of the pc as we know it for a lot of people.
If Microsoft play their cards right in the near future, they can finally execute an Apple alike strategy. A microsoft home pc that is powerful enough to game (for most people) but can also be used as a normal desktop pc.
With an optimized windows 10 successor that greatly taps into the hardware because of optimized low level drivers with no HAL (Hardware abstraction layer).
Just as Apple does with their hardware and optimized drivers + software.

EDIT:
Of course, it could be that Apple will go that road first. Or perhaps Sony with a linux based pc or even google. Maybe sony will come with an android based home pc in cooperation with google.
But microsoft has the advantage of having the largest software base.
If they are hopefully not so stupid to break compatibility with all old software without giving a proper virtualization solution to keep old win32 software running.

hrga225 · Mar 21, 2016

William Gaatjes said:
Those GMI links are interesting and answer a question i had in my mind.

With the later versions of the apu (since kaveri) and heterogenous computing, the gpu and the cpu can read and write in each others memory because of the unified memory. If the cpu needs to do some work on a data set, it would not have to physically copy that data set to "cpu" memory , perform the modification on the data and write it back to "gpu" memory. With shared memory, this is kind of silly. And the later apu's solved that issue by using a hardware method (by use of the MMU : memory managing unit) that is comparable to just copying a pointer to that data set and not the actual data set it self. This reliefs the memory from a lot of unnecessary data copying (meaning lot of read and write). The memory is more efficiently used.
This is called zero-copy through pointer passing.

With the arrival of fury, i always wondered how AMD was going to solve that issue since the fury is connected over the pci-e bus and has its own memory. So data copying is needed again. With the latency of the pci-e bus added. The only way to solve this is to optimize the drivers that as little as possible data is moved between gpu and cpu. But one can do only so much.

It seems that AMD is slowly unraveling their plans.
Now i finally understand why they mention a cpu only zen.
Because they are going the mcm road (Not a pcb but an interposer die with multiple dies on it, cpu, gpu and hbm). With the confidence they have gotten from the fury (die with hbm, same height with very strict tolerances for cooling purposes), AMD has now confidence enough to create a interposer based mcm (multi carrier module). The whole point is that at billions of transistors, the energy density is huge. If you want performance, you need to go wide and crank up the clock speed. This creates heat that cannot be removed easily.
Also, a multi 10 to 100 billion transistor (single) die apu would be statistically more prone to defects then separate (lower transistor number) dies. Even adding redundant circuitry and as a last resort the method of binning, can do only so much.
And, either the cpu architecture or the gpu architecture can be kept and the other modified. More or less hbm. mcm allows for more flexibility.

But the issue is to get the same latency advantage as the apu's have since kaveri. The zero copy method. And AMD has also solved that riddle by coming up with the GMI link. One GMI link allows for 25GB/sec. The more GMI links, the higher the bandwidth. But what is the real benefit, is the much lower latency when compared to pci-e. They will start if the slides can be believed with a 100GB/sec link. Cranking it up to a point as much as the cpu and gpu can handle the data. So, data copying is still needed but at a much lower latency. And when the need for ddr4/ddr5 external memory is again eliminated by the use of a hbm2 only, mcm based "SOC", the zero copy through pointer passing method is again available. This works at the grace of having a unified memory architecture once again. And HBM2 allows for a computer configuration with enough memory on it that unified memory makes sense. A complete mcm with 16GB would be more then enough for most people, meaning a pc with 16GB of memory.

The future is looking well for AMD if they execute correctly.

ZEN will be a powerhouse because AMD does not have to worry about the GPU being on the same die anymore. It has benefits but also drawbacks. Especially since there are more and more transistors.

Expect the next generation of consoles to be again a AMD design win.
And this might as well be the end of the pc as we know it for a lot of people.
If Microsoft play their cards right in the near future, they can finally execute an Apple alike strategy. A microsoft home pc that is powerful enough to game (for most people) but can also be used as a normal desktop pc.
With an optimized windows 10 successor that greatly taps into the hardware because of optimized low level drivers with no HAL (Hardware abstraction layer).
Just as Apple does with their hardware and optimized drivers + software.

EDIT:
Of course, it could be that Apple will go that road first. Or perhaps Sony with a linux based pc or even google. Maybe sony will come with an android based home pc in cooperation with google.
But microsoft has the advantage of having the largest software base.
If they are hopefully not so stupid to break compatibility with all old software without giving a proper virtualization solution to keep old win32 software running.

I will quote this post in gpu forum.
Great post.:thumbsup:

Blitzvogel · Mar 21, 2016

William Gaatjes said:
ZEN will be a powerhouse because AMD does not have to worry about the GPU being on the same die anymore. It has benefits but also drawbacks. Especially since there are more and more transistors.

Interesting total assessment, but it depends more on studied and projected chip yields to go the MCM route. It's certainly a more feasible route for extremely high performance solutions, like high end desktop and server (8-16 core), but unnecessary for the low end and perhaps "medium-end" which likely won't see HBM being used. Again it comes down to yields since producing chips with so many billions of transistors is going to be very prone to yield issues. The low and medium products will likely still be proper APUs (2 & 4 core).

Such a route also demands the creation of a universal interposer architecture or even just a standard interposer for all products. The flexibility to mount a CPU or APU on an interposer, BGA, or standard pin/LGA I think is paramount too. As interesting as an 8/16 core CPU + big GPU + HBM on interposer system sounds, I think I would still want a full dedicated graphics card that can be upgraded on it's own. I wouldn't want to waste money on limited graphics power only to have it was current and power later on thanks to leakage, unless GPU + CPU HSA actually becomes a notable thing.

Damn, I'd hate to think all the studies and projections that have and still have to be made for all that.

Silverforce11 · Mar 21, 2016

Blitzvogel said:
As interesting as an 8/16 core CPU + big GPU + HBM on interposer system sounds, I think I would still want a full dedicated graphics card that can be upgraded on it's own. I wouldn't want to waste money on limited graphics power only to have it was current and power later on thanks to leakage, unless GPU + CPU HSA actually becomes a notable thing.

The beauty with that is your dGPU will be boosted in performance in DX12/Vulkan multi-adapter by the APU. In the worse case scenario where there's no support, you still have a powerful CPU to drive your dGPU to the max.

Blitzvogel · Mar 21, 2016

Silverforce11 said:
The beauty with that is your dGPU will be boosted in performance in DX12/Vulkan multi-adapter by the APU. In the worse case scenario where there's no support, you still have a powerful CPU to drive your dGPU to the max.

But if you want more dGPU power from the onset, the interposer system seems an unnecessary waste.

Silverforce11 · Mar 21, 2016

Blitzvogel said:
But if you want more dGPU power from the onset, the interposer system seems an unnecessary waste.

If you don't want the APU design, then go with Zen CPU only, I am sure AMD will have these two lines of CPU/APU separate.

Dresdenboy · Mar 22, 2016

Blitzvogel said:
But if you want more dGPU power from the onset, the interposer system seems an unnecessary waste.

Summit Ridge is a pure CPU. And you'll get it earlier than Raven Ridge, the APU.

kapulek · Mar 22, 2016

AM4 Socket will be µOPGA and it will have 1331 pins
http://www.bitsandchips.it/english/...cket-will-be-uopga-and-it-will-have-1331-pins

LTC8K6 · Mar 22, 2016

kapulek said:
AM4 Socket will be µOPGA and it will have 1331 pins
http://www.bitsandchips.it/english/...cket-will-be-uopga-and-it-will-have-1331-pins

I thought sure AMD would get away from those pins on the chip.

Looks like only the server chips will move on to LGA.

AtenRa · Mar 22, 2016

LTC8K6 said:
I thought sure AMD would get away from those pins on the chip.

Looks like only the server chips will move on to LGA.

Those pins are better than LGA for the consumer market. More broken motherboards are coming with warped pins than broken pins on CPUs.

New Zen microarchitecture details

Lifer

Senior member

Senior member

Lifer

Lifer

Lifer

Lifer

Diamond Member

Golden Member

Golden Member

Lifer

Golden Member

Platinum Member

Golden Member

Diamond Member

Lifer

Member

Platinum Member

Lifer

Platinum Member

Lifer

Golden Member

Member

Lifer

Lifer