AMD Raven Ridge 'Zen APU' Thread

sm625 · May 2, 2017

Shivansps said:
I still fail to see the point of pairing a super slow 11CU GPU with HBM2, and adding 1 CCX to the mix to make a super expensive thing that will be outperformed by a $400(or less) cpu+gpu combination on the market. And im not talking about games here.

And adding more CU will just make things worse, (cost).

11CU doesnt make sense. Not at all. Ryzen is roughly the size of polaris, a bit smaller actually. So half of a Ryzen and half a polaris would be 4C/8T/18CU. I guess the rumor is that Vega CUs are larger, but still 11CU would be a very small APU. Smaller than Llano, Richland, or Kaveri.

maddie · May 2, 2017

We finally have a firm basis to estimate an APU with integral HBM2.

Since it is generally believed that an APU with a Zen CCX and a powerful [RX 560 class ] GPU graphics unit will not work properly without significantly more bandwidth than DDR4 offers, we can estimate the additional costs with 1 stack 2-Hi HBM2 on an interposer to be roughly $ 15-20 maximum.

Additionally the HBM2 module can be a cheaper, lower clocked one and higher volume production will lower these costs.

If true, then there is definitely a place for this product. A Zen CCX and 16 CU product. I can easily see this as a product.

Assuming that 2-Hi HBM2 is 60% of 4-HI HBM stack and interposer is 1/3 cost of Fiji interposer.

This is based on the following analysis:
http://electroiq.com/insights-from-leading-edge/category/uncategorized/page/6/

Glo. · May 2, 2017

Volume production for HBM2 is bigger, so the cost will be lower. APU/NPU die will cost 20-30$, at best, and interposer will cost at best 10-15$.

So in worst case scenario we are looking at 80$ manufacturing cost, for something that can be sold at 349-399$ price tag.

maddie · May 2, 2017

Glo. said:
Volume production for HBM2 is bigger, so the cost will be lower. APU/NPU die will cost 20-30$, at best, and interposer will cost at best 10-15$.

So in worst case scenario we are looking at 80$ manufacturing cost, for something that can be sold at 349-399$ price tag.

A single HBM 4-Hi stack was $12 and a 1000 mm^2 interposer was $25.

How do you get as much as $80?

Glo. · May 2, 2017

HBM2 stack - 8$.
The die: 20-30$. If we account for 30$, Die+ HBM2 stacks - 46$.
Interposer - 15$. Total = 51$.
Substrate and packaging cost = 30$ max, including 1$ TSV's.
Total in worst case scenario - 83$.

AMD have bought in Q3 and Q4 2016 a volume of supplies for 80 000 000 $, which was pointed in one of previous conference calls. It is HBM2 supply. You have to bare in mind that HBM1 stack cost was so high because of limited production. HBM2 production was bigger: both AMD and Nvidia was/will be using it.

maddie · May 2, 2017

Glo. said:
HBM2 stack - 8$.
The die: 20-30$. If we account for 30$, Die+ HBM2 stacks - 46$.
Interposer - 15$. Total = 51$.
Substrate and packaging cost = 30$ max, including 1$ TSV's.
Total in worst case scenario - 83$.

AMD have bought in Q3 and Q4 2016 a volume of supplies for 80 000 000 $, which was pointed in one of previous conference calls. It is HBM2 supply. You have to bare in mind that HBM1 stack cost was so high because of limited production. HBM2 production was bigger: both AMD and Nvidia was/will be using it.

RX 460 has 112 Gb/s memory bandwidth.

A single stack 2 Hi-HBM2 2GB module @ 1 (Gb/s) per pin [slowest one] has 128 Gb/s.

You don't need 2 stacks for 16CU graphics. I am assuming only one needed. This drops HBM2 costs and interposer costs [smaller area].

I estimate around $60-65. In any case even assuming you are accurate, this can be a compelling product for many users.

scannall · May 2, 2017

I'm not so sure you'll see HBM2 APU's this year. The cost is still up there. As just another wild guess here, I can see a server APU though. 24 cores/48 threads and some Vega GPU with HBM2 in lieu of AVX. Arguably that would perform better than AVX.

Glo. · May 2, 2017

maddie said:
RX 460 has 112 Gb/s memory bandwidth.

A single stack 2 Hi-HBM2 2GB module @ 1 (Gb/s) per pin [slowest one] has 128 Gb/s.

You don't need 2 stacks for 16CU graphics. I am assuming only one needed. This drops HBM2 costs and interposer costs [smaller area].

I estimate around $60-65. In any case even assuming you are accurate, this can be a compelling product for many users.

Sure RX460 can have/need 112 GB/s of bandwidth. But I do not believe people understand why is Vega going to use such massive bandwidth for.

Let me give you an example. Over 50 GB/s of BWdth you get with 3200 MHz RAM memory. And that will be enough to feed 11 Vega CU's .

The thing is that HBCC, and HBC(!) is here because of specific behavior of the GPU. Lets just say, that 16 CU's with 1.2 GHz can be as fast as WX5100/RX 470D, IMO.
So we are looking at completely different scale of performance.

raghu78 · May 2, 2017

scannall said:
I'm not so sure you'll see HBM2 APU's this year. The cost is still up there. As just another wild guess here, I can see a server APU though. 24 cores/48 threads and some Vega GPU with HBM2 in lieu of AVX. Arguably that would perform better than AVX.

There is a server APU planned with HBM2. I expect that we will see such a product in 2018.

http://www.fudzilla.com/news/processors/37494-amd-x86-16-core-zen-apu-detailed

The rumoured specs were 16C/32T with Vega GPU (formerly called Greenland) .

dnavas · May 2, 2017

raghu78 said:
There is a server APU planned with HBM2. I expect that we will see such a product in 2018.

http://www.fudzilla.com/news/processors/37494-amd-x86-16-core-zen-apu-detailed

The rumoured specs were 16C/32T with Vega GPU (formerly called Greenland) .

The GPU is interesting, but high-throughput hevc/avc *decode* hardware would be even more interesting to me (particularly if it decoded production streams, rather than delivery streams -- so high bit-depths and 4:2:2 support). Something that could handle half a dozen 4k60p streams, and (even more importantly) someone making the rounds in the NLE world to get it used would be awesome. Not sure what AMD has planned there, but Intel's qsv is one of the few remaining advantages of the 4-core Intels, and AMD has the opportunity to leapfrog and create a version targeted for use in a workstation, rather than just on the consumption/mainstream capabilities we've seen thus far. Not that I would expect it, just that it's possible given their mix-and-match approach to CPU construction....

maddie · May 2, 2017

If AMD is going to make an APU with a HBM2 stack, why have a monolithic APU?

One of the tech presentations a few yrs ago by AMD was on disintegrating the SOC and using an interposer to reintegrate the sub-units. Production costs would be lower as 2 x 100mm^2 would be cheaper than a 200mm^2 one. The interposer cost would be pretty much the same and you could offer a wider range of products. X CPUs and Y GPUs would give (XxY) possibilities.

jpiniero · May 2, 2017

maddie said:
If AMD is going to make an APU with a HBM2 stack, why have a monolithic APU?

The interposer adds a cost they frankly can't afford. They need something closer to Intel's EMIB.

crashtech · May 2, 2017

jpiniero said:
The interposer adds a cost they frankly can't afford. They need something closer to Intel's EMIB.

If the interposer will be there anyway, the cost isn't increased, at least for HBM2 skus.

jpiniero · May 2, 2017

crashtech said:
If the interposer will be there anyway, the cost isn't increased, at least for HBM2 skus.

That's the thing, there won't be any HBM2 SKUs. Unless they somehow convince Apple.

The Stilt · May 2, 2017

Glo. said:
Let me give you an example. Over 50 GB/s of BWdth you get with 3200 MHz RAM memory. And that will be enough to feed 11 Vega CU's .

I would love to know how you came up with this conclusion.

For example the iGPU found in Steamroller APUs (8 CUs @ 866MHz) is bandwidth starved until > 80GB/s (~90MB/s per GFlop) and the performance scales well up to 96GB/s (~108MB/s per GFlop).
Raven is expected to have > 1.8TFlops of throughput which at 51.2GB/s (DDR-3200) would mean < 28.5MB/s of bandwidth per GFlop, which naturally is also shared between the CPU and the GPU.
Even RX 480 cards are bandwidth starved and they have < 44MB/s of dedicated bandwidth available per GFlop.

So essentially the DCC in Vega should reduce the bandwidth requirement by < 3.2x compared to scenario where DCC is not used at all (Steamroller, GCN 1.1), so that the 11 CU Vega GPU in Raven wouldn't be bandwidth starved at DDR-3200MHz.
AMD stated -40% reduced bandwidth requirement from DCC for GCN 1.2, IIRC.

maddie · May 3, 2017

jpiniero said:
The interposer adds a cost they frankly can't afford. They need something closer to Intel's EMIB.

You've often mentioned high interposer costs.

What is your idea of interposer costs?

The problem I see with EMIB is that your connection is limited to the adjacent die. Signals on an interposer can be routed anywhere, but on the EMIB equipped units must skip along using intermediaries.Will work well for simple few numbered sub-units. Complex, many numbered sub-unit systems should be better with interposers.

maddie · May 3, 2017

The Stilt said:
I would love to know how you came up with this conclusion.

For example the iGPU found in Steamroller APUs (8 CUs @ 866MHz) is bandwidth starved until > 80GB/s (~90MB/s per GFlop) and the performance scales well up to 96GB/s (~108MB/s per GFlop).
Raven is expected to have > 1.8TFlops of throughput which at 51.2GB/s (DDR-3200) would mean < 28.5MB/s of bandwidth per GFlop, which naturally is also shared between the CPU and the GPU.
Even RX 480 cards are bandwidth starved and they have < 44MB/s of dedicated bandwidth available per GFlop.

So essentially the DCC in Vega should reduce the bandwidth requirement by < 3.2x compared to scenario where DCC is not used at all (Steamroller, GCN 1.1), so that the 11 CU Vega GPU in Raven wouldn't be bandwidth starved at DDR-3200MHz.
AMD stated -40% reduced bandwidth requirement from DCC for GCN 1.2, IIRC.

I agree. It's extremely difficult to imagine any 11 CU APU being fed with 2 channel DDR4.

It appears to lead to the conclusion that any APU with more than 6-8 CU graphics will have HBM2 or 4 channel DDR4.

Glo. · May 3, 2017

The Stilt said:
I would love to know how you came up with this conclusion.

For example the iGPU found in Steamroller APUs (8 CUs @ 866MHz) is bandwidth starved until > 80GB/s (~90MB/s per GFlop) and the performance scales well up to 96GB/s (~108MB/s per GFlop).
Raven is expected to have > 1.8TFlops of throughput which at 51.2GB/s (DDR-3200) would mean < 28.5MB/s of bandwidth per GFlop, which naturally is also shared between the CPU and the GPU.
Even RX 480 cards are bandwidth starved and they have < 44MB/s of dedicated bandwidth available per GFlop.

So essentially the DCC in Vega should reduce the bandwidth requirement by < 3.2x compared to scenario where DCC is not used at all (Steamroller, GCN 1.1), so that the 11 CU Vega GPU in Raven wouldn't be bandwidth starved at DDR-3200MHz.
AMD stated -40% reduced bandwidth requirement from DCC for GCN 1.2, IIRC.

Raven Ridge APUs will use Vega architecture.

Draw Stream Binning Rasterizer. In this technology there is a key to how it will work. Even its name should tell you a lot. Everything in the execution pipeline is being "streamed" to the cores. Because each part of the rasterization process is tiled, there is much less requirement for massive amount of memory bandwidth. Cores are much more capable, and at the same time resource prone. The thing is that every Vega architecture GPU will have High Bandwidth Cache Controller, regardless if they have HBC(!), or not. It allows the framebuffer to locate particular data in the memory that is needed, and deliver it to the cores, when it is needed.

Thirdly, pixel engine is detached from Memory Controller, and it is a client of L2 cache. As a result of all of this the requirements for bandwidth in small GPUs will be... smaller. I am not saying that HBM2 will not help. The difference will be huge.

Lets just say that 11 CU chip, with 704 GCN cores from Vega arch, and without HBM can have performance of RX 550.

P.S. where did you got the 1.8 TFLOPs number for Raven Ridge APU?

The Stilt · May 3, 2017

Glo. said:
Raven Ridge APUs will use Vega architecture.

Draw Stream Binning Rasterizer. In this technology there is a key to how it will work. Even its name should tell you a lot. Everything in the execution pipeline is being "streamed" to the cores. Because each part of the rasterization process is tiled, there is much less requirement for massive amount of memory bandwidth. Cores are much more capable, and at the same time resource prone. The thing is that every Vega architecture GPU will have High Bandwidth Cache Controller, regardless if they have HBC(!), or not. It allows the framebuffer to locate particular data in the memory that is needed, and deliver it to the cores, when it is needed.

Thirdly, pixel engine is detached from Memory Controller, and it is a client of L2 cache. As a result of all of this the requirements for bandwidth in small GPUs will be... smaller. I am not saying that HBM2 will not help. The difference will be huge.

Lets just say that 11 CU chip, with 704 GCN cores from Vega arch, and without HBM can have performance of RX 550.

P.S. where did you got the 1.8 TFLOPs number for Raven Ridge APU?

If Vega can cope with such a low bandwidth, what's the point in using HBM2 in the first place?
Especially when HBM2 is extremely expensive and is expected to have extremely limited availability initially.

GDDR5/X would allow higher and generally much more flexible memory capacity, at lower cost. Also in case the bandwidth requirements are modest, it is possible to use low(er) power GDDR5/X variants too.

I expect the iGPU in Raven to be clocked anything between 1.1 - 1.3GHz, hence > 1.8TFlops.

maddie · May 3, 2017

Glo. said:
Raven Ridge APUs will use Vega architecture.

Draw Stream Binning Rasterizer. In this technology there is a key to how it will work. Even its name should tell you a lot. Everything in the execution pipeline is being "streamed" to the cores. Because each part of the rasterization process is tiled, there is much less requirement for massive amount of memory bandwidth. Cores are much more capable, and at the same time resource prone. The thing is that every Vega architecture GPU will have High Bandwidth Cache Controller, regardless if they have HBC(!), or not. It allows the framebuffer to locate particular data in the memory that is needed, and deliver it to the cores, when it is needed.

Thirdly, pixel engine is detached from Memory Controller, and it is a client of L2 cache. As a result of all of this the requirements for bandwidth in small GPUs will be... smaller. I am not saying that HBM2 will not help. The difference will be huge.

AFAIK, a HBCC GPU will need a higher bandwidth than one without the HBCC. At the most basic, it interleaves between the GPU and external memory. The memory bandwidth requirement is thus in excess of an existing GPU. A HBCC will not be able to work without high bandwidth low latency memory. It needs HBM as the cache, DDR4 need not apply. In return it saves the amount of video memory needed.

Draw Stream Binning Rasterizer and other architectural changes are a separate issue. If a GPU with those improvements needs X bandwidth, then one with those improvements + a HBCC will need X+ bandwidth.

Glo. · May 3, 2017

maddie said:
AFAIK, a HBCC GPU will need a higher bandwidth than one without the HBCC. At the most basic, it interleaves between the GPU and external memory. The memory bandwidth requirement is thus in excess of an existing GPU. A HBCC will not be able to work without high bandwidth low latency memory. It needs HBM as the cache, DDR4 need not apply. In return it saves the amount of video memory needed.

Draw Stream Binning Rasterizer and other architectural changes are a separate issue. If a GPU with those improvements needs X bandwidth, then one with those improvements + a HBCC will need X+ bandwidth.

Its the other way around. HBCC is there to limit the requirements for bandwidth, and control the streaming of data from any type of memory(Volatile, and non-volatile) to the GPU.

But I will go and research into this matter further.

The Stilt said:
If Vega can cope with such a low bandwidth, what's the point in using HBM2 in the first place?
Especially when HBM2 is extremely expensive and is expected to have extremely limited availability initially.

GDDR5/X would allow higher and generally much more flexible memory capacity, at lower cost. Also in case the bandwidth requirements are modest, it is possible to use low(er) power GDDR5/X variants too.

I expect the iGPU in Raven to be clocked anything between 1.1 - 1.3GHz, hence > 1.8TFlops.

Its there to increase the performance of the GPU. HBCC is not only about memory bandwidth, its about also controlling the data streaming from any source to the GPU.

coercitiv · May 3, 2017

Glo. said:
Its the other way around. HBCC is there to limit the requirements for bandwidth, and control the streaming of data from any type of memory(Volatile, and non-volatile) to the GPU.

But the cache itself has high bandwidth requirements. So how does HBCC limit the need for bandwidth in the absence of the fast cache?

richaron · May 3, 2017

maddie said:
AFAIK, a HBCC GPU will need a higher bandwidth than one without the HBCC. At the most basic, it interleaves between the GPU and external memory. The memory bandwidth requirement is thus in excess of an existing GPU. A HBCC will not be able to work without high bandwidth low latency memory. It needs HBM as the cache, DDR4 need not apply. In return it saves the amount of video memory needed.

I think you're on the wrong track here. AFAIK the HBCC reduces the significant software overhead from communications over any bus. This is it's major advantage. For example in the event of a VRAM buffer overflow a GPU with HBCC should have a significant latency advantage accessing system RAM over one which addresses system RAM with drivers.

Glo. · May 3, 2017

richaron said:
I think you're on the wrong track here. AFAIK the HBCC reduces the significant software overhead from communications over any bus. This is it's major advantage. For example in the event of a VRAM buffer overflow a GPU with HBCC should have a significant latency advantage accessing system RAM over one which addresses system RAM with drivers.

Exactly, HBCC just "knows" where the data is, and it can draw them at the correct time, when it is needed.

coercitiv said:
But the cache itself has high bandwidth requirements. So how does HBCC limit the need for bandwidth in the absence of the fast cache?

Cache is just HBM2...

If there is no HBM2 there is no cache. The data are streamed from the memory in the right time. The framebuffer of data is smaller than memory bandwidth.

NTMBK · May 3, 2017

jpiniero said:
The interposer adds a cost they frankly can't afford. They need something closer to Intel's EMIB.

Samsung are working on lower-cost HBM using organic interposer:

https://www.extremetech.com/gaming/...tions-to-blow-the-doors-off-the-memory-market

AMD Raven Ridge 'Zen APU' Thread

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Lifer

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Lifer