Vega/Navi Rumors (Updated)

Page 68 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
It's like 'infinite fabric", but for GPU - a marketing term designed to hide a fact that due to memory pricing realities AMD is forced to introduce extra hw/driver schemes to manage GPU memory. And just like with CPUs, once you hit a link that has bw/latency limits and is being also used for other things, you will get reduced performance.

We already got this management for GTX970, and given how that card performs once near/above 3.5GB is perfect indicator for how this scheme will work once near 4GB limit for this card. PCIE bandwith/latency is still the same and it is still being used for other things. And no, AMD has not invented time machine and cannot anticipate user actions and does not know what assets will be needed.
You misunderstood the technology completely. It has nothing to do with what you are describing.

Let me put this in simple terms. Nvidia Pascal GP100 Chip has 49 bit addressing through CUDA. It has similar capability like Vega, but on GP100 it is done through software, and it will have all of the software limitations. HBCC is hardware implementation of the Unified Memory from HSA 2.0 feature set. I genuinely suggest reading about the hardware, and what it does, and what you are looking at.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
@Joe
Did it cross your mind the alternative to 4gb hbm is not 12gb hbm but 4gb hbm used the old way?

Its incredible it have takes so many years to get to a memory model that is used everywhere else. Its an insane waste of ressources.
 

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
Most of it was discussed already on Beyond3D forum, by game developers. AMD slides, are enough, to understand what is Vega.
 
Reactions: french toast

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
You misunderstood the technology completely. It has nothing to do with what you are describing.

Let me put this in simple terms. Nvidia Pascal GP100 Chip has 49 bit addressing through CUDA. It has similar capability like Vega, but on GP100 it is done through software, and it will have all of the software limitations. HBCC is hardware implementation of the Unified Memory from HSA 2.0 feature set. I genuinely suggest reading about the hardware, and what it does, and what you are looking at.

Does Vega desktop implementation has unified memory with CPU? Nope. That pretty much means memory is not shared, coherency is not maintained and traffic needs to go through PCIE. Period.
And i was referring to claims that "insert buzzword here" would magically expand frame buffer. Comparison with GTX 970 memory management schemes and their successes/failures is very much on topic here.
 
Reactions: xpea

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
Does Vega desktop implementation has unified memory with CPU? Nope. That pretty much means memory is not shared, coherency is not maintained and traffic needs to go through PCIE. Period.

And why would going through PCIe be an issue? Exactly how much data do you think actually needs to be loaded for a new frame compared to the previous one (for which all of the necessary assets would of course already be in memory)?

Assuming that the game is running at 60 FPS, then between 2 subsequent frames, the various animations would have progressed by 16.7ms and the camera would have potentially moved by 16.7ms worth of distance. Given such generally tiny movements, do you really think the scene would be able to change so much that the amount of new asset data required would be capable of overwhelming the PCIe link?

And i was referring to claims that "insert buzzword here" would magically expand frame buffer. Comparison with GTX 970 memory management schemes and their successes/failures is very much on topic here.

Zlatan's comparison of 4GB-8GB to 12-24GB was not the best way of putting it, a different way of looking at it might be to say that 4-8GB of VRAM with HBCC is effectively equal to the full 4-8GB of actual usable VRAM, whereas 4-8GB of VRAM without HBCC is effectively equal to 1-3GB of usable VRAM (since the rest is wasted on unused data).
 
Last edited:

zlatan

Senior member
Mar 15, 2011
580
291
136
Zlatan's comparison of 4GB-8GB to 12-24GB was not the best way of putting, a different way of looking at it might be to say that 4-8GB of VRAM with HBCC is effectively equal to the full 4-8GB of actual usable VRAM, whereas 4-8GB of VRAM without HBCC is effectively equal to 1-3GB of usable VRAM (since the rest is wasted on unused data).

You're right, this is a better comparison.
 

richaron

Golden Member
Mar 27, 2012
1,357
329
136
Does Vega desktop implementation has unified memory with CPU? Nope.

Actually from what I've learnt HSA and hUMA do work over PCIe with supporting GPUs and CPUs. That's Excavator+ (and Ryzen) and GCN 1.2+ (?). But also Vega with the HBCC and appropriate drivers should be similar to hUMA on systems with non-supportive CPUs (assuming drivers take up the slack for hardware missing from Intel CPUs).

But I don't think thats the crux of your argument... It's something more about PCIe limitations? Which as far as I'm concerned is valid, but also everything I've seen gives Vega the advantage in accessing "extra" data over any bus.
 
Last edited:
Reactions: w3rd

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
And why would going through PCIe be an issue? Exactly how much data do you think actually needs to be loaded for a new frame compared to the previous one (for which all of the necessary assets would of course already be in memory)?

So why vendors are bothering with increasing memory amounts? Must be some anti-AMD conspiracy, 256kb + PCIE must be enough.

Does it occur to you guys that for example Sony with it's 8GB of GDDR has advantage over MS who are using DDR3 + "HBCC" of their own. Caching schemes are always like that, difficult to manage, prone to fail.

Actually from what I've learnt I think HSA and hUMA do work over PCIe with supporting GPUs and CPUs. That's Excavator+ (and Ryzen) and GCN 1.2+ (?). But also Vega with the HBCC and appropriate drivers should be similar to hUMA (assuming drivers take up the slack for hardware missing from Intel CPUs).

But I don't think thats the crux of your argument... It's something more about PCIe limitations? Which as far as I'm concerned is valid, but also everything I've seen gives Vega the advantage in accessing "extra" data over any bus.


I am aware of those, the problem is that on any desktop platform AMD won't have anything "unified" and hw coherent unlike what several dilettantes are happily claiming here. Vega is for sure advanced, probably has extra DMA resources and so on, but it is a necessity for them, since at some point they realized that HBM stuff is pricey and 4GB is what they can have now for lower tier cards. Which is perfectly fine.

What i am not happy is a load of bs claims, backed by some "49 bits for NV". ( oh and btw NV has nice scheme going with IBM, their own interconnect for maintaining coherency, but that is also a necessity when under attack by AMD and Intel stuff).
 
Last edited:

richaron

Golden Member
Mar 27, 2012
1,357
329
136
I am aware of those, the problem is that on any desktop platform AMD won't have anything "unified" and hw coherent unlike what several dilettantes are happily claiming here.

I don't understand what you are trying to say. I just told you that modern AMD CPUs and GPUs connected over a PCIe bus support HSA and hUMA. This is the top of the line, leading edge, best available to the known universe "unified" for CPUs and GPUs. AMD's advantage is that there is a lot of "unified" in hardware. And if it's not already in hardware depending on the system, they can fill in the "unified" with software, which is another thing I brought up.

But partial hardware "unified" plus partial software "unified" should still be faster and more efficient than fully software "unified", if that's even a point worth making.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
But partial hardware "unified" plus partial software "unified" should still be faster and more efficient than fully software "unified", if that's even a point worth making.

1) This discussion of HSA and hUMA being "more efficient" with Ryzen does nothing for gaming performance, actually with AMD limited software resources and current AMD CPU market share in desktop gaming would probably not result in anything good?
2) I am after claims that one can get away with 4GB of dedicated mem while applying some AMD buzzwords and come out with effective 8-16GB of video memory. That is bs.
 
Reactions: xpea

richaron

Golden Member
Mar 27, 2012
1,357
329
136
1) This discussion of HSA and hUMA being "more efficient" with Ryzen does nothing for gaming performance, actually with AMD limited software resources and current AMD CPU market share in desktop gaming would probably not result in anything good?

Lol nice deflection, but I'm not going to ignore what I was talking about. What you said was:
Does Vega desktop implementation has unified memory with CPU? Nope.
And I made a reply saying:
I don't understand what you are trying to say. I just told you that modern AMD CPUs and GPUs connected over a PCIe bus support HSA and hUMA.

So that point was completely destroyed. I'm sure there are a lot of people who are willing to contribute whether HBCC is more efficient in accessing outside of local memory, even if drivers are used for some part of the process rather than pure hardware.

2) I am after claims that one can get away with 4GB of dedicated mem while applying some AMD buzzwords and come out with effective 8-16GB of video memory. That is bs.

I didn't say that. I even argued against it. Nothing to do with me.
 
Last edited:
Reactions: french toast

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
Does Vega desktop implementation has unified memory with CPU? Nope. That pretty much means memory is not shared, coherency is not maintained and traffic needs to go through PCIE. Period.
And i was referring to claims that "insert buzzword here" would magically expand frame buffer. Comparison with GTX 970 memory management schemes and their successes/failures is very much on topic here.

Non-Volatile RAM, NETWORK Storage, System DRAM. That is what HBCC connects with.

It goes through PCIe and is shared(this is what 49 bit adressing does), and coherent with appropriate bandwidth here. There is a reason why total addressing amount of data available to Vega is 512 TB's of data.

Vega GPUs are not only dGPUs, but also APUs. But this is not the question here.

Lets say we are talking about 3072 GCN core chip, with 4 GB of HBM2 with 512 GB/s. This GPU will be perfectly capable of doing 4K, because the framebuffer in this GPU is not wasted on unused data, that had to be stored in previous version of driver and framebuffer models. Secondly, culling techniques used in Vega, are saving enormous amounts of video RAM, and secondly, memory compression techniques also add to this.

Both ways of looking at this problem are correct. 4GB GPU that has much more "usable" framebuffer, compared to other 4GB GPU with 1-2 GB usable at this particular moment. And 4GB GPU that has effective 12 GB memory framebuffer, because that part not on GPU memory chips can be considered volatile. AMD has done this, to make GPU memory more usable, and not had to refresh the page as often when the data changed, and effectively wasted power, and cycles doing so. Right now everything will be done "when it is needed". The GPU architecture is more reactive.

Second part, strictly from development perspective is that new architecture will be much simpler to program and optimize. You can ask developers how bad was optimizing the software to work on AMD GPU, and tell them to compare it to Nvidia architectures. Its a night and day right now for GCN/NCU. Biggest struggle for GCN came from Memory management, because Pixel Engine was client of Memory controller, not L2 cache, like it was with Nvidia architectures. Vega is similar on this front to Kepler/Maxwell/Pascal. This also cost power(page refreshing), and cycles. Vega will do everything automatically. The only thing right now with Vega, what Devs will have to custom work is Primitive Shader vectors. They will have to tell the GPU what is seen and what is not seen and the GPU will cull it saving resources, and memory, and cycles.

Its just a top of the mountain of how memory architecture works with Vega.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Lol nice deflection, but I'm not going to ignore what I was talking about. What you said was:
So that point was completely destroyed. I'm sure there are a lot of people who are willing to contribute whether HBCC is more efficient in accessing outside of local memory, even if drivers are used for some part of the process rather than pure hardware.

So complete destruction nowadays is "might run at certain point in the future on AMD cpus only" ? I could add speculation that only in Linux, but please don't quote me, OK?

hUMA and HSA are for SOCs that share memory between CPU and GPU and other accelerators. Once memory is not shared anymore, you can of course still maintain part of that illusion with variuos amounts of overhead and hardware help.
But rest assured, you will still move data from GPU/CPU at speed of PCIE. Sure you might not need to do a copy after copying and your memory "pointer" in the driver is maybe the same. So what? Guys here claim HBCC completely makes up for lack of buffer and does so in not in percents but rather times.

Btw talking about deflections, we have moved nicely from discussing ridiculous claims of HBCC advantages, to fine details of hUMA and HSA2 advantages.


Non-Volatile RAM, NETWORK Storage, System DRAM. That is what HBCC connects with.

It goes through PCIe and is shared(this is what 49 bit adressing does), and coherent with appropriate bandwidth here. There is a reason why total addressing amount of data available to Vega is 512 TB's of data.
.

AMD already has corporate graphics card with flash memory on board. I think half of terabyte or sth. I am fine with that and kudos for innovation. No doubt it works well for intended market with tens of gigabytes.

But some guys here claimed that same technology somehow can make up for lack of video memory on 4GB card. Nope
 
Last edited:

richaron

Golden Member
Mar 27, 2012
1,357
329
136
So complete destruction nowadays is "might run at certain point in the future on AMD cpus only" ? I could add speculation that only in Linux, but please don't quote me, OK?
Yeah again I think you are trying to change the subject. Or you can't understand what I was saying. Again, what you said was:
I am aware of those, the problem is that on any desktop platform AMD won't have anything "unified" and hw coherent unlike what several dilettantes are happily claiming here.
And I replied with:
I don't understand what you are trying to say. I just told you that modern AMD CPUs and GPUs connected over a PCIe bus support HSA and hUMA... AMD's advantage is that there is a lot of "unified" in hardware. And if it's not already in hardware depending on the system, they can fill in the "unified" with software...
So I don't know why you are running with other stuff because I will always bring it back to this point. This obviously goes against your "just for SOCs" idea, but obviously you've been wrong in the past so that seems in line.

And yes, HBCC managing addresses rather than data (in hardware) can be much more efficient than current systems. Go and research information about "pointers" if you have no idea about programming or what I'm talking about.
 
Last edited:
Reactions: w3rd

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
AMD already has corporate graphics card with flash memory on board. I think half of terabyte or sth. I am fine with that and kudos for innovation. No doubt it works well for intended market with tens of gigabytes.

But some guys here claimed that same technology somehow can make up for lack of video memory on 4GB card. Nope
4GB is 4GB. That is correct. However, the approach of how data is handled within this 4 GB's is different, and this is bigger change than you can imagine right now. Fury X was doing similar thing, but through software, and as you can see in latest games, where the drivers have to be set in place for the GPU to start performing accordingly - it was pain in the ass.

Secondly, do not consider 4 GB as small amount on Vega. It is big enough amount for 4K gaming in this architecture. 8GB versions will work perfectly fine with 8K resolution.
 

Snarf Snarf

Senior member
Feb 19, 2015
399
327
136
But some guys here claimed that same technology somehow can make up for lack of video memory on 4GB card. Nope

I think you're missing the point that zlatan was making, 4GB isn't a lack of memory. In the current model, 4GB is really 1-2GB of actually used memory with the other 2-3GB being used for unneeded frame buffer data. What he is saying is that 4GB doesn't have to be considered a lack of VRAM if it was managed more appropriately, which is what AMD claims they have achieved through HBCC and driver stack.

How this plays out in actual performance is anybody's guess but the technology is definitely interesting, and it's a fun talking point when we have nothing else to talk about
 
Reactions: guachi

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
They've already shown it doing that in Deus Ex in their Demo.

They limited the GPU to only 2GB and then ran it off/on

https://youtu.be/bDl6xJJqIAU?t=1171

And they say in the end of that segment "We have several game developers on board, leveraging the stuff"? Tech is definitely working, but remains to be seen how well it is supported and actual gameplay benefits?

Yeah again I think you are trying to change the subject. Or you can't understand what I was saying. Again, what you said was:

And I replied with:

So I don't know why you are running with other stuff because I will always bring it back to this point. .

yeah, definitely bring it back. Currently on the market there are zero dektop PCs supporting this tech. In the future, Ryzen systems might support hUMA and HSA2 on desktop on Linux OS, happy now? Since GPU and CPU each have local memories, gains will be smaller than on SOCs? Happy?

And yes, HBCC managing addresses rather than data (in hardware) can be much more efficient than current systems. Go and research information about "pointers" if you have no idea about programming or what I'm talking about.

Ironically i am programmer, while I haven't done any HSA programming ( but i know a guy on these forums who tried and utterly failed ) i am pretty competent in "pointers".
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
@JoeRambo so how about we try this

https://forum.beyond3d.com/threads/will-gpus-with-4gb-vram-age-poorly.58233/page-12
Let's say you are playing at 144 fps (high frame rate monitor + high end GPU). You want to access all 16 GB of data every frame (otherwise part of the data will just sit there doing nothing = WASTE). 144 frames/s * 16 GB/frame = 2304 GB/s. That's a lot of bandwidth. Usually over 50% of the bandwidth is used by repeatedly accessing the render targets and other temporary data. BF1 presentation describes that their 4K mem layout (RTs and other temps) is around 500 MB. So if we assume 50% of bandwidth is used to access this small 0.5 GB region, the remaining 15.5GB has only half of the bandwidth left. So in order to access it all on every frame, you need 4.5 TB/s memory bandwidth.

This explains why I am not a big believer of huge VRAM sizes, until we get higher bandwidth memory systems. I am eagerly waiting for Vega's memory paging system.

So the point is we can't effectively use the large memory sizes because we become bandwidth limited anyway. So the question is how good is the HBCC at prefetching data that isn't in the "small" HBM stacks , If its anything like a modern CPU then its very good. Now the question is how much of an impact is there for HBCC miss and you have to pull from main memory?

well i found this document which is actually very good:
https://www.xilinx.com/support/documentation/white_papers/wp350.pdf
RCB is 64 bytes, so the majority of completions are 64-byte TLPs. [(64 bytes payload + 20 bytes overhead) / 8 bytes/clock] × [4 ns/clock] = 42 ns
now on top of that you would have the actual DRAM memory request, let go with "bad" and call that 90ns so 132ns in total. So if your target say 144hz ( 6.9ms a frame) that transfer time is something like 1/52000th of you 6.9ms for a frame.

This is the point! Hardware can do this very efficiently compared to software managing it, even if the hardware does dumb stuff and pushes the wrong data out ( they do this all the time on CPU's caches etc ) its so quick at pulling it back in it doesn't matter.

Now The next point, Game Data has very good locality, Go and read up about SoA (struct of Arrays) or AoSoA, CPU's and GPU's have needed these structures to get good cache usages and the HBCC will use this fact as well. You will find that on a HBM miss when the HBCC goes and get the data from memory it will stream in other "adjacent" data based on temporal or spatial locality.

Now look at Sebbi's other point around 50% of your memory bandwidth is used on just ~500MB of data (render targets) so that means the HBCC has 3gb in a 4gb card to use as a cache. Now it probably will virtually segment that space based on access patterns etc so it doesn't push out commonly used assets if we take AMD's own data thats probably ~2gb in size. So that means on a 4gb card you have around 1.5 to 1gb of area that can just be used for aggressive, low latency prefetching.

Now i think its worth watching something like this Oxides Capsaicin pres and then thinking about what that means for the above, its very interesting:
https://www.youtube.com/watch?v=QJOIvACRY6g

Now my last point is NV already do this and it works very well for CUDA on P100 and they have to because 16/32gb of HBM is a piddly amount of memory compared to the TB's that are sitting on each main system and the PB's that are in the cluster.
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
And the post is from sebbbi. Look nowhere else for information, and knowledge of GPU architectures, and game development.
 
Reactions: french toast
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |