[WCCF] AMD Radeon R9 390X Pictured

Silverforce11 · May 20, 2015

manimal said:
if it ends up being only effective with drivers could we see a widening gap in gameworks titles?

For GW to abuse vram, it would have to get devs to include 4k/8k textures with their games. That's actually good for pc gaming.

Also, the gap is already huge in Prj Cars & Witcher 3 (with Hairworks). Its big enough to be unplayable on max so it doesn't really matter if the gap widens further.

MrTeal · May 20, 2015

maddie said:
Yes, but the opposite is not true. you can have 4GB cards with 256 bit and 4GB cards with 128 bit.

As already stated by several sites, you can have 2-8 stacks of 4Hi hBM on an interposer, each stack being 1 GB.

With this in mind, why are some saying that you can't have 8Gb HBM with 4096 bit?

Notice, I'm not saying Fiji will have 8GB, only that it might and that there is no hard limit preventing this from happening.

There is a 1024-bit direct bus between the GPU and one HBM stack, similar to how there is one 32-bit interface between a GDDR5 chip and the GPU. With GDDR you change the total amount of memory for a given bus width by using a different size of memory chip. IE, the 256-bit bus on a 4870 was connected to eight 64MB chips to give it 512MB of VRAM while the 128-bit bus on a 260X is connected to four 512MB chips to give 2GB of VRAM.

Right now there's no provision to increase the size of stack in HBM1, so adding extra memory requires another 1024-bit physical interface to that stack.

gamervivek · May 20, 2015

GDDR5 can work in clamshell mode as well, the dual link interposer sounded something akin to that.

Headfoot · May 20, 2015

RampantAndroid said:
I for one...don't believe it. There is no way that I know of for them to, in the driver, tell what can be cleared from the VRAM....and hitting system RAM constantly isn't a solution either.

Are you an EE or a Software Engineer? If not, then it's pretty irrelevant what you know. I trust AMD's paid software engineers a lot more than some random guy on the internet who "doesn't know" of any way for it to happen.

Do we know if Tonga's color compression is both for transfer and at rest? It's highly possible to compress the color data at rest for smaller sizes in memory as well as in transport. Especially if they've built a fixed hardware piece that can encode/decode that compression algorithm without bottlenecking. IIRC both 360 and PS3 had standardized at rest texture compression that allowed those consoles to make do with such incredibly small RAM pools. I don't recall what the algorithm was though.

MrTeal · May 20, 2015

gamervivek said:
GDDR5 can work in clamshell mode as well, the dual link interposer sounded something akin to that.

True, and that's what they do on the PS4. I don't think it's as common on desktop GPUs, though there's always outliers.

Other than WCCF rumor has anything official been shown that HBM is capable of clamshell mode?

gamervivek · May 20, 2015

It's was more like videocardz rumor (leaked slides from AMD which look the most genuine to me besides the synapse and sushiwarrior's post about big big die) which I'm told is a way better site when it comes to this(on reddit's hardware sub).

http://egmr.net/wp-content/uploads/2015/03/amd_hbm.png

Perhaps it's not coming to 390X's first iteration.

RampantAndroid · May 20, 2015

Headfoot said:
Are you an EE or a Software Engineer?

I am the latter (to hell with analog circuit design!) I've made it clear I don't work on video drivers, but I've been around long enough to have some knowledge. Hacks are always bad. The way drivers currently optimize for individual games....should be avoided. It's terrible design. I've seen the same kind of nonsense done for DVDs, HD DVDs and BluRays to fix stamping screw ups. It's never a reliable fix. It's never that easy to maintain.

Azix · May 20, 2015

RampantAndroid said:
I am the latter (to hell with analog circuit design!) I've made it clear I don't work on video drivers, but I've been around long enough to have some knowledge. Hacks are always bad. The way drivers currently optimize for individual games....should be avoided. It's terrible design. I've seen the same kind of nonsense done for DVDs, HD DVDs and BluRays to fix stamping screw ups. It's never a reliable fix. It's never that easy to maintain.

My thinking and I am not even a software engineer. If you think about other hardware, you really only need drivers to add features or fix bugs with operating systems. What happens with games is too much and it shouldn't happen when the tools being used should be standardized. Has to be people are messing things up and the drivers have to patch it. Other hardware typically doesn't have people interacting with it through new software constantly so nothing breaks that needs fixing.

You have to ask whose responsibility it is to make things right. AMD and Nvidia I am sure can manage games coded to standard and recommended practices. I guess they get the blame because they make a component that is central to PC gaming and also have the cash and expertise to patch things up.

There was a link I posted here where someone mentioned how complicated the drivers have gotten.

http://www.gamedev.net/topic/666419-what-are-your-opinions-on-dx12vulkanmantle/#entry5215019

Many years ago, I briefly worked at NVIDIA on the DirectX driver team (internship). This is Vista era, when a lot of people were busy with the DX10 transition, the hardware transition, and the OS/driver model transition. My job was to get games that were broken on Vista, dismantle them from the driver level, and figure out why they were broken. While I am not at all an expert on driver matters (and actually sucked at my job, to be honest), I did learn a lot about what games look like from the perspective of a driver and kernel.

The first lesson is: Nearly every game ships broken. We're talking major AAA titles from vendors who are everyday names in the industry. In some cases, we're talking about blatant violations of API rules - one D3D9 game never even called BeginFrame/EndFrame. Some are mistakes or oversights - one shipped bad shaders that heavily impacted performance on NV drivers. These things were day to day occurrences that went into a bug tracker. Then somebody would go in, find out what the game screwed up, and patch the driver to deal with it. There are lots of optional patches already in the driver that are simply toggled on or off as per-game settings, and then hacks that are more specific to games - up to and including total replacement of the shipping shaders with custom versions by the driver team. Ever wondered why nearly every major game release is accompanied by a matching driver release from AMD and/or NVIDIA? There you go.

The second lesson: The driver is gigantic. Think 1-2 million lines of code dealing with the hardware abstraction layers, plus another million per API supported. The backing function for Clear in D3D 9 was close to a thousand lines of just logic dealing with how exactly to respond to the command. It'd then call out to the correct function to actually modify the buffer in question. The level of complexity internally is enormous and winding, and even inside the driver code it can be tricky to work out how exactly you get to the fast-path behaviors. Additionally the APIs don't do a great job of matching the hardware, which means that even in the best cases the driver is covering up for a LOT of things you don't know about. There are many, many shadow operations and shadow copies of things down there.

The third lesson: It's unthreadable. The IHVs sat down starting from maybe circa 2005, and built tons of multithreading into the driver internally. They had some of the best kernel/driver engineers in the world to do it, and literally thousands of full blown real world test cases. They squeezed that system dry, and within the existing drivers and APIs it is impossible to get more than trivial gains out of any application side multithreading. If Futuremark can only get 5% in a trivial test case, the rest of us have no chance.

Stuka87 · May 20, 2015

RampantAndroid said:
I for one...don't believe it. There is no way that I know of for them to, in the driver, tell what can be cleared from the VRAM....and hitting system RAM constantly isn't a solution either.

Are you being serious? So you think memory gets cleared out magically on its own?

ALL memory allocation and deallocation happens via the driver. It is well within their power to deallocate memory that is no longer in use.

Currently most games use more memory than is needed because in most cases people have plenty of it. Its actually very interesting that The Witcher 3 uses so little vram compared to GTV V. Shows the CDPR Devs payed attention to memory consumption.

RampantAndroid · May 20, 2015

Azix said:
My thinking and I am not even a software engineer. If you think about other hardware, you really only need drivers to add features or fix bugs with operating systems. What happens with games is too much and it shouldn't happen when the tools being used should be standardized. Has to be people are messing things up and the drivers have to patch it. Other hardware typically doesn't have people interacting with it through new software constantly so nothing breaks that needs fixing.

You have to ask whose responsibility it is to make things right. AMD and Nvidia I am sure can manage games coded to standard and recommended practices. I guess they get the blame because they make a component that is central to PC gaming and also have the cash and expertise to patch things up.

There was a link I posted here where someone mentioned how complicated the drivers have gotten.

http://www.gamedev.net/topic/666419-what-are-your-opinions-on-dx12vulkanmantle/#entry5215019

Great link/post - thanks for cross posting it! My experience is in the media side, and it's astounding what some major studios would stamp discs with. Entire screw ups that, in a player that implemented the spec to the letter, would render the media unplayable. Menus not working and worse. Sometimes this was due to the simulator tools being bad, often it was just a lack of testing. Always once it went to production, the onus was suddenly on the people making the players to recognize the bad discs and then implement some hacks to fix the problem. In some cases, that was as bad as goto statements to skip entire instructions or replace them in memory.

RampantAndroid · May 20, 2015

Stuka87 said:
Are you being serious? So you think memory gets cleared out magically on its own?

ALL memory allocation and deallocation happens via the driver. It is well within their power to deallocate memory that is no longer in use.

Currently most games use more memory than is needed because in most cases people have plenty of it. Its actually very interesting that The Witcher 3 uses so little vram compared to GTV V. Shows the CDPR Devs payed attention to memory consumption.

Last I knew, they didn't have a garbage collector in the driver. Now, if they somehow know what is being referenced, why aren't they cleaning it up already? My question is really basic here: if it is so easy to tell what resources in VRAM are needed, why isn't it already a driver bug that so much VRAM is needed? Why isn't it already a driver bug to go and release memory that is no longer referenced?

Saying that the Witcher 3 game uses less memory than GTA V doesn't lead you to "the driver can free it" - I would assume the rendering engine itself is responsible for knowing what is in VRAM and managing that memory. Where does the driver get information on what is currently being referenced?

https://msdn.microsoft.com/en-us/library/windows/desktop/ee418784(v=vs.85).aspx

The majority of your resources should be created as managed resources in POOL_MANAGED. All your resources will be created in system memory and then copied as needed into video memory. Lost-device situations will be handled automatically from the system memory copy. Since not all managed resources are required to fit into video memory all at once, you can over-commit memory where a smaller video memory working set of resources is all that is required to render in any given frame...

...The runtime keeps a timestamp for the last time a resource is used, and when a video memory allocation fails for loading a needed managed resource, it will release resources based on this timestamp in a LRU fashion. Usage of SetPriority takes precedence over the timestamp, so more commonly used resources should be set to a higher priority value...

...Proper priorities can help eliminate situations where something gets evicted and then is required again shortly thereafter.

Direct3D drivers are free to implement the driver managed textures capability, indicated by D3DCAPS2_CANMANAGERESOURCE, which allows the driver to handle the resource management instead of the runtime. For the (rare) driver that implements this feature, the exact behavior of the driver's resource manager can vary widely, and you should contact the driver's vendor for details on how this works for their implementation.

This sounds like a pretty common sense model. I don't see from here where the driver can suddenly gain info on what can be removed, given MS is implying that drivers handling memory management are rare. It seems to be more common sense to me to just have an excess of VRAM and shove everything in there to lower the chances of a resource being "evicted" and then required a couple of frames later. Which is the risk AMD is going to run, if they're going to be releasing/evicting a lot of resources.

Glo. · May 20, 2015

RampantAndroid said:
Last I knew, they didn't have a garbage collector in the driver.

Its not the matter of Garbage Collection for memory, but implementation of Out-of-Order Execution of Data in Memory.

To be honest. If the Out of Order data execution from memory is the case of new cards from AMD, that is a game changer, ang brings credibility to what Cloudfire posted about information he got from Korea.

Out-of-Order would bring a lot of efficiency and power to every GPU.

Why I believe it is OoOE? The structure of HBM itself and the way the data is alocated in memory.

http://forums.anandtech.com/showpost.php?p=37413621&postcount=740

And if its really OoOE then refining the Drivers makes much more sence than anyone can understand on the first view.

RampantAndroid · May 20, 2015

Glo. said:
Its not the matter of Garbage Collection for memory, but implementation of Out-of-Order Execution of Data in Memory.

To be honest. If the Out of Order data execution from memory is the case of new cards from AMD, that is a game changer, ang brings credibility to what Cloudfire posted about information he got from Korea.

Out-of-Order would bring a lot of efficiency and power to every GPU.

Why I believe it is OoOE? The structure of HBM itself and the way the data is alocated in memory.

http://forums.anandtech.com/showpost.php?p=37413621&postcount=740

And if its really OoOE then refining the Drivers makes much more sence than anyone can understand on the first view.

I don't see how out of order execution would help with cycling resources out of VRAM. You still need to hit system RAM if the resource you need was evicted or not yet loaded.

Stuka87 · May 20, 2015

RampantAndroid said:
Last I knew, they didn't have a garbage collector in the driver. Now, if they somehow know what is being referenced, why aren't they cleaning it up already? My question is really basic here: if it is so easy to tell what resources in VRAM are needed, why isn't it already a driver bug that so much VRAM is needed? Why isn't it already a driver bug to go and release memory that is no longer referenced?

Saying that the Witcher 3 game uses less memory than GTA V doesn't lead you to "the driver can free it" - I would assume the rendering engine itself is responsible for knowing what is in VRAM and managing that memory. Where does the driver get information on what is currently being referenced?

https://msdn.microsoft.com/en-us/library/windows/desktop/ee418784(v=vs.85).aspx

This sounds like a pretty common sense model. I don't see from here where the driver can suddenly gain info on what can be removed, given MS is implying that drivers handling memory management are rare. It seems to be more common sense to me to just have an excess of VRAM and shove everything in there to lower the chances of a resource being "evicted" and then required a couple of frames later. Which is the risk AMD is going to run, if they're going to be releasing/evicting a lot of resources.

Correct, there is no GC in the driver, as memory is not managed. If it was BSOD's would be nearly a thing of the past, but then things would also be way slower.

My intent with comparing GTA V and TW3 was that the developers of TW3 are actively doing what they can to keep VRAM usage low.

I am not saying it would be simple for the driver to tag and track things. And it may be there is too much overhead for this to work. But it is possible.

The above post about OoOE makes a lot of sense and could be a big game changer.

Stuka87 · May 20, 2015

RampantAndroid said:
I don't see how out of order execution would help with cycling resources out of VRAM. You still need to hit system RAM if the resource you need was evicted or not yet loaded.

Yes it would, but the latency involved in doing such a move would be significantly less with HBM.

96Firebird · May 20, 2015

Stuka87 said:
My intent with comparing GTA V and TW3 was that the developers of TW3 are actively doing what they can to keep VRAM usage low.

TW3 probably uses such little VRAM because it needed to keep the VRAM usage down in the console versions because of their GDDR5 limit. The game uses ~4GB system RAM, so add the 1.5-2GB of VRAM on top of that for 1080p and you're at the limit of the console. From what I've seen, there are not higher resolution textures for the PC version.

Headfoot · May 20, 2015

afaik texture and game asset compression is still 100% dependent on the engine designer implementing it on PC. Compressing assets at rest alone could net a huge reduction in memory usage in the best case and at least some benefit in the worst case. Think SandForce SSDs. Don't know if this is what they've done but it would be a logical progression after Tonga's compressed in transit scheme

Glo. · May 20, 2015

RampantAndroid said:
I don't see how out of order execution would help with cycling resources out of VRAM. You still need to hit system RAM if the resource you need was evicted or not yet loaded.

Have you read documentation of Mantle and OpenCL 2.0?

HBM looks like was designed exactly for that.

Second thing is that OoOE is for only how the data is handled. Its not STORED in VRAM, its a constant stream of data to the core.

THAT makes gigantic difference. Lets just say, that with this ability Hawaii chip would be faster than GTX 980.

pj- · May 20, 2015

Stuka87 said:
Yes it would, but the latency involved in doing such a move would be significantly less with HBM.

I don't recall specific times but I believe it's an order of magnitude or two slower for the gpu to pull data from system memory than to read it from its own memory. The difference between HBM and GDDR5 in that scenario has to be pretty meaningless.

RampantAndroid · May 20, 2015

pj- said:
I don't recall specific times but I believe it's an order of magnitude or two slower for the gpu to pull data from system memory than to read it from its own memory. The difference between HBM and GDDR5 in that scenario has to be pretty meaningless.

This is exactly the point I'm trying to get across. System RAM has always been a bottleneck, and will continue to be so. That is partially why it's beneficial to have an excess of VRAM and just shove everything into VRAM early on so you never have to evict resources.

It's the same idea with CPU on die caches and system RAM - it's great to have a larger cache on die cache of SRAM so you have to hit the DRAM less often.

So unless AMD has worked out a way to tell what the engine no longer needs (and is able to say that games rendering with high quality textures at 4K do not really need more than 4GB in the VRAM at one time) then I don't see how a driver solution will work.

To be clear though - I'm not saying that AMD is lying. I'm saying that I'd love to understand how they are doing this and if their solution is at all sound.

Cloudfire777 · May 20, 2015

maddie said:
Yes, but the opposite is not true. you can have 4GB cards with 256 bit and 4GB cards with 128 bit.

As already stated by several sites, you can have 2-8 stacks of 4Hi hBM on an interposer, each stack being 1 GB.

With this in mind, why are some saying that you can't have 8Gb HBM with 4096 bit?

Notice, I'm not saying Fiji will have 8GB, only that it might and that there is no hard limit preventing this from happening.

2-8 stacks for the entire 300 series yes. That include dual Fiji (395X2) as well.

The problem with the 8GB theory is that there are only 1GB HBM stacks available. HBM isnt scattered across the PCB like GDDR5 are on current cards. The HBM stacks are on the same silicon as the GPU package. Meaning there isnt room for 8 stacks of HBM.
The HBM introduction we got yesterday are full of pictures like this. Its no coincidence, both in amount of stacks and the height.

The only way to overcome the 1GB per stack/4GB in total limitation for a card is going up in height, increasing the stack of DRAM from 4-Hi to 8-Hi.
That will not be ready until 400 series and Pascal get HBM2. Not only will we get bigger capacity but the bandwidth from just one stack will increase from 128GB/s to 256GB/s since speed per pin is doubled.

We will most likely get Fiji Pro/XT cards with 4x1GB stacks and Dual Fiji with 4x1GB per Fiji core (total 8 stacks). Thats why AMD said 2-8 stacks imo. Maybe even get a 2048bit 2GB HBM card as well

Azix · May 20, 2015

Cloudfire777 said:
2-8 stacks for the entire 300 series yes. That include dual Fiji (395X2) as well.

The problem with the 8GB theory is that there are only 1GB HBM stacks available. HBM isnt scattered across the PCB like GDDR5 are on current cards. The HBM stacks are on the same silicon as the GPU package. Meaning there isnt room for 8 stacks of HBM.
The HBM introduction we got yesterday are full of pictures like this. Its no coincidence, both in amount of stacks and the height.

The only way to overcome the 1GB per stack/4GB in total limitation for a card is going up in height, increasing the stack from 4-Hi to 8-Hi.
That will not be ready until 400 series and Pascal get HBM2. Not only will we get bigger capacity but the bandwidth from just one stack will increase from 128GB/s to 256GB/s since speed per pin is doubled.

We will most likely get Fiji Pro/XT cards with 4x1GB stacks and Dual Fiji with 4x1GB per Fiji core (total 8 stacks). Thats why AMD said 2-8 stacks imo.

those stacks are pretty small. There is room, there might be technical limitations.

its 93% smaller than GDDR5 IIRC

LTC8K6 · May 20, 2015

Azix said:
those stacks are pretty small. There is room, there might be technical limitations.

its 93% smaller than GDDR5 IIRC

There is the Pascal pic for comparison. Looks like there's only room to stack higher?

http://cdn.wccftech.com/wp-content/uploads/2014/03/NVIDIA-Pascal-GPU-Chip-Module.jpg

Cloudfire777 · May 20, 2015

Azix said:
those stacks are pretty small. There is room, there might be technical limitations.

its 93% smaller than GDDR5 IIRC

HBM is 5x7mm. You not only are still on 28nm where the 4096 core Fiji will take a huge space, but adding 8 HBM stacks will further increase the size. Imagine 8 of these including the GPU itself on the same silicon...
It will be big and expensive, and a waste if they figured out 4GB is more than enough with memory management they will bring to life

LTC8K6 said:
There is the Pascal pic for comparison. Looks like there's only room to stack higher?

http://cdn.wccftech.com/wp-content/uploads/2014/03/NVIDIA-Pascal-GPU-Chip-Module.jpg

Thats 16nm Pascal as well, probably scaled to a reasonable expectations of real size. Not 28nm.

Going up is pretty much the only way yes. Or increasing the DRAM capacity itself, from 256MB to 512MB. I`m sure that will happen sooner or later

Azix · May 20, 2015

LTC8K6 said:
There is the Pascal pic for comparison. Looks like there's only room to stack higher?

http://cdn.wccftech.com/wp-content/uploads/2014/03/NVIDIA-Pascal-GPU-Chip-Module.jpg

That's probably rubbish. Been showing that pic since last year no?

http://www.extremetech.com/gaming/1...w-pascal-gpu-offers-colossal-memory-bandwidth

that look has been out since last year. Just a representation.

Cloudfire777 said:
HBM is 5x7mm. You not only are still on 28nm where the 4096 core Fiji will take a huge room, but adding 8 HBM stacks will further increase the size. Imagine 8 of these including the GPU itself on the same silicon...
It will be huge and expensive.

Thats 16nm Pascal as well. Not 28nm.

process doesn't matter just the die size. 550mm2 is max that seems to be going around. The AMD slides showed 70x70 as the upper limit. Thats an area of 4900mm2. roughly tons of space left for 35mm2 stacks.

[WCCF] AMD Radeon R9 390X Pictured

Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Golden Member

Lifer

Golden Member

Golden Member