Question Diablo4 causing gpu's to die

Ranulf · Mar 22, 2023

Just a heads up for anyone trying the Diablo4 beta this weekend. Game is apparently bricking 3080ti cards, gigabyte ones. There are reports of it hitting other cards though, including AMD.

RTX 3080 Ti GPUs reportedly bricking while playing Diablo IV beta

Multiple PC users are reporting that their expensive NVIDIA RTX 3080 Ti video cards had bricked while playing the recent Diablo IV early access beta test.

www.tweaktown.com

While Diablo IV's lenient PC spec requirements indicate a well-optimized game, some users share troubling reports of their expensive graphics cards failing during gameplay. There have been multiple reports of NVIDIA RTX 3080 Ti GPUs failing while playing the Diablo IV early access beta, with symptoms like GPU fan usage skyrocketing to 100% following an outright hardware shut down.

Blizz forum post on it:

Diablo 4 Bricked my GIGABYTE 3080 TI

I am going to play devil’s advocate here and say that it might not be entirely blizzards fault. Yes, it probably helped a little, but I am running a full AMD am 4 setup Ryzen 5 5600x 6600xt 16 3200 ram and my setting for the game are maxed out and have not adjusted them since I started the beta...

us.forums.blizzard.com

Jayz2c video:

Geegeeoh · Mar 26, 2023

Stuka87 said:
But hey, thanks for thumbing down my post for no real reason.

You are... welcome?

BFG10K · Mar 26, 2023

Stuka87 said:
And its worth noting with DX12 and Vulcan, games are much closer to the metal than they were with older APIs.

Can you please demonstrate anything intrinsic in those APIs that's specifically designed to kill cards? Show us a kill_hardware(), fry_card(), invoke_RMA_process(), or similar. Thanks.

The issue with New World was the game itself being poorly coded which resulted in load characteristics that caused massive spikes in current.

Games do not cause "massive spikes in current" or even "load characteristics". They call an intermediate software stack called an API.

The API is then translated by the GPU driver which causes the GPU UEFI/hardware to react based on how the driver programs it, including physical changes such as current and transistor load.

We do not yet know what is causing these failures, but it would not be surprising if it is caused by a load profile that induces radical current transients.

Which again means a faulty driver/UEFI/hardware stack on nVidia's cards.

amenx · Mar 26, 2023

There is an active enthusiast community who abuse their cards with over-volted modified bioses. I guess these have all been ruled out.

https://forums.evga.com/So-I-flashed-a-1000w-bios-on-my-3080ti-FTW3-Ultra-m3536424.aspx

[Official] NVIDIA RTX 3080 Ti Owner's Club

www.overclock.net

Ranulf · Mar 26, 2023

Trying out the beta this weekend, game is nice enough but they have issues in the code. Stuttering everywhere, cinematics and game play. Worse on gameplay when you get into the open world with other people. Bigger problem is the memory leak. Game was using 20GB system ram last night, 22GB today before it crashed on exit. 8GB of vram usage too for 1080p.

Edit: I also noticed that the game reset graphics settings to high after I lowered them to medium last night.

VashHT · Mar 26, 2023

I played both betas quite a bit on my 3080ti, didn't really notice any problems, I do have my fps capped at like 160 though.

Stuka87 · Mar 26, 2023

BFG10K said:
Can you please demonstrate anything intrinsic in those APIs that's specifically designed to kill cards? Show us a kill_hardware(), fry_card(), invoke_RMA_process(), or similar. Thanks.

Games do not cause "massive spikes in current" or even "load characteristics". They call an intermediate software stack called an API.

The API is then translated by the GPU driver which causes the GPU UEFI/hardware to react based on how the driver programs it, including physical changes such as current and transistor load.

Which again means a faulty driver/UEFI/hardware stack on nVidia's cards.

WTF are you talking about? At no point did I saw there some some API call that will kill GPUs. I specifically talked about load characteristics. The software that is running is what dictates load characteristics. In the case of a game, there are tens of thousands of API calls made every second, that can go up to hundreds of thousands for some types of GPU loads. No one call will cause the issue being seen. What calls are being made is what generates the load. Different calls will have a different type of impact on the GPU. The idea that the driver is the one that dictates load is ludicrous.

But clearly I just need to leave this forum for a while as people are taking what I am saying, and completely ignoring it and twisting it into something else. The days of having a constructive conversation are apparently gone.

DooKey · Mar 26, 2023

Stuka87 said:
WTF are you talking about? At no point did I saw there some some API call that will kill GPUs. I specifically talked about load characteristics. The software that is running is what dictates load characteristics. In the case of a game, there are tens of thousands of API calls made every second, that can go up to hundreds of thousands for some types of GPU loads. No one call will cause the issue being seen. What calls are being made is what generates the load. Different calls will have a different type of impact on the GPU. The idea that the driver is the one that dictates load is ludicrous.

But clearly I just need to leave this forum for a while as people are taking what I am saying, and completely ignoring it and twisting it into something else. The days of having a constructive conversation are apparently gone.

LOL. This forum has been a joke for years if you're looking for a constructive discussion.

DAPUNISHER · Mar 26, 2023

DooKey said:
LOL. This forum has been a joke for years if you're looking for a constructive discussion.

r/selfawarewolves

blckgrffn · Mar 27, 2023

Stuka87 said:
WTF are you talking about? At no point did I saw there some some API call that will kill GPUs. I specifically talked about load characteristics. The software that is running is what dictates load characteristics. In the case of a game, there are tens of thousands of API calls made every second, that can go up to hundreds of thousands for some types of GPU loads. No one call will cause the issue being seen. What calls are being made is what generates the load. Different calls will have a different type of impact on the GPU. The idea that the driver is the one that dictates load is ludicrous.

But clearly I just need to leave this forum for a while as people are taking what I am saying, and completely ignoring it and twisting it into something else. The days of having a constructive conversation are apparently gone.

FWIW, I think you are talking sense.

GPUs are basically full on systems inside our PCs. They are 100% integrated in that the GPU is not socketed. They are liked tuned up cars and a dummy driving it does something that is so corner case that it isn’t engineered for and something dies.

These cards could have “failed as designed” when some part of the card that was the a known weakest link died. The manufacturers have to choose where to spend and save money on their boards after all, and if board aren’t failing by the hundreds then it’s probably just bad luck that the manufacturer made a sub-optimal component choice or a bad batch of something was used.

For all we know, it could be some sort of onboard circuit breaker dying and the actual “GPU” is just fine but unlike PCs that blow a part of their power delivery systems and we just replace the mobo/PSU that failed, it’s just dead until someone at the integrator determines which part of the card has failed. For all we know it’s something they repair and the put the cards into the RMA pool.

Also, I do think software manufacturers have a responsibility to ensure their pile of compromises, half baked ideas and laziest solutions doesn’t face tank hardware into submission. It’s just irresponsible to have your menus run at 1200 fps. It’s hugely wasteful, unpleasant to tune configurations for and for a decidedly not shoestring effort like D4 disappointing. IMO, clearly.

Ranulf · Mar 27, 2023

All I can say at this point, from my play testing, D4 does not run well at high settings which it defaulted to on my 4790k/2060S system. It seems that way on most people's hardware. Tonight, at medium settings the game was rather stable for an hour using no more than 7.5GB of system ram and no stuttering. A lot less compared to the 20-22GB on high settings earlier today and last night.

Timorous · Mar 27, 2023

A menu running at 1,200 FPS should not cause a GPU to fail in any circumstance. Ideally a dev would limit that to reduce GPU load but failing to do so in a beta/alpha should not kill cards.

If that does kill a card it is either a hardware fault (possibly NV, possibly AIB or possibly just a specific unit / batch) or it is a driver/firmware fault.

coercitiv · Mar 27, 2023

blckgrffn said:
FWIW, I think you are talking sense.

He's talking perfect sense now from a technical point of view, whereas his initial reaction was aimed at the user point of view. Average Joe should never have to understand or worry about transient loads, and should absolutely never have to fear turning of vsync in game. Running a game without a frame cap is a bad idea, but the only tax the user should pay for this is higher power consumption. Risk of failure due to transient load is not acceptable, that means that either the card is badly built or the entire design is faulty.

Ranulf said:
All I can say at this point, from my play testing, D4 does not run well at high settings which it defaulted to on my 4790k/2060S system. It seems that way on most people's hardware. Tonight, at medium settings the game was rather stable for an hour using no more than 7.5GB of system ram and no stuttering. A lot less compared to the 20-22GB on high settings earlier today and last night.

I had lots of stutter when teleporting into towns and/or leaving towns on foot (and not talling about rubber-banding between areas). Using 12700K/6800XT. Graphics settings did not seem to impact this, and memory usage cannot be a problem either as the system has 64GB. The one time I checked RAM usage I saw it howering aroung 15GB, though I assume your 20-22GB were also due to lower VRAM pool. My issues may have been E core related, this morning I logged in one last time, the specific town related stutter seems mostly gone.

The beta was supposed to have ended by now, I assume they're leaving it on a bit more since they're adding fixes and would like to test them.

Geegeeoh · Mar 27, 2023

What about a low-end GPU that can't even get to the framecap?
It runs at 100% but falls short... does it have to blow up?

coercitiv · Mar 27, 2023

Geegeeoh said:
What about a low-end GPU that can't even get to the framecap?
It runs at 100% but falls short... does it have to blow up?

It's not the same scenario. The big dGPUs are much more likely to push board structure and/or components so far as to risk failure. Think about the EVGA cards that died in New World, the problem was a weak solder that failed under extremely high currents. Had board TDP been much lower, the issue may have been impossible to reproduce. The transients introduced by low-end cards are much easier to handle by the typical components and overall build standards that board makers are used to.

As a simple analogy, think of the new 12VHPWR connector issues, would they happen with 150W cards? Even with less than perfect pin contact, the current per pin would be much lower and probably never result in failure.

IntelUser2000 · Mar 27, 2023

Yes, @coercitiv is right. It's the push to ridiculous power levels that are causing it to fail. They are absolutely pushing the limits in all directions.

Remember when 980 Ti's could be overclocked to the point where it was 20-30% higher performance? Now we barely get 5%? And it can't even handle stock? What kind of nonsense is that?

The first rule in selling is knowing the market. If they can't absolutely control the product so it doesn't fail in common conditions, then your product sucks, period.

I think after the overhyped conditions in the GPU market with the crypto-boom, we're seeing a general crash. Pendulum is swinging the other way. It'll reach new lows as never seen before.

Geegeeoh · Mar 27, 2023

coercitiv said:
It's not the same scenario. The big dGPUs are much more likely to push board structure and/or components[...]

So high-end can't be pushed to 100%? We can't have they been engineered... "good".

blckgrffn · Mar 27, 2023

Geegeeoh said:
So high-end can't be pushed to 100%? We can't have they been engineered... "good".

That’s not it.

From a systems engineering approach, these cards are very complex systems.

When designing them, the engineers have to try to account for all of the happy paths, the less than optimal paths, the lazy paths and also downright malicious paths that a user (attacker) might take.

Then each component needs to be selected to withstand the impacts these varied use scenarios.

At some point, the designers have to accept that under certain conditions the system will behave in such a way that puts components under stress that is likely so brief or so unlikely that those components will not be fully rated for those loads. I am sure if we talked to the actual engineers of these cards they could tell us specifically what scenarios they identified as risks but the trigger scenarios are so niche/unlikely they made the decision not to accommodate them. Obviously they publish best coding practices, etc. and not “do this to break our product“ guides because that would be incredibly problematic

These are not human life critical systems in a rocket ship. They are gaming trinkets.

Mopetar · Mar 27, 2023

I think part of the issue is that the company making the GPU isn't the one building the cards. That's going to create a higher probability of crucial information not being given to the people building the actual card and creating situations where failure is a lot more likely on the bleeding edge.

Rebel_L · Mar 31, 2023

blckgrffn said:
At some point, the designers have to accept that under certain conditions the system will behave in such a way that puts components under stress that is likely so brief or so unlikely that those components will not be fully rated for those loads. I am sure if we talked to the actual engineers of these cards they could tell us specifically what scenarios they identified as risks but the trigger scenarios are so niche/unlikely they made the decision not to accommodate them. Obviously they publish best coding practices, etc. and not “do this to break our product“ guides because that would be incredibly problematic

While it may be a good business decision it is not good engineering to ignore vulnerabilities that you are aware of.

BFG10K · Mar 31, 2023

Stuka87 said:
The idea that the driver is the one that dictates load is ludicrous.

O'rly? Explain how a GPU gets a load without a driver. This isn't x86 running directly on a CPU. Every GPU has a completely different ISA.

The driver absolutely loads the card. A simple example is a driver framerate cap. The game makes exactly the same API calls but the driver dictates the final load by throttling the software stack as needed.

BFG10K · Mar 31, 2023

coercitiv said:
As a simple analogy, think of the new 12VHPWR connector issues, would they happen with 150W cards?

I said right from the start I consider that connector to be faulty hardware design.

blckgrffn · Mar 31, 2023

Rebel_L said:
While it may be a good business decision it is not good engineering to ignore vulnerabilities that you are aware of.

Sounds nice, but it’s just naive. How terrible of input power are they required to handle? Are they going to insulate them from solar flares too?

At some point they have to be “used as intended” or they will break like any other tool. Usually they just break in a less permanent manner.

And that’s ignoring the fact that with an engineering stack as complicated as these there will be issues that are unknown at the time of engineering that manifest themselves after. Look at the hardware and software stack that’s required for these to function - and the bugs and hand optimizations/fixes that happen at a title by title basis. It’s clear that that some of this interaction between software and hardware is made up as they go along, putting fingers in the holes in the dam.

I guess it’s not hard for me to conclude with a systems engineering background that conditions exist for naughty software to put the hardware at risk, especially considering how non-standard and fully variable a hand built PC is.

What if there are other bad variables? Lmao, like if these Gigabyte GPUs were bundled with their garbage PSUs for some sort of promotion 😂

Rebel_L · Apr 1, 2023

blckgrffn said:
Sounds nice, but it’s just naive. How terrible of input power are they required to handle? Are they going to insulate them from solar flares too?

At some point they have to be “used as intended” or they will break like any other tool. Usually they just break in a less permanent manner.

And that’s ignoring the fact that with an engineering stack as complicated as these there will be issues that are unknown at the time of engineering that manifest themselves after. Look at the hardware and software stack that’s required for these to function - and the bugs and hand optimizations/fixes that happen at a title by title basis. It’s clear that that some of this interaction between software and hardware is made up as they go along, putting fingers in the holes in the dam.

I guess it’s not hard for me to conclude with a systems engineering background that conditions exist for naughty software to put the hardware at risk, especially considering how non-standard and fully variable a hand built PC is.

What if there are other bad variables? Lmao, like if these Gigabyte GPUs were bundled with their garbage PSUs for some sort of promotion 😂

If you dont think those things are accounted for you are the naive one. While Im sure they dont sit down and individually account for all the variables they can think of and instead have it covered in a standard safety margin they slap onto specs, it doesnt mean its not accounted for. If GPU's couldnt handle solar flares everyone would loose all their hardware on a daily to weekly basis.

No one said its easy to design these things with all the variables to consider but it seems pretty obvious that if your cards circuitry allows for x power to reach a component it better be capable of handling x power. If your idea of "intended use" for a GPU doesnt cover situations like trying out a public beta for a new game that uses your hardware differently than a previous game you should put large disclaimers on the box warning people of that. Working in an industrial industry I can tell you that design there is expected to cover the possibilities. If you want want a cheaper component in place you better have interlocks or something else in place that ensures the part cant be exposed to over spec conditions. To not design for that is certainly not considered "good" engineering.

It does seem that generally the engineering on GPU's is pretty decent as these kind of issues are not that common but it doesnt mean we shouldnt call them out when they screw up and come up with flimsy excuses to try and deflect blame.

Mopetar · Apr 2, 2023

Let's just be glad that none of these companies design automobiles. They'd make the Pinto look like a five star product by comparison.

In2Photos · Apr 3, 2023

Mopetar said:
Let's just be glad that none of these companies design automobiles. They'd make the Pinto look like a five star product by comparison.

Did you not just describe Tesla?

Question Diablo4 causing gpu's to die

Platinum Member

Member

Lifer

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Super Moderator CPU Forum Mod and Elite Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Member

Diamond Member

Elite Member

Member

Diamond Member

Diamond Member

Senior member

Lifer

Lifer

Diamond Member

Senior member

Diamond Member

Golden Member