Discussion RDNA4 + CDNA3 Architectures Thread

Page 209 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,754
6,631
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Racan

Golden Member
Sep 22, 2012
1,199
2,201
136
I’d like to “thank” this thread for helping me rationalize a 7900 XTX that should arrive about the time CES happens and I can return until the end of January.

Should it really get weird, I’ll have what’s likely to be the fastest (and hungriest) AMD card available until 2027 to fall back on.

Yay.
And yet the 7900 XTX is the most efficient AMD card, it’s not far from nvidia actually:


 

blckgrffn

Diamond Member
May 1, 2003
9,501
3,816
136
www.teamjuchems.com
About ~4% less energy efficient than the 4070:


The Mercury Magnetic is what I ended up with after Walmart repeatedly denied my attempts to purchase the AsRock or Sapphire options available there via NewEgg. That takes it down a few notches.

I ended up investing and extra $100 vs the PowerColor option because it sounds like the Mercury card invested in paste that I won't have to immediately revisit. Its amazing when you read about these how many of these need repasting essentially out of the box.

I plan to use a lower power limit and Chill to keep things from getting roasty toasty. I already upgraded to a solid 850W RMx which I thought would be plenty but I didn't really have the XTX in mind when I bought it.
 

Bryo4321

Member
Dec 5, 2024
28
59
51
The Mercury Magnetic is what I ended up with after Walmart repeatedly denied my attempts to purchase the AsRock or Sapphire options available there via NewEgg. That takes it down a few notches.

I ended up investing and extra $100 vs the PowerColor option because it sounds like the Mercury card invested in paste that I won't have to immediately revisit. Its amazing when you read about these how many of these need repasting essentially out of the box.

I plan to use a lower power limit and Chill to keep things from getting roasty toasty. I already upgraded to a solid 850W RMx which I thought would be plenty but I didn't really have the XTX in mind when I bought it.
That’s the model I have. Fans can be a little droney but it’s great! My hotspot delta actually decreased from when I first got it. PTM7950 from the factory is the way to go, a huge selling point for this model. My friend has the nitro and it’s gorgeous, but not having ptm is a bummer, his hotspot temps have gotten worse while mine are basically unchanged.

For your reference, The mag air seems like it’s capped to 390w from the reference model 355w at default power settings.
 

gaav87

Senior member
Apr 27, 2024
452
794
96
Maybe we saw 9070 non XT 3D mark scores
Well we already had rtx5000 boxes leaked. No 9070xt boxes yet. Could mean end of january launch or even later.
I would not be suprised if they even changed the name back to 9700xt/9700xtx.
Even 9750xt and 9700xt would be better
 

Gideon

Golden Member
Nov 27, 2007
1,921
4,668
136
Whoa, somehow I missed this:


While HW is good, the AMD software is in far worse state than I imagined (you can see the frustration from the article). Not being On Nvidia level is understandable and expected. All of this, absolutely is not:

Issues even with bare-bones basic GEMM lbraries:
It is important to note that calling GEMM is a simple task, and we shouldn’t expect to run into AMD software bugs. Unfortunately, a major bug that we encountered is that the torch.matmul and F.Linear APIs have been delivering different performances on AMD for a couple of months during the summer. One would expect the torch.matmul and F.Linear APIs to have the same performance, but, surprisingly, F.Linear is much slower!


This is a strange bug as torch.matmul and F.Linear are both wrappers around the hardware vendor GEMM libraries, so they should achieve the same level of performance. F.Linear, in particular, is important, as this is the way most end users in PyTorch launch the GEMM kernels.


When we started testing AMD five months ago, the public AMD PyTorch still had this bug. The root cause was that AMD in fact has two different underlying GEMM libraries, rocBLAS and hipBLASLt, with HipBLASLt being more optimized for the MI300X. The bug was that torch.matmul uses the optimized hipBLASLt, but AMD had not changed F.Linear by default, leaving it to use the unoptimized rocBLAS library.


This major bug was ultimately fixed by AMD a few months ago after our bug reports, and we hope it doesn’t reappear due to a lack of proper regression testing. AMD’s usability could improve considerably if it boosted its testing efforts instead of waiting for users to discover these critical issues.

AMD doesn't dogfood their own code and new versions sometimes have terrible performance regressions that are not caught for months unless some clients themselves report it
AMD needs to hook up thousands more of MI300X, MI325X to PyTorch CI/CD for automated testing to ensure there is no AMD performance regressions & functional AMD bugs. Nvidia has given thousands of GPUs for PyTorch CI/CD to ensure an amazing out of box experience

...

Due to poor internal testing (i.e. “dogfooding”) and a lack of automated testing on AMD’s part, the MI300 is not usable out of the box and requires considerable amounts of work and tuning. In November 2024 at AMD’s “Advancing AI”, AMD’s SVP of AI stated that are over 200k tests running every evening internally at AMD. However, this seems to have done little to ameliorate the many AMD software bugs we ran into, and we doubt AMD is doing proper CI/CD tests include proper performance regression, or functional and convergence/numerics testing. We will outline a few examples here for readers to understand the nature of the AMD software bugs we have encountered and why we feel they have been very obstructive to a good user experience on AMD.

Although AMD’s own documentation recommends using PyTorch native Flash Attention, for a couple months this summer, AMD’s PyTorch native Flash Attention kernel ran at less than 20 TFLOP/s, meaning that a modern CPU would have calculated the attention backwards layer faster than an MI300X GPU. For a time, basically all Transformer/GPT model training using PyTorch on the MI300X ran at a turtle’s pace. Nobody at AMD noticed this until a bug report was filed following deep PyTorch/Perfetto profiling showing the backwards pass (purple/brown kernels) took up far more time than the forward pass (dark green section). Normally, the backwards section should take up just ~2x as much time as the forward pass (slightly more if using activation checkpointing).



Another issue we encountered was that the AMD PyTorch attention layer led to a hard error when used with torch.compile due to the rank of the longsumexp Tensor being incorrect. What was frustrating is that this had already been fixed in internal builds of AMD PyTorch on May 30th, but did not reach any AMD PyTorch distributions or even any PyTorch nightly builds until October when it was pointed out to them that there was a bug. This demonstrates a lack of testing and dogfooding on the packages AMD puts out to the public. Another core reason for this problem is that the lead maintainer of PyTorch (Meta) does not currently use MI300X internally for production LLM training, leading to code paths not used internally at Meta being buggy and not dogfooded properly. We believe AMD should partner with Meta to get their internal LLM training working on MI300X.



AMDs developers themselves have no internal clusters and are forced to used external hardware (they are also understaffed):
AMD RCCL is a fork of Nvidia NCCL. AMD’s RCCL Team and many other teams at AMD are resource limited and don’t have enough of either compute or headcount to improve the AMD ecosystem. AMD’s RCCL Team currently has stable access to less than 32 MI300Xs for R&D, which is ironic, as improving collective operations is all about having access to many GPUs. This is frankly silly, AMD should spend more on their software teams having access to more GPUs.

This contrasts with Nvidia’s NCCL team, which has access to R&D resources on Nvidia’s 11,000 H100 internal EOS cluster. Furthermore, Nvidia has Sylvain Jeaugey, who is the subject matter expert on collective communication. There are a lot of other world class collective experts working at Nvidia as well, and, unfortunately, AMD has largely failed to attract collective library talent due to less attractive compensation and resources – as opposed to engineers at Nvidia, where it is not uncommon to see engineers make greater than a million dollars per year thanks to appreciation in the value of RSUs.


TensorWave has generously sponsored AMD a medium-sized cluster in order help the RCCL Team have greater resources to do their jobs. The fact that Tensorwave after buying many GPUs has to give AMD GPUs for them to fix their software is insane.

This is what the users have to go through to get basic stuff running:

The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us, since the Pytorch Nightly and public PyTorch AMD images functioned poorly and had version differences. This docker image requires ~5 hours to build from source and installs dependencies and sub-dependencies (hipBLASLt, Triton, PyTorch, TransformerEngine), a huge difference compared to Nvidia, which offers a pre-built, out of the box experience and takes but a single line of code. Most users do not build Pytorch, hipBLASLt from source code but instead use the stable release.


When using public PyTorch, users have the choice of working with the latest stable images or a nightly PyTorch upload. So, although a nightly PyTorch upload may have the latest commits that could potentially lead to better performance or could fix some bugs, but users must accept that the upload may not be fully tested and could contain new bugs from Meta/AMD/Nvidia or other PyTorch contributors that have not been discovered yet. Note that most end users are using the stable release of PyTorch.


To anyone who has actually shipped software other developers need to use ( I have) this is insane undefendable sloppyness (starting from the management level) and should be unacceptable.

This means either AMD's entire GPU software department is either flat out incompetent, or doesn't prioritize AI development at all (outside Meta, who does all the heavy lifting for their specific use-case). The latter would be a valid excuse, if AMD didn't prioritize Datacenter GPUs so heavily (upping the schedule to annual releases and investing heavily in that area).

You can't just skimp on the software like this and expect the clients to dogfood your broken code.

AMD is not a broke small company. If you can do stock buybacks, you can find the resources for the software stack of your "major push" area.

I hope all of this is because they are putting all their efforts into MI355X generation, but I have deep reservations about that.
 
Last edited:

carrotmania

Member
Oct 3, 2020
96
245
106
While HW is good, the AMD software is in far worse state than I imagined (you can see the frustration from the article). Not being On Nvidia level is understandable and expected. All of this, absolutely is not:
I read that as quite a highly biased article. I'm not saying that AMD doesn't have issues, but CUDA has had 20 years of investment, and we know that ROCm only got really serious this year now that AMD have RnD monies. But the article keeps banging on and on about stuff they simply can't know. And then they think themselves high and mighty because THEY have told AMD how to do software dev. Sure. A lot like LinusTech trying to say he told Intel how to market the B580. Absolutely laughable.

The article actually publishes a telling graph, but rather than be optimistic, they STILL go hard on the negative.



According to the article, AMD "perform worse than H100/H200 on GPT 1.5B", which is true. Then, barely recognises that someone somewhere is actually making progress with AMDs software, giving up to a 2x lift. But the point of the article is, AMD is still slower, not that they are making progress after less than a year of real investment. Also, it glosses over the fact that AMD hardware is in the biggest machines on in the world, so it can't all be bad.

Again, I'm not saying AMD is fantastic in this area, but that article is a hit piece. It's all "AMD is way better than a year ago, BUT THAT DOESN'T MATTER!"
 

ToTTenTranz

Senior member
Feb 4, 2021
278
522
136
Whoa, somehow I missed this:


While HW is good, the AMD software is in far worse state than I imagined (you can see the frustration from the article). Not being On Nvidia level is understandable and expected. All of this, absolutely is not:

Issues even with bare-bones basic GEMM lbraries:


AMD doesn't dogfood their own code and new versions sometimes have terrible performance regressions that are not caught for months unless some clients themselves report it






AMDs developers themselves have no internal clusters and are forced to used external hardware (they are also understaffed):


This is what the users have to go through to get basic stuff running:




To anyone who has actually shipped software other developers need to use ( I have) this is insane undefendable sloppyness (starting from the management level) and should be unacceptable.

This means either AMD's entire GPU software department is either flat out incompetent, or doesn't prioritize AI development at all (outside Meta, who does all the heavy lifting for their specific use-case). The latter would be a valid excuse, if AMD didn't prioritize Datacenter GPUs so heavily (upping the schedule to annual releases and investing heavily in that area).

You can't just skimp on the software like this and expect the clients to dogfood your broken code.

AMD is not a broke small company. If you can do stock buybacks, you can find the resources for the software stack of your "major push" area.

I hope all of this is because they are putting all their efforts into MI355X generation, but I have deep reservations about that.



Not trying to defend the sloppiness, but it seems to me that AMD's business model for their compute cards is to gain big contracts and then do custom software integration for those clients.

They're probably not selling the MI300 in single units through distributors very often.

Of course, the problem with this is they might be reinventing the wheel every time they get a new big client, and there's a bunch of stuff they already implemented but never gets to see the light on the public stack.


For large contracts this is probably not that big of an issue, but this way they'll never get wide adoption from universities, RTOs, smaller companies and even hobbyists. Which is a shame.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,790
4,109
106
Half of nVidia's revenue comes from 4 companies. So it's a very limited market in terms of companies.

Yup. And both Microsoft and Meta have high value internal applications which use Mi300x.

So, it is far more important that AMD performs superbly on these apps, before dealing with a rando developer with 0% market share
 

bearmoo

Junior Member
May 8, 2018
12
13
81
The question then is why they don't have a team dedicated to bridging the software gap between internal and external. Or, it could be that they are simply focusing much more on inference at this point?
 
Reactions: lightmanek

basix

Member
Oct 4, 2024
41
75
51
Or, it could be that they are simply focusing much more on inference at this point?
Exactly that. Most of AMDs show cases compare MI300X at inferencing workloads. It is their priority. Nvidia has NLink scale-out networking which AMD does not have but such networking is crucial for training. So AMD plays to it's biggest strengths: 8-GPU server inferencing with max. HBM capacity possible. And some (little) 8-GPU training stuff.

And AMD is fortunate that OpenAI recently declared another scaling law mainly focused on inferencing (multi-iteration inferencing / chain of thought). Due to that the inferencing market will get even bigger.
 

Hitman928

Diamond Member
Apr 15, 2012
6,524
11,805
136
Yup. And both Microsoft and Meta have high value internal applications which use Mi300x.

So, it is far more important that AMD performs superbly on these apps, before dealing with a rando developer with 0% market share

I don't know, I hear from a guy named George on twitter with his twitter bros that AMD instead should focus on supporting a model that prioritizes running on consumer hardware that is completely unused by the industry at large. The twitter bros say that AMD is missing out on millions and millions of dollars by not supporting him.
 

Jan Olšan

Senior member
Jan 12, 2017
467
874
136
Maybe we saw 9070 non XT 3D mark scores

I think VideoCardz noted that the scores were suspicious because with an early leak, 3D Mark would likely not be recognizing the final SKU name of the card and should be displaying something else.
Not a 100% certain clue though, I suppose. But those guys are more experienced in this matters than most (I noticed that they ingored the GeForce RTX pricing rumours too, possibly for unstated reasons).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |