Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Hitman928 · Jul 24, 2024

Saylick said:
Some people on Xitter are saying it might be a packaging issue, whatever that means. Obviously, that's not in reference to the box the CPU comes in, although it would be funny if the issue was to fix a typo on the box.

According to AMD, it is not a design or packaging issue, but that they discovered that not all chips that were sent out went through QA, so they are sending out new chips to make sure they were properly tested before being sold/reviewed. See post #16,557.

Edit: If it was an actual issue with the chips, there's no chance they would be able to get them fixed and new ones out the door within a week or two. It would have to be either the QA testing miss as explained, or something wrong with the microcode/firmware that they could fix and push out quickly.

poke01 · Jul 24, 2024

Hitman928 said:
According to AMD, it is not a design or packaging issue, but that they discovered that not all chips that were sent out went through QA, so they are sending out new chips to make sure they were properly tested before being sold/reviewed. See post #16,557.

Edit: If it was an actual issue with the chips, there's no chance they would be able to get them fixed and new ones out the door within a week or two. It would have to be either the QA testing miss as explained, or something wrong with the microcode/firmware that they could fix and push out quickly.

That’s not what hardwareluxx is reporting. It’s also a hardware issue.

Saylick · Jul 24, 2024

Hitman928 said:
According to AMD, it is not a design or packaging issue, but that they discovered that not all chips that were sent out went through QA, so they are sending out new chips to make sure they were properly tested before being sold/reviewed. See post #16,557.

Edit: If it was an actual issue with the chips, there's no chance they would be able to get them fixed and new ones out the door within a week or two. It would have to be either the QA testing miss as explained, or something wrong with the microcode/firmware that they could fix and push out quickly.

Maybe. Ryan seems to think it's a packaging issue as well:

https://twitter.com/x/status/1816237383895113808

poke01 · Jul 24, 2024

poke01 said:
That’s not what hardwareluxx is reporting. It’s also a hardware issue.

More info, translated:

Quality problems ensure a complete recall of the samples and also of the processors already delivered to the trade. All processors already delivered initially will therefore be replaced by a fresh production badge. AMD does not provide any information exactly which quality problems have occurred. But apparently it is a hardware problem that cannot be fixed by software.

SK10H · Jul 24, 2024

biostud said:
Do you remember the launch problems of Zen 4 and burned burned sockets? Maybe AMD simply prefer a launch without bugs, as in September no one will remember if it was a late July or early August launch. They will however remember if their CPU is not working.

Not sure if this affect later Zen4, but the 3 7900x retail samples I got in Feb2023 that had ihs dated Jul-Aug 2022 all failed single thread corecycler AVX2 ycruncher/p95 at stock clock with no pbo or curve optimizer.
I just live with the last one with +10 curve on some cores but set static clock almost all the time. I just sidegraded to a 7800x3d recently at no cost for power efficient v/f 24/7 operation, as the Zen4 reg voltage is stupidly unoptimized below the 4.8Ghz range as I tested last year. The x3d I have obviously is a better quality die at lower clock, so pass this AVX2 test just fine at ~-20 curve.

Looking forward for ppl to test single thread AVX2 corecycler on Zen5 at stock clock no pbo/curve, and what the v/f curve look like since they sure know how to tweak the x3d die. 😏

Abwx · Jul 24, 2024

Hitman928 said:
Conductor resistance is a big deal on advanced nodes and channel mobility increases with lower temperature, it's not just the conductors.

I mean, we have direct tests of power use vs. temperature and decades of practical overclocking experience to tell us that your theory is not correct. I honestly thought this was just established knowledge at this point, at least in overclocking communities.

They said that cold bug for the 9950X occur at -130°C, it means that at this temp the device is just too slow to work, wich say that at extremely low temps lowered transconductance has more impact than the lower resistances.

It s just that under LN2 they must make sure that the silicon reach a minimal temperature to be functional, because even with LN2 it will be way over this temp once it booted and is somewhat loaded.

Hitman928 · Jul 24, 2024

Saylick said:
Maybe. Ryan seems to think it's a packaging issue as well:

https://twitter.com/x/status/1816237383895113808

poke01 said:
More info, translated:

Quality problems ensure a complete recall of the samples and also of the processors already delivered to the trade. All processors already delivered initially will therefore be replaced by a fresh production badge. AMD does not provide any information exactly which quality problems have occurred. But apparently it is a hardware problem that cannot be fixed by software.

I mean, this is just them speculating based upon the fact that mobile isn't being recalled and it can't be fixed in software. Whereas you have an AMD rep directly stating that it isn't a hardware issue but a testing one. That makes the most sense (if they're not sending out new firmware) because, like I said, if it was something in the chip, there is zero chance they could get replacements out this quickly. It's possible that some bad samples went out because they were damaged during packaging (packaging has defects and yields too) and didn't go through the proper QA testing to catch it before shipping, but that would still be what the AMD rep said, that some chips slipped through QA and so they were sending out new chips they know went through the proper testing.

Abwx · Jul 24, 2024

That could be as trivial as badly aligned SMS caps on the CPU substrate, the chips
would still work reliably but that s something to be corrected because that wouldnt look professional.

Hitman928 · Jul 24, 2024

Abwx said:
They said that cold bug for the 9950X occur at -130°C, it means that at this temp the device is just too slow to work, wich say that at extremely low temps lowered transconductance has more impact than the lower resistances.

It s just that under LN2 they must make sure that the silicon reach a minimal temperature to be functional, because even with LN2 it will be way over this temp once it booted and is somewhat loaded.

Cold bugs aren't because it's too slow, they happen because of either timing violations or that there are analog parts of the CPU that fail with the increased Vth from cold temperatures. I don't think the analog part is really a concern with modern CPUs, so it's most likely a hold time violation as the timing paths shift too far with the extreme temperatures and the data misses the edge window of the flip flop and fails to propagate to the next stage. It's not running too slow, the timings just weren't designed for that cold of operation.

Markfw · Jul 24, 2024

All I can add, is after the Intel fiasco, AMD wants to be SURE there is nothing at all wrong with what they send out, even if it causes a slight delay. 2 weeks is a slight delay. You can't get pissed about that.

Abwx · Jul 24, 2024

Hitman928 said:
I don't think the analog part is really a concern with modern CPUs, so it's most likely a hold time violation as the timing paths shift too far with the extreme temperatures and the data misses the edge window of the flip flop and fails to propagate to the next stage. It's not running too slow, the timings just weren't designed for that cold of operation.

But for time violation to occur or interstage propagation to be too slow something has to limit the speed at wich the transistors are switching since lower resistance are supposed to help...

This means that the parasistic capacitances cant be charged fast enough, that is, that the provided current are too low, wich get us back to too low transistors conductance, actually low temp would be an advantage for higher speed if it werent for the transistors worse characteristics under this condition.

In2Photos · Jul 24, 2024

Markfw said:
You can't get pissed about that.

Sure we can! This is the Internet!

Hitman928 · Jul 24, 2024

Abwx said:
But for time violation or interstage propagation to be too slow something has to limit the speed at wich the transistors are switching since lower resistance are supposed to help...

This means that the parasistic capacitances cant be charged fast enough, that is, that the provided current are too low, wich get us back to too low transistors conductance, actually low temp would be an advantage for higher speed if it werent for the transistors worse characteristics under this condition.

Timing violation does not mean too slow, it just means off. It can also be too fast. Flip flops need a narrow window for the signal to be present and held in. If the signal is too early, it will also be a timing violation. A hold time violation cannot be fixed by lowering the frequency (i.e., the signal is propagating too quickly), hence a cold bug will still be there even if you down clock as low as possible. Again, your theory is wrong. You can argue all you want, but real world tests have shown that it is not correct.

poke01 · Jul 24, 2024

Markfw said:
All I can add, is after the Intel fiasco, AMD wants to be SURE there is nothing at all wrong with what they send out, even if it causes a slight delay. 2 weeks is a slight delay. You can't get pissed about that.

Yep, Im happy AMD is doing this. Cooled down a bit and a yeah better do it now and have a smooth launch.

Josh128 · Jul 24, 2024

In2Photos said:
Sure we can! This is the Internet!

Amen, brotha.

RnR_au · Jul 24, 2024

In2Photos said:
Sure we can! This is the Internet!

If we can't have our daily drama... life becomes rather dull...

/me throws Lisa Su off the hypetrain... (╯°□°)╯︵ ┻━┻

Abwx · Jul 24, 2024

Hitman928 said:
Timing violation does not mean too slow, it just means off. It can also be too fast. Flip flops need a narrow window for the signal to be present and held in. If the signal is too early, it will also be a timing violation.

It doesnt mater if it s too early as long as the clocks rising and falling edges are fast enough, once triggered the flip flop will keep its state for at least the duration of a clock cycle.

Hitman928 said:
A hold time violation cannot be fixed by lowering the frequency (i.e., the signal is propagating too quickly), hence a cold bug will still be there even if you down clock as low as possible. Again, your theory is wrong.

Same as above, if the signal is propagated swiftly this will allow for better level validation, what is a problem actually is when clocks signal hedges are not fast enough, at wich point levels coherency can no more be maintained since the flip flops cant be switched on/off correctly if the clocks signals are not well formed, no matter what are the data signals levels and shapes.

Hitman928 said:
You can argue all you want, but real world tests have shown that it is not correct.

I never use such sentences, i mean such arguments or rather lack of, you know, things like "it s well known that", "it s shown in real world tests" and so on.

Josh128 · Jul 24, 2024

Arrow Lake Leak got somebody excited enough to tune up a 9950X on Geekbench. Multiple different runs of 5950 MHz all core OC today, probably DI or LN2

ASUS System Product Name - Geekbench

Benchmark results for an ASUS System Product Name with an AMD Eng Sample: 100-000001277-60_Y processor.

browser.geekbench.com

ASUS System Product Name - Geekbench

Benchmark results for an ASUS System Product Name with an AMD Eng Sample: 100-000001277-60_Y processor.

browser.geekbench.com

CouncilorIrissa · Jul 24, 2024

Josh128 said:
Arrow Lake Leak got somebody excited enough to tune up a 9950X on Geekbench.

ASUS System Product Name - Geekbench

Benchmark results for an ASUS System Product Name with an AMD Eng Sample: 100-000001277-60_Y processor.

browser.geekbench.com

View attachment 103765

5.95 GHz, it's OC'd to hell. Someone having fun with an ES.

Hitman928 · Jul 24, 2024

Abwx said:
It doesnt mater if it s too early as long as the clocks rising and falling edges are fast enough, once triggered the flip flop will keep its state for at least the duration of a clock cycle.

Same as above, if the signal is propagated swiftly this will allow for better level validation, what is a problem actually is when clocks signal hedges are not fast enough, at wich point levels coherency can no more be maintained since the flip flops cant be switched on/off correctly if the clocks signals are not well formed, no matter what are the data signals levels and shapes.

I never use such sentences, i mean such arguments or rather lack of, you know, things like "it s well known that", "it s shown in real world tests" and so on.

You may not like that real world tests prove your theory wrong, but that is the ultimate evidence. You can theorize all you want, but if the real life tests show something very different or even the opposite, then your theory is clearly wrong. The proof is in the pudding.

Hold time violations are also called minimum delay violations because the signal is propagating too fast, so saying it doesn't matter if it is too early is, again, wrong. These type of timing violations are frequency independent. A quick google search will show this is true. I've never met someone who is so confidently wrong over and over again. If you want to prove me, and every digital designer out there, wrong, show some proof of what you say is true in working designs. Outside of that, best of luck to you, I won't be wasting any more time on this.

Fjodor2001 · Jul 24, 2024

So who wants to be guinea pig and buy a CPU from the first batch now, unless AMD discloses what the actual problem was and how well it could be fixed?

Josh128 · Jul 24, 2024

Fjodor2001 said:
So who wants to be guinea pig and buy a CPU from the first batch now, unless AMD discloses what the actual problem was and how well it could be fixed?

Make me a deal.

Abwx · Jul 24, 2024

Hitman928 said:
You may not like that real world tests prove your theory wrong, but that is the ultimate evidence. You can theorize all you want, but if the real life tests show something very different or even the opposite, then your theory is clearly wrong. The proof is in the pudding.

Hold time violations are also called minimum delay violations because the signal is propagating too fast,

The pudding interior say that time violation occur mainly when the data signal is too late.

It can occur if the signal comes too early but in this case it s only if the clock is too high and as a consequence that there s not enough time for the stage to be triggered during the relevant clock cycle as to hold the desired value.

So assuming that frequency is low enough at the start there will be no time violation by other mean than the transistors not switching fast enough, that is, too low transconductance to charge parasistic capacitances in due time, i.e, signal being too late as a result.

branch_suggestion · Jul 24, 2024

Gives AMD enough time to launch a new AGESA for reviews.
N3B being a complete mess has really led to lots of chaos, thankfully Zen 6 development is going nicely.
But for this gen rollout, things are rough, same with GPUs for all players.

Hitman928 · Jul 24, 2024

Abwx said:
The pudding interior say that time violation occur mainly when the data signal is too late.

It can occur if the signal comes too early but in this case it s only if the clock is too high and as a consequence that there s not enough time for the stage to be triggered during the relevant clock cycle as to hold the desired value.

So assuming that frequency is low enough at the start there will be no time violation by other mean than the transistors not switching fast enough, that is, too low transconductance to charge parasistic capacitances in due time, i.e, signal being too late as a result.

Prove it, otherwise. . .

Hold violation happen when data is too fast compared to the clock speed. For fixing the hold violation, delay should be increased in the data path.

*Note:* Hold violations is critical and on priority basis in comparison are not fixed before the chip is made, more there is nothing that can be done post fabrication to fix hold problems unlike setup violation where the clock speed can be reduced. The designer needs to simply add more delay to the data path.

10 Ways to fix SETUP and HOLD violation: Static Timing Analysis (STA) Basic (Part-8)

VLSI Basics, Static Timing Analysis , Parasitic Extraction , Physical Design, DFM, Interview Questions, Resume Sample and Other VLSI Information

www.vlsi-expert.com

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Member

Lifer

Diamond Member

Lifer

Diamond Member

Moderator Emeritus, Elite Member

Lifer

Golden Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Lifer

Senior member

Diamond Member