News Intel GPUs - Falcon Shores cancelled

igor_kavinski · Jun 6, 2024

GodisanAtheist said:
Guess we'll start seeing IC after AMD's apu's get squeezed by Intel.

They have enough of it too. Just need to cancel the current 7900X3D and future 9900X3D SKUs.

blckgrffn · Jun 6, 2024

GodisanAtheist said:
-I've always been perplexed by AMD's unwillingness to add IC to their APUs. Figure it would have an outsized effect in the most bandwidth constrained scenarios.

Guess we'll start seeing IC after AMD's apu's get squeezed by Intel.

3D cache APU? Who's here for that?

Heck, AFAIK they don't have LLC/IC in the console APUs either, where you would think it would make so much sense as it is so power efficient and that is a huge boon to getting performance out of a tiny box and reducing costs elsewhere in size, cooling capacity, power delivery, etc.

Thinking next gen is when we'll see it.

ToTTenTranz · Jun 6, 2024

GodisanAtheist said:
-I've always been perplexed by AMD's unwillingness to add IC to their APUs. Figure it would have an outsized effect in the most bandwidth constrained scenarios.

Guess we'll start seeing IC after AMD's apu's get squeezed by Intel.

Word out there says they had IC in Strix Point until Microsoft mandated NPUs everyone for Copilot+ compliance.

Of course, changes like these would have happened years ago so there's no way to know for sure.
The one thing we do know is how small the iGPU portion ended up being in Strix Point.
And I bet the performance gains at sub-1080p would have been a lot larger had AMD put like 8 to 16MB Infinity Cache instead of those extra 2 WGPs. That lack of balance between processing throughput and effective bandwidth points to removing IC might have been a "last-minute" change, obviously within what "last minute" really means.

DavidC1 · Jun 6, 2024

Ghostsonplanets said:
IMO the plan is probably to match the previous generation peak performance in the U series but using much less area to do so. And the H series gets a wider GPU to keep up the 2x gains gen on gen while also creeping up into the Nvidia xx50 performance.

This is always the case. Do not assume SIMD32 will save huge space. We knew RDNA3 won't perform well after Angstronomics revealed the compactness of the die. It does require transistors to do so, and it is hardware, so if it's too compact, then it's suspicious. It might save some space, but if it's say 40% more compact, then they would have removed features.

coercitiv said:
AFAIK Intel had a perfect storm in their hands, a piece of hardware that required special software attention... and a software team paralyzed by the recent war.

I think this is better way of looking at it than just politics. They are still inexperienced.

This isn't the first time the driver team was blamed nearly entirely for problems.

Remember X3000? It took them forever to add hardware T&L in their drivers. Yes, that is a big mistake. However, we found out the performance was low, which turned out to be due to lack of hardware. In the 4000 series they doubled the performance of their geometry unit, and they did that further in GMA HD while adding much better culling engine, which would have further increased the performance of the geometry unit.

psolord said:
Let's hope they take out this stupid re-sizable bar requirement this time. For older systems, the performance uplift between Arc1 and Arc2, would be +130%.

Resizable bar requirements are because they had the iGPU mentality for so long. They didn't need to care. Now they have a dGPU they'll really understand what is needed.

Theory is one thing, but absolutely nothing substitutes experience for real world.

beginner99 · Jun 7, 2024

KompuKare said:
have long suspected that Intel is large enough for plenty of internal politics

Especially with the manager of that time that seems to mostly play stupid politics games instead of delivering.

igor_kavinski · Jun 7, 2024

beginner99 said:
Especially with the manager of that time that seems to mostly play stupid politics games instead of delivering.

CEO Brian Krzanich was responsible for the complacency of the company during his tenure and Intel TMG's Sohail for the 10nm delays.

The entire world has their eggs in one basket - Taiwan Semiconductor Manufacturi... | Hacker News

news.ycombinator.com

Intel TMG's Sohail was the biggest borderline criminal executive that was forced out a couple of years ago and was responsible for the 10nm delays.

Hopefully Raptor Lake Refresh is the last bad decision that Intel makes in this decade.

NTMBK · Jun 7, 2024

blckgrffn said:
3D cache APU? Who's here for that?

Heck, AFAIK they don't have LLC/IC in the console APUs either, where you would think it would make so much sense as it is so power efficient and that is a huge boon to getting performance out of a tiny box and reducing costs elsewhere in size, cooling capacity, power delivery, etc.

Thinking next gen is when we'll see it.

I was honestly surprised the PS5 Pro didn't add an Infinity Cache. They blew the transistor budget on extra shaders, raytracing etc, but didn't give it any more memory bandwidth.

ToTTenTranz · Jun 7, 2024

NTMBK said:
I was honestly surprised the PS5 Pro didn't add an Infinity Cache. They blew the transistor budget on extra shaders, raytracing etc, but didn't give it any more memory bandwidth.

There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.

Regardless, the PS5 Pro doesn't need a lot more memory bandwidth because it will actually be targeting a lower base resolution. It'll render 1080p upscaled to 4K using AI PSSR whereas the PS5 renders 1440p upscaled to 4K using temporal FSR2 or similar.

blckgrffn · Jun 7, 2024

ToTTenTranz said:
There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.

Regardless, the PS5 Pro doesn't need a lot more memory bandwidth because it will actually be targeting a lower base resolution. It'll render 1080p upscaled to 4K using AI PSSR whereas the PS5 renders 1440p upscaled to 4K using temporal FSR2 or similar.

Which is funny, because I was referring to it in terms of PPW which is a big deal, less power for bandwidth which allows for more juice for presumably the GPU. With a fixed power budget, it would be interesting in the data they received that put the product o this path.

It's Doubly interesting because at lower resolutions (like 1080p) is where you can get the biggest gains in terms of percentage LLC hits from even a "paltry" 16MB IC. AMD has some pretty graphs of this that I linked in another thread, but where you really need a lot of IC is when you want to benefit higher resolutions. It seems like 1080p upscaled to 4k would be a sweet spot for a budget IC implementation.

DavidC1 · Jun 7, 2024

ToTTenTranz said:
There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.

Yes but a cache will speed it up in places where it needs lower latency such as instructions. Also a cache is much better at extracting theoretical bandwidth for that same reason.

Intel said with the eDRAM that it acts as something that has twice the bandwidth because it's a cache.

cherullo · Jun 7, 2024

DavidC1 said:
Resizable bar requirements are because they had the iGPU mentality for so long. They didn't need to care. Now they have a dGPU they'll really understand what is needed.

The Draw/Execute Indirect speed-up for BattleMage is another one of these cases.

For those who don't know, Draw/Execute Indirect is a mechanism which allows a draw command or compute shader to be dispatched based on the results of a previous shader.

For example, you may have a large list of asteroids in your scene, and you'd like to do culling using a compute shader. This culling shader writes the list of visible asteroids to a buffer. Next, you'd like to draw each one of the visible asteroids. Without Draw Indirect, the CPU would have to read the number of visible asteroids from the GPU's buffer to then dispatch the draw command for the correct number of asteroids. With Draw Indirect, you can prepare the draw command as soon as possible and have it read the number of visible asteroids directly from the buffer in the GPU memory.

Now, in a iGPU, all memory can be accessed by the CPU. So to implement Draw Indirect, the iGPU raises an interrupt, the driver copies the number of asteroids in the buffer into the draw command using the CPU and dispatches it. Pretty fast. On a dGPU you have to the same copy from the buffer into the draw command, but you need some dedicated hardware (or an onboard processor, like GCN's Asynchronous Compute Engine) to do this in order to avoid the CPU round-trip.

Arc probably doesn't have such onboard processor. So yeah, all this amazing Draw/Execute Indirect speed up is actually getting grips with dGPU development. Remember, GCN is 12 years old now.

DavidC1 · Jun 11, 2024

Alchemist needs hand-tuning by the driver writers to optimize for weak APIs and engines such as Unreal Engine 5. It is because they said Alchemist emulates the feature widely used by UE5.

Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.

KompuKare · Jun 11, 2024

DavidC1 said:
Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.

Well as long as the hardware team were able to blame the driver team!

Intel internal politics being what it is, and the former Intel GPU boss being someone who came across as a keen player of internal politics!

coercitiv · Jun 11, 2024

KompuKare said:
Well as long as the hardware team were able to blame the driver team!

Raja getting sidelined confirmed Intel was not happy with the hardware. The blaming game may have been real, but I doubt they were convincing enough.

igor_kavinski · Jun 11, 2024

coercitiv said:
The blaming game may have been real, but I doubt they were convincing enough.

He blamed Lisa, the name that is a success even when it's a failure!

The Lisa Was Apple’s Best Failure

Apple’s Macintosh computers, known for bringing graphical user interfaces to the masses and transforming the way we use computers, owes its existence to its immediate predecessor, the Lisa. Without the Lisa, there would have been no Macintosh and perhaps no Microsoft Windows either.

spectrum.ieee.org

Lisa_Pearce

This is a community forum where members can ask and answer questions about Intel products.

community.intel.com

I hope he learned his lesson and will stay away from all LISA's in future.

ToTTenTranz · Jun 11, 2024

DavidC1 said:
Alchemist needs hand-tuning by the driver writers to optimize for weak APIs and engines such as Unreal Engine 5. It is because they said Alchemist emulates the feature widely used by UE5.

Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.

The idea I get from the chipsandcheese's microbenchmarks on the A770 is that execution latencies are high and bandwidth for low workgroup count is low.

So it does look like the hardware is highly dependent on hand-tuned driver optimizations to keep many ALUs occupied and thus hide the low effective bandwidth. It does look a bit like the same problems GCN used to have.

igor_kavinski · Jun 11, 2024

I wonder which CPU benefits ARC the best, helping to keep it busy.

blckgrffn · Jun 11, 2024

ToTTenTranz said:
The idea I get from the chipsandcheese's microbenchmarks on the A770 is that execution latencies are high and bandwidth for low workgroup count is low.

So it does look like the hardware is highly dependent on hand-tuned driver optimizations to keep many ALUs occupied and thus hide the low effective bandwidth. It does look a bit like the same problems GCN used to have.

Almost immediately Intel stated that the design suffered from memory bandwidth issues. I am pretty sure Raja said that out loud in a post launch interview. Based on that I assume it was already being addressed in the hardware design of the next generation parts.

It seems like a sophomore effort that addresses this (many details were probably baked very close to the actual retail launch of the current gen cards) that addresses some of the biggest gotcha's while moving to a new node could create something much more desirable.

DavidC1 · Jun 11, 2024

igor_kavinski said:
I wonder which CPU benefits ARC the best, helping to keep it busy.

Ironically, it would be the one that is best at games, or the Ryzen X3D series.

blckgrffn said:
Almost immediately Intel stated that the design suffered from memory bandwidth issues. I am pretty sure Raja said that out loud in a post launch interview. Based on that I assume it was already being addressed in the hardware design of the next generation parts.

Saying that is akin to saying Vega suffered from memory bandwidth issues. It's just that both have a difficult time utilizing the said bandwidth.

C&C tests show this clearly, that it may have 512GB/s bandwidth but is it available for all workload sizes? It said it required high workload count to fully utilize it. Vega tried to counter it by using HBM, it wasn't sufficient, and it doesn't help with small file size performance which is entirely dependent on the performance of the caching system for example.

This is why A770 was rumored to have x70 level performance. The MLID-type leakers have almost no technical knowledge so they thought Shaders x Clock speed + Memory speed looked like x70 level so they thought "A770 is x70!!!"

Same conclusion as the folk that believed 2x performance for RDNA3. Of course for AMD it was slightly different. The dual-issue design is very different from actually spending transistors to have double the amount of shaders. It literally did not have enough transistors to perform at 2x. But the leakers thought "2x flops = 2x performance".

blckgrffn · Jun 11, 2024

DavidC1 said:
Saying that is akin to saying Vega suffered from memory bandwidth issues. It's just that both have a difficult time utilizing the said bandwidth.

Yes, as they sad at that time, bandwidth issues. They stated that raw bandwidth was available but architecturally they were not able to exploit it.

It's actually pretty neat to get into the details on that. Having Raja talk about it and having Vega suffer from similar issues is ironic as well.

If Intel fumbles the ball on this gen in the same way that will be pretty disappointing. Also, since I am assuming that it is resolved to a large degree, it will be interesting to see what the next bottleneck is performance is on their side. It seems likely there will be more hardware dedicated to scheduling and some of the most costly functions the drivers are doing in software now, let's hope there isn't another "bandwidth" type issue that ties battlemages shoelaces together before the race starts.

NTMBK · Jun 11, 2024

igor_kavinski said:
I wonder which CPU benefits ARC the best, helping to keep it busy.

How about some ARC with your ARC? https://en.m.wikipedia.org/wiki/ARC_(processor)

DavidC1 · Jul 1, 2024

Looks like the performance figures of Lunarlake's Xe2 iGPU is using a lower 2.05GHz max GPU Turbo. This compares to Meteorlake's iGPU:

Core Ultra 9 185H: 2.35GHz
Core Ultra 7 155H: 2.25GHz

So it performs better despite clocking lower and having same number of Xe cores, ROPs and TMUs.

Remember the "50% improvement in performance" and we were disappointed because it seemed to be comparing against MTL-U? Well I pixel counted and it shows 50% improvement but at lower power! At maximum power it's 2.3x the performance.

Compared to -H it shows 33% improvement in performance. So it's 50% faster with same Shader/ROP/TMUs. Games might show even better improvement.

DavidC1 · Jul 1, 2024

I was disppointed with the recent rumors pointing to having same 32 Xe cores as A770, but the architectural reveal shows that I might not need to worry.

Xe2/Battlemage current specs:
-32 Xe cores
-Higher 2.8-3GHz

Lunarlake shows that it seems it's 50% faster at the same basic Shader/TMU/ROP specs, and that's without even counting the low level details that might make it faster in games.

Let's look at the clocks though. 2.4GHz for A770 versus 2.8GHz for say, B970 is 73% faster by using faster GDDR memory. At 3GHz, it'll result in being 87.5% faster performance over A770. If it's better in actual games than Alchemist, then it could be enough to push it over the mark and be 2x and rival RTX 4070S/4070 Ti.

32 Xe cores at 3GHz is only 24.5TFlops. It'll turn out to be the "most powerful" GPU for it's TFlops class!*

*Ok that's expected as based on rumored die size it's said to be AD103 class.

gaav87 · Jul 1, 2024

DavidC1 said:
I was disppointed with the recent rumors pointing to having same 32 Xe cores as A770, but the architectural reveal shows that I might not need to worry.

Xe2/Battlemage current specs:
-32 Xe cores
-Higher 2.8-3GHz

Lunarlake shows that it seems it's 50% faster at the same basic Shader/TMU/ROP specs, and that's without even counting the low level details that might make it faster in games.

Let's look at the clocks though. 2.4GHz for A770 versus 2.8GHz for say, B970 is 73% faster by using faster GDDR memory. At 3GHz, it'll result in being 87.5% faster performance over A770. If it's better in actual games than Alchemist, then it could be enough to push it over the mark and be 2x and rival RTX 4070S/4070 Ti.

32 Xe cores at 3GHz is only 24.5TFlops. It'll turn out to be the "most powerful" GPU for it's TFlops class!*

*Ok that's expected as based on rumored die size it's said to be AD103 class.

I found this mesa code today. This points to ARL-H with xe2, as gfx20=xe2 but im not sure.
Also ONE xe1 EU is 2x slower then ONE xe2 EU at half the power well at least in timespy. But 32xe2 cores = 256EU and A770 512EU. At worst 4060ti at best 4070super.
Also i still think the 56/64 variant is comming just from the mesa/linux kernel dev patches im reading daily.
LNL xe2 is already above b0 stepping since 4days ago.
Found kernel patches for G21 today for example. First mention of G21 in patchwork.

DavidC1 · Jul 1, 2024

gaav87 said:
I found this mesa code today. This points to ARL-H with xe2, as gfx20=xe2 but im not sure.
Also ONE xe1 EU is 2x slower then ONE xe2 EU at half the power well at least in timespy. But 32xe2 cores = 256EU and A770 512EU. At worst 4060ti at best 4070super.
Also i still think the 56/64 variant is comming just from the mesa/linux kernel dev patches im reading daily.
LNL xe2 is already above b0 stepping since 4days ago.
Found kernel patches for G21 today for example. First mention of G21 in patchwork.

64 would not be needed as 32 Xe2 cores with 3GHz clock and game improvements due to architecture is enough to reach 4070S. 64 would mean 2x on top of that, which I don't believe they'll reach that on N5, especially with just 1.5x perf/watt improvement.

In fact, this would turn out to be an opposite of RDNA3 where the high level specs seem weak but they actually spent enough transistors to outperform it's high level specs.

XFX Radeon RX 7900 XTX Magnetic Air Review

The XFX Radeon RX 7900 XTX Magnetic Air comes with user-replaceable fans that pop out without any tools. It's a fantastic solution to clean your card and to simplify RMAs. The card achieved excellent performance results in our review, and the overclocking potential is highly impressive.

www.techpowerup.com

Lower resolutions/settings being on the level of other cards would indicate a ~10% improvement where the A770 would be on the 3060 Ti level. 25% improvement on top of that with clocks on Xe2 would be 4070S.

Assuming clocks are increased without exponential increases, 2x that would require going from 225W to something like 700W assuming GPU takes up about 160W, and way far past even 4090's power requirements. Yes it would also greatly outperform 4090 but it doesn't seem realistic.

Regarding Arrowlake: https://cdn.mos.cms.futurecdn.net/o4vQofD6ZXVJ43gfCqZXWU-1200-80.png.webp

On the older roadmaps the Alchemist+ parts were referred to as "G20". It would fit with the current leaks that it's Meteorlake's GPU with XMX and more caches added.

News Intel GPUs - Falcon Shores cancelled

Lifer

Diamond Member

Senior member

Golden Member

Diamond Member

Lifer

Lifer

Senior member

Diamond Member

Golden Member

Member

Golden Member

Golden Member

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

Golden Member

Diamond Member

Lifer

Golden Member

Golden Member

Senior member

Attachments

Golden Member