Discussion Qualcomm Snapdragon Thread

Page 61 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SpudLobby

Senior member
May 18, 2022
961
655
106
L0 caches are a waste of energy.
This is interesting — I wonder how much it differs depending on the setup though. Having a small L0 to buoy you against a higher latency L1
Don't act so smug. QC is making big mistakes. You don't conquer a market with premium devices. You do so with a broad range of price points from $300 to $999. More than that is for professionals or gamers. Microsoft/QC are dumb to think that people will max out their credit cards to afford AI stuff. The average person does not have to lie so much like salespeople/sales engineers/marketing folks/scummy businessmen that they need to keep a database of their lies in their head to avoid their lies getting caught so there isn't enough space in their heads for the mundane stuff like remembering appointments or retaining important details from their emails that they read already. Only these people have a use case for Microsoft Recall at a $1000+ price point. Or maybe imposters who will try to act like executives who know what they are doing.
They have another die coming.
 

Doug S

Platinum Member
Feb 8, 2020
2,479
4,035
136
This is interesting — I wonder how much it differs depending on the setup though. Having a small L0 to buoy you against a higher latency L1

But how much latency does the L1 have to have before an L0 helps? What hit rate does the L0 have to achieve before to insure it isn't making things worse? Someone with access to a simulator (like Gerard) could figure out where that point is - and whether the L0 specs that would be required to improve performance is physically possible in today's state of the art processes.

An L0 is potentially more useful at Intel/AMD clock rates since its L1's latency in cycle (given equivalent cache size/ways) must be higher due to a shorter cycle time. But L1 design isn't necessarily the same at clock rates 50% higher than Apple/Qualcomm's.

What I don't get why it should be called an L0. If you can make your CPU run faster with a small direct mapped cache with extremely low latency in front of your other caches that's your L1.
 
Jul 27, 2020
17,800
11,599
106
The likes of Krait had tiny L0i's and all.
In other words, L0 is a kind of a micro-op cache?

Google isn't much help and it seems most CPU designs don't denote a micro-op cache as L0.

There's even crap being taught at reputable universities like the following:



I think the main characteristic of an L0, from what I can understand, is that it has a latency of 1 cycle or less and can provide 128 bytes of cached data or decoded instructions.

Why that makes it power wasteful like Gerard is saying, is unclear to me.
 

adroc_thurston

Diamond Member
Jul 2, 2023
3,312
4,775
96
In other words, L0 is a kind of a micro-op cache?
Sort of, Krait L0 didn't store decoded uops, just instructions.
It was tiny (4K iirc), direct mapped and just 1c latency.
Google isn't much help and it seems most CPU designs don't denote a micro-op cache as L0.
That's because uOp caches really aren't that, they store decoded uops.
Why that makes it power wasteful like Gerard is saying
Too tiny to make any real impact with modern code/data footprints.
Either opt for a chungus opcache like AMD does or 64/64 (or bigger) Athlon-style classic and there you go.
 

Nothingness

Platinum Member
Jul 3, 2013
2,722
1,357
136
In other words, L0 is a kind of a micro-op cache?

Google isn't much help and it seems most CPU designs don't denote a micro-op cache as L0.
Micro-op caches tend to be after L1 because the micro-op split obviously operates on fetched instructions. Also micro-op caches are not like traditional caches. I'm sure you should find correct information on Google (and surely in filed patents).

Also L1 icache now often contain extra information (IIRC on x86 it has extra bits to indicate instruction boundaries).
 

Nothingness

Platinum Member
Jul 3, 2013
2,722
1,357
136
Predecode windows are on ARM things too.
It's not new.
Yes, that was somewhat useful for 32-bit Arm.But not for 64-bit Arm. (Only talking in the context of the x86 instruction boundaries example I gave. There's a lot of other useful information one can store in Icache, I guess many patents exist that give hints.)
 
Reactions: SpudLobby

SpudLobby

Senior member
May 18, 2022
961
655
106
Hmm.



By that standard, Altra never hit GA, and therefore doesn't exist.



I don't care enough about AMD to hate them. (Or Intel, ARM, Nvidia... I care about IBM stuff a fair bit, which is why I do hate IBM.)

You seem upset though.
He’s not really an honest actor.

Best way to think of him is a 4Chan guy with a boatload of AMD shares that most guys in these places just pander to because they’re either spineless or aren’t sharp enough to see how manipulative he is — which is a low bar.

Or the other AMD gang guys (like Uzzi etc) who are nominally sane and fine just baby him and won’t directly confront him when they disagree or know he’s over the top, it’s all very strange and degrades the forum.
 
Last edited:

SpudLobby

Senior member
May 18, 2022
961
655
106
"HPC doesn't count because reasons"



"When I said everyone I meant everyone but HPC"



Well I have evidence that Neoverse-V3 is 0.08mm2 in an implementation on TSMC N4 capable of achieving 6GHz, does iso-clock performance 35% higher than Zen7, and dissipates 110mW in that configuration! You can only see it if you give me money tho.
lol
 

Gerard Williams

Junior Member
Apr 1, 2014
8
62
91
In other words, L0 is a kind of a micro-op cache?

Google isn't much help and it seems most CPU designs don't denote a micro-op cache as L0.

There's even crap being taught at reputable universities like the following:

View attachment 99793

I think the main characteristic of an L0, from what I can understand, is that it has a latency of 1 cycle or less and can provide 128 bytes of cached data or decoded instructions.

Why that makes it power wasteful like Gerard is saying, is unclear to me.
For it to be useful, it needs a good hit rate, which means its size matters. It will ve a guaranteed VIVT structure since it lives before the L1TLB. So you need to do PA aliasing checks on it or hold the PA inside the structure so it can be coherently managed. You can get clever and build mechanisms to help it but you will never approach the hit rate of a 64, 80, 96, or 128KB cache. Hence the comment about wasteful energy and area. And it also needs to forward data to all the same structures that the L1 cache also forwards, and you are widening every forwarding mux. Is it inclusive with whats in L1 or L2? I can ask a million questions about how to handle this or that but no simple answer come to mind.
 

Henry swagger

Senior member
Feb 9, 2022
437
279
106
For it to be useful, it needs a good hit rate, which means its size matters. It will ve a guaranteed VIVT structure since it lives before the L1TLB. So you need to do PA aliasing checks on it or hold the PA inside the structure so it can be coherently managed. You can get clever and build mechanisms to help it but you will never approach the hit rate of a 64, 80, 96, or 128KB cache. Hence the comment about wasteful energy and area. And it also needs to forward data to all the same structures that the L1 cache also forwards, and you are widening every forwarding mux. Is it inclusive with whats in L1 or L2? I can ask a million questions about how to handle this or that but no simple answer come to mind.
thank you for the brilliant explanation
 
Jul 27, 2020
17,800
11,599
106
Now would be a good time for you guys to present some points that Gerard may want to expound upon at his presentation?

I'll start:

What sets Oryon apart from your previous designs?

Did you achieve what you set out to accomplish or were there features/optimizations that you really wanted to put in there but had to put off for a later design to meet time to market deadlines?

Are there special hardware optimizations in there to make x86-64 binary emulation faster?

Which part of the SoC was the most time-consuming to get right and working as intended? CPU, GPU, NPU or I/O?

How do you feel about IBM's virtual shared L3 cache in Telum? Can something like this be done in the ARM SoC space and would it be more power-efficient, considering it requires a lot more inter-core communication?

Do you anticipate any of these core types superseding the others? Could the NPU, for example, become larger than the CPU and GPU in future designs?

Do you see consumer ARM designs hitting 6 GHz, power be damned?

Which do you prefer more and would rather have? Massive cache or super fast low power RAM?

Do you foresee the NPU taking part in improving cache hit rates through intelligent prediction of what the user may do next?

Are you onboard with ARM SME for your next design or would you rather do that on the GPU?

What is the optimal ratio of performance core to efficient core compute power? What would you personally prefer?

Are you using something like the Intel Thread Director in hardware to predict and tune workload balancing between performance/efficient cores or relying solely on the Windows scheduler?

What do you think about a PCIe 5.0 add in board with Oryon SoC that can run Windows on ARM on x86 platform in Hyper-V so existing PC users can have the best of both worlds?

How about re-launching the current Snapdragon Elite X development kit as a game console compatible with existing Steam libraries?

What are your thoughts on sandwiching a super fast cache between two Oryon SoCs to increase the compute power? Or stacking multiple SoCs on top of each other to create powerful workstations/servers?
 

gdansk

Platinum Member
Feb 8, 2011
2,488
3,375
136
They said AI the requisite number of times. Stock goes up.

(Also their PE was lower than a lot of their competitors so why shouldn't it go up?)
 
Reactions: Doug S
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |