Question Is there an absolute limit for ipc in CPU design?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MasterofZen

Junior Member
Jul 12, 2021
15
16
41
With the end of Dennard scaling, single thread performance is more and more reliant on achieving higher IPC (instructions per second). WIth new techniques we have been able to pushing ipc higher, but it is getting more and more difficult(apple A14 and Arm Cortex has seen diminishing generational improvement). I'm wondering is there an absolute limit for IPC. If there is, how close are we to this limit (for CISC/RISC respectively)?
 
Reactions: Tlh97 and Carfax83

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
Where's the evidence that the code generated by the Mac compiler is using resources outside the CPU core? I agree that if it was doing that then the results would not be proper to compare to other hardware, but AFAIK that is not happening.

Source code isn't being compiled into code that runs on the GPU, NPU, or other things on the SoC that aren't part of the CPU core. I don't think generated object code even calls AMX (which IS part of the CPU core, but I don't think Apple's LLVM will generate AMX instructions instead Apple wants developers to make AMX library calls...probably because it lets them make changes to AMX without worrying about supporting backwards compatibility)
Every instruction set a CPU has is an ASIC and is part of the cores (things like sse/mmx/avx/fma and so on).
The whole scandal about the intel compiler was because it would check what ASICS a CPU has by looking at the model of the CPU and checking against a list of known CPUs/instruction sets and would default to general code if the CPU would be "unknown" causing intel CPUs to run faster because it would use special code for special hardware while AMD had to run generic code only on x86 alone.
 
Reactions: podspi

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,167
136
Where's the evidence that the code generated by the Mac compiler is using resources outside the CPU core?

I never suggested that there was. There are plenty of applications available under MacOS for the M1 that ARE using resources outside of the CPU core, but I directed you to Gideon's post where he is producing software for his own use that (apparently) DOESN'T use fixed-function hardware which is why I directed you specifically to his results. I must admit that it would be a pretty neat trick if Apple had figured out how to involved fixed-function hardware without coders specifically utilizing it through appropriate function calls.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Yes, there is a limit to IPC and that bottleneck has become program design in modern times ...

Bad OOD practices are still virtually practiced everywhere which are going to thrash the caches since we're fetching unnecessary data. You'll see both lower performance and lower parallelism in general regardless of how many instructions the CPU is capable of executing in parallel ...

We invented a new programming paradigm known as "Data-Oriented Design" to solve the memory access problems which indirectly benefitted parallelism including multithreading too ...
 
Reactions: Gideon

cytg111

Lifer
Mar 17, 2008
23,560
13,120
136
Well… ponder this then
The largest and fastest turing complete computer in the universe is the universe. There is nothing more “if then else” but reality. Contemplating this and the fact that the speed of light really do look like a finite boundary of how fast stuff can move in this box, then I must deduce, that yes, there is a limit to IPC.
 

Nothingness

Platinum Member
Jul 3, 2013
2,769
1,429
136
Every instruction set a CPU has is an ASIC and is part of the cores (things like sse/mmx/avx/fma and so on).
What you describe are instruction set extensions and are part of the CPU. This has nothing to do with ASIC.

The whole scandal about the intel compiler was because it would check what ASICS a CPU has by looking at the model of the CPU and checking against a list of known CPUs/instruction sets and would default to general code if the CPU would be "unknown" causing intel CPUs to run faster because it would use special code for special hardware while AMD had to run generic code only on x86 alone.
And how does that relate to Apple? Their software only works on their CPU. I've been playing with both Apple LLVM and a version of gcc that wasn't provided by Apple and found no significant difference in the speed of generated code on my test programs.

Apple are not cheating. Their CPU is just one of the top performing CPU on the market, no need to cheat.
 

Mopetar

Diamond Member
Jan 31, 2011
8,019
6,471
136
Well… ponder this then
The largest and fastest turing complete computer in the universe is the universe. There is nothing more “if then else” but reality. Contemplating this and the fact that the speed of light really do look like a finite boundary of how fast stuff can move in this box, then I must deduce, that yes, there is a limit to IPC.

The coolest thing about this is that there are still problems so hard to solve that this universal supercomputer is incapable of doing so before it's own heat death.

I do wonder if we'll ever move computers beyond the constraints of the Turing machine.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

Some of the most incredible data i have ever seen on actual performance bottlenecks on ZEN3 architecture and gives great insights about workloads and what could be done to improve performance.

some interesting insights:

1) As expected - Cinebench style rendering workload fits ZEN3 cache subsystems, and with ~0.2% of L3 misses there isn't much more L3 cache will do to improve performance here.
2) Workload like Linpack needs more and bigger registers and more L1 data cache. More backend resources would help as well.
3) As i long expected about Paradox game performance, all those calculations, scripts and so on are very hard to execute and present a tough nut for both backend and frontend. Huge and unpredictable working set, that tends to miss L3 and also has horrible characteristics for Code caching. It just beats the hell out of whole chip, stressing everything.
4) Civ6 in fact turned out to be very tame workload, while I expected AI turn calcs to destroy modern CPUs, it does not seem to be happening
5) Timespy - synthetic benchmarks could not be further from actual game CPU perf estimation


So there is plenty that ZEN4 can do to increase IPC.
 

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
That's a nice article.

My take away is that increasing the ROB, Load and Store Queues are the seemingly obvious areas for possibly significant improvements in Zen 4. For comparison, Andrei mentioned following ROB depths in his M1/A14 piece:
"A +-630 deep ROB is an immensely huge out-of-order window for Apple’s new core, as it vastly outclasses any other design in the industry. Intel’s Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD’s newest Zen3 core makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure."
 
Jul 27, 2020
17,992
11,727
116
The secret Uganda deal that has brought NSO to the brink of collapse | Ars Technica

For example, when Google reverse-engineered the hack used against American diplomats in Uganda, it found an elegant, tiny piece of code that adapted software from 1990s Xerox machines to fit a so-called Turing machine—essentially a complete computer—into a single GIF file.
“Pretty incredible, and at the same time, pretty terrifying,” said Google’s engineers. “Wow. Just wow,” tweeted Yaniv Erlich, an Israeli professor of computer science at Columbia University.

The evil genius(es) who came up with the idea of a complete computer inside a GIF file, people like that could advance computing at a much faster pace if they would only use their minds constructively.

Also,
Examining Soft Machines' Architecture: An Element of VISC to Improving IPC (anandtech.com)
Report: Intel has quietly bought chip startup Soft Machines for $250M - SiliconANGLE

Here is an example of a VISC design with four physical cores. The design can handle four ‘virtual cores’ or threads as well, but what makes the VISC design different is that when the virtual core has a thread of instructions, it can use the resources of any physical core. Thus, if each physical core is a 4-wide out-of-order design, if a thread running on a virtual core can utilize the resources of all four cores essentially making a giant 16-wide design, then under VISC can do so.

I'm so tired of wondering if this will ever see the light of day in my room.
 

Doug S

Platinum Member
Feb 8, 2020
2,508
4,111
136
The NSO hack wasn't "running a Turing machine inside a GIF file", it was not a Turing machine nor was there an actual GIF file involved. The actual story is even more amazing/scary than that though. They used bitwise operations like OR and NOT made possible via an obsolete part of the PDF standard to emulate NAND gates - then using 70K of those implemented a simple 64 bit CPU to search RAM allowing them to bypass ASLR defenses! One can only guess the number of man hours from VERY smart people that went into developing it. I've linked the Project Zero writeup below but it kind of elides a few details so I'll post a simplified account here.

The first stage of the exploit is of the type we're all too familiar with. In this case there was a bug in iMessage where a method that was intended to copy a GIF upon receipt did so by rendering it into a new GIF - and that rendering occurred outside of Apple's "BlastDoor" sandbox. Worse, any file ending in ".gif" was passed to this method, but the image renderer detected the type of image so it worked on any image type. Thus in this exploit, a PDF was provided with a GIF extension, which caused that to be rendered.

The second stage of the exploit leveraged a bug in open source JBIG2 code used inside Apple's PDF decoder which allowed a buffer overflow to access arbitrary memory. Perfect example of "legacy debt" since JBIG2 is functionally obsolete, but some very old PDF files won't render without it.

Since there was no way to run any code at this stage the worst that could be done with this exploit would be to crash the phone - since the use of ASLR means you don't know what is where without the ability to search through memory for something specific.

In the third stage they solved the inability to run code by leveraging JBIG2's bitmap operations to emulate AND, OR, XOR and XNOR, which of course allows emulating NAND gates. They used 70,000(!) such operations to emulate a simple 64 bit CPU, which allowed searching through memory to find what they needed, and move onto the fourth and subsequent stages of the exploit which eventually leveled up enough to take complete control of the phone. Who knows how many separate exploits were ultimately strung together to achieve that, and how much work went into them all (though some may have been purchased from others rather than developed in house)

Project Zero writeup
 

Doug S

Platinum Member
Feb 8, 2020
2,508
4,111
136
I keep checking every day for the next article where they further break down the NSO hack, there should be more amazing stuff coming though it will be hard to top designing a custom 70K gate 64 bit CPU as part of a smartphone hack.

On the one hand it makes you feel a bit relieved that such extreme lengths were required for such a bad hack to work, on the other hand it is terrifying that someone actually went to such extreme lengths. And NSO is (maybe soon "was", as they are looking to be in financial trouble now with loans they can't repay) a commercial enterprise, with shareholders they are responsible to - the NSA, GCHQ, FSB, MSS and Mossad have much more unlimited budgets that are totally opaque to their "shareholders".
 

diediealldie

Member
May 9, 2020
77
68
61
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |