Question Is there an absolute limit for ipc in CPU design?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MasterofZen

Junior Member
Jul 12, 2021
15
16
41
With the end of Dennard scaling, single thread performance is more and more reliant on achieving higher IPC (instructions per second). WIth new techniques we have been able to pushing ipc higher, but it is getting more and more difficult(apple A14 and Arm Cortex has seen diminishing generational improvement). I'm wondering is there an absolute limit for IPC. If there is, how close are we to this limit (for CISC/RISC respectively)?
 
Reactions: Tlh97 and Carfax83

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
but i think Apple has already shown that there is at least 50-100% more IPC to extract with todays technology versus what we have in x86 now.
How did they show that?
The M1 is basically a collection of asic for specific things and everything else runs like dog.
We now that since 1997-8 were we got MMX and then 3dnow, that gave us 2-400% "more IPC" compared to doing that on the CPU, that's not more IPC though that's a special circuit that can do only that one thing that fast.
 
Reactions: Racan

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
How did they show that?
The M1 is basically a collection of asic for specific things and everything else runs like dog.
We now that since 1997-8 were we got MMX and then 3dnow, that gave us 2-400% "more IPC" compared to doing that on the CPU, that's not more IPC though that's a special circuit that can do only that one thing that fast.

Please show up just ONE thing that "runs like a dog" on M1.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
Please show up just ONE thing that "runs like a dog" on M1.
Please link ONE benchmark that shows anything that isn't specifically made to run on the M1.

Does not look specific at all, in fact if they can make "specific" changes that benefit gcc sub test so greatly, all power to them, surely those changes would benefit my workloads on x86 CPUs
What does estimated score mean in that review? And that's a honest question, not trying to be funny here.
Reading the article though.
"we’ve been able to track down Apple’s compiler setting which increases the 456.hmmer by such a dramatic amount – Apple defaults the “-mllvm -enable-loop-distribute=true” in their newest compiler toolchain whilst it needs to be enabled on third-party LLVM compilers. "
Just my point, you have to design for it, otherwise it won't run well.
They even show it on a separate page, if you are not running M1 specific code you can end up at 60% of the performance, and this is with artificial benchmarks that are already made to run as efficiently as possible.

 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I quoted the relevant parts, they compiled it especially for the M1.

News at 11 i guess? With Apple ecosystem within a year or so, developers will either have their apps compiled for M1, or they won't be selling on Apple platform.
Right now most of apps where performance matters, like Adobe, sound, video processing, browsers, IDEs are already native on M1 and perform incredibly given the power usage of those machine.
 

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,453
136
I'm not really sure using Rosetta makes a particularly good comparison for the point you're trying to make. How would an x86 CPU (or one for any other architecture really) do if it had to run code compiled for the M1? The notion of using that as some kind of comparison is frankly absurd because no one does that in the real world.

The only reason Apple has it there at all is because they like to be secretive and not tell anyone (even their own developers) what they're doing with their product roadmap which means outside of Apple itself pretty much no one had time in advance to create and bug-test builds of their application for this new architecture and it would suck horribly to release a product that can't run software you need at all. Never mind older software that no one is actively updating that will never get a native release.

I'm also not sure that compiler optimizations are cheating or even particularly dishonest either. Intel has used their own compiler to take advantage of their hardware and really the only complaining I've ever heard about it is that it'll switch to slow path for non-Intel CPUs even if they're capable of also executing those instructions.

If Apple spends more time with the compiler they ship to make sure it produces better performance then it just means their compiler is better than others for building Mac applications. Intentionally or knowingly use a compiler that produces worse results is similarly counter productive. If I dusted off some ancient copy of the Borland C compiler and used it to build SPEC to make the same claims about an x86 chip that you're making here I'd be rightly lambasted.

You might have some kind of point if these improvements required writing a lot of hand-tuned code specifically aimed at the M1 itself, but that isn't the case here. It's a compiler flag in an open source compiler. If the M1 has some better vectorization hardware that makes such a setting more practical than on other CPUs it's silly to complain about taking advantage of it.
 

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
I quoted the relevant parts, they compiled it especially for the M1.

You really know nothing about SPEC and its history, I guess. ALL SPEC submissions have carefully tuned compiler flags - often different for each subtest depending on what gets the best results. It has always been like that. Go look at Intel's SPEC submissions, compiled with their own C compiler that had changes made to it that were specific to a single SPEC test to run faster. Sun did the same.

I can't believe you're whining about a single flag that makes one test run faster, and using it to try to claim that M1 is rigged on everything. You are clueless.
 

scannall

Golden Member
Jan 1, 2012
1,948
1,640
136
I quoted the relevant parts, they compiled it especially for the M1.

So running things 40-50% slower doesn't qualify?!
If you are running software on x86, then you compile for x86. No surprise there. If you are running software on ARM, then you compile for ARM. How is that a problem, or 'cheating'?

That it can run anything compiled for something else at all, let alone that fast is amazing, and an indication of just how strong it is.
 
Reactions: Tlh97 and Mopetar

Thunder 57

Platinum Member
Aug 19, 2007
2,814
4,105
136
Why build an IO die on N6 thought when you don't see much size reduction (remember the physical interfaces always take up the same amount of space)? Given that AMD already can't get enough wafers to satisfy all of the demand they're seeing it seems bizarre to go down that route....

My guess would be to cut back on power consumption, particularly idle power usage. That is one area where Intel still has an advantage. On the server parts in particular is can be quite high.

EDIT, replied to the wrong thread. Please ignore.
 

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,453
136
That it can run anything compiled for something else at all, let alone that fast is amazing, and an indication of just how strong it is.

I wouldn't put much emphasis on this since it depends more on the emulation software than anything baked into the chip itself. I'm assuming Apple put some work into it, but it's not likely something they'll continue working on since the expectation is for developers to release binaries compiled for ARM. Anything particularly old that no one is updating probably runs just as well if not better given the hardware available when it first released. Anyone who absolutely needs performance for it for whatever reason can still use an Intel Mac.
 

eek2121

Diamond Member
Aug 2, 2005
3,051
4,276
136
Please link ONE benchmark that shows anything that isn't specifically made to run on the M1.

What does estimated score mean in that review? And that's a honest question, not trying to be funny here.
Reading the article though.
"we’ve been able to track down Apple’s compiler setting which increases the 456.hmmer by such a dramatic amount – Apple defaults the “-mllvm -enable-loop-distribute=true” in their newest compiler toolchain whilst it needs to be enabled on third-party LLVM compilers. "
Just my point, you have to design for it, otherwise it won't run well.
They even show it on a separate page, if you are not running M1 specific code you can end up at 60% of the performance, and this is with artificial benchmarks that are already made to run as efficiently as possible.


You are looking at x86 code running on the m1. You misunderstand that chart completely. A better example would be the m1 running a generic armv8 binary with no m1 specific optimizations.
 

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
You are looking at x86 code running on the m1. You misunderstand that chart completely. A better example would be the m1 running a generic armv8 binary with no m1 specific optimizations.

In what circumstances would anyone ever run a "generic" ARMv8 binary on an M1? The first ARM Macs sold use the M1, the feature set it includes makes up the low bar for compiler targeting for all future Macs.

It might be reasonable if you were talking about some future ARM Mac that gets a lot of benefit from compiler settings to exploit new stuff in an M4 or whatever, but produces a binary that won't run at all on an M1 or runs noticeably slower.
 
Reactions: Charlie22911

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
ARM Linux inside Docker or standalone virtual machine, old iOS applications, android emulators...

OK, so a handful of niche cases that will make up maybe 1% of usage on M1 Macs. Why should that become the bar for benchmarks? Some people run DOS emulators on Windows, so I guess we should do Windows benchmarks using binaries compiled for 8088?
 

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,453
136
ARM Linux inside Docker or standalone virtual machine, old iOS applications, android emulators...

Since it's Linux and the source is available why wouldn't you compile it for the M1 specifically? Sure not everyone is quite that geeky, but if performance were really important it would probably be worth it. Someone else has already done it and has a binary available anyhow.
 

Gideon

Golden Member
Nov 27, 2007
1,714
3,935
136
ELF's comments are beyond stupid. I've used an M1 Mac Mini as a developer machine for 6 months now and overall it's 1,5-2x faster than my 2018 Macbook Pro with 6 core 12 thread coffee lake (therefore no slouch).

And I'm not talking about benchmarks, I'm talking about real custom-written code: microservices, backends and frontends written in Nodejs, Java, Golang. I've even tried out some toy Rust programs. It's faster in absolutely everything also when compiling, running unit- and integration tests. The Comet lake version spins up the fan pretty quickly, the only time I've got M1’s fans to even spin up was to brute-force decode 8K AV1 video (no accelerators for it) for ~5+ minutes. It never spins up during my normal work, when in comparison the Coffee-Lake Mac is quite loud and still very hot to touch.

I'm on vacation, so I can only show these screenshots of running the Angular frontend unit tests of a smaller project I was writing when I first got the M1 Mini, but I think these DO get the point across:






Exhibit one: 6/12 Intel Coffee Lake 15" Macbook Pro (2018):
1st run: 57s
2nd run: 28s

Notes: The fans kick in during the test-run already

Exhibit two (last screenshot): 8/16 Ryzen 3700X stock, but with optimized DDR4 3600 MHz CL14, custom timings:

1st run: 29s
2nd run: 18s

Notes: Probably some of it is "windows tax" as I ran tests in win10, but it won't be more than a single-digit percentage.
All in all a healthy 2x speedup on first run and about 55% on the second.

Exhibit three (second screenshot): 4 Big + 4 Small core M1 Mac Mini:
1st run: 22s
2nd run: 13s


And this is just the first example I recorded, it happens in nearly everything I run all the time(it's just tiresome to compare it on 3 rigs rather than to do work). And while NodeJS is a more extreme example, for JVM it was closer to 1.5x not 2x, it DOES show that M1 just is faster in daily programming work, all while being silent and cool.

But Yeah, "I'm sure" it's the magic custom accelerators and not the 8-wide architecture with 630 entry Re-order Buffer and other unparallelly large elements compared to the competitino.

Even if those magic accelerators were the cause of a 1,5 - 2x speedup in any arbitrary JVM and V8 virtual machine code and their compile times (as well as fro golang and rust) ... well congrats to Apple, then they've done something that's orders of magnitude more comlpex (not to say impossible) than just building the wider core.
 
Last edited:

Cardyak

Member
Sep 12, 2018
73
161
106
Attempting to steer this thread back on topic, here is another paper investigating the extent of IPC limits simulating different parameters and rule sets: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.6121&rep=rep1&type=pdf

What is most interesting here is the results in this study are very similar to the results in the previous paper I linked, meaning there is fairly conclusive evidence from multiple sources that there are large IPC gains still to be obtained in most integer workloads.

The integer program they use in this example is GCC. Adhering to data dependencies the IPC tops out around 40. Using value prediction to break data dependencies nets you an IPC of around ~240

If an infinitely wide and unrealistically ambitious hardware model can achieve more than 200 instructions per clock, then it stands to reason that a more realistic hardware implementation with *some* limit on the number of transistors could achieve at least 1/2 of that performance.

The crucial details moving forward will be:

- How to expand and enlarge existing structures without crippling diminishing returns. This applies to depth as well as width.
- What's the best implementation for Value Prediction without causing too many pipeline flushes (High confidence, low coverage to begin with? Maybe improve from there)
- Overcoming the memory wall and granting faster access to data (Caches, Memory Controllers with optics/photonics, etc)
- Security (Meltdown and Spectre have proven that speculation can be a nightmare for security, this will only become more of a danger as speculation becomes more and more aggressive in future designs)
 
Reactions: Schmide

Viknet

Junior Member
Nov 14, 2020
9
10
51
OK, so a handful of niche cases that will make up maybe 1% of usage on M1 Macs. Why should that become the bar for benchmarks?
I'm not saying it should be used for benchmarks. Just pointing out that there are real use-cases for running "generic" ARMv8 binaries.

Since it's Linux and the source is available why wouldn't you compile it for the M1 specifically? Sure not everyone is quite that geeky, but if performance were really important it would probably be worth it. Someone else has already done it and has a binary available anyhow.
Main reason for using docker is to have *exactly* the same environment on server and on developer machine.
And no one in their right mind would recompile Ubuntu (or any binary distro), all its packages and external docker-images just to run some apps slightly faster.
 

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
- How to expand and enlarge existing structures without crippling diminishing returns. This applies to depth as well as width.

Diminishing returns are a fact of life, they cannot be avoided in attempts to increase IPC any more than they can be avoided in enlarging cache or improving branch prediction accuracy, because pretty much everything you do that increases IPC is subject to the same diminishing returns.
 

DrMrLordX

Lifer
Apr 27, 2000
21,805
11,161
136
OK, so a handful of niche cases that will make up maybe 1% of usage on M1 Macs. Why should that become the bar for benchmarks? Some people run DOS emulators on Windows, so I guess we should do Windows benchmarks using binaries compiled for 8088?

Because people constantly juxtapose Apple's CPU design team to that of other companies that may or may not run software from the Apple ecosystem. To properly compare the hardware, it's important to have some benchmarks in common, and that means benchmarks that don't necessarily use ASICs/embedded coprocessors (along with the software ecosystem that makes their use ubiquitous). If AMD tried schlepping fixed-function hardware in their x86 CPUs, it would likely go unutilized since AMD couldn't force developers to recompile for it. Also see post by Gideon below.

I've used an M1 Mac Mini as a developer machine for 6 months now and overall it's 1,5-2x faster than my 2018 Macbook Pro with 6 core 12 thread coffee lake (therefore no slouch).

You're getting better results than Phoronix did when they tried recompiling common FOSS benchmarks for M1. Might be time for them to revisit the issue, though I doubt they will due to their focus on Linux.

Even if those magic accelerators were the cause of a 1,5 - 2x speedup in any arbitrary JVM and V8 virtual machine code and their compile times (as well as fro golang and rust) ... well congrats to Apple, then they've done something that's orders of magnitude more comlpex (not to say impossible) than just building the wider core.

The problem with fixed-function hardware is that it's . . . fixed function. Update your software and the hardware might not do anything anymore. Not that I expect that the speedups you're seeing are from fixed-function hardware per se. I would think you'd have to make specific function calls to said hardware to utilize it.
 

Doug S

Platinum Member
Feb 8, 2020
2,493
4,060
136
Because people constantly juxtapose Apple's CPU design team to that of other companies that may or may not run software from the Apple ecosystem. To properly compare the hardware, it's important to have some benchmarks in common, and that means benchmarks that don't necessarily use ASICs/embedded coprocessors (along with the software ecosystem that makes their use ubiquitous). If AMD tried schlepping fixed-function hardware in their x86 CPUs, it would likely go unutilized since AMD couldn't force developers to recompile for it. Also see post by Gideon below.


Where's the evidence that the code generated by the Mac compiler is using resources outside the CPU core? I agree that if it was doing that then the results would not be proper to compare to other hardware, but AFAIK that is not happening.

Source code isn't being compiled into code that runs on the GPU, NPU, or other things on the SoC that aren't part of the CPU core. I don't think generated object code even calls AMX (which IS part of the CPU core, but I don't think Apple's LLVM will generate AMX instructions instead Apple wants developers to make AMX library calls...probably because it lets them make changes to AMX without worrying about supporting backwards compatibility)
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |