Question Another potential big blow to Intel

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Not really? There's only one uarch in the M1 - ARMv8 - and the OS has binaries that have been translated to ARMv8 instructions. A "true" hybrid x86/ARM machine would require an OS that could handle binaries compiled for different uarches natively. It might require multiple kernels.
You (I think) are talking of two separate cores (x86 and Arm) that are accessible by the OS depending on the machine code of the application, and I was just thinking that Apple took a step towards that by adding the ability of the M1 to enable x86-like memory ordering dynamically on the hardware side. My though is that Rosetta asking the SoC turn on strong memory ordering on the Firestorm core is at least a step toward Rosetta asking the SoC to use the x86 core instead of the Arm one. Never mind all the OS-level machinations that would require
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
My though is that Rosetta asking the SoC turn on strong memory ordering on the Firestorm core is at least a step toward Rosetta asking the SoC to use the x86 core instead of the Arm one. Never mind all the OS-level machinations that would require

An interesting point. But yes the OS would have to handle things quite differently.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
An interesting point. But yes the OS would have to handle things quite differently.
I wonder how different it is for an OS to feed a GPU's ISA with instructions as well as a CPU's ISA with instructions, compared to feeding two different CPUs different ISA instructions.

That is, I can't wrap my mind around the fact that I've been told that Linux can't execute foreign code... but it can execute instructions that utilize a GPU that runs on RDNA or PTX or GCN ISAs just fine. I can't wrap my mind around why an OS can send instructions to an different-ISA GPU and x86-64 CPU but not an x86-64 CPU and an ARMv8 CPU. I'm sure there's a lot of intermediate language / compilation / kernel level stuff that I'm not understanding.

In any case, right now, 3 hefty winter beers in, it isn't connecting in my head why one could not use something like IOMMU and HSA-enabled x86-64 and Aarch64 cores to permit execution of whatever OS-specific binary you want.
 

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
I wonder how different it is for an OS to feed a GPU's ISA with instructions as well as a CPU's ISA with instructions, compared to feeding two different CPUs different ISA instructions.

That is, I can't wrap my mind around the fact that I've been told that Linux can't execute foreign code... but it can execute instructions that utilize a GPU that runs on RDNA or PTX or GCN ISAs just fine. I can't wrap my mind around why an OS can send instructions to an different-ISA GPU and x86-64 CPU but not an x86-64 CPU and an ARMv8 CPU. I'm sure there's a lot of intermediate language / compilation / kernel level stuff that I'm not understanding.

In any case, right now, 3 hefty winter beers in, it isn't connecting in my head why one could not use something like IOMMU and HSA-enabled x86-64 and Aarch64 cores to permit execution of whatever OS-specific binary you want.

In the case of addressing workloads to a GPU, typically the OS communicates with the driver, which passes the workload to the GPU for processing. People with more GPGPU programming experience could add more, but in the case of GPGPU, you aren't really executing commands directly. It's all handled through a software stack.
 
Reactions: moinmoin

beginner99

Diamond Member
Jun 2, 2009
5,223
1,598
136
Not really? There's only one uarch in the M1 - ARMv8 - and the OS has binaries that have been translated to ARMv8 instructions. A "true" hybrid x86/ARM machine would require an OS that could handle binaries compiled for different uarches natively. It might require multiple kernels.

I don't know enough about the inner workings of SOC/CPU and OS but internal aren't CPUs basically all RISC and x86 just has a more complex decoder? Wouldn't it be possible to make a CPU with 2 decoders x86 and ARM and depending on the code, the appropriate decoder is chosen? (by hardware or OS).

I assume the "all CPUs are RISC" is an oversimplification?
 

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
Wouldn't it be possible to make a CPU with 2 decoders x86 and ARM and depending on the code, the appropriate decoder is chosen? (by hardware or OS).

@amrnuke brought up the issue of strong vs. weak memory ordering:


So that might be a roadblock as well, though Apple can force strong memory ordering when executing "fat" binaries translated by Rosetta2. Otherwise, at least for operating systems that use a HAL, you could probably just rely on the HAL to take care of things.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
In the case of addressing workloads to a GPU, typically the OS communicates with the driver, which passes the workload to the GPU for processing. People with more GPGPU programming experience could add more, but in the case of GPGPU, you aren't really executing commands directly. It's all handled through a software stack.
@amrnuke brought up the issue of strong vs. weak memory ordering:


So that might be a roadblock as well, though Apple can force strong memory ordering when executing "fat" binaries translated by Rosetta2. Otherwise, at least for operating systems that use a HAL, you could probably just rely on the HAL to take care of things.
So overall my impression is that Apple could put an x86 "accelerator" or core on the M1 if it made logical sense to them to do so and if they could license x86-64 to put such an accelerator on there. Perhaps that was the issue... licensing. Or perhaps they felt that Rosetta 2 was so good, and their universal binary / Xcode so helpful in the transition period, that they didn't really need to even bother with that.

The opposite direction may be easier, because license-holders of Arm already include AMD etc. But x86 don't have a software compatibility issue for which putting an Arm accelerator on a chip would make much sense right now, and the power / battery life difference isn't so massive right now that it would seem to make a huge amount of sense for Intel or AMD to open that box.
 

Doug S

Platinum Member
Feb 8, 2020
2,483
4,039
136
The fact remains Apple did not include ANY support for directly executing x86 code. Rosettas translates the x86-64 binaries into ARM64 binaries, and the CPU runs that. In cases where Rosetta can't do that (stuff like x86 browsers translating Javascript to x86 on the fly) then Apple has to fall back to translating x86 to ARM on the fly.

That's why running x86 browsers was significantly slower than other x86 code run under Rosetta, the static translation doesn't work since the code doesn't exist until runtime.

Theoretically Microsoft can do a similar thing, but in reality I don't believe that's possible. Windows has a lot more legacy code and APIs than Apple had in the x86 Mac. Apple even made the task easier by last year phasing out support for 32 bit x86 and some old APIs like Carbon. They also benefit in that almost every Mac application was created by XCode, so they may have been building x86 binaries that included "hints" to Rosetta for years to help with the translation process.

I don't think Microsoft has any choice but to do JIT type on the fly translation of x86 code, and accept the performance hit implied by that unless someone really believes that a dual ISA CPU is feasible (I don't)
 

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
So overall my impression is that Apple could put an x86 "accelerator" or core on the M1 if it made logical sense to them to do so and if they could license x86-64 to put such an accelerator on there. Perhaps that was the issue... licensing. Or perhaps they felt that Rosetta 2 was so good, and their universal binary / Xcode so helpful in the transition period, that they didn't really need to even bother with that.

The opposite direction may be easier, because license-holders of Arm already include AMD etc. But x86 don't have a software compatibility issue for which putting an Arm accelerator on a chip would make much sense right now, and the power / battery life difference isn't so massive right now that it would seem to make a huge amount of sense for Intel or AMD to open that box.

Having given some thought to what is proposed here, I've concluded that the main barrier would be getting the HAL to figure out which uarch is being targeted by which application. After that it could do what a HAL does, and take calls to an API and translate them to machine code. That might require maintaining multiple, redundant APIs just to help the HAL keep things ordered correctly, which would mean recoding all your x86 AND ARM applications just to accommodate the dual-ISA environment. Anything utilizing an API that bypasses the HAL might have . . . issues.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
What will Microsoft do to differentiate their custom chip from Qualcomm? I just don't see it making sense.

This is a strange thing to say if you are even a bit familiair with the ARM design landscape. Qualcomm uses the stock ARM designs and although they are good Apple is miles ahead and showing the potential of ARM. Just riding on the default ARM option offered by Qualcomm won't get Microsoft any nearer to Apple unless the ARM design team pulls some serious rabbits out of their hats.

If the Surface design team is an indicator to this effort this will be a serious threat to both Intel and AMD.

I'm expecting Alder Lake to be my last x86 desktop.
 

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
Just riding on the default ARM option offered by Qualcomm won't get Microsoft any nearer to Apple unless the ARM design team pulls some serious rabbits out of their hats.

In terms of perf/watt, A77 isn't that much worse than A13. Actually some have said it's better? The problem here is in actually getting these designs in something other than a phone. At least Apple will give you some of their best ARM cores in a tablet or, now, a laptop.

Regardless, A78 should be a nice step up from 8cx and SQ1.
 
Reactions: Tlh97

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
This is a strange thing to say if you are even a bit familiair with the ARM design landscape. Qualcomm uses the stock ARM designs and although they are good Apple is miles ahead and showing the potential of ARM. Just riding on the default ARM option offered by Qualcomm won't get Microsoft any nearer to Apple unless the ARM design team pulls some serious rabbits out of their hats.

If the Surface design team is an indicator to this effort this will be a serious threat to both Intel and AMD.

I'm expecting Alder Lake to be my last x86 desktop.

Apple have spent a decade developing their CPU design prowess, and they started out by buying a CPU design company (PA Semi). I see no indications that Microsoft has been building the depth of knowledge, talent and IP needed to compete on that level.

Just look at Samsung. Their Mongoose custom architecture tried to go big and wide like Apple, but ended up worse than stock ARM cores- while using more transistors in the process. CPU design is hard.
 
Reactions: Tlh97 and Spartak

Spartak

Senior member
Jul 4, 2015
353
266
136
Apple have spent a decade developing their CPU design prowess, and they started out by buying a CPU design company (PA Semi). I see no indications that Microsoft has been building the depth of knowledge, talent and IP needed to compete on that level.

Just look at Samsung. Their Mongoose custom architecture tried to go big and wide like Apple, but ended up worse than stock ARM cores- while using more transistors in the process. CPU design is hard.

This is all true but then you frased your remark a bit unclear. Question isn't what they can improve (we've seen Apple show what can be done) but how.

Clearly Microsoft thinks they can, and as to who: I've read (was it on these forums?) that the entire SPARC team moved to Microsoft after Oracle terminated their teams in 2017.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
In terms of perf/watt, A77 isn't that much worse than A13. Actually some have said it's better? The problem here is in actually getting these designs in something other than a phone. At least Apple will give you some of their best ARM cores in a tablet or, now, a laptop.

Regardless, A78 should be a nice step up from 8cx and SQ1.

That's not really much of a problem rather than a choice. A77 isn't much worse in perf/watt because they are/will be fabbed on the same process node. That's not what this is about.
Perf/watt might matter for the smartphone as there is a certain top TDP that has restrained the Apple designs, but for the tablet, desktop and laptop? Not so much.

First you say perf/watt (=mobile performance as they all operate around a similar thermal budget) is somewhat equal, but when you raise the thermal budget the Apple A14 isnt artificially limited anymore. You are comparing dual/hexa cores to quad/octacores. So the problem isnt so much designing something for the desktop as designing a core with massive IPC that you can run at higher frequencies with more cores in a different environment. The hard part isnt adding cores or upping the frequency (at a power cost) but getting to that high IPC performance in the first place. It's just now that Apple unlocked the true power of their 'Apple silicon', but that wasn't a design or engineering enterprise as much as it was a software and ecosystem enterprise.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
That's not really much of a problem rather than a choice. A77 isn't much worse in perf/watt because they are/will be fabbed on the same process node.

The node isn't the only thing that determines performance per watt. And I'll reiterate, A77 has proven to be better than A13 in perf/watt according to some analysis. And perf/watt definitely matters for laptops!

Also I don't know why you're going on about thermal budget or dual/quad/etc. core counts. A76, A77, and A78 aren't limited to the number of cores present in the SoC. All the design tools necessary to produce an SoC with more than 4 cores are available and licensable from ARM.
 
Reactions: Tlh97

Thala

Golden Member
Nov 12, 2014
1,355
653
136
You (I think) are talking of two separate cores (x86 and Arm) that are accessible by the OS depending on the machine code of the application, and I was just thinking that Apple took a step towards that by adding the ability of the M1 to enable x86-like memory ordering dynamically on the hardware side. My though is that Rosetta asking the SoC turn on strong memory ordering on the Firestorm core is at least a step toward Rosetta asking the SoC to use the x86 core instead of the Arm one. Never mind all the OS-level machinations that would require

Still you would need a second kernel for x86 - and i fail how a machine with 2 different OS kernels could work.
I'd say it is conceptionally not sound having a machine with 2 different ISAs using a single OS.

It is much easier to just emulate x86 as an interim solution until the application is available as native ARM64 binary.

It also does not make any sense from performance perspective. Having say 2 ARM cores and 2 x64 cores would make it impossible to run an application using all 4 cores. However since the emulation hit is typically less than 50% - running x64 code on 4 ARM cores via emulation could be faster than running the same code on 2 x64 cores native.
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,355
653
136
The node isn't the only thing that determines performance per watt. And I'll reiterate, A77 has proven to be better than A13 in perf/watt according to some analysis. And perf/watt definitely matters for laptops!

Also I don't know why you're going on about thermal budget or dual/quad/etc. core counts. A76, A77, and A78 aren't limited to the number of cores present in the SoC. All the design tools necessary to produce an SoC with more than 4 cores are available and licensable from ARM.

There was at least a limitation to 4 big cores when using ARMs DynamIQ IP. They apparently changed this recently with the addition of the Cortex A78-C IP.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
The node isn't the only thing that determines performance per watt. And I'll reiterate, A77 has proven to be better than A13 in perf/watt according to some analysis. And perf/watt definitely matters for laptops!

Also I don't know why you're going on about thermal budget or dual/quad/etc. core counts. A76, A77, and A78 aren't limited to the number of cores present in the SoC. All the design tools necessary to produce an SoC with more than 4 cores are available and licensable from ARM.

Ok I'll break it down more simply for you then. Multicore performance is just slapping more cores on within the same thermal design envelope (TDP), so a 4+4 core ARM processor would offer similar MC performance to a 2+4 core Apple A14 if perf/watt is similar and both are designed around a similar TDP (which they are).*
For the current gen it actually doesnt come close in MC performance either but I'll take your word the next gen will somewhat close that gap.

But single core performance is very important as well for overall performance, especially on the desktop. Point me to an ARM CPU beyond Apple that has similar (or actually beating) desktop class single core performance.

I'll wait. That's the whole point of Microsofts endeavour.

*Never thought I'd have to explain perf/watt * watt = perf to a regular Anandtech member.
 
Last edited:

Heartbreaker

Diamond Member
Apr 3, 2006
4,262
5,259
136
Probably nothing. I would expect MS to ape Amazon at first by designing their own in-house cloud server CPUs for Azure. Anything consumer will be handled by Qualcomm for the foreseeable future. Or I guess AMD but I am skeptical of that.


Agreed. If you read the original Bloomberg story, it's much more hesitant on PC chips:
Microsoft’s efforts are more likely to result in a server chip than one for its Surface devices, though the latter is possible, said one of the people.

It seems like the ARM Server chip push is all about savings, over VERY expensive server chips from AMD/Intel.

But ARM SoCs for portable devices, are not massive premium parts like Server chips. If Microsoft wants more competition for ARM-Windows Chips, they just need to open up ARM-Windows licensing which is currently Qualcomm exclusive.

If Microsoft said they would license any reasonable Quad core Cortex-X1 design, they would probably get some more takers.
 
Reactions: scannall

Doug S

Platinum Member
Feb 8, 2020
2,483
4,039
136
The node isn't the only thing that determines performance per watt. And I'll reiterate, A77 has proven to be better than A13 in perf/watt according to some analysis. And perf/watt definitely matters for laptops!

Comparing CPUs that differ in performance by 50% isn't a good way to measure perf/watt. If you clock down the A13 so that it performs as slowly as an A77 I'm willing to bet it has superior performance/watt.

Heck, just look at Apple's little cores, which are about as slow compared to the A77 as the A77 is to the A13, but use a tiny fraction of the power.

So if people want to hold up the A77 as an example of "this has better performance per watt than Apple's big cores" then they should be equally willing to hold up Apple's little cores as an example of "this has better performance per watt than A77" (and A55)
 

DrMrLordX

Lifer
Apr 27, 2000
21,797
11,144
136
There was at least a limitation to 4 big cores when using ARMs DynamIQ IP. They apparently changed this recently with the addition of the Cortex A78-C IP.

Interesting. Sure that wasn't a limitation of older big.LITTLE implementations? Granted I only recently started reading about DynamIQ, so ARM might have updated all their literature on the subject to reflect the changes you mentioned.

Ok I'll break it down more simply for you then. Multicore performance is just slapping more cores on within the same thermal design envelope (TDP)

It's also interconnect. As AMD has demonstrated, sometimes interconnect can eat into your power budget, depending on your topology.

so a 4+4 core ARM processor would offer similar MC performance to a 2+4 core Apple A14

Actually the relevant comparisons were between A77 and A13. If you want to use the almighty GB5 to compare the two (which I am somewhat loathe to do, but digging into SPEC and other crap is frankly too time-consuming and threatens to take us even further off-topic) and you look at the rated TDP for the SoCs (since actual power measurements are not on my fingertips), you'll find:

Snapdragon 865: MT 3195 (TDP 5W)
Apple A13: 3315 (TDP 6W)

Sources:


At least in this specific benchmark, A13 only comes out with a 3.75% performance lead despite having a 20% higher TDP rating. We can't really compare A14 with A78 since A78 hasn't hit the market yet (not to speak of X1).

For the current gen it actually doesnt come close in MC performance either but I'll take your word the next gen will somewhat close that gap.

See above. No, A77 isn't competitive with A14, but it is quite competitive with its generational rival, A13.

But single core performance is very important as well for overall performance

Until the end-user ceases to notice its importance. Remember we're talking about potential laptop SoCs that may wind up in some of Microsoft's Surface products. Rest assured that MS won't use anything slower than A78 (and more likely than not, X1) in their Surface. And in laptop workloads (which are often "desktop class"), raw ST performance hasn't been a factor in ages. You are going to utilize more than one core almost constantly. While the SoC might not be pushed to its power limits all the time, when you're talking about devices that have a maximum sustained power draw in the 7-10W range, you're going to be pushing those limits (and your cores) more often than not. If you were to build hypothetical "desktop class" laptops out of A77 and A13 - which, again, will never happen - you would not see the ridiculous gap in performance demonstrated by GB5 ST materialize in many workloads. Both SoCs would be hitting their power limits pretty often with their "big" cores fully engaged more often-than-not.

If MS is really going to push their own "private label" ARM solution (as opposed to something sourced from Qualcomm, and I remain unconvinced that MS will turn their backs on Qualcomm completely), they can either:

a). use bog-standard ARM designs, the next generation of which will be available next year or
b). develop their own in-house ARM CPU with time-to-market of maybe 3 years if they're successful, meaning they'll have to guess where the market is in 2023

In case you hadn't noticed, outside of Apple (and sort of Huawei; see Taishan v110), all the big players in the ARM mobile and server realm have reverted to standard ARM designs. Graviton2 and Ampere are using N1 cores. Samsung killed their Mongoose core series. Snapdragon 865 is A77 (okay, 8cx/SQ1 are mild departures from reference ARM designs, but not by a whole lot), while Kirin 990 is still A76. And on and on and on. Apple is mostly an outlier here. I do not expect Microsoft to be any more ambitious than Qualcomm, Amazon, or Altra.

*Never thought I'd have to explain perf/watt * watt = perf to a regular Anandtech member.

You didn't have to, and really, you didn't.

Comparing CPUs that differ in performance by 50%

In ST? Maybe. In MT? Nope.
 
Reactions: Tlh97

naukkis

Senior member
Jun 5, 2002
779
636
136
Actually the relevant comparisons were between A77 and A13. If you want to use the almighty GB5 to compare the two (which I am somewhat loathe to do, but digging into SPEC and other crap is frankly too time-consuming and threatens to take us even further off-topic) and you look at the rated TDP for the SoCs (since actual power measurements are not on my fingertips), you'll find:

Snapdragon 865: MT 3195 (TDP 5W)
Apple A13: 3315 (TDP 6W)

Sources:


At least in this specific benchmark, A13 only comes out with a 3.75% performance lead despite having a 20% higher TDP rating. We can't really compare A14 with A78 since A78 hasn't hit the market yet (not to speak of X1).

So you are comparing 4 big cores + 4 small cores snapdragon 865 to 2 big cores + 4 small cores A13 in MT performance? And, even with half big cores A13 still achieves better performance and that's a proof that A77 is as good as Apple's cores?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |