Apple A7 is now 64-bit

jhu · Sep 28, 2013

Nec_V20 said:
There is no such thing as an " Apple A7", Apple just gets others to make something for them to their specs. They have no fabs and no expertise whatsoever in designing processors.

lol

Eug · Sep 28, 2013

Tentative A7 analysis:

We publish this with the caveat that these are best guesses – we have not done any real circuit extraction to confirm them. The dual-core CPU and cache make up ~17% of the die area, and the quad-core GPU and shared logic about 22%. The CPU itself is not packed the same way as the A6 (see below), it looks much more like a conventional automated layout; although Linley Gwennap thinks that it’s still Apple designed, not the first ARM A53/57 usage. There’s a great review of the A7’s capability over at AnandTech.

We know from our analysis of the 32 nm A6 chip that the 6 transistor SRAM cell area was ~0.15µm2, so if we shrink that, we can guesstimate the 28 nm 6T SRAM cell to be ~0.12 µm2. If we further allow a conservative 40%-50% utilization to allow for the row and column circuitry, then we get densities of ~1 MB for the L2 cache, and ~256 KB for the L1 cache.

One thing we haven’t identified is the memory block used for the data from the fingerprint sensor – Jony Ives, in his video about the sensor, clearly states that the information is stored securely in the A7 and emphasized it with this shot below.

Nothingness · Sep 28, 2013

Eug said:
Tentative A7 analysis:

We publish this with the caveat that these are best guesses we have not done any real circuit extraction to confirm them. The dual-core CPU and cache make up ~17% of the die area, and the quad-core GPU and shared logic about 22%. The CPU itself is not packed the same way as the A6 (see below), it looks much more like a conventional automated layout;

I don't think L1 are correctly placed.

Exophase · Sep 28, 2013

Nothingness said:
I don't think L1 are correctly placed.

They quite obviously are not. I don't know how someone at Chipworks made this kind of mistake. The blocks just diagonal of the L2 and to the opposite side of core are the L1 caches. My guess is the one closer to the L2 is dcache. Stuff just to the left of those blocks can include tags, TLB entries, and buffers. Area in the center to the right of the L2 SRAM arrays would be L2 tags.

Ajay · Sep 29, 2013

Exophase said:
They quite obviously are not. I don't know how someone at Chipworks made this kind of mistake. The blocks just diagonal of the L2 and to the opposite side of core are the L1 caches. My guess is the one closer to the L2 is dcache. Stuff just to the left of those blocks can include tags, TLB entries, and buffers. Area in the center to the right of the L2 SRAM arrays would be L2 tags.

O.K. that makes more sense, as I was thinking that the areas diagonal and towards the opposite side of the cores looked like split L1$ (I/D) - if not, WTH would the be! Thanks!

ancientarcher · Sep 30, 2013

A basic question on the power of the Apple A7.
On Geekbench 3, the A7 scores around 1400 for a single core (@1.3GHz). The Intel Core i7-3840QM scores around 3,500 on a single core basis (@2.8Ghz). The TDP of A7 is 1.5-2W and the TDP of the core i7-3840QM is around 45W (but then people say that is only for the CPU, the integrated GPU is not included).

So, theoretically Apple can double the A7 frequency to 2.6GHz and the performance will increase to 2,800 per core with power quadrupling to 6-8W TDP, so basically getting 80% of core i7 score with 1/6th the power consumption. I realise, you can't just double the clock rate and there are other factors, but does that show you how close the A7 architecture is to the best that Intel has got. By the way, the Core i7-3840QM suggested retail price is $568.

If that is the performance that Apple gets from 28nm planar design, then clearly it can take the same design, ramp up the clock rate in 20nm next year and get a processor with single core performance equal to the best of Intel?

Is the above broadly correct?
If it is, then what stops apple from having two different (or 3) processor design teams geared to deliver cores for different TDPs. One for phones (lowest TDP), another for macbook pro and air and the third for the imacs (the highest power consumption levels)?

ShintaiDK · Sep 30, 2013

ancientarchr said:
A basic question on the power of the Apple A7.
On Geekbench 3, the A7 scores around 1400 for a single core (@1.3GHz). The Intel Core i7-3840QM scores around 3,500 on a single core basis (@2.8Ghz). The TDP of A7 is 1.5-2W and the TDP of the core i7-3840QM is around 45W (but then people say that is only for the CPU, the integrated GPU is not included).

So, theoretically Apple can double the A7 frequency to 2.6GHz and the performance will increase to 2,800 per core with power quadrupling to 6-8W TDP, so basically getting 80% of core i7 score with 1/6th the power consumption. I realise, you can't just double the clock rate and there are other factors, but does that show you how close the A7 architecture is to the best that Intel has got. By the way, the Core i7-3840QM suggested retail price is $568.

If that is the performance that Apple gets from 28nm planar design, then clearly it can take the same design, ramp up the clock rate in 20nm next year and get a processor with single core performance equal to the best of Intel?

Is the above broadly correct?
If it is, then what stops apple from having two different (or 3) processor design teams geared to deliver cores for different TDPs. One for phones (lowest TDP), another for macbook pro and air and the third for the imacs (the highest power consumption levels)?

The problem, as already shown in other threads, is that geekbench is not a good way of showing cross platform performance. In short, the real world performance is lower on the A7 than geekbench shows vs for example x86. The A7 "magic" is already dead.

Ajay · Sep 30, 2013

ShintaiDK said:
The problem, as already shown in other threads, is that geekbench is not a good way of showing cross platform performance. In short, the real world performance is lower on the A7 than geekbench shows vs for example x86. The A7 "magic" is already dead.

No it's not, the only "Magic" was the crypto scores because the are offloaded to hardware - which is still a good thing, just not comparable to CPU based crypto scores (and x86 has hardware accelerated crypto as well).

The are still small to significant gains in the A7 due to the implementation of AArch64. The move to 32 GPRs being the one of the biggest reasons for some of the gains, double wide FP units also add to the performance. Importantly, this is in a SoC that is still below 2 Watts peak load (and typically much less). If you don't believe, you can go over to RWT and try arguing with some much smarter guys there. Even Hans posted a die shot here and made mention of the much larger performance bump than expect.

Please note, I'm not an Apple 'fanboi', I'm just impressed with what Apple/ARM was able to do with such a low power CPU on a 1/2 node shrink.

liahos1 · Sep 30, 2013

do you have a link to the tdps? I havent seen them anywhere.

ancientarcher · Sep 30, 2013

ShintaiDK said:
The problem, as already shown in other threads, is that geekbench is not a good way of showing cross platform performance. In short, the real world performance is lower on the A7 than geekbench shows vs for example x86. The A7 "magic" is already dead.

Well, everyone says that, but no one has an alternative. I am just using Geekbench as a substitute for performance. But my main point is: can't apple (or any of the ARM design guys, on their 64bit ARM designed processors)
1) Double the clock rate of the A7 (or its soon to be born brothers from qualcomm etc)
2) Double the GPU area (which is more scalable anyways)
3) Move to a new process node (which is going to happen anyways)

and bring out a processor which is comparable in performance to the core i7 series Intel processors but much lower in power consumption?

After all, the best of breed intel core i-7 processors have south of 1.5bn transistors and the A7 has 1bn transistors... ARM architecture can't be so sh*t compared to x86 that they can't have at least comparable performance!

ancientarcher · Sep 30, 2013

liahos1 said:
do you have a link to the tdps? I havent seen them anywhere.

The A7 TDP is hearsay - from Anandtech and other sites.
The Intel TDP is http://ark.intel.com/products/70846

From what I hear, Intel measures TDP of only the CPU cores and not the integrated GPU...

ShintaiDK · Sep 30, 2013

ancientarchr said:
From what I hear, Intel measures TDP of only the CPU cores and not the integrated GPU...

Incorrect.

Nothingness · Sep 30, 2013

ancientarchr said:
Well, everyone says that, but no one has an alternative. I am just using Geekbench as a substitute for performance. But my main point is: can't apple (or any of the ARM design guys, on their 64bit ARM designed processors)
1) Double the clock rate of the A7 (or its soon to be born brothers from qualcomm etc)

Doubling frequency is not an easy task if your design has been thought for that from early in the design process. We don't know what Apple did, but given the apparent efficiency on Geekbench, they went for high perf/MHz instead of high frequency (or A7 frequency is already higher than the 1.3GHz found).

2) Double the GPU area (which is more scalable anyways)
3) Move to a new process node (which is going to happen anyways)

and bring out a processor which is comparable in performance to the core i7 series Intel processors but much lower in power consumption?

After all, the best of breed intel core i-7 processors have south of 1.5bn transistors and the A7 has 1bn transistors... ARM architecture can't be so sh*t compared to x86 that they can't have at least comparable performance!

No, high performance doesn't imply x86 for sure. But it took Intel many years and iterations to reach the level they are at. I doubt Apple will be able have a competing chips before 2 or 3 years (assuming one iteration per year), and even then Intel will have moved on.

All of the above is pure speculation

Dresdenboy · Sep 30, 2013

ShintaiDK said:
The problem, as already shown in other threads, is that geekbench is not a good way of showing cross platform performance. In short, the real world performance is lower on the A7 than geekbench shows vs for example x86. The A7 "magic" is already dead.

Which are the real world measurements you're referring to? Please don't list synthetic/scripted browser benchmarks, benchmarks depending on some single vendor optimized execution engine, or compiler-biased ones (e.g. GCC vs. ICC).

ShintaiDK · Sep 30, 2013

Dresdenboy said:
Which are the real world measurements you're referring to? Please don't list synthetic/scripted browser benchmarks, benchmarks depending on some single vendor optimized execution engine, or compiler-biased ones (e.g. GCC vs. ICC).

Compiler biased? So anythign compiled with clang is Apple biased I assume.

Nothingness · Sep 30, 2013

ShintaiDK said:
Compiler biased? So anythign compiled with clang is Apple biased I assume.

What about getting proofs instead of assumptions? Can you show us code that hints at clang cheating at some benchmark?

ancientarcher · Sep 30, 2013

Nothingness said:
Doubling frequency is not an easy task if your design has been thought for that from early in the design process. We don't know what Apple did, but given the apparent efficiency on Geekbench, they went for high perf/MHz instead of high frequency (or A7 frequency is already higher than the 1.3GHz found).

No, high performance doesn't imply x86 for sure. But it took Intel many years and iterations to reach the level they are at. I doubt Apple will be able have a competing chips before 2 or 3 years (assuming one iteration per year), and even then Intel will have moved on.

All of the above is pure speculation

Speculation, it might be, but consider the facts below:

1) AMD said that one of the reasons they are signing the architecture licensing deal with ARM is because of the ease of designing with ARM. From what I remember reading, it cuts down the time to design significantly and the number of people by ~1/5th ish

2) Legacy is a burden as well as a benefit. Intel has benefited from the status of x86 in legacy systems today and will continue to benefit. But that also brings the burden of carrying on with bits of inefficient legacy blocks in their designs. Whereas ARM had the advantage of designing a 64bit architecture from the ground up, only having to maintain backward compatibility with their 32bit arch, not for stuff from 20 years ago (which is the case with x86)

So, don't for a moment assume that Intel and the ARM design camp are on a level playing field. They are not!

Nothingness · Sep 30, 2013

ancientarchr said:
1) AMD said that one of the reasons they are signing the architecture licensing deal with ARM is because of the ease of designing with ARM. From what I remember reading, it cuts down the time to design significantly and the number of people by ~1/5th ish

That's probably true because AMD will be using an existing CPU design, instead of designing their own. Designing a CPU from the ground up is probably as expensive for anyone as it is for Intel, when you're targeting performance.

2) Legacy is a burden as well as a benefit. Intel has benefited from the status of x86 in legacy systems today and will continue to benefit. But that also brings the burden of carrying on with bits of inefficient legacy blocks in their designs. Whereas ARM had the advantage of designing a 64bit architecture from the ground up, only having to maintain backward compatibility with their 32bit arch, not for stuff from 20 years ago (which is the case with x86)

ARM 32-bit instruction set has been accumulating stuff for 20 years too, though certainly not anything as bad as x87 :biggrin:

But yeah Aarch64 is definitely way cleaner than x86.

So, don't for a moment assume that Intel and the ARM design camp are on a level playing field. They are not!

From a high performance chip design point of view, I am sorry but they are

EDIT: BTW what I called "speculation" was my answer, not a critics of your post!

ancientarcher · Sep 30, 2013

Nothingness said:
From a high performance chip design point of view, I am sorry but they are

!

As Neils Bohr said "Prediction is very difficult, especially if it's about the future." Some of the prediction must fall into speculation by default :biggrin:

My point was that the performance is not so different between those two. Just designing a core which can take clock rates of 2.6-3GHz and Apple has a chip which can theoretically (at least as per Geekbench) compare with the best of breed low power core i7 which costs 20x more. Not that ARM chips don't run at those clock rates, just look at the snapdragon 800.

Designing a chip from the ground up will be difficult. But modifying the A7 so that it can operate at higher frequencies - is that so difficult??? And yes, you have to modify for the 20nm node of course and the 14nm finfet after that, but thats what chip designers are paid to do...

Nothingness · Sep 30, 2013

ancientarchr said:
Designing a chip from the ground up will be difficult. But modifying the A7 so that it can operate at higher frequencies - is that so difficult???

Yes, that is difficult, very difficult. In fact high frequency and low frequency mean different micro-architectures.

This comes from the fact that at a given frequency your electrons can travel through let's say 30 layers of logic per cycle; if you start increasing frequency, given that electron speed is constant, you'll have less layers of logic, hence you'll be able to do less work (that's a simplification, but it's close enough to reality).

So if for instance, Apple had time to multiply two numbers because this requires 30 layers of logic, if they increase freq, which implies less layers, then they'll have to split multiplication into two cycles.

Hope I made this clear enough...

Dresdenboy · Sep 30, 2013

ancientarchr said:
1) AMD said that one of the reasons they are signing the architecture licensing deal with ARM is because of the ease of designing with ARM. From what I remember reading, it cuts down the time to design significantly and the number of people by ~1/5th ish

There could be a) simple reuse of existing Cortex macros, b) a modified design (also saves work), c) a redesign from ground up.

A LinkedIn profile suggests, that AMD is working on some modifications. The owner mentioned the ARM scheduler.

Exophase · Sep 30, 2013

Dresdenboy said:
There could be a) simple reuse of existing Cortex macros, b) a modified design (also saves work), c) a redesign from ground up.

A LinkedIn profile suggests, that AMD is working on some modifications. The owner mentioned the ARM scheduler.

Are you sure the guy isn't just citing previous experience while under employment at ARM?

ARM core licensees have not been in the habit of modifying the core outside of the configurable options that ARM provides. Probably because it now lies on them to redo validation which is a big part of the design cost.

TuxDave · Sep 30, 2013

ancientarchr said:
But my main point is: can't apple (or any of the ARM design guys, on their 64bit ARM designed processors)
1) Double the clock rate of the A7 (or its soon to be born brothers from qualcomm etc)
2) Double the GPU area (which is more scalable anyways)
3) Move to a new process node (which is going to happen anyways)

They absolutely can do all of the above. It just takes time, money and talent.

Exophase · Sep 30, 2013

There's no guarantee you can double the peak clock speed of a uarch by throwing a lot of money at it. Normally you'd end up with a totally different uarch - leaving that much clock speed on the table is a sign of a poorly balanced design.

It's possible Apple can further optimize the timings by spending more time tuning things (including physical layout) without changing the behavioral specification of the uarch, but 2x seems like a lot given the experience of their design teams. Probably something is going to have to start giving in terms of cycle counts.

ancientarcher · Sep 30, 2013

Nothingness said:
Yes, that is difficult, very difficult. In fact high frequency and low frequency mean different micro-architectures.

This comes from the fact that at a given frequency your electrons can travel through let's say 30 layers of logic per cycle; if you start increasing frequency, given that electron speed is constant, you'll have less layers of logic, hence you'll be able to do less work (that's a simplification, but it's close enough to reality).

So if for instance, Apple had time to multiply two numbers because this requires 30 layers of logic, if they increase freq, which implies less layers, then they'll have to split multiplication into two cycles.

Hope I made this clear enough...

Fair enough. You and Exophase have made it clear that it is not trivial. However, am I correct in assuming that it is still possible to design a core (a little different from the A7 obviously) which is nearly 2x the performance of the A7 cyclone. Maybe with 20nm (with performance gains of 20-30% over the current node), Apple (and/or the others) can theoretically reach core i7 levels of performance at a TDP of ~8W (or may be slightly higher power consumption).

I know that we can't extrapolate the performance from A7, but clearly there is a world where 8-15W TDP ARM cores would work. And who is to say that Apple didn't have parallel teams working on cores with these power envelope levels as well... After all the ARMv8 instruction set was out 2 years ago...

I guess we will find out when we find out.

Apple A7 is now 64-bit

Lifer

Lifer

Platinum Member

Diamond Member

Lifer

Member

Lifer

Lifer

Senior member

Member

Member

Lifer

Platinum Member

Golden Member

Lifer

Platinum Member

Member

Platinum Member

Member

Platinum Member

Golden Member

Diamond Member

Lifer

Diamond Member

Member