Apple A10 Fusion is Quad-core big.LITTLE

Eug · Oct 16, 2016

Nothingness said:
- big.LITTLE can work with all CPU's on; A10 can't

Not that Wikipedia is the final authority, but it states that your scenario of big.LITTLE with all cores potentially active is just one possible implementation of it. Other implementations don't allow all CPU cores to be active simultaneously.

The clustered model approach is the first and simplest implementation, arranging the processor into identically-sized clusters of "Big" or "Little" cores. The operating system scheduler can only see one cluster at a time; when the load on the whole processor changes between low and high, the system transitions to the other cluster. All relevant data is then passed through the common L2 cache, the first core cluster is powered off and the other one is activated. A Cache Coherent Interconnect (CCI) is used. This model has been implemented in the Samsung Exynos 5 Octa (5410).

Andrei. · Oct 16, 2016

Nothingness said:
- big.LITTLE can work with all CPU's on; A10 can't
- big.LITTLE has separate L2 cache for each cluster; A10 seems to share half of L2 cache

IMHO that's all there is that matters. But I'm missing some coffee so I might have skipped something important

People are being anal and smartass about what big.LITTLE actually means - there are those that seem to say that it's ARM's specific implementation of heterogeneous CPU cores consisting of separate independent CPU clusters interconnected through a coherent interconnect fabric.

To me (just my and other's subjective opinion) big.LITTLE means nothing more than the use of heterogeneous CPU cores in a system for the sake of power or performance optimisation. That's what "big.LITTLE" has become in lay-man terms.

name99 said:
If you look at that article you will see a number of comments (including mine) that point out that what Andrei measured did not mean what he thought it meant. These technical points have never been responded to.

Because I don't want to get into useless arguments with you because it's a futile exercise, but I will indulge you on the current topic just because of how stupid it is.

name99 said:
As for that S6 review that you like so much, let me quote from:
http://www.anandtech.com/show/9146/the-samsung-galaxy-s6-and-s6-edge-review/2
"The memory controller is new and supports LPDDR4 running at 1555MHz. This means that the Galaxy S6 has almost double the theoretical memory bandwidth when compared to.."

Note that SINGULAR there. Memory controller, not "memory controller(s)"...

Are you really now trying to use wording semantics to prove a technical point? Can we stop this nonsense?

https://github.com/AndreiLux/ref-3.10/blob/perseus7420/drivers/bts/bts-exynos7420.c

Code:

#define MIF_BLK_NUM        4
....
    __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[0] + QOS_TIMEOUT_0xB);
    __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[1] + QOS_TIMEOUT_0xB);
    __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[2] + QOS_TIMEOUT_0xB);
    __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[3] + QOS_TIMEOUT_0xB);
....
#define EXYNOS7_PA_DREX0    0x10800000
#define EXYNOS7_PA_DREX1    0x10900000
#define EXYNOS7_PA_DREX2    0x10A00000
#define EXYNOS7_PA_DREX3    0x10B00000

The above is taken from the Bus Traffic Shaper driver of the 7420 as you're talking about that SoC in particular.

Those are just a few excerpts of the hundreds of code pieces pointing out that the "memory controller" beyond having 4 channels also has 4 schedulers (because we're settings different QoS parameters to each "controller's" base address), physical interface (die shot), independent power management, internal clock planes.

To me that looks like 4 "memory controllers", but you're always welcome to provide technical expertise of your own.

name99 said:
So Andrei, what's YOUR explanation for
(a) Galaxy Note 7 (as an example) matching A10 in GeekBench Stream numbers (which suggests to me a single memory controller
(b) Galaxy Note 7 GB Multi results are only 2.3x the single core results (again, this is the pattern across high end devices, nothing specific to Samsung here).
?

How does the S820 matching the A10's Stream numbers in any way suggest to you a single memory controller? How do you even come to this conclusion? Stream performance is mainly dictated by the CPU's memory subsystem. This is why you see ARM boasting about new microarchitectures improving in that regard, this is why you see Samsung boasting their microarchitecture is very good in memory performance. So I ask you again, how do you come to the conclusion that the memory controller is here to either blame or to credit?

What GB multi results are only 2.3x? Where do you even get this number from?

Let me do the job for you:

It's obvious that you're not talking about the E8890 Note 7 as that one's score ratio is far above "2.3x", so let's look at the S820 numbers. Seems fine to me? Do you forget that this SoC has 2 cores at 2150 MHz while the other 2 run up to only 1559 MHz? That's only a theoretical multicore peak of 3.45x over single-core. I won't go into more details on why Kryo is just an outright disappointing microarchitecture but you can assume that the rest of the delta between the actual 2.74x scaling and the theoretical peak of 3.45x can be found in a bunch of reasons such as software scheduling, overhead, or even bottlenecks in the communication between the two clusters and the central bus.

But let's go back, you mentioned the S6 earlier so I included those scores as well. I don't see nothing wrong there. An average of 4.18x scaling from single to multi-core (Don't forget there are small cores, but they're relatively insignificant because of both their performance and GTS scheduling).

The vanilla Exynos 8890 has worse scaling with a figure of 3.65x over single core, but again that's because the SoC boosts up to 2.6GHz for 2 load threads while it remains at 2288MHz for more threads. So I went ahead and included my own S7 with that mechanism turned off so we're always at 2288MHz, and oh look the scaling goes back to 4.23x.

So what about the memory figures? They mean nothing in regards to multi-core performance scaling as evidence by the actual benchmarks here. All it means is that each CPU is powerful enough to saturate a) either the cluster L2 subsystem b) the interconnect bandwidth which we've seen in the past is often much less than total theoretical memory controller(s) bandwidth.

So again, where is this "pattern among high-end devices?". I don't see it. The last few flagships SoCs sure didn't have it. Are you looking at S810 devices which can't even do full freq for all cores?

David warned you on RWT about people complaining about your discussion tone and I absolutely agree that it's absolutely offputting to even argue with you about anything. You have such a chip on your shoulder against me that you attacked me in the comments section of the process node pipeline even though I had nothing to do with it and Anton wrote the article. That alone is just enough reason to probably make this the last time I engage with you directly. I see no point in arguing with somebody who thinks he's always correct, got better things to do.

Nothingness · Oct 16, 2016

Eug said:
Not that Wikipedia is the final authority, but it states that your scenario of big.LITTLE with all cores potentially active is just one possible implementation of it. Other implementations don't allow all CPU cores to be active simultaneously.

That's right, but IIRC only the very first b.L Exynos had such a limitation. I might remember wrong...

Andrei. said:
People are being anal and smartass about what big.LITTLE actually means - there are those that seem to say that it's ARM's specific implementation of heterogeneous CPU cores consisting of separate independent CPU clusters interconnected through a coherent interconnect fabric.

To me (just my and other's subjective opinion) big.LITTLE means nothing more than the use of heterogeneous CPU cores in a system for the sake of power or performance optimisation. That's what "big.LITTLE" has become in lay-man terms.

Well that's a way to see b.L.

I would add that the heterogeneity is limited in the sense that both clusters should contain CPU with the very same architecture and some parameters of the micro-architecture have to be somehow "misreported" for software to work (for instance the cache line size reported must be the smallest of all the CPU, something Samsung somehow got wrong in Exynos 8890).

name99 = juanrga ?

name99 · Oct 16, 2016

Eug said:
Not that Wikipedia is the final authority, but it states that your scenario of big.LITTLE with all cores potentially active is just one possible implementation of it. Other implementations don't allow all CPU cores to be active simultaneously.

The clustered model approach is the first and simplest implementation, arranging the processor into identically-sized clusters of "Big" or "Little" cores. The operating system scheduler can only see one cluster at a time; when the load on the whole processor changes between low and high, the system transitions to the other cluster. All relevant data is then passed through the common L2 cache, the first core cluster is powered off and the other one is activated. A Cache Coherent Interconnect (CCI) is used. This model has been implemented in the Samsung Exynos 5 Octa (5410).

big.LITTLE is a trademark. It means whatever ARM wants it to mean.
The generic term is "single ISA heterogeneous computing".

As for what Apple does differently, who knows, since Apple is very opaque.

+ The most important difference that is visible today is that Apple's cores are more tightly coupled (at least sharing the same L2).

It is POSSIBLE that they could be even more tightly coupled, eg sharing the same L1s, or even TLB entries and branch prediction entries. Linley claims that they are somewhat separate cores (for example with separate L1s, the smaller core having 32K L1s rather than 64K), but I haven't read their article so don't know how much credence to give it.

+ A second significant possible dimension of difference is whether a large and small core are so tightly coupled that only one can run at a time. This appears to be the case for Apple, but again this is uncertain. It may be inevitable to their hardware, it may be a policy choice (not worth the verification effort, given what I said earlier about lack of RAM support for handling many cores). It may just be that the OS support isn't ready yet.

+ A third significant dimension of difference is how the decision is made to switch between cores. This can be done purely in the OS (as with ARM). But it can also be done purely in hardware. Once again we have no idea what Apple has chosen.

If there is anything we have seen from Apple's prior patterns it is that they are both cautious AND ambitious. They don't do things before they've got them right; but they also don't do thing half-assed, in a kind of "well, that checkbox is marked off, let's move on to the next". Basically right now (of course this could change) they are an engineer's dream company, operating with an engineer mentality.
That suggests to me that this first implementation was engineered to be safe and robust. It may, for example, have the mechanisms in place for some of the fancier things I have suggested (both big and small cores running simultaneously, and/or hardware core switching), but those have been disabled, at least for now, in shipping devices. In which case they are under investigation at Apple HQ and, depending on how things turn out, we may seem them activated in a future OS release, or they may be irredeemably buggy in this release but the lessons learned are being applied to the A11 or A12.

To say "this is just the same as big.LITTLE" is to deny the possibility of these differences. I think this MAY be acceptable today, but it's unlikely to remain acceptable. (That is to say, Apple's and ARM's implementations of this technology are likely to diverge, with ARM's remaining at the very simple --- controlled by the OS, very weakly coupled, cores --- while Apple's will evolve to ever lower latency and more responsive switching. This just reflects the reality of ARM's constraints vs Apple's.)

Andrei. · Oct 16, 2016

name99 said:
It is POSSIBLE that they could be even more tightly coupled, eg sharing the same L1s, or even TLB entries and branch prediction entries. Linley claims that they are somewhat separate cores (for example with separate L1s, the smaller core having 32K L1s rather than 64K), but I haven't read their article so don't know how much credence to give it.

You've been told endless times that that is not how the cores are set up and repeating your theories endless times won't make it any more true. If you can't even take Linley's credibility at face-value than I don't know what would ever convince you of you being wrong.

They analyzed the die shots which have been public for over a month. Even the lowest resolution shots are absolutely clear in the core's setups. Anybody who has a clue already knows this is old news by now.

I had prepared this on my own for the eventual A10 piece but might as well end these topic derailing theories here and now:

A10 on the left vs TSMC A9 on the right.

It's very clear the big cores are more or less identical in the most regard - they're just mirrored in their layout and the L2 banks are rearranged. There are new cache structures near what is likely the front-end as the biggest change in the cores. The L1 are clearly distinguishable near the L2 arb and again almost identical between the cores. The small cores are absolutely separate from the larger cores with an automated layout synthesis (As opposed to more manual one for the big cores) for everything except the cache blocks - with their L1 visible next to the big one's L1, and them being half the size, thus 32kB instead of 64kB. From a clock-plane perspective it's impossible for the small cores to be asynchronous to the big cores because they have no dedicated CMU/PLL/MUX, so it's very likely they share the big core's clock plane (this does not mean same clocks). Or they share the L2's clock plane, which is possible but quite exotic. In any case that alone makes it impossible for the two to work in tandem unless you throw away any reason in terms of proper DVFS.

name99 · Oct 16, 2016

Andrei. said:
People are being anal and smartass about what big.LITTLE actually means - there are those that seem to say that it's ARM's specific implementation of heterogeneous CPU cores consisting of separate independent CPU clusters interconnected through a coherent interconnect fabric.

To me (just my and other's subjective opinion) big.LITTLE means nothing more than the use of heterogeneous CPU cores in a system for the sake of power or performance optimisation. That's what "big.LITTLE" has become in lay-man terms.
Because I don't want to get into useless arguments with you because it's a futile exercise, but I will indulge you on the current topic just because of how stupid it is.

Are you really now trying to use wording semantics to prove a technical point? Can we stop this nonsense?

https://github.com/AndreiLux/ref-3.10/blob/perseus7420/drivers/bts/bts-exynos7420.c

Code:

#define MIF_BLK_NUM 4 .... __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[0] + QOS_TIMEOUT_0xB); __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[1] + QOS_TIMEOUT_0xB); __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[2] + QOS_TIMEOUT_0xB); __raw_writel(exynos7_bts_param_table[target_idx][BTS_MIF_DISP + rot], base_drex[3] + QOS_TIMEOUT_0xB); .... #define EXYNOS7_PA_DREX0 0x10800000 #define EXYNOS7_PA_DREX1 0x10900000 #define EXYNOS7_PA_DREX2 0x10A00000 #define EXYNOS7_PA_DREX3 0x10B00000

The above is taken from the Bus Traffic Shaper driver of the 7420 as you're talking about that SoC in particular.

Those are just a few excerpts of the hundreds of code pieces pointing out that the "memory controller" beyond having 4 channels also has 4 schedulers (because we're settings different QoS parameters to each "controller's" base address), physical interface (die shot), independent power management, internal clock planes.

To me that looks like 4 "memory controllers", but you're always welcome to provide technical expertise of your own.
How does the S820 matching the A10's Stream numbers in any way suggest to you a single memory controller? How do you even come to this conclusion? Stream performance is mainly dictated by the CPU's memory subsystem. This is why you see ARM boasting about new microarchitectures improving in that regard, this is why you see Samsung boasting their microarchitecture is very good in memory performance. So I ask you again, how do you come to the conclusion that the memory controller is here to either blame or to credit?

What GB multi results are only 2.3x? Where do you even get this number from?

Let me do the job for you:

It's obvious that you're not talking about the E8890 Note 7 as that one's score ratio is far above "2.3x", so let's look at the S820 numbers. Seems fine to me? Do you forget that this SoC has 2 cores at 2150 MHz while the other 2 run up to only 1559 MHz? That's only a theoretical multicore peak of 3.45x over single-core. I won't go into more details on why Kryo is just an outright disappointing microarchitecture but you can assume that the rest of the delta between the actual 2.74x scaling and the theoretical peak of 3.45x can be found in a bunch of reasons such as software scheduling, overhead, or even bottlenecks in the communication between the two clusters and the central bus.

But let's go back, you mentioned the S6 earlier so I included those scores as well. I don't see nothing wrong there. An average of 4.18x scaling from single to multi-core (Don't forget there are small cores, but they're relatively insignificant because of both their performance and GTS scheduling).

The vanilla Exynos 8890 has worse scaling with a figure of 3.65x over single core, but again that's because the SoC boosts up to 2.6GHz for 2 load threads while it remains at 2288MHz for more threads. So I went ahead and included my own S7 with that mechanism turned off so we're always at 2288MHz, and oh look the scaling goes back to 4.23x.

So what about the memory figures? They mean nothing in regards to multi-core performance scaling as evidence by the actual benchmarks here. All it means is that each CPU is powerful enough to saturate a) either the cluster L2 subsystem b) the interconnect bandwidth which we've seen in the past is often much less than total theoretical memory controller(s) bandwidth.

So again, where is this "pattern among high-end devices?". I don't see it. The last few flagships SoCs sure didn't have it. Are you looking at S810 devices which can't even do full freq for all cores?

David warned you on RWT about people complaining about your discussion tone and I absolutely agree that it's absolutely offputting to even argue with you about anything. You have such a chip on your shoulder against me that you attacked me in the comments section of the process node pipeline even though I had nothing to do with it and Anton wrote the article. That alone is just enough reason to probably make this the last time I engage with you directly. I see no point in arguing with somebody who thinks he's always correct, got better things to do.

Andrei I am glad to see you engaging with me with real information. Contrary to what you may think, I am not interested in trolling, I am interested in learning. I often have to make aggressive statements, and make them REPEATEDLY because people simply will not provide the information I request. They will assert certain things over and over, but will not answer obvious, simply stated questions, that would clarify the point.

Your post is an example of what I TRY to get from people, because now we have genuine facts we can discuss and try to learn from. You may not believe it, but I AM willing to change my mind (and have done so), but NOT purely on the grounds of argument from authority.

OK, so to the point at hand:

(a) Semantics matters. How do you think I (and other people) are supposed to learn if we cannot believe what we read? It is Ars' job to get details like that right, and you, as a member of the team, have some role in correcting technical details like that if they are pointed out to you.
Of COURSE if I see an Ars article referring to a memory controller, I am going to interpret as saying "The guys who have spoken to the relevant company are making a particular claim about the chip". How else is any normal person supposed to interpret what you publish?

(b) Since I am not a Linux guru, what is the header snippet that you provide supposed to prove? Can you give me a reference to what it does that's more usable than simply ten .C files?

(c) I took my ratios from
http://arstechnica.com/apple/2016/0...at-annual-upgrades-with-one-major-catch/5/#h2
If you take the overall ratio you get 2.3 for the Note 7.
Eyeballing the other two Android devices you get much the same sort of number.

You can argue that that is "unfair" because it includes the memory score which doesn't improve much from one core to multi; but I think it's reasonable insofar as it capture that for most usage scenarios there will be other users of memory bandwidth, most notably IO and graphics.

Now, obviously, of course, when you look at the individual benchmarks you see many that are barely throttled by RAM, and so scale well, along with some that are clearly affected by something (and my best guess is RAM). It's then a judgement call. Do you consider most of the real world usage cases to reflect the scaling of the low-RAM-contention(?) benchmarks, or do you consider them (especially when I point out the requirements of the GPU and IO) better to match the high-RAM-contention(?) benchmarks?

HOWEVER I realize that this point that I made a huge screwup. (You see, I AM willing to admit my mistakes, when we can talk in a technical manner, and so I can see what the actual point of disagreement is).
I assumed that the 820 was still a 4+4 system (because I can't keep track of which cores of doing what) but I'm wrong about that. It's a 2+2 Kryo system, in which case I think the scaling is completely as expected.

As a collaboration of that:
The Nexus 6P gets much better scaling than the Note 7 and the HTC10, at least by this metric, but it appears to have a much lousier memory controller (certainly much lower bandwidth). However the 810 is classic ARM (4+4).

So, OK, we've learned something. Awesome.
It would be nice if the people discussing lithography had been as willing as you to actually reply to my questions and points in as much detail as you --- perhaps then we could have GOT somewhere...

If you're willing to explain (EXPLAIN, not just sneer and say "well of course you can find it if you look") the significance of your Linux code quoting, we may also make progress on understanding the memory controller situation.

name99 · Oct 16, 2016

Andrei. said:
You've been told endless times that that is not how the cores are set up and repeating your theories endless times won't make it any more true. If you can't even take Linley's credibility at face-value than I don't know what would ever convince you of you being wrong.

They analyzed the die shots which have been public for over a month. Even the lowest resolution shots are absolutely clear in the core's setups. Anybody who has a clue already knows this is old news by now.

I had prepared this on my own for the eventual A10 piece but might as well end these topic derailing theories here and now:

A10 on the left vs TSMC A9 on the right.

It's very clear the big cores are more or less identical in the most regard - they're just mirrored in their layout and the L2 banks are rearranged. There are new cache structures near what is likely the front-end as the biggest change in the cores. The L1 are clearly distinguishable near the L2 arb and again almost identical between the cores. The small cores are absolutely separate from the larger cores with an automated layout synthesis (As opposed to more manual one for the big cores) for everything except the cache blocks - with their L1 visible next to the big one's L1, and them being half the size, thus 32kB instead of 64kB. From a clock-plane perspective it's impossible for the small cores to be asynchronous to the big cores because they have no dedicated CMU/PLL/MUX, so it's very likely they share the big core's clock plane (this does not mean same clocks). This alone makes it impossible for the two to work in tandem unless you throw away any reason in terms of DVFS.

Once again, Andrei, I am HAPPY to be proven wrong when information is provided.
But I have little patience for being told that I'm wrong, end of story, with no explanation as to why, or answer to my questions beyond being told that I'm wrong.

Basically people cannot complain that others are ignorant when they refuse to help educate them.

I have no idea where you got THESE die shots from, but I have never seen them before, so don't pretend that I'm being an obstinate fool. The standard die shots that we have all been working off are the ones like this

at that level of quality.

Andrei. · Oct 16, 2016

name99 said:
If you're willing to explain (EXPLAIN, not just sneer and say "well of course you can find it if you look") the significance of your Linux code quoting, we may also make progress on understanding the memory controller situation.

The short and simple version is that it's referencing hardware register addresses which clearly point to to 4 "entities" (call them memory controllers, call them interfaces, I don't care) that are clearly independent and separate in their functioning.

The problem with answering your questions is that it's absolutely tedious go to back and explain to you every single detail. I don't gain anything for even discussing here and I'm solely doing this as a hobby and passion (and btw I'm no longer an editor so my pieces will be limited). Also I don't go into every absolute minute detail like the ones you demand because then articles like the 7420 review would go up from 15k words to maybe 30k to 50k words if I'm to go explain source code for every claim I make.

I don't go around making statements I'm not sure about and this has been true for all my community work for the last several years and the rare times where I was wrong about some technical detail I admitted it. At some point you need to accept the argument from authority because I simply don't have the energy to take everybody step by step through everything.

And the die shot you posted is the exact SAME one I used for the A10 only colour enhanced, zoomed in and edited the center text away. The A9 shot is the same public shot from Chipworks you can Google. You only have to spend time in analysing the structures, finding the similarities, and give it a shot at labelling them, and I'm pretty confident in what I labelled there.

name99 · Oct 16, 2016

Andrei. said:
The short and simple version is that it's referencing hardware register addresses which clearly point to to 4 "entities" (call them memory controllers, call them interfaces, I don't care) that are clearly independent and separate in their functioning.

What do you consider a memory controller to be?
The iPhone 7 (which let's take as a fairly standard high end phone), as far as I know behaves as follows:
- The actual RAM consists of 4 physical chips, each 4Gbits in capacity and with a 16bit interface. Each pin of the interface runs at 3.2Gb/s.
- These are all run in sync to provide a single 64-bit wide interface. Going through the numbers, this gives 64 pins, ie 8 bytes, and a total phy throughput of 3.2*8=25.6GB/s
- This tracks, since these high end devices for the most part provide a STREAM bandwidth of 13..16GB/s.

Given this background, how do you believe these independent memory controllers are operating? They have to run all 4 pieces of RAM silicon in sync so the actual "controlling" part of the memory controller is surely logically one entity?

As for sequencing and prioritizing accesses, my mental model would be that all the various clients (so the L2s, presumably the GPU is separate and has its own cache(?), various IO blocks, etc) all dump their requests on the bus, they arrive at the memory controller, and get stuck in various queues (for example a read queue, a write queue, perhaps a prefetch queue. etc).

I see nowhere in this model where MULTIPLE memory controllers (or even memory-controller abstractions, like multiple different read queues) might sit. The usual circumstance where you have multiple memory controllers (again, as far as I know) is where each controller owns what is both a physically separate collection of chips and corresponding to a different block of address space. Requests are then directed to the appropriate controller that owns the relevant address. But obviously this sort of model is only relevant for systems with vastly more RAM than mobile.

Andrei. · Oct 16, 2016

name99 said:
What do you consider a memory controller to be?

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0331f/I1005644.html

Consider ARM's documentation on their memory controllers, should have pointed out to it earlier as what do you want more than directly from the horses mouth. I don't know most of your questions and didn't go through the docs but I imagine the controllers are simply interleaved on a page size basis and most of the points you bring up are simply sorted out at the higher level bus.

And regarding STREAM again, I told you that between what the controller can provide and what the CPU can access is different because there are busses in between. ARM cores are limited by the CCI frequency which dictates maximum bandwidth to a cluster which is always less than what the memory can do and the resulting bandwidth in STREAM largely coincides with the cluster bandwidth; I.E. everything is working as designed. I even address this in the 5433 piece and provide the math and interface configuration behind it.

name99 · Oct 16, 2016

Andrei. said:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0331f/I1005644.html

Consider ARM's documentation on their memory controllers, should have pointed out to it earlier as what do you want more than directly from the horses mouth. I don't know most of your questions and didn't go through the docs but I imagine the controllers are simply interleaved on a page size basis and most of the points you bring up are simply sorted out at the higher level bus.

And regarding STREAM again, I told you that between what the controller can provide and what the CPU can access is different because there are busses in between. ARM cores are limited by the CCI frequency which dictates maximum bandwidth to a cluster which is always less than what the memory can do and the resulting bandwidth in STREAM largely coincides with the cluster bandwidth; I.E. everything is working as designed. I even address this in the 7420 piece and provide the math and interface configuration behind it.

I have no idea what the link you provide has to do with anything. It certainly doesn't contradict anything I said, or explain your claims about 4 memory controllers.

A more interesting and relevant link would appear to be something like the diagram on
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0331f/I1005644.html

This is obviously a much higher end system (4 quad-core big clusters). Even so, it contains "only" two memory controllers (each handling essentially a 64-bit wide connection).
Obviously a mobile system would be substantially stripped down compared to this, normally including two quad-clusters (big and LITTLE), one memory controller, and probably using 64-bit rather than 72-bit RAM.

It seems to me that what you are calling a memory controller (and what the Linux interfaces seem to call an "MIF") is NOT what micro-architects consider a memory controller?
That's perhaps an honest mistake, but it's NOT a reason to insult me.

As for STREAM, I've no idea what your point here is. The RAM can deliver an certain PHY data rate. The CPU can (with all the usual overhead) access about 1/2 to 2/3 of that. This is the case with iPhone 7, this is the case with Note 7.
This is all as expected. I don't understand why you believe there is a "problem" that needs to be explained. There may have been an issue with the S6 because of the issues you discussed, but I'm not talking about the S6 as an exemplar of anything.

I mention these STREAM rates simply to point out that my understanding (at the micro-architectural level) seems to track pretty well with my claim that there's a single memory controller.

imported_ats · Oct 16, 2016

coercitiv said:
Can you give us more details on this please?

A) it is a hardware based solution
B) the hardware is actually integrated, aka it really isn't independent clusters

Big little is primarily a software solution where all cores are exposed to software and relies on software scheduling and data movement. Apple's solution is basically hardware based core hoping with a tightly coupled large and small core.

imported_ats · Oct 16, 2016

name99 said:
Given this background, how do you believe these independent memory controllers are operating? They have to run all 4 pieces of RAM silicon in sync so the actual "controlling" part of the memory controller is surely logically one entity?

They don't have to run them in sync. Each dram device can be run independently. They could have 4 separate sequencers, one for each DRAM chip, and therefore run them separate. They could just as easily have 2 or 3.

Likewise, the presence of 4 control registers doesn't actually mean that there are 4 separate sequencers, merely that the granularity of priority control is per dram. Without detailed knowledge of the hardware we cannot know.

Mopetar · Oct 16, 2016

I'm kind of wondering when manufacturers are going to start using Wide I/O memory or something other than LPDDR4. If offers around double the bandwidth and is supposed to use less power as well. Someone like Apple could probably afford to put down the money necessary to get the production online and then buy the entire run.

name99 · Oct 16, 2016

imported_ats said:
They don't have to run them in sync. Each dram device can be run independently. They could have 4 separate sequencers, one for each DRAM chip, and therefore run them separate. They could just as easily have 2 or 3.

Likewise, the presence of 4 control registers doesn't actually mean that there are 4 separate sequencers, merely that the granularity of priority control is per dram. Without detailed knowledge of the hardware we cannot know.

If they are run independently, the timings on the 4 sets of 16 pins/RAM chip will drift apart. That's what I mean by saying they have to run in sync. At the level of MICRO-ARCHITECTURAL analysis (as opposed to at the level of implementation) they form one memory controller.

imported_ats · Oct 16, 2016

Mopetar said:
I'm kind of wondering when manufacturers are going to start using Wide I/O memory or something other than LPDDR4. If offers around double the bandwidth and is supposed to use less power as well. Someone like Apple could probably afford to put down the money necessary to get the production online and then buy the entire run.

Wide I/O is expensive. It also requires expensive packaging. It pretty much lost out in mind share because of that cost. If cost isn't an issue, you go with HMC/HBM, if it is, you go with LPDDR4.

imported_ats · Oct 16, 2016

name99 said:
If they are run independently, the timings on the 4 sets of 16 pins/RAM chip will drift apart. That's what I mean by saying they have to run in sync. At the level of MICRO-ARCHITECTURAL analysis (as opposed to at the level of implementation) they form one memory controller.

Why do you think they'll drift apart anymore than the timings on a 64b DIMM channel will drift apart? And even if they do, does that really matter? The reality is that while there are trade offs between 1 64b channel, 2 32b channels, and 4 16b channels, those tradeoffs have nothing to do with timings drifting apart.

name99 · Oct 16, 2016

imported_ats said:
Why do you think they'll drift apart anymore than the timings on a 64b DIMM channel will drift apart? And even if they do, does that really matter? The reality is that while there are trade offs between 1 64b channel, 2 32b channels, and 4 16b channels, those tradeoffs have nothing to do with timings drifting apart.

What I mean is that you aren't seriously considering programming different timings (or anything else different, like different refresh rates) into the four separate DRAM chips are you? What possible benefit would that have?
The benefit of a 64-bit channel is the usual benefit of parallelism --- higher bandwidth. You want to pull in/push out data a cache line at a time, so say 64 bytes. Ideal would be to have a bus that wide, but that's too many pins, so you use the largest number of pins you can practically use, and pulse them 8 times to pull in the entire cache line.
Running those 64 pins as two INDEPENDENT sets of 32 pins buys you what advantages? It just means that the timings of the two independent sets will be slightly different and every so often you'll have to waste a cycle syncing between the internal bus and these two external buses. It's called (S)dram for a reason.

Is your point something like rather than sending ONE address at a time, which gets sent to all 4 chips, each of which then delivers 16 bits of the total, you want to, say, be able to send FOUR simultaneous different addresses in parallel, one to each chip?
There's a certain class of system for which this sort of thing is very valuable (and that's what the multiple memory controller systems I described above achieve in some sense, though with a baseline width of 64 or 128 bits wide) but it would make no sense on a phone. Usually you would not have four (or two) independent addresses in play, but you would have to take four times, or twice, as many "pulses" to transfer the cache line to the CPU. This seems to me for any standard phone to be a poor tradeoff. Are you suggesting that this is actually used by some phones, because I'd be very interested in hearing details about that and what compelled them to make this choice.

imported_ats · Oct 16, 2016

name99 said:
Running those 64 pins as two INDEPENDENT sets of 32 pins buys you what advantages? It just means that the timings of the two independent sets will be slightly different and every so often you'll have to waste a cycle syncing between the internal bus and these two external buses. It's called (S)dram for a reason.

Running them independently buys to both concurrency and higher efficiency. DRAM channel efficiency is inversely related to channel width. It one of the big reasons that GPUs for instance run their channels independent.

name99 · Oct 16, 2016

imported_ats said:
Running them independently buys to both concurrency and higher efficiency. DRAM channel efficiency is inversely related to channel width. It one of the big reasons that GPUs for instance run their channels independent.

GPUs are optimizing for bandwidth. CPUs should be optimizing for latency.
If the memory controller has to serve both (like on a SoC) it would make far more sense to optimize for latency than for bandwidth.

name99 · Oct 16, 2016

However, now that I think about it, I could give two arguments for why you might prefer to use the 4 independent 16-bit wide channels (or 2 independent 32-bit wide channels).

The first would be that each request would only have to activate one chip for transferring the data, in other words it would be lower energy. Of course the chip would stay active 4x as long, but overall it seems like a minor energy win. On the other hand you're slowing down your CPU a little for the sake of that win, so overall is there a net?

Secondly using the chips in that way would give you more INDEPENDENT pages that could be kept open simultaneously. If you were running an extremely aggressive open-page policy then you'd want more pages to be kept open. But of course open-page is NOT a clear win under all circumstances, so once again, while one could make this argument, it's not clear that it's actually very helpful.

I tried a few searches to see if I could get any details about this (ideally an academic PDF with full details and simulations, but anything is better than nothing) however I came up short. It does seem to be the case that some of the phone SoCs allow for configuring dual-channel operation, but what I can't find is numbers on whether anyone uses this, and if so, what the performance differences are wrt bandwidth and latency.

Andrei. · Oct 17, 2016

name99 said:
I have no idea what the link you provide has to do with anything. It certainly doesn't contradict anything I said, or explain your claims about 4 memory controllers.

A more interesting and relevant link would appear to be something like the diagram on
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0331f/I1005644.html

I just gave the wrong link the same way you just gave me the same link I gave you.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100131_0000_02_en/nde1417020681666.html

That's a typical mobile system with 4 controllers.

name99 said:
As for STREAM, I've no idea what your point here is.

I mention these STREAM rates simply to point out that my understanding (at the micro-architectural level) seems to track pretty well with my claim that there's a single memory controller.

NO. I just demonstrated that the stream scores have absolutely no direct correlation with what the memory controller is capable of and you still refuse to listen. You still don't explain where you get that notion of it being a single controller just by looking at the bandwidth results.

Go back to the ARM website because that's why I linked it to you last night - if you can spend the energy to write all these posts then spend the energy to research the topic.

Start at the interconnect again because that's where you're confused. Again this is just ARM's architecture but Apple's or Qualcomm's isn't going to be much different in the basics. The CPU cluster is connected to a single ACE interface. This is a fixed 2x128bit bi-directional interface meaning that's the first bottleneck for CPU bandwidth. Memory requests go through the bus to the memory "doodads".

Now go to the DMC documentation:

So by ARM's definition this is a "memory controller". Note about things such as the QoS engine or the "memory interface" out of which should only be one per controller and my previous post about evidence of the SoC seemingly having 4.

About system interface:

The DMC-500 interfaces to the rest of the SoC through this interface.

For any attempted accesses that the system makes outside of the programmed address range of the DMC-500, the system interface responds with a non-data error response. According to how you program the DMC-500, it converts the system access information to the correct rank, bank, column, and row access of the external SDRAM that connects to it.

The DMC-400 document is more precise in how memory access on a per-controller basis works:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0466f/CJHDGCFD.html

Particularly we're talking about transactions and limited to 4KB page sizes.

Now going back to your base issue: Yes each controller would be connected to a single DRAM die. And it doesn't matter because:

The primary difference between memory and system interfaces is the definition in the memory map. You can configure memory interfaces to be interleaved across the memory region, for example striping, to enable higher utilization of memory.

Transactions are simply multiplexed back through the bus to the CPU cluster. Again, I ask you how you're telling me that you're able to distinguish memory controller configuration from any other bottleneck higher up in the system?

It does seem to be the case that some of the phone SoCs allow for configuring dual-channel operation, but what I can't find is numbers on whether anyone uses this, and if so, what the performance differences are wrt bandwidth and latency.

It's simply related to the amount of NAND dies in the package. If you have 2 dies with a single "controller" (i.e. 1 DRAM interface to the dies), then you can still only access 1 die at once and have to resort to chip select addressing to get to the other one (again, look at the documentation).

http://www.nxp.com/files/ftf_2010/Americas/WBNR_FTF10_NET_F0401_PDF.pdf

At this point I'll give up on the distinction between "memory controller" and DRAM interface and simply say the latter for all intents and purposes due to the lack of chip select usage in mobile is simply what you consider a memory controller in the PC space, and you can just go look at die shots to count the "memory controllers", i.e. all current SoCs have 4 or even 8 of them in things like the iPad SoCs.

name99 · Oct 17, 2016

Andrei, this is all very interesting and worth knowing, but you seem determined to avoid accepting that there seems to be a deep ambiguity here in the ARM world.

Consider, for example, the Linley paper on X-Gene 3:
https://www.linleygroup.com/uploads/x-gene-3-white-paper-final.pdf

Figure 2 on page 3 shows what looks very much like what I am calling a MICRO-ARCHITECTURAL memory controller as being an object that controls multiple DRAM chips.

Same for the link I was trying to send you. (God that ARM site sucks...)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100000_0000_00_en/dch1349167911002.html

Don't lash out at me for the fact that this terminology seems all over the place.

name99 · Oct 17, 2016

One more addition.
I think the reason you and I so often butt heads is that for the most part I am asking "why" questions, and you (and people like you) are providing "what" answers.

For example, if you look at this latest dialog stream, MY concerns have to do with WHY one would structure a memory controller in a certain way as compared to various alternatives. Your answers, while giving various correct facts, do not answer that question. And this is the constant pattern I see between us.

It is reasonable for you to say that you do not know the answer to my "why" questions; it is NOT reasonable for you to insist that people are answering my questions and I am just being dogmatic in ignoring the answers. They are NOT answering the questions. They are providing various facts (sometimes relevant, sometimes of dubious relevance) but I am usually not asking for specific facts, I am asking, as I said, for a more abstract, conceptual theory of WHY these facts and not others.

Nothingness · Oct 18, 2016

Andrei. said:
I had prepared this on my own for the eventual A10 piece but might as well end these topic derailing theories here and now:

A10 on the left vs TSMC A9 on the right.

The intriguing part I previously mentioned is that "New Caches" on the bottom right. If this is part of the front-end, I wonder if that could be some form of micro-op (or even trace) cache, but it would be oddly placed between L1I and L2. Or perhaps Apple went crazy and implemented a TAGE/ITTAGE larger than the I-cache

One of the sets of small boxes above the I-cache is most likely the tags.

Apple A10 Fusion is ** Quad-core big.LITTLE **

Lifer

Senior member

Platinum Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Platinum Member

Apple A10 Fusion is Quad-core big.LITTLE