AMD Instanbul 6 core demoed

Idontcare · Feb 23, 2009

Originally posted by: jones377
btw a tip to you guys interested in following AMD and Istanbul. The one feature it brings that you should remember is the probe filter. It may seem like just another insignificant checklist item to the untrained eye but believe me it won't be, especially for 4 and 8 socket systems

jones is the HT scout effectively reducing the latency for hops to second or third-neighboring caches? Is that where this improvement hits home?

Flipped Gazelle · Feb 23, 2009

Originally posted by: Idontcare

jones is the HT scout effectively reducing the latency for hops to second or third-neighboring caches...

Sounds like "jones" is a pretty busy fellow! :laugh:

Viditor · Feb 23, 2009

Originally posted by: jones377

Common sense works if you know a bit of the things you talk about. I had never even heard of VMMark before a while back when it was shown with various configs. But it has been known for a long time that Opterons have had much lower virtualisation overhead than contemporary Xeons.

Thing is though, VMMark measures the performance of the various virtual servers running on the machine. So even if Nehalem-EX does nothing to reduce the virtualisation overhead compared with Dunnington (something I find very hard to believe since Intel has improved it in every processor update before) it will STILL perform better in VMMark because the apps within the virtual enviroments will see the improvements brought by the Nehalem architecture, including HT. Don't you agree?

Right now Shanghai is only a hair ahead of Dunnington in VMMark. Will Istanbul improve more over Shanghai than Nehalem-EX over Dunnington? I really doubt it. Again no ultimate proof, just an educated guess...

We're getting into areas that are murky again...and believe me, I'm not just saying that to be obstinate (though I don't blame you for thinking so).

Here are some of the questions I have...
1. It's evident that i7's HT has an issue when 4 threads are running. Are there other times (say at 3, 5, 10, or 12 threads) that those issues crop up?

2. How much does virtualization effect the i7's performance?

3. Is there a number of VMs that is affected by HT more than others (say 4 VMs per core)?

4. Is there a change in HT performance when used in multisocket? (by that I mean does an 8 thread program on dual socket i7 have the same problem that a 4 thread program does on single socket?)

5. Are virtual processors effected more than virtual machines?

We've seen enough exceptions that (at least for me) I can't really invest myself in a common sense explanation...there are too many possible variables that wouldn't show up except in hindsight.
For example, common sense would not have shown the problem at 4 threads...

The reason I focus on VMMark is that I know of no other virtualization benchmark that is acceptable. Virtualization (at least with the majority of my clients) is by far the biggest buzz word going right now...especially as a cost cutting effort.

Idontcare · Feb 23, 2009

Originally posted by: Flipped Gazelle

Originally posted by: Idontcare

jones is the HT scout effectively reducing the latency for hops to second or third-neighboring caches...

Click to expand...

Sounds like "jones" is a pretty busy fellow! :laugh:

Oh yeah, he's a burb's playa fosho

Viditor · Feb 23, 2009

Originally posted by: Idontcare

Originally posted by: jones377
btw a tip to you guys interested in following AMD and Istanbul. The one feature it brings that you should remember is the probe filter. It may seem like just another insignificant checklist item to the untrained eye but believe me it won't be, especially for 4 and 8 socket systems

Click to expand...

jones is the HT scout effectively reducing the latency for hops to second or third-neighboring caches? Is that where this improvement hits home?

I'll quote what I saw on the subject, but I absolutely believe what Jones is saying here...

"The 16-core Shanghai system produced throughput numbers in the range of 25,000 MB/s. The 24-core Istanbul box, by contrast, hit about 42,000 MB/s...Why the huge performance gain with the addition of more cores, given that Stream is typically considered, at least partially, a bandwidth-bound benchmark? And why the magnitude of the gain, with only 50% more cores"
"Part of the answer, it seems, may be a feature new to Istanbul that AMD calls HT assist (presumably for HyperTransport assist). This feature is what the company calls a probe filter (and may more commonly be called a snoop filter) that functions to reduce traffic on socket-to-socket HyperTransport links by storing an index of all caches and preventing unnecessary coherency synchronization requests. Current Opteron systems use a broadcast-based probe protocol, sending probe requests to all sockets. Istanbul, instead, either knows that no probes are required or is able to do a directed probe to a single socket. (Although it may still use broadcasts in certain, specific situations.) Istanbul's probe filter stores its data in the processor's L3 cache. The amount of cache space dedicated to probe filter storage, AMD says, will be configurable in the BIOS, and the more space dedicated to probe filter storage, the more granular its operation will be"

jones377 · Feb 23, 2009

Originally posted by: Idontcare

Originally posted by: jones377
btw a tip to you guys interested in following AMD and Istanbul. The one feature it brings that you should remember is the probe filter. It may seem like just another insignificant checklist item to the untrained eye but believe me it won't be, especially for 4 and 8 socket systems

Click to expand...

jones is the HT scout effectively reducing the latency for hops to second or third-neighboring caches? Is that where this improvement hits home?

Do they call it HT scout now? I assume you mean the probe filter quoted above. In a current Opteron system, anytime there is a L3 cachemiss, and subsequent L1/L2 misses on the other ondie cores due to the exclusive caches, the local CPU sends out a broadcast request for the cacheline it wants on all cc-HT links and it has to wait for all sockets in the system to reply before it proceeds to read the cacheline from memory, even if it's stored in the local memory.

This wastes alot of bandwidth and adds latency. I think 20ns per hop but I am not 100% sure about that, it might be twice that for a roundtrip which includes lookups in the remote caches as well. Getting this info from the L3 probe filter should be alot faster than that and won't create any unnecessary bandwidth on the HT links, a big issue on 8-socket systems and to a lesser degree on 4-socket systems.

The probe filter would potentially remove this added latency and bandwidth. The question is how effective the probe filter will be? On average, what percentage of cacheline locations will be found in the filter? Then of course the probe filter will use up space in the L3 cache that would otherwise be used for data, effectively reducing the size of the L3 cache used by applications. There will be tradeoffs no doubt. You get better scaling for multiple socket systems but starting from a slightly lower base performance. To illustrate performance and scaling:

no probe filter
1 - 1.8 - 3.4 - 6
with probe filter
.95 - 1.9 - 3.7 - 7

(numbers totally made up)

This thing might be a real bitch to validate. I hope AMD takes their time with it and avoids another TLB fiasco... But I remain hopeful, it's the first good news I've heard from AMD in a long time and if done right could differentiate Opteron from Xeon yet again in the Nehalem era. Anyone know if Gainestown or Nehalem-EX will have a similar feature? Current Xeon chipsets from Intel, and IBM, have it and it would be kinda strange if it would dissapear again with Nehalem. But even if Nehalem-EX doesn't have it I suspect the raw performance of it will surpass Istanbul. It's hard to beat 8 cores or 16 threads with only 6 cores or 6 threads per socket

jones377 · Feb 23, 2009

Originally posted by: Viditor

Originally posted by: jones377

Common sense works if you know a bit of the things you talk about. I had never even heard of VMMark before a while back when it was shown with various configs. But it has been known for a long time that Opterons have had much lower virtualisation overhead than contemporary Xeons.

Thing is though, VMMark measures the performance of the various virtual servers running on the machine. So even if Nehalem-EX does nothing to reduce the virtualisation overhead compared with Dunnington (something I find very hard to believe since Intel has improved it in every processor update before) it will STILL perform better in VMMark because the apps within the virtual enviroments will see the improvements brought by the Nehalem architecture, including HT. Don't you agree?

Right now Shanghai is only a hair ahead of Dunnington in VMMark. Will Istanbul improve more over Shanghai than Nehalem-EX over Dunnington? I really doubt it. Again no ultimate proof, just an educated guess...

Click to expand...

We're getting into areas that are murky again...and believe me, I'm not just saying that to be obstinate (though I don't blame you for thinking so).

Here are some of the questions I have...
1. It's evident that i7's HT has an issue when 4 threads are running. Are there other times (say at 3, 5, 10, or 12 threads) that those issues crop up?

2. How much does virtualization effect the i7's performance?

3. Is there a number of VMs that is affected by HT more than others (say 4 VMs per core)?

4. Is there a change in HT performance when used in multisocket? (by that I mean does an 8 thread program on dual socket i7 have the same problem that a 4 thread program does on single socket?)

5. Are virtual processors effected more than virtual machines?

We've seen enough exceptions that (at least for me) I can't really invest myself in a common sense explanation...there are too many possible variables that wouldn't show up except in hindsight.
For example, common sense would not have shown the problem at 4 threads...

The reason I focus on VMMark is that I know of no other virtualization benchmark that is acceptable. Virtualization (at least with the majority of my clients) is by far the biggest buzz word going right now...especially as a cost cutting effort.

Yes those are all big unknowns because we don't know yet what Intel has done with virtualisation on Nehalem. Or do we and maybe I missed it? But the one question you didn't ask and that is highly relevant... How big is the virtualisation overhead on a server running multiple VMs in the first place? By that I mean, what percentage of CPU time is spent on VM operations as opposed to running the applications inside the VM?

If this is only 10% then by making virtualisation infinitely quick would only improve overall performance by 10% for the whole system in VMMark to use that example. Now consider running several SAP-SD inside VMs on Nehalem and Shanghai servers. Standalone the Nehalem system performs 81% faster than Shanghai as reported here. The VM overhead in Nehalem compared with Shanghai would have to be pretty darn big for Shanghai to catch up, something I don't think will be possible.

As far as HT is concerned since you're still beating on that horse... Databases like SQL serving large user systems tend to hammer the server with way more threads than it can execute at one time. With VMs you can partition the hardware so if you have a software that can't use many threads you don't assign that VM many cores in the first place. Problem solved

Viditor · Feb 23, 2009

Originally posted by: jones377
It's hard to beat 8 cores or 16 threads with only 6 cores or 6 threads per socket

But it does happen...Barcelona's 16 cores beat Dunnington's 24 cores at least.

jones377 · Feb 23, 2009

Originally posted by: Viditor

Originally posted by: jones377
It's hard to beat 8 cores or 16 threads with only 6 cores or 6 threads per socket

Click to expand...

But it does happen...Barcelona's 16 cores beat Dunnington's 24 cores at least.

I think you know what I mean. Anyway I'm off for the day.

Idontcare · Feb 23, 2009

Originally posted by: jones377
Do they call it HT scout now? I assume you mean the probe filter quoted above. In a current Opteron system, anytime there is a L3 cachemiss, and subsequent L1/L2 misses on the other ondie cores due to the exclusive caches, the local CPU sends out a broadcast request for the cacheline it wants on all cc-HT links and it has to wait for all sockets in the system to reply before it proceeds to read the cacheline from memory, even if it's stored in the local memory.

This wastes alot of bandwidth and adds latency. I think 20ns per hop but I am not 100% sure about that, it might be twice that for a roundtrip which includes lookups in the remote caches as well. Getting this info from the L3 probe filter should be alot faster than that and won't create any unnecessary bandwidth on the HT links, a big issue on 8-socket systems and to a lesser degree on 4-socket systems.

Thanks Viditor and jones377. Oh man oh man do I know about this and I agree with you.

This graph on the impact of varying broadcast interprocessor communication protocols is from my doctorate dissertation. At the time my experiments and observations were based on a 24-node Athlon beowulf (how fitting, AMD tech then too).

I am not surprised to see AMD's engineers have been aggressively working on improving the broadcast protocol, they are a smart bunch and my stuff wasn't rocket science, but I was just curious to know if this (as you laid it out) really was what AMD had done or if my impressions of "HT assist" were wrong. (thanks for the correction )

(side note - note the projection in green showing performance scaling peaks around 16 processing cores and the actually degrades if you add more cores to the system? This might seem counter-intuitive but Sandia National Labs recently confirmed this to be true)

jones377 · Feb 23, 2009

Originally posted by: Idontcare

Originally posted by: jones377
Do they call it HT scout now? I assume you mean the probe filter quoted above. In a current Opteron system, anytime there is a L3 cachemiss, and subsequent L1/L2 misses on the other ondie cores due to the exclusive caches, the local CPU sends out a broadcast request for the cacheline it wants on all cc-HT links and it has to wait for all sockets in the system to reply before it proceeds to read the cacheline from memory, even if it's stored in the local memory.

This wastes alot of bandwidth and adds latency. I think 20ns per hop but I am not 100% sure about that, it might be twice that for a roundtrip which includes lookups in the remote caches as well. Getting this info from the L3 probe filter should be alot faster than that and won't create any unnecessary bandwidth on the HT links, a big issue on 8-socket systems and to a lesser degree on 4-socket systems.

Click to expand...

Thanks Viditor and jones377. Oh man oh man do I know about this and I agree with you.

This graph on the impact of varying broadcast interprocessor communication protocols is from my doctorate dissertation. At the time my experiments and observations were based on a 24-node Athlon beowulf (how fitting, AMD tech then too).

I am not surprised to see AMD's engineers have been aggressively working on improving the broadcast protocol, they are a smart bunch and my stuff wasn't rocket science, but I was just curious to know if this (as you laid it out) really was what AMD had done or if my impressions of "HT assist" were wrong. (thanks for the correction )

(side note - note the projection in green showing performance scaling peaks around 16 processing cores and the actually degrades if you add more cores to the system? This might seem counter-intuitive but Sandia National Labs recently confirmed this to be true)

Insomnia kicking in As I suspected you know this stuff better than I do even though I think I've grasped the gist of it But you're right, any broadcast scheme should plane out in performance and then decline with increasing sockets because the coherency traffic goes up exponentially with sockets. That's another argument for HT3 (what's taking you so long AMD?!?)

The Sandia study was an interesting one. I can imagine their frustration with this core race and projected ahead without the subsequent increase in available memory bandwidth to the HPC guys satisfaction.

BLaber · Feb 24, 2009

While people think that HT assist + 2 more cores on Istanbul is the main reason behind a good stream benchmark result as compared to Shanghai , I have read some where else that Stream benchmark doesn't rely on INTER PROCESS COMMUNICATION (Stream Benchmark has little to do with HT Bus). Meaning HT assist is a great feature but wasn't used to get the boost in stream scores that Istanbul managed over Shanghai.

Cookie Monster · Feb 24, 2009

Do these 6 cores support HT 3.x? Doesnt the current shanghai's use only HT 1.0?

jones377 · Feb 24, 2009

Originally posted by: BLaber
While people think that HT assist + 2 more cores on Istanbul is the main reason behind a good stream benchmark result as compared to Shanghai , I have read some where else that Stream benchmark doesn't rely on INTER PROCESS COMMUNICATION (Stream Benchmark has little to do with HT Bus). Meaning HT assist is a great feature but wasn't used to get the boost in stream scores that Istanbul managed over Shanghai.

It still generates coherency traffic over the HT links, even though it's completely NUMA as far as the OS scheduler is concerned, because of the broadcast nature of the cache coherency scheme (on Shanghai). Beyond that it's hard to gauge from AMD's numbers since the system configurations were not disclosed. They could have used slower memory on the Shanghai for all we know.

Viditor · Feb 24, 2009

Originally posted by: Cookie Monster
Do these 6 cores support HT 3.x? Doesnt the current shanghai's use only HT 1.0?

The Istanbul is HT 3.0, and Shanghai is HT 2.0 IIRC...

BLaber, you may have a good point there...the difference in the bandwidth numbers is pretty close to that of the difference between HT 2.0 and 3.0. It might be more of a HT difference...though I still agree with Jones that the probe filter will be a very big deal indeed.

jones377 · Feb 24, 2009

Originally posted by: Viditor

Originally posted by: Cookie Monster
Do these 6 cores support HT 3.x? Doesnt the current shanghai's use only HT 1.0?

Click to expand...

The Istanbul is HT 3.0, and Shanghai is HT 2.0 IIRC...

BLaber, you may have a good point there...the difference in the bandwidth numbers is pretty close to that of the difference between HT 2.0 and 3.0. It might be more of a HT difference...though I still agree with Jones that the probe filter will be a very big deal indeed.

But now we have Johan claiming that Shanghai scores 25GB/s with both HT2 and HT3 here. I am not sure what to make of this. It would be nice to have bandwidth results for 1 socket and 2 socket as well. If this new Istanbul result is more than 4x that of a single socket then something else must be going on here.

Idontcare · Feb 24, 2009

Originally posted by: BLaber
While people think that HT assist + 2 more cores on Istanbul is the main reason behind a good stream benchmark result as compared to Shanghai , I have read some where else that Stream benchmark doesn't rely on INTER PROCESS COMMUNICATION (Stream Benchmark has little to do with HT Bus). Meaning HT assist is a great feature but wasn't used to get the boost in stream scores that Istanbul managed over Shanghai.

Stream does rely on interprocessor communications if it has been compiled to operate as multi-processor (multi-threaded) code.

The STREAM FAQ highlights the fact the end-sure decides how their multi-processor compiled code is going to communicate (the IPC choice):

Multiprocessor Runs
If you want to run STREAM on multiple processors, then the situation is not quite so easy.
First, you need to figure out how to run the code in parallel. There are several choices here: OpenMP, pthreads, and MPI.

A second caveat with STREAM is the problem size, which again we don't what AMD selected here for their demo:

Adjust the Problem Size
STREAM is intended to measure the bandwidth from main memory. It can, of course, be used to measure cache bandwidth as well, but that is not what I have been publishing at the web site. Maybe someday....

The general rule for STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 Million elements -- whichever is larger.

So, for a uniprocessor machine with a 256kB L2 cache (like an older PentiumIII, for example), each array needs to be at least 128k elements. This is smaller than the standard test size of 2,000,000 elements, which is appropriate for systems with 4 MB L2 caches. There should be relatively little difference in the performance of different sizes once the size of each array becomes significantly larger than the cache size, but since there are some differences (typically associated with TLB reach), for comparability I require that results even for small cache machines use 1 million elements whenever possible. This requires only 22 MB, so it should be workable on even a 32 MB machine.

If this size requirement is a problem and you are interested in submitting results on a system that cannot meet this criterion, e-mail me and we can discuss the issues.

For an automatically parallelized run on (for example) 16 cpus, each with 8 MB L2 caches, the problem size must be increased to at least N=64,000,000. This will require a lot of memory! (about 1.5 GB). If you get much bigger than this, you will need to compile with 64-bit addressing, and once N exceeds 2 billion, you will need to be sure to use 64-bit integers. (Yes, I have run lots of cases bigger than this -- not on peecees, of course!)

So we can be quite sure that AMD's stream implementation does rely on interprocessor communications (that's the part where you have to chose between OpenMP, pthreads, and MPI) but we don't know their compile options nor do we know the problem size.

jones377 · Feb 25, 2009

According to Johan AMD has also improved the memory controller in Istanbul. So that massive bandwidth increase is the result of HT assist and an improved memory controller with an unknown so far contribution from both.

Idontcare · Feb 25, 2009

Originally posted by: jones377
According to Johan AMD has also improved the memory controller in Istanbul. So that massive bandwidth increase is the result of HT assist and an improved memory controller with an unknown so far contribution from both.

I enjoyed reading the article, thanks for the link, but I don't see anywhere in it that the IMC has been improved. Where are you seeing that part?

I see where memory sub-system bandwidth is improved because the effective latency of the L3$ is reduced, but that's not related to the IMC (or is it?).

jones377 · Feb 25, 2009

Originally posted by: Idontcare

Originally posted by: jones377
According to Johan AMD has also improved the memory controller in Istanbul. So that massive bandwidth increase is the result of HT assist and an improved memory controller with an unknown so far contribution from both.

Click to expand...

I enjoyed reading the article, thanks for the link, but I don't see anywhere in it that the IMC has been improved. Where are you seeing that part?

I see where memory sub-system bandwidth is improved because the effective latency of the L3$ is reduced, but that's not related to the IMC (or is it?).

Yeah it's hidden deep in there.

This means that especially HPC applications, with many threads all working on their own data, will benefit from the higher effective bandwidth. Besides HT assist, AMD has now confirmed to us that the memory controller has been tuned quite a bit. This higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons in many HPC applications.

Idontcare · Feb 25, 2009

Originally posted by: jones377
Yeah it's hidden deep in there.

This means that especially HPC applications, with many threads all working on their own data, will benefit from the higher effective bandwidth. Besides HT assist, AMD has now confirmed to us that the memory controller has been tuned quite a bit. This higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons in many HPC applications.

Click to expand...

Ah, cool, thanks for taking the time to hold my hand and point it out :thumbsup:

I'm glad to see AMD invested resources into further improving the IMC. I'm curious why no tweaking to the cores though.

If it truly is just an uncore optimization then is there any reason it would not transfer to Shanghai quads in a future stepping? Or maybe shanghai quads disappear entirely and are replaced with harvested quad-core Istanbul's? (would make the most sense to me as Shanghai opertons are not exactly a volume segment, don't need 10k wspm to meet the demand there)

Cookie Monster · Feb 25, 2009

Now it begs the question. How much??

AMD Instanbul 6 core demoed

Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Elite Member

Senior member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Elite Member

Senior member

Elite Member

Senior member

Elite Member

Diamond Member