- Sep 16, 2007
- 572
- 0
- 0
Hey guys, this is my first post on the forums here, but I've been an avid anandtech reader (great site) for a long time, around 9 years or so.
Since there are a bunch of knowledgeable people here on the forums, I thought I'd ask a few questions that keep running through my head about the new Quad-Core processors from AMD.
From the previews I've seen so far on the net, it seems that the L3 cache (CPU @2GHz) adds about 21ns additional latency to the caching system, and this L3 cache latency gets lower the higher the CPU speed gets (I think 19ns @2.5Ghz). I'm sure AMD engineers know that this extra latency is offsetted by the use of the shared 2MB L3 cache, otherwise they wouldn't have this L3 cache at all. Now to my question:
1) If AMD had doubled the L2 cache size for each core, meaning each core uses 1MB L2 instead of the current 512KB, but did not use any L3. Thus removing the extra layer of latency mentioned above, but also having no shared cache (the CPU will stay around the same size overall), do you think Barcelona would be slower or faster?
On some benchmarks on the net, I've seen Intel Core2 processors with 8MB L2 cache perform about 10-20% faster than the same CPU but with only 4MB L2 cache, and that is clock-for-clock. I know these are rare cases, and these are probably synthetic benchmarks, most other software/games usually get about 1-5% boost from larger caches, but still it is a difference, and can give Intel CPUs a big edge over Barcelona not because the AMD quad-core is architecturally deficient, but because these programs like more cache, simple as that. Now to my second question:
2) AMD's 45nm CPU called Shanghai, that is supposed to be released Q4 next year (I don't believe they can release it in 2008 at all, but we can always hope), is supposed to have 6MB L3 cache but the same L2 cache as Barcelona has now at 512KB. My thought here is this, L2 cache runs faster than L3 cache, so why wouldn't AMD double the L2 cache size and make the L3 cache size 4MB, also doubling it? Meaning that instead of Shanghai having 2MB L2/6MB L3, it would have 4MB L2/4MB L3, the effective CPU size should stay about the same. I'm not an expert on this, but I would assume that the second cache configuration mentioned 4MB/4MB would allow the processor to perform faster, any thoughts on this guys?
About the Barcelona memory controller, from what I understand, on normal motherboards with a single power plane, it runs 400MHz slower than the CPU clock speed, and on motherboards that support split power planes, it runs a bit faster at 200MHz lower than CPU speed. This small boost to the MC does seem to give a small boost to performance. Which brings me to my final question:
3) Why doesn't the MC in Barcelona run at full processor speed instead of 200-400MHz slower. Is this to lower the CPU power usage, or maybe a stability issue? It might have been mentioned somewhere and I have missed it. I'm asking this because for servers this might be fine, but for the desktop it seems AMD will need every increase in performance it can get, especially now that Intel is so close to releasing their Penryn CPUs, and Phenom will have a hard time competing against them.
*I would like to give my thoughts about K10 from all the info i've seen on the net so far. Like many of you guys here, I care about how this processor will perform on the desktop. As it seems now, for the majority of apps, AMD Phenom CPUs will be about 5-15% slower clock-for-clock than the new Intel Penryns that will be released. And yes, that is taking into account Phenom running with 1066MHz DDR2 memory and maybe a faster running memory controller.
*The frequency scaling and multithreading scaling on Barcelona/Phenom is a bit more efficient than Intels CPU's, so the faster the frequecy of Phenom, the smaller the advantage Penryn will have. And the more threads a program uses, the smaller the Penryn advantage also gets. So next year, with faster Phenom speeds (hopefully 3GHz in Q1), and more multithreaded software/games, Phenom will get really close to Penryn performance but still probably be a bit slower clock for clock.
*Architecturally I do believe Barc is more advanced than C2D and even Penryns, but the problem I see with it now is small cache sizes and low frequency speeds. Imagine if Barcelona was at 45nm process right now, and running at 3.2+GHz with twice the amount of L2/L3 cache, I'm sure it would easily match or beat Penryn. Intel's greater resources and manufacturing strength is giving them the advantage in this case really.
*As I undestand it, K10 can only perform 3 instructions per cycle, same as the old K8 (please correct me if I'm wrong), and C2D can do 4-5 instructions in optimal cases. That seems to me one of the weakest points of the K10 architecture. If AMD can release the 45nm Shanghai next year, with high clock speeds (3GHz+), larger caches, and make one change to the K10 architecture, and that is to allow it to run at least 4 instructions per clock cycle (not sure how hard that is to do), I really believe Shanghai would perform better than any CPU Intel will have in the market at the time, even ones running at higher clock speeds for example a 3.8GHz Penryn vs 3.0GHz Shanghai. Yep it can be like the P4 vs Athlon 64 days
Btw, the 4-instruction K10 Shanghai is just something I added, didn't read anything about it, I don't know if it's possible at all for AMD to do that on K10 anyways.
Sorry for the long post, all comments, corrections, and answers are welcome, please share what you think, thanks.
Kuzi,
Since there are a bunch of knowledgeable people here on the forums, I thought I'd ask a few questions that keep running through my head about the new Quad-Core processors from AMD.
From the previews I've seen so far on the net, it seems that the L3 cache (CPU @2GHz) adds about 21ns additional latency to the caching system, and this L3 cache latency gets lower the higher the CPU speed gets (I think 19ns @2.5Ghz). I'm sure AMD engineers know that this extra latency is offsetted by the use of the shared 2MB L3 cache, otherwise they wouldn't have this L3 cache at all. Now to my question:
1) If AMD had doubled the L2 cache size for each core, meaning each core uses 1MB L2 instead of the current 512KB, but did not use any L3. Thus removing the extra layer of latency mentioned above, but also having no shared cache (the CPU will stay around the same size overall), do you think Barcelona would be slower or faster?
On some benchmarks on the net, I've seen Intel Core2 processors with 8MB L2 cache perform about 10-20% faster than the same CPU but with only 4MB L2 cache, and that is clock-for-clock. I know these are rare cases, and these are probably synthetic benchmarks, most other software/games usually get about 1-5% boost from larger caches, but still it is a difference, and can give Intel CPUs a big edge over Barcelona not because the AMD quad-core is architecturally deficient, but because these programs like more cache, simple as that. Now to my second question:
2) AMD's 45nm CPU called Shanghai, that is supposed to be released Q4 next year (I don't believe they can release it in 2008 at all, but we can always hope), is supposed to have 6MB L3 cache but the same L2 cache as Barcelona has now at 512KB. My thought here is this, L2 cache runs faster than L3 cache, so why wouldn't AMD double the L2 cache size and make the L3 cache size 4MB, also doubling it? Meaning that instead of Shanghai having 2MB L2/6MB L3, it would have 4MB L2/4MB L3, the effective CPU size should stay about the same. I'm not an expert on this, but I would assume that the second cache configuration mentioned 4MB/4MB would allow the processor to perform faster, any thoughts on this guys?
About the Barcelona memory controller, from what I understand, on normal motherboards with a single power plane, it runs 400MHz slower than the CPU clock speed, and on motherboards that support split power planes, it runs a bit faster at 200MHz lower than CPU speed. This small boost to the MC does seem to give a small boost to performance. Which brings me to my final question:
3) Why doesn't the MC in Barcelona run at full processor speed instead of 200-400MHz slower. Is this to lower the CPU power usage, or maybe a stability issue? It might have been mentioned somewhere and I have missed it. I'm asking this because for servers this might be fine, but for the desktop it seems AMD will need every increase in performance it can get, especially now that Intel is so close to releasing their Penryn CPUs, and Phenom will have a hard time competing against them.
*I would like to give my thoughts about K10 from all the info i've seen on the net so far. Like many of you guys here, I care about how this processor will perform on the desktop. As it seems now, for the majority of apps, AMD Phenom CPUs will be about 5-15% slower clock-for-clock than the new Intel Penryns that will be released. And yes, that is taking into account Phenom running with 1066MHz DDR2 memory and maybe a faster running memory controller.
*The frequency scaling and multithreading scaling on Barcelona/Phenom is a bit more efficient than Intels CPU's, so the faster the frequecy of Phenom, the smaller the advantage Penryn will have. And the more threads a program uses, the smaller the Penryn advantage also gets. So next year, with faster Phenom speeds (hopefully 3GHz in Q1), and more multithreaded software/games, Phenom will get really close to Penryn performance but still probably be a bit slower clock for clock.
*Architecturally I do believe Barc is more advanced than C2D and even Penryns, but the problem I see with it now is small cache sizes and low frequency speeds. Imagine if Barcelona was at 45nm process right now, and running at 3.2+GHz with twice the amount of L2/L3 cache, I'm sure it would easily match or beat Penryn. Intel's greater resources and manufacturing strength is giving them the advantage in this case really.
*As I undestand it, K10 can only perform 3 instructions per cycle, same as the old K8 (please correct me if I'm wrong), and C2D can do 4-5 instructions in optimal cases. That seems to me one of the weakest points of the K10 architecture. If AMD can release the 45nm Shanghai next year, with high clock speeds (3GHz+), larger caches, and make one change to the K10 architecture, and that is to allow it to run at least 4 instructions per clock cycle (not sure how hard that is to do), I really believe Shanghai would perform better than any CPU Intel will have in the market at the time, even ones running at higher clock speeds for example a 3.8GHz Penryn vs 3.0GHz Shanghai. Yep it can be like the P4 vs Athlon 64 days
Btw, the 4-instruction K10 Shanghai is just something I added, didn't read anything about it, I don't know if it's possible at all for AMD to do that on K10 anyways.
Sorry for the long post, all comments, corrections, and answers are welcome, please share what you think, thanks.
Kuzi,