Penryn and accuracy in articles

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I was reading this article and one thing about it really bothered me:

This version of Penryn is dual-core, and the first quad-core Penryn chips will simply be two of these on a single package, although later on we may see a single-die solution. At 410 million transistors, we expect a dual-core Penryn to have a 6MB shared L2 cache (up from 4MB in Conroe). The logic part of the Penryn core will be mostly evolutionary from Conroe, but do expect additional functionality and performance from more than just a larger cache.

If we assume that 288M transistors (6T SRAM) will be used by the 6MB cache, that leaves 122M transistors for L1 cache and the rest of the core. Applying the same calculation to Conroe gives us 99M transistors left over, meaning that there are roughly 23% more core-logic, control and L1 transistors being used in Penryn than in Conroe.

The calculation here is awful and leads to a bad conclusion. A bit of cache costs much more than the 6 transistors of the storage cell - there are many sources of overhead: column muxes, hierarchical bitline logic, sense amps (possibly), bitline precharge devices, write drivers, redundant rows/columns, decoders, tags, and ECC data. If you account for the actual transistor count of a cache, you'll get a significantly higher number. I think you'll end up with about the same number of transistors left over for the core as Conroe had.

Tags are huge: with 64 byte cache lines and a 6MB cache, that's 98304 lines. 16-way associative => 6144 sets. That means you can determine ~12 bits of the address from which set a block ends up in. A 64-byte line means you don't need to find 8 of the bits. If you have a 32-bit physical address space, you need 12 more bits of tags per set (16 bites if you have a 36-bit physical address space). For multiprocessor support, you need to know what state the cache line is in (I believe Intel's chips use 4 states - modified, owned, shared, invalid): 2 more bits. This alone gives you another 98304 * 14 = 1,376,256 bits, times 6 transistors per bit = 8.2M transistors.

ECC data is worse: if the ECC is done 64 bits at a time, single-error-correct, double-error-detect requires 8 bits of overhead. So, 6 MB = 6*1024*1024*8 bits => 6,291,456 bits of overhead => 37,748,736 transistors.

That alone is over 45M transistors that Anand missed. The other overheads are harder to estimate precisely, but they add up.

If you assume 80% efficiency with 288M in bitcells, that's 360M transistors for the cache vs 240M for Conroe, you get 50M for Penryn and 290M-240M = 50M for Conroe. I think 80% actual is way above normal - one of my friends in academia is currently taping out an SRAM with 60% efficiency; 70% is generally considered good.
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: CTho9305
I was reading this article and one thing about it really bothered me:

This version of Penryn is dual-core, and the first quad-core Penryn chips will simply be two of these on a single package, although later on we may see a single-die solution. At 410 million transistors, we expect a dual-core Penryn to have a 6MB shared L2 cache (up from 4MB in Conroe). The logic part of the Penryn core will be mostly evolutionary from Conroe, but do expect additional functionality and performance from more than just a larger cache.

If we assume that 288M transistors (6T SRAM) will be used by the 6MB cache, that leaves 122M transistors for L1 cache and the rest of the core. Applying the same calculation to Conroe gives us 99M transistors left over, meaning that there are roughly 23% more core-logic, control and L1 transistors being used in Penryn than in Conroe.

The calculation here is awful and leads to a bad conclusion. A bit of cache costs much more than the 6 transistors of the storage cell - there are many sources of overhead: column muxes, hierarchical bitline logic, sense amps (possibly), bitline precharge devices, write drivers, redundant rows/columns, decoders, tags, and ECC data. If you account for the actual transistor count of a cache, you'll get a significantly higher number. I think you'll end up with about the same number of transistors left over for the core as Conroe had.

Tags are huge: with 64 byte cache lines and a 6MB cache, that's 98304 lines. 16-way associative => 6144 sets. That means you can determine ~12 bits of the address from which set a block ends up in. A 64-byte line means you don't need to find 8 of the bits. If you have a 32-bit physical address space, you need 12 more bits of tags per set (16 bites if you have a 36-bit physical address space). For multiprocessor support, you need to know what state the cache line is in (I believe Intel's chips use 4 states - modified, owned, shared, invalid): 2 more bits. This alone gives you another 98304 * 14 = 1,376,256 bits, times 6 transistors per bit = 8.2M transistors.

ECC data is worse: if the ECC is done 64 bits at a time, single-error-correct, double-error-detect requires 8 bits of overhead. So, 6 MB = 6*1024*1024*8 bits => 6,291,456 bits of overhead => 37,748,736 transistors.

That alone is over 45M transistors that Anand missed. The other overheads are harder to estimate precisely, but they add up.

If you assume 80% efficiency with 288M in bitcells, that's 360M transistors for the cache vs 240M for Conroe, you get 50M for Penryn and 290M-240M = 50M for Conroe. I think 80% actual is way above normal - one of my friends in academia is currently taping out an SRAM with 60% efficiency; 70% is generally considered good.

From what I believe, Intel probably reports what the cache size takes up assuming in optimal scenarios, and not any redundancies built into the core. So 410 Million is most likely lower then what is actually in the core itself as this figure doesn't included failsafe and redundant logic.

It would also likely be impossible for the core to remain the same size as Intel is adding SSE4 to Penryn which would take up some transistor budget.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: coldpower27
From what I believe, Intel probably reports what the cache size takes up assuming in optimal scenarios, and not any redundancies built into the core. So 410 Million is most likely lower then what is actually in the core itself as this figure doesn't included failsafe and redundant logic.

Doubtful. Montecito was only 29M transistors per core, and it was a pretty fancy machine. Montecito's 24MB L3 cache was 1550M transistors (according to Wikipedia), or 65M transistors per MB. That actually lines up well with my 360M-transistors-for-6MB estimate here.

It would also likely be impossible for the core to remain the same size as Intel is adding SSE4 to Penryn which would take up some transistor budget.

True. My estimate is obviously not exact. However, in practice, new instructions can often be implemented with microcode or only small changes to the hardware. My point was that it's unlikely that the Penryn core has significantly more transistors than Conroe.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I believe Anand said the logic would gain about 20% more transitors . I worked it out and came up with a 19% in lodic transitor count. That pretty Good.
 

zsdersw

Lifer
Oct 29, 2003
10,505
2
0
Originally posted by: CTho9305
Doubtful. Montecito was only 29M transistors per core, and it was a pretty fancy machine.

Montecito is also *totally* different from Conroe/Penryn, as it is basically a souped up VLIW design.. so comparisons of transistors per core is best done between similar architectures.

 

dmens

Platinum Member
Mar 18, 2005
2,274
959
136
i.r.t. efficiency per area, since the cache size pretty much determines die size which is a big factor in the bottom line, one of the first things done on the lead studies in a new process is how to pack as much data as possible into a small signal array.... i thought 70% was low, but ive really only dealt with these arrays from within industry.

that said, i agree, the article estimate is way off base. 23% extra transistor budget is enough for what... a new FPU? heh.
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: CTho9305
Originally posted by: coldpower27
From what I believe, Intel probably reports what the cache size takes up assuming in optimal scenarios, and not any redundancies built into the core. So 410 Million is most likely lower then what is actually in the core itself as this figure doesn't included failsafe and redundant logic.

Doubtful. Montecito was only 29M transistors per core, and it was a pretty fancy machine. Montecito's 24MB L3 cache was 1550M transistors (according to Wikipedia), or 65M transistors per MB. That actually lines up well with my 360M-transistors-for-6MB estimate here.

Yeah, but Montecito is not x86 based, and has a really light transistor count on the core itself compared to x86. Montecito is also using LV3 instead of LV2.

Look at the fact that the core caches take up only 106.5 million compared to the LV3 and there is 2 MB of LV2 instruction and 512KB of LV2 Data which actually lines up well with x86 LV2 caches of about 6 transistors per storage cell.

Look at Prescott to Prescott 2M, 1MB of LV2 to 2MB of LV2 was 125 Million to 169 Million. An increase of 44 Million for 1MB of LV2, right in line with about 6 transistors per storage cell.

Now look at Northwood to Gallatin-2MB, an increase of 2MB of LV3, 55 Million to 178 Million, so an increase of 123 Million, or almost 8 transistors per storage cell, or close to it.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: coldpower27
Originally posted by: CTho9305
Originally posted by: coldpower27
From what I believe, Intel probably reports what the cache size takes up assuming in optimal scenarios, and not any redundancies built into the core. So 410 Million is most likely lower then what is actually in the core itself as this figure doesn't included failsafe and redundant logic.

Doubtful. Montecito was only 29M transistors per core, and it was a pretty fancy machine. Montecito's 24MB L3 cache was 1550M transistors (according to Wikipedia), or 65M transistors per MB. That actually lines up well with my 360M-transistors-for-6MB estimate here.

Yeah, but Montecito is not x86 based, and has a really light transistor count on the core itself compared to x86. Montecito is also using LV3 instead of LV2.

Sandpile gives <20M transistors for the Core core. A L3 cache SRAM is not drastically different from an L2.

Look at Prescott to Prescott 2M, 1MB of LV2 to 2MB of LV2 was 125 Million to 169 Million. An increase of 44 Million for 1MB of LV2, right in line with about 6 transistors per storage cell.

I'd guess something is wrong with those numbers then. Picking a few numbers from sandpile:
19 Million (65 nm SC, includes 2x 32 KB L1 Cache)
167 Million (65 nm DC, includes 2x 32 KB L1 and 2.0 MB L2 Cache)
291 Million (65 nm DC, includes 2x 32 KB L1 and 4.0 MB L2 Cache)
These are for Core. Note the 19M number for a core (which fits with my derived 50M for dual-core Core 2). Note also the 124M difference between 2 and 4MB, which again works out to ~60M transistors per MB.

Taking some of sandpile's numbers for Athlon64 / Opteron:
105.9 M (130 nm with 128 KB L1 and 1,024 KB L2 Cache)
68.5 M (130 nm with 128 KB L1 and 512 KB L2 Cache)
~75M transistors/MB

114.0 M (90 nm with 128 KB L1 and 1,024 KB L2 Cache)
77.0 M (90 nm with 128 KB L1 and 512 KB L2 Cache)
~75M transistors/MB

153.8 M (90 nm with 128 KB L1 and 512 KB L2 Cache DC Rev F)
227.4 M (90 nm with 128 KB L1 and 1,024 KB L2 Cache DC Rev F)
~74 for 1 MB (well, 2x512KB)


I'll grant that the P4 numbers on sandpile don't work out right:
125,000,000 (includes 12 K µOP TC + 16 KB L1d + 1024 KB L2)
164,000,000 (includes 12 K µOP TC + 16 KB L1d + 2048 KB L2)
That's 39M transistors for 1024*1024*8 bits, or an increase of less than 6T per bit. That's pretty clear evidence that these numbers are missing something or there were other significant changes. Note that a lot of the 256K counts are identical to the 512K counts.
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: CTho9305
Originally posted by: coldpower27
Originally posted by: CTho9305
Originally posted by: coldpower27
From what I believe, Intel probably reports what the cache size takes up assuming in optimal scenarios, and not any redundancies built into the core. So 410 Million is most likely lower then what is actually in the core itself as this figure doesn't included failsafe and redundant logic.

Doubtful. Montecito was only 29M transistors per core, and it was a pretty fancy machine. Montecito's 24MB L3 cache was 1550M transistors (according to Wikipedia), or 65M transistors per MB. That actually lines up well with my 360M-transistors-for-6MB estimate here.

Yeah, but Montecito is not x86 based, and has a really light transistor count on the core itself compared to x86. Montecito is also using LV3 instead of LV2.

Sandpile gives <20M transistors for the Core core. A L3 cache SRAM is not drastically different from an L2.

Look at Prescott to Prescott 2M, 1MB of LV2 to 2MB of LV2 was 125 Million to 169 Million. An increase of 44 Million for 1MB of LV2, right in line with about 6 transistors per storage cell.

I'd guess something is wrong with those numbers then. Picking a few numbers from sandpile:
19 Million (65 nm SC, includes 2x 32 KB L1 Cache)
167 Million (65 nm DC, includes 2x 32 KB L1 and 2.0 MB L2 Cache)
291 Million (65 nm DC, includes 2x 32 KB L1 and 4.0 MB L2 Cache)
These are for Core. Note the 19M number for a core (which fits with my derived 50M for dual-core Core 2). Note also the 124M difference between 2 and 4MB, which again works out to ~60M transistors per MB.

Taking some of sandpile's numbers for Athlon64 / Opteron:
105.9 M (130 nm with 128 KB L1 and 1,024 KB L2 Cache)
68.5 M (130 nm with 128 KB L1 and 512 KB L2 Cache)
~75M transistors/MB

114.0 M (90 nm with 128 KB L1 and 1,024 KB L2 Cache)
77.0 M (90 nm with 128 KB L1 and 512 KB L2 Cache)
~75M transistors/MB

153.8 M (90 nm with 128 KB L1 and 512 KB L2 Cache DC Rev F)
227.4 M (90 nm with 128 KB L1 and 1,024 KB L2 Cache DC Rev F)
~74 for 1 MB (well, 2x512KB)


I'll grant that the P4 numbers on sandpile don't work out right:
125,000,000 (includes 12 K µOP TC + 16 KB L1d + 1024 KB L2)
164,000,000 (includes 12 K µOP TC + 16 KB L1d + 2048 KB L2)
That's 39M transistors for 1024*1024*8 bits, or an increase of less than 6T per bit. That's pretty clear evidence that these numbers are missing something or there were other significant changes. Note that a lot of the 256K counts are identical to the 512K counts.

I will give you the difference between Allendale and Conroe.

The Athlon 64's have been 8T per or more closer to 9T per bit for LV2 cache, but that's for AMD's processor and not Intel's.

It seems to vary, I don't go by Sandpile very often, but the same values for 256KB and 512KB on the Pentium 4's are because of the fact that die functionality was disabled to create the lower end versions and not removed. So the transistor counts are still correct.

Like the B2 Stepping E6300/E6400 is still 291 Million Transistors on a 143mm2 die, except 124 million is deactivated. As well as the T5x00 line on the Core 2 Duo Mobile's and Celeron M 5x0 sequence. If these values are indeed correct then Celeron M 5x0 has only 62 Million + 21.5 million or 83.5 million activated out of the total 291 million.

And some numbers don't add up, if Allendale is 167 million and the difference is 124 Million for 2MB of LV2, then your looking at 21.5 Million per Core for Core 2 Duo. Not 19 Million as Sandpile suggests, as well Prescott-2M is 169 Million so that helps offset the discrepancy to 6T per cell. So it should be at least 43 million on the core for Penryn. So I think for Penryn the size should be around 7.5T.

For Netburst derivatives as well as Itanium's LV2, it's close to 6T, for AMD's processors it's close to 9.5T per cell, For Pentium M it's close to 8T, for Core 2 Penryn it's close to 7.5T. And that line up well with your 360 million estimate for Penryn. Probably need to just need to tweak it a little more, to make sure that the core actually doesn't grow smaller, then 43 million spread over the 2 Cores. I think for Conroe it's about 7.75T but for Penryn your looking at 7.4-7.5T even using 7.4T per cell would allow an extra 5 Million per Core which I would guess is quite significant.

But I agree then in the end 6T at least for Penryn and Core 2 based processor is a bit on the low side. It would be close to accurate if your talking about Netburst or Itanium 2 LV2 cache though.

It's already proven false in another way, if Allendale only has 167 million Transistors, and 124 million is Cache as compared to Conroe, then 99 Million for Core logic is impossible on Conroe.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
For Netburst derivatives as well as Itanium's LV2, it's close to 6T

My point is, you cannot under any circumstance have 6 transistors per bit of cache. You need tags, and if you're not insane, you need ECC protection (according to this Intel document, Core 2 does have ECC protection in the L2). You also need supporting logic like decoders and muxes. Anybody giving you a transistors-per-bit number of 6 is misleading you. If you increase cache size by increasing line size, you can get away without adding tag bits, but you need to ECC the extra data, and unless you're doing very risky design you need more/wider column muxes*. In practice you'll need to add to other supporting logic as well.

*You need to use column muxes because at modern bitcell sizes, a single particle strike can actually flip multiple bits in a region, and if you don't want to lose data, you need to make sure that within the affected region you have no more than 1 bit from each ECC-protected block.
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: CTho9305
For Netburst derivatives as well as Itanium's LV2, it's close to 6T

My point is, you cannot under any circumstance have 6 transistors per bit of cache. You need tags, and if you're not insane, you need ECC protection (according to this Intel document, Core 2 does have ECC protection in the L2). You also need supporting logic like decoders and muxes. Anybody giving you a transistors-per-bit number of 6 is misleading you. If you increase cache size by increasing line size, you can get away without adding tag bits, but you need to ECC the extra data, and unless you're doing very risky design you need more/wider column muxes*. In practice you'll need to add to other supporting logic as well.

*You need to use column muxes because at modern bitcell sizes, a single particle strike can actually flip multiple bits in a region, and if you don't want to lose data, you need to make sure that within the affected region you have no more than 1 bit from each ECC-protected block.

I can't really agree with this given the numbers, 6T per bit is what the average is around for NetBurst as well as Itanium's LV2, how they achieve that is a mystery at this point in time, ~8T or in AMD's case over 9T is only seems to be seen on LV2 based on Pentium and Core, as well as K8.

And Core 2 has close to around 7.5T per bit, so it's not 6 in Core's case. Unless they don't count some certain transistors in NetBurst processors or something that is what those numbers are. Not necessarily a lie per say, could be down to different counting methods.
 

BrownTown

Diamond Member
Dec 1, 2005
5,314
1
0
There are 6 transistors per bit of SRAM, that is just a fact. In order to do anything usefull with them you need a good deal more memmory bits to handle parity and tags. Also of course you need the logic to access the memmory cells. So it is impossible to have 6 transistors per bit of memmory in practice. One thing I am not sure about is that I learned in computer architecture class that some processors include the parity and tag bits as part of the total amount of memmory the processor has, so it might say 4MB, but only have 3.5MB of data. I'm not sure how Intel's processors work, but it is possible that this would account for the total numbers being much clsoer to 6T than expected.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I can't really agree with this given the numbers, 6T per bit is what the average is around for NetBurst as well as Itanium's LV2, how they achieve that is a mystery at this point in time, ~8T or in AMD's case over 9T is only seems to be seen on LV2 based on Pentium and Core, as well as K8.

So you're thinking Intel invented new magical ECC, tagging, and decode systems that require no overhead, and have 100% yield so they don't need redundant rows/columns?

Are you aware of where the 6T per bit number actually comes from? It comes from the circuit that is used to store data: 2 cross-coupled inverters (2 devices each), and 2 access transistors. You can't use fewer transistors nowadays, although back when variability was lower and power didn't matter you could replace 2 of the transistors with resistors as shown here.

Intel routinely shows off electron-microscope shots of their bit cells and if you know what you're looking at, it's clearly 6T per bit of data.

One thing I am not sure about is that I learned in computer architecture class that some processors include the parity and tag bits as part of the total amount of memmory the processor has, so it might say 4MB, but only have 3.5MB of data. I'm not sure how Intel's processors work, but it is possible that this would account for the total numbers being much clsoer to 6T than expected.

That would be extremely deceptive, probably to the point of being illegal marketing, since those bits aren't usable for data storage. As demonstrated by the linkpack results here, AMD only counts usable bits. These results also indicate 4MB of usable L2 cache space for Core 2 Duo.
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: CTho9305
So you're thinking Intel invented new magical ECC, tagging, and decode systems that require no overhead, and have 100% yield so they don't need redundant rows/columns? Are you aware of where the 6T per bit number actually comes from?

IF 8T is indeed necessary overall, then there something else were not seeing as 6T has been the average overall for Netburst's lifetime. As well as what is reported for Itanium 2's LV2. It's likely Intel is just counting NetBurst's and Itanium 2's LV2 differently, as ~6T has been consistent for Netburst. The question we should ask is why would Intel be doing this specifically for the Netburst and Itanium 2 LV2's, as well as why Core 2/Pentium M/P3 based LV2 cache's have the more standard 8T per bit LV2 levels.
 

josh609

Member
Aug 8, 2005
194
0
0
CTho9305 I fear you. No, not because your name is CTho9305 but because you actually sound like you know what your talking about...........
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: coldpower27
Originally posted by: CTho9305
So you're thinking Intel invented new magical ECC, tagging, and decode systems that require no overhead, and have 100% yield so they don't need redundant rows/columns? Are you aware of where the 6T per bit number actually comes from?

IF 8T is indeed necessary overall, then there something else were not seeing as 6T has been the average overall for Netburst's lifetime. As well as what is reported for Itanium 2's LV2. It's likely Intel is just counting NetBurst's and Itanium 2's LV2 differently, as ~6T has been consistent for Netburst. The question we should ask is why would Intel be doing this specifically for the Netburst and Itanium 2 LV2's, as well as why Core 2/Pentium M/P3 based LV2 cache's have the more standard 8T per bit LV2 levels.

Well, the whole point of this thread was to point out that people were using bad math when dealing with transistor counts. I don't know where the Netburst numbers came from - maybe somebody got the small-cache number and calculated what the others would be based on a mistaken 6T assumption.

Originally posted by: josh609
CTho9305 I fear you. No, not because your name is CTho9305 but because you actually sound like you know what your talking about...........

I have some industry experience and majored in electrical & computer engineering in college.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |