P4" Cache vs Latency

damocles

Diamond Member
Oct 9, 1999
5,105
5
81
Ok, I may have this completely wrong, but heres what I was wondering

2 areas of performance where the P4 seems to need a boost are with regard to its small L1 cache and comparatively high Latency when using RDram?

As I understand it, the amount of L1 cache onboard was kept low because adding more would have increased latency

When the P4 goes to using PC1066 and PC1200 RDram- presumably the higher FSB would cause the latency to decrease.

Would the decrease be enough to compensate and allow additional L1 cache to be added? Would there be any tangible benefit?

Of course this would probably only work with RDram solutions (if at all). I am just trying to think of ways they could further tweak the P4
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
The amount of time it takes to access a cache is directly proportional to the size of the cache. So, for example using fake numbers, one might be able to make a 4-way set-associative 4kB cache on a 0.18um process with a 0.5ns access time or a 4-way set-associative 16kB cache on a 0.18um with a 2ns access time. The former would allow a 1 cycle latency for read accesses with a 2GHz clock, while the latter would allow a 4 cycle latency for read accesses witha 2GHz clock. The latency here is not the total latency of the system, but the latency to access the cache itself. So you can make a big slow cache, or a fast small cache. That's why there are various levels of cache sizes, L1, L2, etc. as they get larger they get slower. So you try to balance the likelihood of a cache hit with the latency involved to optimize time-to-access with likelihood of a cache hit.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
An 8KB cache may seem small, but you're missing the beauty of what caches can exploit: spatial and temporal locality. If address X is accessed, chances are it will be access again soon (this is temporal locality). Also, caches store data in blocks that are some integer multiple of the word size...a block, for example, might store 16 32-bit words from address X to X+63. That way, caches can exploit spatial locality: if address X is accessed, chances are address X+1 will be accessed soon. So despite the 8KB size, the P4's L1 cache is estimated to have a 96.1% hit-rate.

But what's important is not hit-rate alone, or access time alone, but rather the average memory access time. For a CPU with an L1 and an L2 cache, the average access time = L1 access time + L1 miss-rate * L2 access time + L2 global miss-rate * main memory access time. It's important to keep the average memory access time as low as possible...main memory access times decrease very, very slowly....for simulation purposes, it's easiest to assume that main memory access time is always constant at around 80-120ns (for Paul Demone's numbers above, he assumed 120ns). On the other hand, as CPU speeds increase, the number of cycles, from the perspective of the CPU, needed to access main memory is constantly increasing. So given the target speed of a CPU, and the size of the L2 cache (usually determined by the process technology and target market), you have to design the cache system to get a target hit-rate and access times for the L1 and L2 cache to give you the target average memory access time. The problem is that many of the techniques used to decrease the miss-rate also increase the access time, such as increasing the set-associativity and the cache size. Increasing the block size increases the spatial locality (since there are more words per block), but decreases the temporal locality (since a cache of a given size will store fewer blocks). Thus, for a given cache size and access patterns, there is usually a target block size that will maximize the hit-rate.

So despite its size, the P4's 8KB 4-way set-associative L1 cache has a 3.9% miss-rate, compared to the Athlon's 64KB 2-way set-associative L1 cache's 1.8% miss-rate (slightly over two times better). The global hit-rates for the L2 are 98.9% and 99.0% for the P4 and Athlon, respectively. The P4's L1 access time is 2 cycles, it's L2 access time is 5 cycles, and given a 1.5GHz P4, at 120ns the main memory access time is 180 cycles. The Athlon's L1 acccess time is 3 cycles, it's L2 access time is 11 cycles, and it's main memory access time at 1GHz is 120 cycles.

For the P4, the average access time = 2 + .039 * 5 + .011 * 180 = 4.18 cycles. At 1.5GHz, this is 2.78ns.

For the Athlon, the average access time = 3 + .018 * 11 + .01 * 120 = 4.40 cycles. At 1GHz, this is 4.40ns.

So with its cache hierarchy, the P4 will achieve a much better average access time in nanoseconds, but given its faster clockrate, the average access times for the P4 and Athlon are very similar.

If it meant increasing the L1 access time to 3 cycles, increasing the P4's L1 size would probably have an adverse affect. IIRC, the general rule is that you have to quadruple the cache size to halve its miss-rate. So if the L1 size was doubled to 16KB, it's miss-rate would decrease by about 40% to around 2.76%. Let's say this also decreases the L2 global miss-rate to 1.05%. The average access time would be:

3 + .0276 * 5 + .011 * 180 = 5.12 cycles

So despite the better hit-rate, the average access time would be ~22% worse. Honestly, I don't think the L1 size would ever be increased on the P4 unless the access time can be kept at 2 cycles.

edit: Hehe, once again I took too long....pm beat me.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |