CPU caching scemes...

sao123

Lifer
May 27, 2002
12,650
203
106
I was reading a little bit about about inclusive & exclusive caching on cpu, but which is better?
It seems that (if i understood) inclusive cache may be a little faster for L1 misses that hit in L2 or L3.
But it seems that exclusive, gives you more maximum effective cache.
Plus what is the best replacement algorithms for each.

But i fail to understand the associativity and blocking vs non-blocking concepts.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
Exclusive is more cache efficent (as you dont have the L1 cache repeated in the L2 cache) However to make the cache exclusive uses up alot of transistors and transistors = money. So basically it is a trade off bettween the hit you get if you hvae the effective reduction of cache due to the overlap and how much it will cost you to make it exclusive.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I don't think there is a clear-cut answer to whether inclusive or exlusive cache is better. If there was, you wouldn't see Athlons one way and Pentiums the other. Both types of caches usually use LRU (or pseudo LRU, since LRU for many-way-associative caches is not feasible).

A blocking cache can only service one request at a time. If you ask it for word at location x, you have to wait for it to give you back the data before you can do anything else. A non-blocking cache allows you to continue executing instructions while you're waiting for x. I don't know if calling the cache blocking or non-blocking makes sense - as far as I can tell, the cache is the same, but the CPU core is different. An out-of-order computer might encounter code like:
lw r2, r3 ;read memory[r2] into r3
add r4, r0, r0 ; r4 = 0
add r4, r5, r6 ; r4 = r5+r6
add r4, r7, r8 ; r4 = r5+r6+r7+r8
sub r4, r4, r2 ; r4 = r5+r6+r7+r8-r2
In a blocking design, once we hit the "lw", we can't do anything else until the data comes back. On a modern CPU with 3 cycle L1, you're stuck for 3 cycles. In a non-blocking design, the CPU can keep going until it hits something that depends on that load (the sub instrction), so since there are 3 instructions in between, if you had only one pipeline, you would experience no extra delay - the memory load would appear to be instantaneous.

It's even worse for stores - a load often occurs soon before its data is needed, so with 3-way superscalar machines, you might not have 9 instructions before a dependent one to keep you busy for 3 cycles, but with a store, you're rarely dependent on it. A blocking cache stops you at every store, while a non-blocking cache would effectively make stores free.

Do you understand direct-mapped caches? Let's say you have an 8-entry direct-mapped cache. When you're given a memory location to read from, you just take the address modulo 8 (look at the last 3 bits of the address), and you have to use that cache location. There is no real "replacement policy" - any address can go in one spot, and if data is already there, you have to evict it. The advantage of a direct-mapped cache is that it's simple. The disadvantage is that if you have many accesses to addresses that map to the same location, you'll be "thrashing" - repeatedly evicting and loading the same data (consider a program that accesses memory location 3 and location 11 over and over again - both locations fight for cache location 3. To determine a cache hit, you just compare the tag to the address requested and are done.

Fully associative caches are the other end of the spectrum. If you have an 8-entry fully associative cache, any memory location can end up in any of the 8 cache locations. To determine if a memory location is already present, you have to look at all 8 tags and do comparisons, and figure out which one (if any) hit. This is going to be pretty slow. Also, LRU replacement is not realistic for 8 ways - you'd need something like 15 bits to store LRU information, and the logic to interpret that information would be huge. In reality, you have to implement pseudo-LRU. The advantage of a fully associative cache is that you can store ANY 8 memory locations simultaneously.

Set associativity is the 'happy middle'. An 8-entry 2-way set associative cache would have 2 banks of 4 entries. When you get a memory address, use the last 2 bits to identify which of the 4 rows you need to look at, and then compare the two tags for that row. If one matches, read out data from (or write data to) the bank that hit. Replacement is pretty simple - LRU requires just 1 extra bit per row. If you touch bank 0, clear the bit. If you touch bank 1, set the bit. When it comes time to evict something, if the bit is 0, evict bank 1, and if the bit is 1, evict bank 0. Earlier, for direct mapped caches, we saw there was a problem if a program repeatedly used location 3 and location 11. Now, both 3 and 11 can coexist in the cache. We only have a problem if the program uses location 3, 11, and 19.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: CTho9305

A blocking cache can only service one request at a time. If you ask it for word at location x, you have to wait for it to give you back the data before you can do anything else. A non-blocking cache allows you to continue executing instructions while you're waiting for x. I don't know if calling the cache blocking or non-blocking makes sense - as far as I can tell, the cache is the same, but the CPU core is different.
Caches can be blocking or non-blocking...a non-blocking cache allows multiple requests to be issued, and can service and complete misses out-of-order. The issue at heart is determining, as data come back out-of-order, which data is for which request. They're pretty important to performance for any multiple-issue microprocessor...Itanium 2 has non-blocking/out-of-order L2 and L3 caches even though the execution pipeline is mostly in-order.

The seminal paper on the topic is "Lockup-Free Instruction Fetch/Prefetch Cache Organization" by David Kroft (I think it's on the ACM Portal)...the paper is 20 years old, but the concept is pretty much the same.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Inclusive and exclusive caching schemes each have some advantages. While exclusive caching obviously has the advantage that it gives you more effective cache space, it also complicates the search procedure. More complex search procedures means longer access latencies. It also complicates write-back and cache coherency quite a bit. Keep in mind that in exclusive schemes (at least, most ones), you are not guaranteed exclusiveness. That is to say, there might be duplicate data, you just aren't guaranteed that the data is duplicated.

Let's take an example. Say the data you're looking for is at address 5. Address 5 is currently in the L2 cache but no the L1 cache. You load the data at address 5 and also, at the same time, put it in L1 because it's likely that it may be used again. However, if you were to take the extra time to remove it from L2 (as to guarantee exclusiveness), then you would be wasting valuable cache time. Instead, you leave the L2 the way it is.

Next instruction writes back to address 5. Since most modern caches are write-back and not write-through, this new data is only written to L1 cache. The older data is in L2. You would set the dirty bit (the bit indicating that the data in cache is different than the data in memory) to 1 in L1, however, you've not invalidated the data in L2 (that is, the L2 cache still considers the data it has at address 5 to be valid). Since the caches are exclusive, and different data is allowed in L2 and L1. The next time you try to replace that block in L1 (the one containing address 5), complications will come up as you'd have to write the L1 block into memory. But the L2 still considers itself to have the data at address 5 and that has not been set invalid yet. In other words, everytime you write back, you would have to check each level of cache individually and invalidate any data that's there. This costs time.

In an inclusive-cache scheme, this situation is avoided. All changes in L1 are guaranteed in L2 as well. L1 cache blocks are associated with L2 blocks. Writes that occur in L1 are written-through to L2. There is no additional search time neccessary making L1 writes very very fast.

The trade-off between cache speed and cache size appears again. If you have a large L1 cache, you do not want to waste the space in L2 and exclusive makes sense. If you have a small L1 cache, the wasted space is negligible and having faster cache is better. I believe the cut-off point is if your L2 cache is 8x greater than your L1 cache or something along those lines.
 

sao123

Lifer
May 27, 2002
12,650
203
106
I thought that the victim buffer solved most of the writeback latency problems.
this way, the evicted L1 block would always be written over the promoted L2 block, therefore guaranteeing exclusivity. And this write back could occur during any L1 search hits, where the L2 is not in use. The victim buffer can usually hold 5-8 blocks to allow for a small delay in writeback.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
If everytime the L1 cache was written back, it is written through to the L2, then it would not guarantee exclusivity, but rather, work towards inclusivity. That is to say, data is duplicated in L2 and L1. Having a victim buffer is ok, but it does not solve the entire problem, especially when your L1 cache is large and you can have way more than 5-8 blocks that are in L1 and also in L2. Consider the hits to L2 cache are ~95% and all of that is most likely written to L1 (with a 64KB L1 cache, using 64-byte blocks, that's 1000 blocks that can be in both L1 and L2). In the end, you end up wasting space anyway.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |