I don't think there is a clear-cut answer to whether inclusive or exlusive cache is better. If there was, you wouldn't see Athlons one way and Pentiums the other. Both types of caches usually use LRU (or pseudo LRU, since LRU for many-way-associative caches is not feasible).
A blocking cache can only service one request at a time. If you ask it for word at location x, you have to wait for it to give you back the data before you can do anything else. A non-blocking cache allows you to continue executing instructions while you're waiting for x. I don't know if calling the cache blocking or non-blocking makes sense - as far as I can tell, the cache is the same, but the CPU core is different. An out-of-order computer might encounter code like:
lw r2, r3 ;read memory[r2] into r3
add r4, r0, r0 ; r4 = 0
add r4, r5, r6 ; r4 = r5+r6
add r4, r7, r8 ; r4 = r5+r6+r7+r8
sub r4, r4, r2 ; r4 = r5+r6+r7+r8-r2
In a blocking design, once we hit the "lw", we can't do anything else until the data comes back. On a modern CPU with 3 cycle L1, you're stuck for 3 cycles. In a non-blocking design, the CPU can keep going until it hits something that depends on that load (the sub instrction), so since there are 3 instructions in between, if you had only one pipeline, you would experience no extra delay - the memory load would appear to be instantaneous.
It's even worse for stores - a load often occurs soon before its data is needed, so with 3-way superscalar machines, you might not have 9 instructions before a dependent one to keep you busy for 3 cycles, but with a store, you're rarely dependent on it. A blocking cache stops you at every store, while a non-blocking cache would effectively make stores free.
Do you understand direct-mapped caches? Let's say you have an 8-entry direct-mapped cache. When you're given a memory location to read from, you just take the address modulo 8 (look at the last 3 bits of the address), and you have to use that cache location. There is no real "replacement policy" - any address can go in one spot, and if data is already there, you have to evict it. The advantage of a direct-mapped cache is that it's simple. The disadvantage is that if you have many accesses to addresses that map to the same location, you'll be "thrashing" - repeatedly evicting and loading the same data (consider a program that accesses memory location 3 and location 11 over and over again - both locations fight for cache location 3. To determine a cache hit, you just compare the tag to the address requested and are done.
Fully associative caches are the other end of the spectrum. If you have an 8-entry fully associative cache, any memory location can end up in any of the 8 cache locations. To determine if a memory location is already present, you have to look at all 8 tags and do comparisons, and figure out which one (if any) hit. This is going to be pretty slow. Also, LRU replacement is not realistic for 8 ways - you'd need something like 15 bits to store LRU information, and the logic to interpret that information would be huge. In reality, you have to implement pseudo-LRU. The advantage of a fully associative cache is that you can store ANY 8 memory locations simultaneously.
Set associativity is the 'happy middle'. An 8-entry 2-way set associative cache would have 2 banks of 4 entries. When you get a memory address, use the last 2 bits to identify which of the 4 rows you need to look at, and then compare the two tags for that row. If one matches, read out data from (or write data to) the bank that hit. Replacement is pretty simple - LRU requires just 1 extra bit per row. If you touch bank 0, clear the bit. If you touch bank 1, set the bit. When it comes time to evict something, if the bit is 0, evict bank 1, and if the bit is 1, evict bank 0. Earlier, for direct mapped caches, we saw there was a problem if a program repeatedly used location 3 and location 11. Now, both 3 and 11 can coexist in the cache. We only have a problem if the program uses location 3, 11, and 19.