<<
<< I am curious about how the cache works >>
Caches are pretty complex...IMHO cache and the subsequent cache controller design is logically one of the more complex parts of the CPU. Caches work on the principles of temporal and spatial locality. That is, if a certain memory address is accessed, chances are it will be accessed again soon; also if a certain memory address is accessed, chances are an address nearby will be accessed soon. Based on this, caches have two properties: data is stored in the cache in blocks, which contain more than one word (the P4's L2 cache has a 1024-bit block size). Also, the most recently-used blocks are kept in the cache.
But because the cache stores many fewer blocks than are available in system memory, there has to be a way to store and replace blocks in the cache. As an example, let's say a CPU has 16-bit data words and addresses (giving 64K lines in memory, 16-bits each), and its cache has 1024 blocks, each with two 16-bit words.
The simplest method of block replacement is direct-mapped, where each block is mapped to a specific location. Thus, line 0 in the cache stores blocks 0, 1024, 2048, etc; line 1 stores blocks 1, 1025, 2049, etc. Since there are 1024 blocks, and two words per block, the first bit of the address selects the word within the block, and the next 10 bits select the line within the cache. Since there are 5 address bits left, these 5 bits get stored as a tag next to each line of the cache. So on a cache read, the 10 line bits selects the line, and the 5 tag bits of the read address are compared to the tag bits stored at that line. If they are the same, then the address exists at that line, and the proper word is read. If the tags are not the same, then the existing block in the cache may have to be written back to memory (if it's dirty flag is high, signalling that the data has been changed since it was last read from memory), and the proper block is fetched from the memory and stored in the cache. Then the read is executed again, and the correct word is returned from the cache. There may be a lot of steps involved in a cache miss, but the advantage is that the most used addresses are kept in the cache, so overall the CPU is able to stay out of the memory as much as possible. The advantage of direct-mapped caching is that you know exactly where a block might be located in the cache; the disadvantage is that, for example, you can't store block 0 and block 1024 in the cache at the same time, so the hit-rate is lowered.
At the opposite end is fully-associative mapping, in which any block can be stored in any line of the cache. On a read, the tag of the address is compared to the tag of every line of the cache...this is done in parallel, but the extra comparator hardware and large multiplexor involved slows down the access time. On a cache miss, the least-recently used block is replaced. The advantage is that the hit-rate is at a maximum.
In between is set-associative, in which the cache is divided into sets. Each block gets mapped to a specific set, but within the set, the block can be placed anywhere. For example, with the 1024-line cache, say it is 4-way set-associative. Therefore, each set has 256 lines...block 0, 4, 8, 12, etc goes the the first set, block 1, 5, 9, 13, etc to the second and so forth. On a read, the 8 address bits select the line within each of the four sets, so only four tags need to be compared to the tag of the address. Thus, the more sets you have, the better the hit-rate, but the slower the access time becomes.
So when designing a cache, there are a lot of factors involved in the hit-rate and access time: block size, cache size, mapping method, number of sets, etc. CPU caches use set-associative mapping, while TLBs (translation lookaside buffers, used in virtual addressing) use fully-associative to minimize the impact of a page miss. >>
Wanna know whats funny? You wrote all that and never answered his question LOL Im just kidding with you, that was still informative.