Inner workings of cache

MadRat

Lifer
Oct 14, 1999
11,922
259
126
I was reading some material about how L1 and L2 caches work. It dawned on me that all of the INPUT and OUTPUT goes through the cache, never directly to memory.

Apparently the system memory doesn't operate fast enough to recieve the flood of output. The fact that programmers try to fit their entire application into main cache makes alot of sense. Doesn't leave alot of room for new data from memory if the cache is full of the program.

I am curious about how the cache works. If the cache is for both INPUT/OUTPUT, does that mean the L1 and L2 cache is half for INPUT, and half for OUTPUT? Or can the entirity of each cache do both?
 

Dulanic

Diamond Member
Oct 27, 2000
9,949
569
136
They can do both input and output. Think of it like this... cache is nothing but RAM that is smaller, onboard, and faster, because thats exactly what it is. If half was for input and half for output you would have to store the data twice which would be highly inefficent. The reason you want L1 and L2 cache is this... If you had a large L1 cache it would take a while to search it, adding more latency. That is why L1 is kept somewhat small, tho now that cache runs so fast L1 cache has become slightly larger for most CPUs compared to back in the day. So important data that is needed instantly should be kept in L1 cache. If the CPU cant find it in L1 its going to go into L2, now the larger the L2 cache will also add latency but that is covered up more because the larger it is the more data is in there and the less it has to go to the memory.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0


<< I am curious about how the cache works >>

Caches are pretty complex...IMHO cache and the subsequent cache controller design is logically one of the more complex parts of the CPU. Caches work on the principles of temporal and spatial locality. That is, if a certain memory address is accessed, chances are it will be accessed again soon; also if a certain memory address is accessed, chances are an address nearby will be accessed soon. Based on this, caches have two properties: data is stored in the cache in blocks, which contain more than one word (the P4's L2 cache has a 1024-bit block size). Also, the most recently-used blocks are kept in the cache.

But because the cache stores many fewer blocks than are available in system memory, there has to be a way to store and replace blocks in the cache. As an example, let's say a CPU has 16-bit data words and addresses (giving 64K lines in memory, 16-bits each), and its cache has 1024 blocks, each with two 16-bit words.

The simplest method of block replacement is direct-mapped, where each block is mapped to a specific location. Thus, line 0 in the cache stores blocks 0, 1024, 2048, etc; line 1 stores blocks 1, 1025, 2049, etc. Since there are 1024 blocks, and two words per block, the first bit of the address selects the word within the block, and the next 10 bits select the line within the cache. Since there are 5 address bits left, these 5 bits get stored as a tag next to each line of the cache. So on a cache read, the 10 line bits selects the line, and the 5 tag bits of the read address are compared to the tag bits stored at that line. If they are the same, then the address exists at that line, and the proper word is read. If the tags are not the same, then the existing block in the cache may have to be written back to memory (if it's dirty flag is high, signalling that the data has been changed since it was last read from memory), and the proper block is fetched from the memory and stored in the cache. Then the read is executed again, and the correct word is returned from the cache. There may be a lot of steps involved in a cache miss, but the advantage is that the most used addresses are kept in the cache, so overall the CPU is able to stay out of the memory as much as possible. The advantage of direct-mapped caching is that you know exactly where a block might be located in the cache; the disadvantage is that, for example, you can't store block 0 and block 1024 in the cache at the same time, so the hit-rate is lowered.

At the opposite end is fully-associative mapping, in which any block can be stored in any line of the cache. On a read, the tag of the address is compared to the tag of every line of the cache...this is done in parallel, but the extra comparator hardware and large multiplexor involved slows down the access time. On a cache miss, the least-recently used block is replaced. The advantage is that the hit-rate is at a maximum.

In between is set-associative, in which the cache is divided into sets. Each block gets mapped to a specific set, but within the set, the block can be placed anywhere. For example, with the 1024-line cache, say it is 4-way set-associative. Therefore, each set has 256 lines...block 0, 4, 8, 12, etc goes the the first set, block 1, 5, 9, 13, etc to the second and so forth. On a read, the 8 address bits select the line within each of the four sets, so only four tags need to be compared to the tag of the address. Thus, the more sets you have, the better the hit-rate, but the slower the access time becomes.

So when designing a cache, there are a lot of factors involved in the hit-rate and access time: block size, cache size, mapping method, number of sets, etc. CPU caches use set-associative mapping, while TLBs (translation lookaside buffers, used in virtual addressing) use fully-associative to minimize the impact of a page miss.
 

Dulanic

Diamond Member
Oct 27, 2000
9,949
569
136


<<

<< I am curious about how the cache works >>

Caches are pretty complex...IMHO cache and the subsequent cache controller design is logically one of the more complex parts of the CPU. Caches work on the principles of temporal and spatial locality. That is, if a certain memory address is accessed, chances are it will be accessed again soon; also if a certain memory address is accessed, chances are an address nearby will be accessed soon. Based on this, caches have two properties: data is stored in the cache in blocks, which contain more than one word (the P4's L2 cache has a 1024-bit block size). Also, the most recently-used blocks are kept in the cache.

But because the cache stores many fewer blocks than are available in system memory, there has to be a way to store and replace blocks in the cache. As an example, let's say a CPU has 16-bit data words and addresses (giving 64K lines in memory, 16-bits each), and its cache has 1024 blocks, each with two 16-bit words.

The simplest method of block replacement is direct-mapped, where each block is mapped to a specific location. Thus, line 0 in the cache stores blocks 0, 1024, 2048, etc; line 1 stores blocks 1, 1025, 2049, etc. Since there are 1024 blocks, and two words per block, the first bit of the address selects the word within the block, and the next 10 bits select the line within the cache. Since there are 5 address bits left, these 5 bits get stored as a tag next to each line of the cache. So on a cache read, the 10 line bits selects the line, and the 5 tag bits of the read address are compared to the tag bits stored at that line. If they are the same, then the address exists at that line, and the proper word is read. If the tags are not the same, then the existing block in the cache may have to be written back to memory (if it's dirty flag is high, signalling that the data has been changed since it was last read from memory), and the proper block is fetched from the memory and stored in the cache. Then the read is executed again, and the correct word is returned from the cache. There may be a lot of steps involved in a cache miss, but the advantage is that the most used addresses are kept in the cache, so overall the CPU is able to stay out of the memory as much as possible. The advantage of direct-mapped caching is that you know exactly where a block might be located in the cache; the disadvantage is that, for example, you can't store block 0 and block 1024 in the cache at the same time, so the hit-rate is lowered.

At the opposite end is fully-associative mapping, in which any block can be stored in any line of the cache. On a read, the tag of the address is compared to the tag of every line of the cache...this is done in parallel, but the extra comparator hardware and large multiplexor involved slows down the access time. On a cache miss, the least-recently used block is replaced. The advantage is that the hit-rate is at a maximum.

In between is set-associative, in which the cache is divided into sets. Each block gets mapped to a specific set, but within the set, the block can be placed anywhere. For example, with the 1024-line cache, say it is 4-way set-associative. Therefore, each set has 256 lines...block 0, 4, 8, 12, etc goes the the first set, block 1, 5, 9, 13, etc to the second and so forth. On a read, the 8 address bits select the line within each of the four sets, so only four tags need to be compared to the tag of the address. Thus, the more sets you have, the better the hit-rate, but the slower the access time becomes.

So when designing a cache, there are a lot of factors involved in the hit-rate and access time: block size, cache size, mapping method, number of sets, etc. CPU caches use set-associative mapping, while TLBs (translation lookaside buffers, used in virtual addressing) use fully-associative to minimize the impact of a page miss.
>>



Wanna know whats funny? You wrote all that and never answered his question LOL Im just kidding with you, that was still informative.
 

Dulanic

Diamond Member
Oct 27, 2000
9,949
569
136


<< LOL...um, no Madrat, caches handle both input and output. >>



:Q
 

br0wn

Senior member
Jun 22, 2000
572
0
0
Just trying to complete Sohcan's explanation of caches.

There are also two types of caches:
1. Write-through cache.
Every write goes to both cache and memory.
2. Write-back cache.
Every write only goes to cache. If the write replaces a non-empty block
in the cache, the replaced block is then written to memory.

Write-back cache can improve performance if the memory can't keep up
with the rate of processor writing the data. However, it is much
more complex to implement than write-through.

Then, there is a cache for the program/code (instruction cache) and there
is a cache for data (data cache).

To answer your question whether half of the cache handles input and the
other half handles output, this doesn't make sense.
Consider this, input = read and output = write.
In order for the cache to handle input (read hit in cache), it has
to have that data in the cache (the data has to be written into cache).
Thus, there is no DIVISION that half of the cache handles input and the
other half handles output. All of the cache should handle both input and
output.

The idea of having distinction for L1, L2 cache and memory (or having
a memory hierarchy) is to help CPU performance.
A very fast memory should be small (hardware limitation). This very fast
memory is usually L1 cache. The second fastest memory is L2 cache, it is
slower than L1 cache but is bigger in size than L1 cache. Then, we have
system memory (like your SDRAM).

To understand this memory hierarchy, consider this hypothetical example.
L1 cache has size 16KB, and can be accessed in 1-2 clock cycles.
L2 cache has size 256KB, and can be accessed in 3-6 clock cycles.
System memory has size ranging from 1MB-2GB, and can be accessed in 10-100 clock
cycles.
We can also include Hard drive as memory type, ranging from 10MB-100GB, and
can be accessed in 10,000+ clock cycles.

L1 cache can be accessed in 1-2 clock cycles means that requesting read or
write (or load and store)
the cache takes 1-2 CPU cycles (or taking 2 clock cycles means takes twice
as much time as 1 instruction to complete in CPU).
Thus, if all data can be fitted into L1 cache, all read and writes will takes
1-2 clock cycles, compared it will take 10-100 clock cycles instead if the
instruction has to wait from memory.

 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81


<< I was reading some material about how L1 and L2 caches work. It dawned on me that all of the INPUT and OUTPUT goes through the cache, never directly to memory. >>



It's not really true to say &quot;all of the input and output goes through the cache&quot;. In the case of a load or store miss, then it does directly go out to memory. And in most caches there are strange situations where you can have a load or store hit and still end up having to go to memory.



<< To answer your question whether half of the cache handles input and the other half handles output, this doesn't make sense. Consider this, input = read and output = write. In order for the cache to handle input (read hit in cache), it has to have that data in the cache (the data has to be written into cache). Thus, there is no DIVISION that half of the cache handles input and the other half handles output. All of the cache should handle both input and
output.
>>



Actually if you break up the actual cache design, one section does specifically deal with input and one deals specifically with output. The fill and store paths handle input (to the cache) and load path deals with output (from the cache), and these are definitely separate parts of the cache that need to be considered in the design. Since a CPU design is hierarchically designed, there will be definite unique individual blocks of a cache that are dedicated to either reading and writing to and from the cache.

Patrick Mahoney
McKinley L1 Cache Circuit Design
Intel Corp
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
pm: how do DMA requests play into caching and virtual addressing? Do DMA requests go to the cache and TLB, or do they go directly to the main memory and page table?

(btw, great to have you back )
 

br0wn

Senior member
Jun 22, 2000
572
0
0

<<
<< To answer your question whether half of the cache handles input and the other half handles output, this doesn't make sense. Consider this, input = read and output = write. In order for the cache to handle input (read hit in cache), it has to have that data in the cache (the data has to be written into cache). Thus, there is no DIVISION that half of the cache handles input and the other half handles output. All of the cache should handle both input and
output. >>

Actually if you break up the actual cache design, one section does specifically deal with input and one deals specifically with output. The fill and store paths handle input (to the cache) and load path deals with output (from the cache), and these are definitely separate parts of the cache that need to be considered in the design. Since a CPU design is hierarchically designed, there will be definite unique individual blocks of a cache that are dedicated to either reading and writing to and from the cache.
>>


Ah, you are talking about the CONTROL path for the cache which splits into two, one handling input and the other handling output.
I don't think a 128KB L1 cache is divided into two parts, one 64KB to be handled
for input only and one 64KB for output only (as I mentioned that they have to work together,
meaning that full 128KB is used for input and output as STORAGE/DATA, of course the
control path can be divided into two parts).
However, if you have some inside Intel information about this implementation, please
correct me
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81


<< Ah, you are talking about the CONTROL path for the cache which splits into two, one handling input and the other handling output. >>

Ok, well, true, I was splitting hairs a bit, but Madrat did have a point about the fact that different portions of a cache are dedicated to different functions. But you are correct, I'm not referring to the division of the memory array of a cache, but the division of the cache unit itself.

I seem to think of a cache differently than others on here - I think of it as a device or a design or a unit - not as a memory array. To me it is a thing that does everything that a cache does - TLB control, bypassing, buffering, etc. as well as a memory array. On a very small cache size, the rest of the stuff in cache can be (and actually has been in the past) larger than the actual memory array. So when I talk about parts of a cache being dedicated to various functions, I'm not talking about the data memory element, but the actual cache unit itself - the unit on the chip that is the cache. And the read and the write paths of the cache can take up distinctly different portions of a cache unit and that they can be very large - not half and half but possibly 1/3 and 1/3 - especially if you have individually dedicated TLB's or TAG's for the read and write ports.

But I do I agree that the memory array is not split up.



<< pm: how do DMA requests play into caching and virtual addressing? Do DMA requests go to the cache and TLB, or do they go directly to the main memory and page table? >>

DMA's are, I believe, handled completely by the chipset, not the CPU or it's caches. But to be honest, I don't really know. BTW, thanks. I can honestly say that it's certainly nice to be back.
 

MadRat

Lifer
Oct 14, 1999
11,922
259
126
I'm wondering about this hierarchy design. In a hierarchy I assume that specific paths are chosen in order to simply organization. I'm assuming one task is done at a time in order to keep everything orderly. With modern CPU designs I'd think that different parts (instruction units) of the processor need access at different intervals for different reasons.

Do you mean by hierarchal that each instruction unit then has some type of flag to control its access to the caches? I'd think that there would be some type of control bit that determines who (meaning the pipeline?) is accessing the cache on any particular cycle.

Or by hierarchal do you simply mean that a series of logical steps is used to resolve the actual control of the caches?
 

br0wn

Senior member
Jun 22, 2000
572
0
0


<<
how do DMA requests play into caching and virtual addressing? Do DMA requests go to the cache and TLB, or do they go directly to the main memory and page table?
>>



DMA (Direct Memory Access) doesn't go to the cache nor TLB.

DMA is a specialized controller that transfers data between an I/O device and memory
independent of CPU.
The DMA controller becomes the bus master and directs the reads or writes between
itself and memory.

There are three steps in a DMA transfer:
1. The processor sets up the DMA by supplying all necessary information (identity
of the device, operation to perform, source and destination of memory address, ...)
2. The DMA starts the operation on the device and arbitrates for the bus.
When data is available, it transfers the data. The DMA device supplies the memory
address for the read and write. If the request requires more than one transfer
of data on the bus, the DMA unit generates the next memory address and initiates
the next transfer.
This way, it can transfer thousand of bytes of unit without bothering the processor.
3. Once DMA transfer is complete, it interrupts the processor.



<<
With modern CPU designs I'd think that different parts (instruction units) of the processor need access at different intervals for different reasons.

Do you mean by hierarchal that each instruction unit then has some type of flag to control its access to the caches? I'd think that there would be some type of control bit that determines who (meaning the pipeline?) is accessing the cache on any particular cycle.

Or by hierarchal do you simply mean that a series of logical steps is used to resolve the actual control of the caches?
>>



It is not true that all different parts of the processor need to access the memory.
Consider a simple pipeline processor which has five stages:
IF ID EX MEM WB
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory operation
WB = Write back

So in this simple pipeline processor, only the MEM part is dealing with MEMORY operation.

In the MEM stage, it doesn't control cache directly. Instead, it sends request through
read/write ports. If the data is in the cache (hit), the data is fetched from the
cache instead of going to the memory.
So the processor doesn't know in advance (except for some special cases), whether
data is in the cache or not.


 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
I guess the DMA question leads me to another one: how does virtual addressing play into the cache? The CPU only generates virtual addresses, right? So does that mean all requests to the L1 and L2 cache have to go to the TLB first (and in the case of a TLB miss, the page table) to get the physical address? Can data from multiple user or kernel processes be stored in the cache at the same time?
 

br0wn

Senior member
Jun 22, 2000
572
0
0


<<
I guess the DMA question leads me to another one: how does virtual addressing play into the cache? The CPU only generates virtual addresses, right? So does that mean all requests to the L1 and L2 cache have to go to the TLB first (and in the case of a TLB miss, the page table) to get the physical address? Can data from multiple user or kernel processes be stored in the cache at the same time?
>>



Actually, there are two design schemes for CPU to index the cache/memory:
1. It requires to use TLB to translate the address and then access
the cache.
2. It uses an address that is completely virtual to index the cache.
This is called virtually addressed cache, and it uses tags that
are virtual addresses, hence such a cache is virtually indexed
and virtually tagged.
In this cache, TLB is unused during normal cache access. However,
when a cache miss occurs, the processor needs to translate the address
to a physical address so that it can fech the cache block from main memory.

To answer the second part of your question:
&quot;Can data from multiple user or kernel processes be stored in the cache at the same time?&quot;

Yes, this is the wonder of virtual memory scheme.
Since the memory can handle data from multiple user or kernel processes at the
same time, I don't see why cache can't have these data at the same time (However,
this is just based on my opinion, please correct me if I am wrong).
Of course, to be able to handle these, a protection scheme has to be implemented
(we don't want one process accesses data from another process), which is entirely
another issue.

 

MadRat

Lifer
Oct 14, 1999
11,922
259
126
So you are saying the TLB basically holds virtual addresses of main memory. It would stand to reason that requests for all memory must be handled through the cache controller then to prevent duplicate replies from both the cache and main memory. There would be a several cycle delay if the request went first to the cache and then to the main memory; either they are mirrored requests to both or there is a penalty. This means that data can therefore be retrieved through the cache independent of the duplicate information held in the main memory. In effect if its found in the cache it cancels any request to main memory. Some type of message would have to be sent to the main memory controller in order to cancel the request.

Am I understanding you right?
 

br0wn

Senior member
Jun 22, 2000
572
0
0


<<
So you are saying the TLB basically holds virtual addresses of main memory. It would stand to reason that requests for all memory must be handled through the cache controller then to prevent duplicate replies from both the cache and main memory. There would be a several cycle delay if the request went first to the cache and then to the main memory; either they are mirrored requests to both or there is a penalty. This means that data can therefore be retrieved through the cache independent of the duplicate information held in the main memory. In effect if its found in the cache it cancels any request to main memory. Some type of message would have to be sent to the main memory controller in order to cancel the request.
>>



Not quite. I guess I was being not very clear in my previous post.

First, TLB holds four items:
1. virtual page number
2. physical page number
3. reference bit
4. dirty bit

Second, requests for data from memory doesn't go into both cache and memory.
It only goes to the cache. However, when there is a cache miss, then the request
is transferred to the memory.

 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |