AtenRa
Lifer
- Feb 2, 2009
- 14,003
- 3,361
- 136
That would be contrary to everything AMD has said about Steamroller.
I have never said that this is SteamRoller
That would be contrary to everything AMD has said about Steamroller.
It could be 4 128, they look very much like the existing ones ( higher/lower order bits are split), there is just double the amount of them. There also looks to be double the amount of register/queue for the FPU so maybe some kind of course grain separation between them can take pressure off needing so many read/write ports on one register/queues with a 4x128bit design.
If AMD really wanted to they could have expanded decode and still kept it shared.
The die pic shows 4 ALUs + 4 AGUs per Integer Core. I dont believe that they will use 8 pipes per Thread. The utilization of all 8 pipes from a single thread will be very low
This implementation is surely a 4 Threads design.
the performance gains to die area ratio used will be even lower.
Maybe this diagram is off or my interpretation but it looks like fetch and L1I are separate here.
Where are the L1i tags? Shouldn't they be on the order of 1/8th of the L1i arrays for a 64B line 48-bit VIPT cache? The only structures close enough that are roughly the correct size is the part you've marked ROM. Putting the cache tags in the middle between two halves of one unified cache is a very common design, so looking at that I'd expect the L1i cache to be unified (and a lot bigger than it used to be). Also, being of different color makes sense because tags are combined logic+memory, while ucode rom is typically just memory.
I have never said that this is SteamRoller
I think it's more likely that they are moving back to the K7 idea of having an ALU + AGU per pipe, and generally only being able to use one of them per clock unless you can avoid loads from the RF. 4x(alu+agu) would potentially be a nice IPC boost, without being a total mess to feed/forward like 8 full exec units.
So you are trolling then? Because we are discussing Steamroller and how it is moving away from CMT as it exists in BD.
Are you going to stop trolling and contribute to the discussion?
How about 4 threads per module, with 2 threads per each INT? ie two hyperthreaded INT cores and one really big FPU in each module?
Yes but as you say yourself, BD already has a duplicated register file, i.e. 2 independent sets of register. A guy at the SA forum quoted a text from AMD that said that the register file was only duplicated to shorten the wire-length due to latency reasons. The latency from the top left AGU to the bottom down ALU would be to long because of the vertical difference.4-thread design with both threads getting a 2x2 cluster would imply the threads having separate register files. This does not appear to be the case -- the register file is the same once-duplicated one in BD.
So you are trolling then? Because we are discussing Steamroller and how it is moving away from CMT as it exists in BD.
Are you going to stop trolling and contribute to the discussion?
Yes but as you say yourself, BD already has a duplicated register file, i.e. 2 independent sets of register. A guy at the SA forum quoted a text from AMD that said that the register file was only duplicated to shorten the wire-length due to latency reasons.
The latency from the top left AGU to the bottom down ALU would be to long because of the vertical difference.
Thus having 2 AGUs and 2 ALUs in the same plane/line around one set of register seems ok for me to have 2 threads in one INT cluster, one each for 1 register and 2 AGUs/ALUs. One on the top one at the bottom.
The only problem I see in the overall design are indeed the missing ROMs. I wonder what happened to them. Maybe there could be a new technique to get rid of them like storing them in the L1I and loading it dynamically from the µCode in the mainboard bios files?
We were discussing about the die pic in the OP and if that implementation deviates from the CMT design.
But it seems that even talking technically can make you a troll these days
Ah ok, then lets say port count, sounds credible, too.He misunderstood what he was reading. The INC units were duplicated for wire latency. RF is not duplicated for wire latency, it's duplicated to lower port count.
There is simply no way, no how AMD will be able to run 8 reads + 4 writes from a single, non-duplicated RF at anything near the clocks they do. Simply not physically possible. The duplication is necessary to make the high count of read accesses possible.
Hmm where do you mean now are the TAGs? Like under the L1I-cache? Sorry I cannot follow.Actually, I now think the rom is probably the purple block. I found the tags, it was on the wrong side of the fetch (/facepalm).
LOL, Excavator is still on 28nm but they went ahead and doubled FPU resources?
What power envelope are they targeting? And they'd better be able to charge high prices for these, since the dies will be huge.
Based on Tuna-Fish's response, if this die is indeed Steamroller, it seems in no way guaranteed that AMD is moving to a 1 module/2core/4 thread design. If anything, it seems the opposite (staying 2 core 2 thread but giving each core more execution resources)
Tuna-Fish, what is your rough estimate how credible the die shot is? In percent terms . .
Ah ok, then lets say port count, sounds credible, too.
What would happen if the limit would be 4 reads / 2 writes?
Hmm where do you mean now are the TAGs? Like under the L1I-cache? Sorry I cannot follow.
Excavator has an ATI high density library. AMD considers it equal to a full node shrink.
The die is not that much bigger, you can estimate the size by the L2-cachesegments that are visible at the right side of the photo.LOL, Excavator is still on 28nm but they went ahead and doubled FPU resources?
What power envelope are they targeting? And they'd better be able to charge high prices for these, since the dies will be huge.
Ok, but if you have 2 threads running, each on one of the 2 RFs then you dont have to sync the RFs, thus you can double the write ports, cant you?The RF in BD is 4 reads/4 writes. Duplicating RF sharing the same register set can be used to increase read port count, but not write port count. (Both register files must do all writes to keep contents in sync.)
There is an AMD patent for that:With 4/4 register file, you can only issue instructions if enough ports are available. Since normal integer instructions are 2r/1w, that means typically two, removing most gains from duplicating the execution units. You can avoid some register reads through forwarding (Intel's pre-SNB CPUs are really good at this, the RF is significantly underprovisioned), but that means that RF ports turn into resources that you need to track and manage, and IIRC AMD has never done this, other than in the K7 style where the read ports are shared between the agu and alu that are part of same pipe.
I agree with this. If AMD wanted to keep AM2 around to appeal to that market, they would be in a far worse mess.
AM3+ isn't that old, stop exaggerating as if we were talking about AM2.
AM3+ isn't that old, stop exaggerating as if we were talking about AM2.
There have been three major architecture families compatibile AM3+ already (Deneb/Thuban, Zambezi, Vishera). It just feels old and ready for a refresh because all three performed more or less the same. :awe:
That die doesn't look real to me in some areas. It's possible the structures were optimized for...something...compared to more traditional structures. I'm not calling fake, but it looks odd to me.
The interface section sticks out to me more than anything else (probably because I'm an I/O designer). I/Os are highly regular structures that share a ton of wires across the lanes of the port(s), which means colinear alignment makes the most sense. The I/Os in this image look like they are grouped in small grids instead of lines, which makes routing congestion much more difficult to overcome.