First Steamroller processor core exposure

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tuna-Fish

Golden Member
Mar 4, 2011
1,420
1,749
136
It could be 4 128, they look very much like the existing ones ( higher/lower order bits are split), there is just double the amount of them. There also looks to be double the amount of register/queue for the FPU so maybe some kind of course grain separation between them can take pressure off needing so many read/write ports on one register/queues with a 4x128bit design.

256-bit AVX is designed to be split into two lanes, so that the simplest way to implement it is to just take your old design and double it.

If AMD really wanted to they could have expanded decode and still kept it shared.

Expanding x86 decode past 4 instructions is really hairy. Each extra instruction to decode takes more hardware. Basically, to decode more than 4 per clock from a single thread, it probably makes more sense to move to some kind of uop caching, and be able to take 4+ from the cache.

The die pic shows 4 ALUs + 4 AGUs per Integer Core. I don’t believe that they will use 8 pipes per Thread. The utilization of all 8 pipes from a single thread will be very low

I think it's more likely that they are moving back to the K7 idea of having an ALU + AGU per pipe, and generally only being able to use one of them per clock unless you can avoid loads from the RF. 4x(alu+agu) would potentially be a nice IPC boost, without being a total mess to feed/forward like 8 full exec units.

This implementation is surely a 4 Threads design.

4-thread design with both threads getting a 2x2 cluster would imply the threads having separate register files. This does not appear to be the case -- the register file is the same once-duplicated one in BD.

the performance gains to die area ratio used will be even lower.

For scalar resources, that's mostly irrelevant. They are driven much more by latency and the need to keep complexity down (for latency...). The total added die space per module is what? 2mm^2? If either of the CPU makers could add 10mm^2 to a core and make it 25% faster at scalar, they'd jump at the chance. It's just that you generally can't spend transistors there, because signal speed limits you more.

So usually, the 4x(1+1) design would not increase the maximum throughput of instructions executed, just the flexibility of the units, being able to execute 4x ALU or4x AGU per clock. Then you get a lucky and an unit has the opportunity to complete an inc and a single-op store in the same cycle.



Maybe this diagram is off or my interpretation but it looks like fetch and L1I are separate here.

Where are the L1i tags? Shouldn't they be on the order of 1/8th of the L1i arrays for a 64B line 48-bit VIPT cache? The only structures close enough that are roughly the correct size is the part you've marked ROM. Putting the cache tags in the middle between two halves of one unified cache is a very common design, so looking at that I'd expect the L1i cache to be unified (and a lot bigger than it used to be). Also, being of different color makes sense because tags are combined logic+memory, while ucode rom is typically just memory.

This does, however, leave ROM unaccounted for, so I'm not quite comfortable with my understanding of the frontend.

All in all, I doubt this is SR. I'm not convinced it's fake, but it would more likely be SR+1.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,420
1,749
136
Where are the L1i tags? Shouldn't they be on the order of 1/8th of the L1i arrays for a 64B line 48-bit VIPT cache? The only structures close enough that are roughly the correct size is the part you've marked ROM. Putting the cache tags in the middle between two halves of one unified cache is a very common design, so looking at that I'd expect the L1i cache to be unified (and a lot bigger than it used to be). Also, being of different color makes sense because tags are combined logic+memory, while ucode rom is typically just memory.

Actually, now that I look at it again, it seems the L1i tags are in the lower left corner of fetch-0 and lower right corner of fetch-1, which would definitely mean that the L1i is separated.
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
I have never said that this is SteamRoller

So you are trolling then? Because we are discussing Steamroller and how it is moving away from CMT as it exists in BD.

Are you going to stop trolling and contribute to the discussion?
 

del42sa

Member
May 28, 2013
65
65
91
I think it's more likely that they are moving back to the K7 idea of having an ALU + AGU per pipe, and generally only being able to use one of them per clock unless you can avoid loads from the RF. 4x(alu+agu) would potentially be a nice IPC boost, without being a total mess to feed/forward like 8 full exec units.

I found this aproach better than separate ALU/AGU in BD. Still wonder why they make them separate ? K10 had very strong Integer performance so they coud use the same scheme (3 ALU/3AGU) in BD with shared FPU.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
So you are trolling then? Because we are discussing Steamroller and how it is moving away from CMT as it exists in BD.

Are you going to stop trolling and contribute to the discussion?

We don't actually know if this is a die shot of Steamroller or not. The stuff presented at Hot Chips left more of the original CMT decisions in place.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
How about 4 threads per module, with 2 threads per each INT? ie two hyperthreaded INT cores and one really big FPU in each module?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,420
1,749
136
How about 4 threads per module, with 2 threads per each INT? ie two hyperthreaded INT cores and one really big FPU in each module?

As I stated, this would require the doubling of the register file to let each 2x2 cluster be as fast as the one in BD currently. This doesn't seem to have been done -- the RF looks very similar to the old duplicated PRF.

So either they target a *lot* slower clocks, or the RF can only serve ~the same amount of writes and reads as the BD one.
 

SocketF

Senior member
Jun 2, 2006
236
0
71
4-thread design with both threads getting a 2x2 cluster would imply the threads having separate register files. This does not appear to be the case -- the register file is the same once-duplicated one in BD.
Yes but as you say yourself, BD already has a duplicated register file, i.e. 2 independent sets of register. A guy at the SA forum quoted a text from AMD that said that the register file was only duplicated to shorten the wire-length due to latency reasons. The latency from the top left AGU to the bottom down ALU would be to long because of the vertical difference.
Thus having 2 AGUs and 2 ALUs in the same plane/line around one set of register seems ok for me to have 2 threads in one INT cluster, one each for 1 register and 2 AGUs/ALUs. One on the top one at the bottom.

The only problem I see in the overall design are indeed the missing ROMs. I wonder what happened to them. Maybe there could be a new technique to get rid of them like storing them in the L1I and loading it dynamically from the µCode in the mainboard bios files?
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
So you are trolling then? Because we are discussing Steamroller and how it is moving away from CMT as it exists in BD.

Are you going to stop trolling and contribute to the discussion?


We were discussing about the die pic in the OP and if that implementation deviates from the CMT design.

But it seems that even talking technically can make you a troll these days
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,420
1,749
136
Yes but as you say yourself, BD already has a duplicated register file, i.e. 2 independent sets of register. A guy at the SA forum quoted a text from AMD that said that the register file was only duplicated to shorten the wire-length due to latency reasons.

He misunderstood what he was reading. The INC units were duplicated for wire latency. RF is not duplicated for wire latency, it's duplicated to lower port count.

The latency from the top left AGU to the bottom down ALU would be to long because of the vertical difference.
Thus having 2 AGUs and 2 ALUs in the same plane/line around one set of register seems ok for me to have 2 threads in one INT cluster, one each for 1 register and 2 AGUs/ALUs. One on the top one at the bottom.

There is simply no way, no how AMD will be able to run 8 reads + 4 writes from a single, non-duplicated RF at anything near the clocks they do. Simply not physically possible. The duplication is necessary to make the high count of read accesses possible.

The only problem I see in the overall design are indeed the missing ROMs. I wonder what happened to them. Maybe there could be a new technique to get rid of them like storing them in the L1I and loading it dynamically from the µCode in the mainboard bios files?

Actually, I now think the rom is probably the purple block. I found the tags, it was on the wrong side of the fetch (/facepalm). The positioning of the ROM makes sense, if it's shared between two separate decoders (so it's in the middle to be near both).
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
We were discussing about the die pic in the OP and if that implementation deviates from the CMT design.

But it seems that even talking technically can make you a troll these days

Based on Tuna-Fish's response, if this die is indeed Steamroller, it seems in no way guaranteed that AMD is moving to a 1 module/2core/4 thread design. If anything, it seems the opposite (staying 2 core 2 thread but giving each core more execution resources)
 

inf64

Diamond Member
Mar 11, 2011
3,763
4,221
136
Tuna-Fish, what is your rough estimate how credible the die shot is? In percent terms .
Thanks for your input.
 

SocketF

Senior member
Jun 2, 2006
236
0
71
He misunderstood what he was reading. The INC units were duplicated for wire latency. RF is not duplicated for wire latency, it's duplicated to lower port count.



There is simply no way, no how AMD will be able to run 8 reads + 4 writes from a single, non-duplicated RF at anything near the clocks they do. Simply not physically possible. The duplication is necessary to make the high count of read accesses possible.
Ah ok, then lets say port count, sounds credible, too.
What would happen if the limit would be 4 reads / 2 writes?

Actually, I now think the rom is probably the purple block. I found the tags, it was on the wrong side of the fetch (/facepalm).
Hmm where do you mean now are the TAGs? Like under the L1I-cache? Sorry I cannot follow.
 

Haserath

Senior member
Sep 12, 2010
793
1
81
LOL, Excavator is still on 28nm but they went ahead and doubled FPU resources?

What power envelope are they targeting? And they'd better be able to charge high prices for these, since the dies will be huge.

Excavator has an ATI high density library. AMD considers it equal to a full node shrink.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,420
1,749
136
Based on Tuna-Fish's response, if this die is indeed Steamroller, it seems in no way guaranteed that AMD is moving to a 1 module/2core/4 thread design. If anything, it seems the opposite (staying 2 core 2 thread but giving each core more execution resources)

I'm not opposed to the idea of this having SMT. Specifically, the retire buffers seem to have grown a lot, and that is something necessary and mostly (but not only) useful for SMT. However, if it does indeed have SMT, it's Intel-style SMT on shared integer resources. I don't have enough information from the picture to call SMT one way or the other.

Tuna-Fish, what is your rough estimate how credible the die shot is? In percent terms . .

Honestly no clue. If it's a fake, it was made by someone with intimate knowledge in CPU design. The changes in general are credible and sensible. Then again, there are a lot of bored people with the required skills and ability to use photoshop who'd like to play armchair general on what companies should do. My take is that it looks real, but this is the age of photoshop, and there's no way to verify it.

Ah ok, then lets say port count, sounds credible, too.
What would happen if the limit would be 4 reads / 2 writes?

The RF in BD is 4 reads/4 writes. Duplicating RF sharing the same register set can be used to increase read port count, but not write port count. (Both register files must do all writes to keep contents in sync.)

With 4/4 register file, you can only issue instructions if enough ports are available. Since normal integer instructions are 2r/1w, that means typically two, removing most gains from duplicating the execution units. You can avoid some register reads through forwarding (Intel's pre-SNB CPUs are really good at this, the RF is significantly underprovisioned), but that means that RF ports turn into resources that you need to track and manage, and IIRC AMD has never done this, other than in the K7 style where the read ports are shared between the agu and alu that are part of same pipe.

My take is that if they did ship a system with only a 4/4 rf for each integer cluster, IPC would go way down, maybe by 30%. Would they really do that?


Hmm where do you mean now are the TAGs? Like under the L1I-cache? Sorry I cannot follow.

I'm retracting that. I don't know what I was thinking, that's maybe the TLB. In Vesku's annotated picture, directly below h of the "fetch-1" and directly left of n of the "interface".
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Excavator has an ATI high density library. AMD considers it equal to a full node shrink.

Well if AMD says so then it must be true.

They have a stellar track record of promising the moon and delivering it to date, combined with the fact there is zero conflict of interest on behalf of the AMD employees who convinced management to utter such a statement as you have paraphrased in your post.

Yep, I see nothing wrong with taking this on faith and treating it like the gospel truth. Nothing wrong with that at all. :hmm:
 

SocketF

Senior member
Jun 2, 2006
236
0
71
LOL, Excavator is still on 28nm but they went ahead and doubled FPU resources?

What power envelope are they targeting? And they'd better be able to charge high prices for these, since the dies will be huge.
The die is not that much bigger, you can estimate the size by the L2-cachesegments that are visible at the right side of the photo.
Also, if you look at the Int-Exec-Units it seems that Units are just placed there, "overwriting" the used spaced in the current BD-design.
So either it is a fake, or it is AMD's HD-cell-libraries at work.

The RF in BD is 4 reads/4 writes. Duplicating RF sharing the same register set can be used to increase read port count, but not write port count. (Both register files must do all writes to keep contents in sync.)
Ok, but if you have 2 threads running, each on one of the 2 RFs then you dont have to sync the RFs, thus you can double the write ports, cant you?
With 4/4 register file, you can only issue instructions if enough ports are available. Since normal integer instructions are 2r/1w, that means typically two, removing most gains from duplicating the execution units. You can avoid some register reads through forwarding (Intel's pre-SNB CPUs are really good at this, the RF is significantly underprovisioned), but that means that RF ports turn into resources that you need to track and manage, and IIRC AMD has never done this, other than in the K7 style where the read ports are shared between the agu and alu that are part of same pipe.
There is an AMD patent for that:

Pat. No. 7315935 (Apparatus and method for port arbitration in a register file on the basis of functional unit issue slots)

So with sth like that, it would be maybe possible?
 

SlowSpyder

Lifer
Jan 12, 2005
17,305
1,001
126
AM3+ isn't that old, stop exaggerating as if we were talking about AM2.


There have been three major architecture families compatibile AM3+ already (Deneb/Thuban, Zambezi, Vishera). It just feels old and ready for a refresh because all three performed more or less the same. :awe:
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
More like Deneb/Thuban, Zambezi/Vishera. But really Deneb/Thuban is only in the picture for AM3+ because Zambezi was disappointing. Running a 1090T in my AM3+ board. It would be nice if there is a steamroller chip that completely surpasses my 1090T that I can swap in and it would certainly be a welcome gesture on AMD's part to supply such a chip. But there were certainly signs that AM3+ was not guaranteed a long life, being not that different from the later AM3 chipsets.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
AM3+ isn't that old, stop exaggerating as if we were talking about AM2.

AM3+ is a minor revamp of AM3, which was a minor revamp of AM2+, which was a minor revamp of AM2. They need to ditch the legacy of two-part chipsets and massive socket areas- two chip chipsets because they reduce performance, and socket areas because they take up valuable board space.
 

MrDudeMan

Lifer
Jan 15, 2001
15,069
92
91
That die doesn't look real to me in some areas. It's possible the structures were optimized for...something...compared to more traditional structures. I'm not calling fake, but it looks odd to me.

The interface section sticks out to me more than anything else (probably because I'm an I/O designer). I/Os are highly regular structures that share a ton of wires across the lanes of the port(s), which means colinear alignment makes the most sense. The I/Os in this image look like they are grouped in small grids instead of lines, which makes routing congestion much more difficult to overcome.
 

thilanliyan

Lifer
Jun 21, 2005
11,910
2,127
126
There have been three major architecture families compatibile AM3+ already (Deneb/Thuban, Zambezi, Vishera). It just feels old and ready for a refresh because all three performed more or less the same. :awe:

NOOOOO!!!! I hope SR is AM3+ compatible...mostly because I'm too lazy to disassemble my whole system lol
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
That die doesn't look real to me in some areas. It's possible the structures were optimized for...something...compared to more traditional structures. I'm not calling fake, but it looks odd to me.

The interface section sticks out to me more than anything else (probably because I'm an I/O designer). I/Os are highly regular structures that share a ton of wires across the lanes of the port(s), which means colinear alignment makes the most sense. The I/Os in this image look like they are grouped in small grids instead of lines, which makes routing congestion much more difficult to overcome.

Could be the "dense library" automation which AMD has been talking about recently?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |