Future to Bulldozer architecture?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Zstream

Diamond Member
Oct 24, 2005
3,396
277
136
Jaguar is one of the best low powered architectures I've seen in a while. I've had plenty of success with jaguar (5350/5370) . I'd like to see a 8 core, 3ghz version released to consumers.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,692
136
Any love with the K10 architecture today? An excellent lower-cost alternative to Bulldozer that's not clustered into threads at all.

That niche is being filled by Jaguar-derivatives today. You can also view Zen as a spiritual successor to K8/K10.

I have a couple K10 Athlon X4's and Llano systems running today. They're starting to show their age.

No. The K10 derived Llano was clearly slower than the Piledriver based Trinity.

Mostly due to lower frequency. Stars was a pretty decent design, except they couldn't really go past 3GHz.
 
Reactions: amd6502

waltchan

Senior member
Feb 27, 2015
846
8
81
I have a couple K10 Athlon X4's and Llano systems running today. They're starting to show their age
The K10 suffers from low single-thread performance today if not overclocked, but if you can find one a Black Edition one that runs over 4 GHz, then it's a superior choice than Bulldozer I think. I successfully found one Phenom II X2 570 that can run up to 4.2 GHz max on stock voltage. Disable one core before, it can go up to 4.4 GHz max. Sweet... Average is 3.9 GHz based on multiple Phenom 570 CPUs I've bought. Now only $19 each shipped from China. AMD processors maintain their resale value better than Intel, so the 570s won't crash down to $5 mysteriously at anytime soon.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Successor:
22FDX Node
1 Excavator-LPH22-Module // Two cores - No XOP/FMA4/CVT16 -- Full GPR/FP Renaming -- 5 GHz peak when no other cores are in peak use. -- In between 15h & 16h overhaul.
1 Catamount-ULP22-CU // Four cores - AVX2/FMA3 -- "Zen-Lite"
2 Vega-LP22 CUs // Equiv to ~256 or ~512 ALUs from 28nm generation.
6-8 MB L3 (Shared through the data fabric and buffers the single 64-bit DDR4(+ECC) 3.6 GHz(3.2 GHz))
Data Fabric includes an upgrade for AMD's HWP(Intel SpeedShift competition) support.

"Octo-core" APU => 2 BD Cores + 4 JG cores + 2 VG cores.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Successor:
22FDX Node
1 Excavator-LPH22-Module // Two cores - No XOP/FMA4/CVT16 -- Full GPR/FP Renaming -- 5 GHz peak when no other cores are in peak use. -- In between 15h & 16h overhaul.
1 Catamount-ULP22-CU // Four cores - AVX2/FMA3 -- "Zen-Lite"
2 Vega-LP22 CUs // Equiv to ~256 or ~512 ALUs from 28nm generation.
6-8 MB L3 (Shared through the data fabric and buffers the single 64-bit DDR4(+ECC) 3.6 GHz(3.2 GHz))
Data Fabric includes an upgrade for AMD's HWP(Intel SpeedShift competition) support.

"Octo-core" APU => 2 BD Cores + 4 JG cores + 2 VG cores.

Could you please label speculation as such? People often mistake it for actual facts.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Could you please label speculation as such? People often mistake it for actual facts.
The previous post has the Seronx Certificate as "almost factual."

The FDSOI designs will be a prelude the SSRW FinFET designs. So, anything new and improved in this non-disclosed product that will be taking up the BR/SR Refresh. Will happen to also appear in 7nm LP refreshes of 14nm products.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
No. The K10 derived Llano was clearly slower than the Piledriver based Trinity.
It's lacking SMT which is why it was discontinued and hasn't been updated with modern instruction. Without SMT (or CMT) it's too hard to get good energy efficiency (at least at 3+ ghz). I can only imagine any non-zen architecture being continued if development is limited to a shoestring budget.

The cat cores may be the closest thing to the old K8/K10's. But they are way more area efficient and clock much lower than K10's.

I really like Seronx's idea or prediction of a fused cat-excavator cluster. Excavator for decent sparse thread and puma for the high area and energy efficiency and number of threads. The OS can easily taskset lower priority threads to lower clocking cores.

An excavator pair plus puma quad (2+4) budget APU for lower power profiles would be a cheap test project of what can be done on FDSOI and body voltage biasing for both the CPU side and GPU side.

On 28nm ballpark 9 mm2 per XV core (w/o L2) and 9 mm2 for the XV module's shared L2. This would be ~27 mm2 for the core part of the Stoney module. Puma quad would be under 4x3.1mm^2~13 mm2, and these cores would be ultra efficient on the upgraded 22nm FDSOI, especially under 1.5ghz. It would be worth adding a little area and new instructions to the Puma cores so the don't lack intructions that Excavator has. Also a shared L2 cache may add ~10mm2. This may grow the area but it still would be not much more than 50mm^2 for a total 6 cores. (On 22nm this would shrink down to under 40? wild guess a 120 nm2 APU if you add 40 for uncore and ~45 for gpu). Derivative products could include 2+2 APU and 2+0 and 0+4 APUs.

If such a test is successful follow on projects could be a similar Zen-puma cluster, or mobile 14nm FDSOI super low power GPUs.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Jaguar is one of the best low powered architectures I've seen in a while. I've had plenty of success with jaguar (5350/5370) .
If they updated instructions to Excavator's set and also added some crude multithreading, then Puma would not just hands down beat Excavator in efficiency (at ~1.5ghz), but (using 22nm FDSOI) also Zen.

There are some very simplistic non-SMT multithreading methods; eg. see Blocked Multithread, pg 2 http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture09-multithreading.pdf

So on a modern process Puma could have a place in servers.

A low cost heterogenous 2+4 core APU like above, would do great for a variety of devices, from 2-in-1s to low end laptops and all-in-1s. Add a 10 thread blocked multithread version and you also get the range of home servers to low power low end servers.
 
Last edited:

dark zero

Platinum Member
Jun 2, 2015
2,655
138
106
The previous post has the Seronx Certificate as "almost factual."

The FDSOI designs will be a prelude the SSRW FinFET designs. So, anything new and improved in this non-disclosed product that will be taking up the BR/SR Refresh. Will happen to also appear in 7nm LP refreshes of 14nm products.
So... BD will be still be alive but only in lower powered parts? Sounds weird... And BD design... Could fit better on something like ARM..
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,168
136
The K10 suffers from low single-thread performance today if not overclocked, but if you can find one a Black Edition one that runs over 4 GHz, then it's a superior choice than Bulldozer I think.

Meh. I've had K10.5 chips before in the 4 GHz range and my 4.7 GHz Steamroller was generally preferable. K10/10.5 suffers from poor SIMD support, among other things.

Thuban was cool for awhile, but not anymore. It isn't 2010.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,692
136
The K10 suffers from low single-thread performance today if not overclocked, but if you can find one a Black Edition one that runs over 4 GHz, then it's a superior choice than Bulldozer I think. I successfully found one Phenom II X2 570 that can run up to 4.2 GHz max on stock voltage. Disable one core before, it can go up to 4.4 GHz max. Sweet... Average is 3.9 GHz based on multiple Phenom 570 CPUs I've bought. Now only $19 each shipped from China. AMD processors maintain their resale value better than Intel, so the 570s won't crash down to $5 mysteriously at anytime soon.

Since these systems are running in the F&F segment, and are 6-8 years old, I'm not inclined to spend anything on them, nor OC since they have weak PSUs (wouldn't do it anyway, stability is paramount there). I give them a year or two, then they'll be wholesale replaced.

Only thing I might be inclined to do is a 95W Phenom X6 if I could get one for $5. But that is not happening...

Meh. I've had K10.5 chips before in the 4 GHz range and my 4.7 GHz Steamroller was generally preferable. K10/10.5 suffers from poor SIMD support, among other things.

Agreed.

Thuban was cool for awhile, but not anymore. It isn't 2010.

Thuban had a good run, but yes. With Ryzens launch you can get a fairly inexpensive 4C/8T CPU (potentially with a whopping 16MB L3 too... ) that'll blow K10(.5)/BD/PD/SR/EX out of the water. With the added benefit of a fully modern platform. I see no reason to be nostalgic...
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
It's lacking SMT which is why it was discontinued and hasn't been updated with modern instruction. Without SMT (or CMT) it's too hard to get good energy efficiency (at least at 3+ ghz). I can only imagine any non-zen architecture being continued if development is limited to a shoestring budget.

The cat cores may be the closest thing to the old K8/K10's. But they are way more area efficient and clock much lower than K10's.

I really like Seronx's idea or prediction of a fused cat-excavator cluster. Excavator for decent sparse thread and puma for the high area and energy efficiency and number of threads. The OS can easily taskset lower priority threads to lower clocking cores.

An excavator pair plus puma quad (2+4) budget APU for lower power profiles would be a cheap test project of what can be done on FDSOI and body voltage biasing for both the CPU side and GPU side.

On 28nm ballpark 9 mm2 per XV core (w/o L2) and 9 mm2 for the XV module's shared L2. This would be ~27 mm2 for the core part of the Stoney module. Puma quad would be under 4x3.1mm^2~13 mm2, and these cores would be ultra efficient on the upgraded 22nm FDSOI, especially under 1.5ghz. It would be worth adding a little area and new instructions to the Puma cores so the don't lack intructions that Excavator has. Also a shared L2 cache may add ~10mm2. This may grow the area but it still would be not much more than 50mm^2 for a total 6 cores. (On 22nm this would shrink down to under 40? wild guess a 120 nm2 APU if you add 40 for uncore and ~45 for gpu). Derivative products could include 2+2 APU and 2+0 and 0+4 APUs.

If such a test is successful follow on projects could be a similar Zen-puma cluster, or mobile 14nm FDSOI super low power GPUs.

What benefit would this theoretical device give AMD over their current planned lineup? Right now they already have Zen, Polaris and Vega IP on 14nm FinFET, and could easily use these to offer a 2 core APU to address those same market niches. Whereas your proposal requires a hell of a lot of CPU design work to integrate XV and Puma into a single cluster in an efficient fashion, along with porting AMD's entire IP stack to an entirely new process with more expensive wafers.
 
Reactions: scannall
May 11, 2008
20,068
1,293
126
Initially, i thought that bulldozer would be viable as a jaguar replacement.

But i think that an updated Jaguar architecture (with knowledge learned from the piledriver and on)without SMT but with better simd support (wider paths ,execution units and less use of microcoded instructions) and the Zen architecture will be what is happening.
2 architectures. Jaguar and Zen.

No SMT, means there is some time that there are stalls during execution and that is good for power dissipation, no execution is lower consumption. Less dark silicon required. I think that Jaguar is small and cheap enough to just add more cores, no need to redesign it with SMT, also, the fabric to connect all 4 core modules already exists. But i do wonder how much effort it takes to take existing jaguar to a smaller process and how much in clockspeed it would gain. Jaguar is there when it needs to be as inexpensive as possible. When performance is required, Zen is there.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Those are some really good points.


What benefit would this theoretical device give AMD over their current planned lineup? Right now they already have Zen, Polaris and Vega IP on 14nm FinFET, and could easily use these to offer a 2 core APU to address those same market niches. Whereas your proposal requires a hell of a lot of CPU design work to integrate XV and Puma into a single cluster in an efficient fashion, along with porting AMD's entire IP stack to an entirely new process with more expensive wafers.

Yes, if it's not a cheap budget then it wouldn't make much sense. It depends much on whether Puma can have a significant efficiency advantage over Zen at the low power frequencies, whether it's worth it for AMD to pursue the market that Intel's Atoms cover, and the prospects of and the number of products they could spin off such a project.

This is assuming the Atom line continues. Does an atom quad beat a 2c/4t i3 in the low power 5w-7.5w power range? Zen vs puma would be in the same situation if 22FDX is as efficient as 14LPP.

Also a side benefit would be just research purposes, an FDX test run. They might test out this process in cheaper and maybe more sensible ways, too. Maybe the Stoney Refresh in the 2016-2018 roadmap is an FDX port? Such a port would tell you how the process performs for the CPU and iGPU.

Initially, i thought that bulldozer would be viable as a jaguar replacement.

But i think that an updated Jaguar architecture (with knowledge learned from the piledriver and on)without SMT but with better simd support (wider paths ,execution units and less use of microcoded instructions) and the Zen architecture will be what is happening.
2 architectures. Jaguar and Zen.

No SMT, means there is some time that there are stalls during execution and that is good for power dissipation, no execution is lower consumption. Less dark silicon required. I think that Jaguar is small and cheap enough to just add more cores, no need to redesign it with SMT, also, the fabric to connect all 4 core modules already exists. But i do wonder how much effort it takes to take existing jaguar to a smaller process and how much in clockspeed it would gain. Jaguar is there when it needs to be as inexpensive as possible. When performance is required, Zen is there.

I think they should keep the microcoded instructions to conserve area because the small area of the cat cores would be the main reason that the cat cores may not go extinct yet and outlive dozers.

A very simple method of multithreading (eg Blocked Multithreading) would also leave some idle execution units and lower average power. It'd mainly be useful for some niche servers (it improves efficiency and performance when there are more threads than cores). Probably still not worth it; the theme for cat cores, seems to be small, simple, cheap, and efficiency through lower area and clock speeds.

Jaguar did get ported for Xbox to 16nm and they achieved significant power savings and raised the clocks a little bit. I assume for Scorpio the clocks got raised futher (no idea how high, I'm guessing a bit under 3ghz).
 
Last edited:

waltchan

Senior member
Feb 27, 2015
846
8
81
Power consumption is another biggest concern with K10. To get 4.5GHz quad-core, it requires 1.7 V stable. More cores also drain down the frequency speed number fast. I need to reduce 300 MHz going from single to dual-core on stock voltage, and another 500 MHz down going from dual-core to quad-core, so I'm 900 MHz slower on quad-core than single-core.
 
Reactions: amd6502

amd6502

Senior member
Apr 21, 2017
971
360
136
Power consumption is another biggest concern with K10. To get 4.5GHz quad-core, it requires 1.7 V stable.

I'd be interested in seeing wattage vs frequency plotted over 4ghz. ~160W?? or almost as much as FX piledriver octacores consume at 4.5ghz full load?

Low frequency low power questions of efficiency in dozer vs puma vs zen are even more interesting.
 

waltchan

Senior member
Feb 27, 2015
846
8
81
I'd be interested in seeing wattage vs frequency plotted over 4ghz. ~160W?? or almost as much as FX piledriver octacores consume at 4.5ghz full load?
It's between 1.6V to 1.7V on average with Phenom X4 at 4.5GHz, but the K10s do really require more power consumption in order to match Bulldozer's equivalent single-thread performance score. My Phenom X2 (2 cores) runs at 1.4V at only 3.9 GHz max.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
Two threads in 2018 seems low.

How much typical multithread could an Excavator module gain if it could accomodate a low IPC background thread (third thread) that was only allowed to execute on an Int (or float) core that had stalled and would otherwise be idle? The third thread would run simple and have low IPC, without speculative execution and very limited out-of-order execution (limited mostly to FPU instructions); other than a tiny bit of latency in resuming a stalled normal thread it wouldn't have much negative impact on the performance of the main two cores.
 
Last edited:
May 11, 2008
20,068
1,293
126
Those are some really good points.



I think they should keep the microcoded instructions to conserve area because the small area of the cat cores would be the main reason that the cat cores may not go extinct yet and outlive dozers.
.

Well, avx might come in handy and wider paths and instructions that need less cycles to complete would give jaguar an enormous boost. It seems to become more and more used. low amount of cycles for completion might help a lot. I do not see much in adding SMT.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,692
136
Well, avx might come in handy and wider paths and instructions that need less cycles to complete would give jaguar an enormous boost. It seems to become more and more used. low amount of cycles for completion might help a lot. I do not see much in adding SMT.

IMHO at the point of adding AVX2 to Jaguar, you may as well use the Zen core, perhaps minus the L3 cache. Zen is already very efficient, it should be more then able to scale down to 25W for 4C/8T with lower clocks.
 
Reactions: coffeemonster

coffeemonster

Senior member
Apr 18, 2015
241
86
101
IMHO at the point of adding AVX2 to Jaguar, you may as well use the Zen core, perhaps minus the L3 cache. Zen is already very efficient, it should be more then able to scale down to 25W for 4C/8T with lower clocks.
agreed, and zen's boost frequencies would have a much greater range than jaguar/puma even at that TDP
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |