Not so much the future of 15h/Bulldozer. More the future of Bulldozer as the module.
- Take Excavator module.
- Take Jaguar cluster.
- Remove L2 unit from Jaguar cluster. Place power controller of Jaguar cores and L1<->L2 interface into a new modulized unit.
- Remove cores, LD/ST, and FPU from XV module. Keep and modify XV front-end and cache unit/L2. Particularly; Split 2x4-wide into 4x2-wide decode, I$/BPU/L2_CU considerations, etc.
- Place full Jaguar cores into void space. Call the new interface unit the mid-end of the module, L1<->L2 becomes L0<->L1. Make a new 256KB L1d cache shared between all Jag cores. Existing L1i/L1d to be shrinked from 32KB to 16KB and be called L0i/L0d. Do things with fetch/decode/rename to better optimize for macro-ops->micro-ops. Do things with LD/ST to interface with mid-end better.
- Optimize for clock speed(RVT/LVT stuff & FO4) since that is like what <23-stages?
* To enhance even further push L0i into front-mid-end with a Global renamer and boom reverse multitheading.
* While it would lose 15h execution compatibility. It would not lose 16h execution compatibility.
* Have the mid-end have a bypass between cores. So, a core can directly write to another core via ring or crossbar. Mid-end should also have instruction & data coherency tables(look-ups/RAM/etc). Something like that should allow for a flat space across cores. It also reduces the energy to write to the L1, then to another L0.
* Power controller in the mid end can reduce power controller complexity from module level to the core level. (2-lv power controller)
* Global renamer above can house its own FP Decode/Rename component. Thus allowing the FPUs to be FlexFPU-like while still being housed in independent cores. With the mid-end PMU, a core would only need to have its FPU/LSU components active to be controlled.
* Ideally, it would be best to arrange Pipe 0 and Pipe 1 of the FPUs into: P1-128-bit Floating Point FMAC and P0-128-bit Integer FMAC.
Call it Caracal or something. Also, AMD put it on 22FDX. I want those sub-10 picoseconds RO/FO3/FO4s delays asap!! (Not even FinFETs or Nanowires can do those!)
For those who noticed that the module was and is an exoskeleton raise your virtual hand.