Didn't AMD already have a very strong FPU compared to Core2 and newer Intel?
Compared to the Core2 and newer the PhII biggest weakeness is Integer performance. Have they mentioned anything about the Int changes? :\
Phenom II did score better in float-point workloads than it did comparatively in other workloads, for example Cinebench and POVRay. Sandy Bridge muddles the results because it can use AVX (I don't know if these programs do, however).
Yes, the optimization guide (section 2.1) gives some general characteristics:
• Four-way AMD64 instruction decoding (This is a theoretical limit.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
Additionally, in section 2.10, AMD mentions:
2.10.1 Integer Schedule
The scheduler can receive and schedule up to four micro-ops (μops) in a dispatch group per cycle.
The scheduler tracks operand availability and dependency information as part of its task of issuing
μops to be executed. It also assures that older μops which have been waiting for operands are
executed in a timely manner. The scheduler also manages register mapping and renaming.
2.10.2 Integer Execution Unit
There are four integer execution units per core. Two units which handle all arithmetic, logical and
shift operations (EX). And two which handle address generation and simple ALU operations
(AGLU). Figure 2 shows a block diagram for one integer cluster. There are two such integer clusters
per compute unit
From this we can generally grok that integer performance has been "redesigned". They now have 2 AGLU and 2 ALU rather than 3 ALU and 2 AGU (well 3, but the third AGU is not used IIRC). But they actually have 4 pipelines now, but they list it clearly as a theoretical limit. The AGLU sounds particularly interesting, but I cannot find exactly what operations it can or does perform.
Additionally in regards to FPU performance the guide mentions (section 2.7):
The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw
FADD and FMUL bandwidth as the original AMD Opteron
Given that it is designed for 4 times the performance but shared* between two cores, it sounds like they've designed for each core to have twice as much as performance as the previous generation. To me, Bulldozer sounds like a float-crunching beast.
* I'm not sure if this is the right word, as one core can use the entire 256-bit pipelile or both cores can use half the FPU. So it doesn't have to be shared (in either case really).