SickBeast, we're waaaay off the original topic. It might be worth starting follow-up thread so people who don't want to read wild business/financial speculation by tech geeks see the actual tech stuff .
Originally posted by: SickBeast
Originally posted by: VirtualLarry
Originally posted by: SickBeast
They have Fusion and maybe 45nm to look forward to, but intel will surely have an answer to both. Not only that, but I would rather a 'C2D' derivative in my future CPU instead of a 'Phenom' derivative.
Why would you say that? There's nothing inherently inferior, that I can see, about the Phenom design. The primary problem is the yields and the Ghz that they can clock to. If they were competitive in that dept, their design is definately also competetive.
Well, from what I understand, the C2D can execute 4 instructions-per-clock compared to 3 on the Phenom.
It's not as simple as that. The K7-like architecture of Phenom is actually quite different from the PPro-like architecture of Core2. I had a drawn out debate with dmens about issue widths - read
this post... if you don't want to read the whole thing, at least read the paragraph starting with "
Older Intel CPUs" and the one after it. The mix of execution units is also very different - for a particularly striking example of the potential for highly-optimized code that takes full advantage of K8's integer execution units, see
this thread.
Originally posted by: SickBeast
Originally posted by: CTho9305
Unfortunately I know almost nothing about the ATI GPU architecture (nvidia has some really good presentations on their GPUs).
Well...look at it this way:
The AMD 2900XT has 320 'stream processors', compared to the 8800GTS which has only 112 (the original had 96). The 2900XT runs at a clockspeed near 700mhz, whereas the GTS runs below 600mhz. The XT has a full 512-bit memory bus, compared to only 320-bit on the GTS. The XT outputs considerably more heat and uses significantly more power than the GTS (especially at idle).
The GTS *generally* outperforms the XT. From what I understand, the XT uses a weaker, less efficient type of 'stream processor', which makes it suited only for a relatively specific type of programming. Really, when you think about it, the P4 was like this--it was great a video encoding and not much else.
I don't know enough about GPUs to really comment.
These presentations are excellent. Lecture 8 on the hardware has a good set of slides. It says that the nvidia GPU internally operates at 2x frequency (1.35 GHz). If there's an equivalent ATI presentation/lecture someone knows about, I'd love to read it.
Your comments re: the L1 cache and such on the P4 are interesting. Didn't the P4 have much more L2 cache compared to the X64 though? Wasn't it also manufactured on a more advanced process?
L2 is great for marketing, like RAM on GPUs has historically been ("WOW! This GeForce 4MX has 512MB RAM - it must demolish that 256MB GeForce 6600 and even costs less!" or something). In the real world, it tends not to make as much difference as people might expect - every doubling of the cache size buys you only a few (mid single-digits) percent performance. I didn't search long, but take a look at
this review and compare the highlighted chip to the P4 550 - if you subtract 2% from the performance difference due to the clock speed change, you can see that the extra cache doesn't generally make a huge difference (especially when you're talking as much extra die area as another core or 2). The other thing to remember is that this review was on P4, which likely had an abysmal L1 hit rate; a more normal architecture with a 32KB or 64KB L1 cache is going to go to the L2 less often. You also have to be careful with benchmarks to make sure that they haven't been optimized for a particular L2 size (a crafty evil person might set up a scene's geometry to not fit in a 2MB cache, but to fit in a 4MB cache). I guess in the real world, L1 hit rates are decent, and programs that don't fit in X MB generally don't fit in 2X MB either.
I'm sure someone could put together a timeline of CPUs and manufacturing processes. I don't know off the top of my head which processes were used when on which chips at what time. Intel generally moved to new processes sooner though.
Perhaps the 'narrower pipeline' is the critical element. Doesn't the Phenom have a 'narrower pipeline' than the C2D?
See above (the first part of my reply). More thoughts: It's probably harder to sustain peak IPC on C2 because the Intel architecture has things like decode slot restrictions (if you hit a slightly-fancy instruction, it has to be handled as the first instruction decoded during a cycle, whereas the AMD architectures can generally handle slightly-fancy instructions in any decode slot; both architectures can only decode a single instruction at a time if it's super-fancy ("microcoded")). AMD also provides a larger set of execution units - including 3 integer ALUs and 3 AGEN units; IIRC, Intel's architectures provide fewer of both (unfortunately I can't find a block diagram right now and don't want to go hunting through papers to find one). That probably helps them reduce power/area/complexity. Routines that are highly-optimized for either architecture are likely to perform relatively poorly on the other (again, see above). It would be interesting to see if the performance changes much for real-world code when compiled with the PGI compiler, and if 64-bit software was used rather than 32-bit software. When I think about it, things like C2's "stack engine" probably help it out more in 32-bit code than 64-bit code (due to fewer register spills in 64-bit mode)...
From
here:
The average performance improvement we have seen from Athlon 64 FX-62 equaled 16%, while Core 2 Extreme X6800 demonstrated only 10% average performance boost.
While that's not enough to have K8 beat C2, it takes a pretty big chunk out of a 15-20% lead.