64 bit CPU. A true technical review please

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Artanis

Member
Nov 10, 2004
124
0
0
Originally posted by: Vee
You are practically guaranteed at least 30%, from what I've seen, even from a crude port. And that is a good boost indeed. But actually, more 'mature' optimizing will probably bring that up to 40-55%. And in some extreme cases, where 64-bit integer ops, twice the number and more useful registers, and mapping tricks, will converge, you will see 400%-500% increase.

You certainly cannot proof that, at least yet...
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: MartinCracauer
Only programs with 64 bit integer arthmetic turned out to be substancially faster.
Ok. Funny. Well, let's get back to this.
Let's see.

Is it also true that 64-bit floating point math will not be performed any faster, because modern CPUs already have specialized 128-bit hardware to handle that?

No, FP will also be faster, because we have more visable registers.

The FPU has the same number of registers and the registers have the same size. It is exactly the same as in ia32.

Nope. You're thinking of '387. I'm suggesting using SSE2 also for scalar math. P4 style. And we do have twice as many SSE2 registers in x86-64.
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: Artanis
Originally posted by: Vee
You are practically guaranteed at least 30%, from what I've seen, even from a crude port. And that is a good boost indeed. But actually, more 'mature' optimizing will probably bring that up to 40-55%. And in some extreme cases, where 64-bit integer ops, twice the number and more useful registers, and mapping tricks, will converge, you will see 400%-500% increase.

You certainly cannot proof that, at least yet...

I'm not terribly motivated. Time will will show, anyway. MartinCracauer might be right about that I'm underestimating 64-bit integers impact on performance (my 30% figure) and giving too much credit to number of registers. Well, we'll see...
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: MartinCracauer
Also note that Pentium-4s are generally faster than AMDs for floating point, both if you use the FPU and when you use MMX or SSE. AMD didn't make that a priority and it didn't change between Athlon XP and Athlon 64 (except the 64 has SSE at all, but not as fast as Intel's).

No. This "generally" is simply not true. Maximum possible achivable FP performance, is higher for P4, if and only if, you're using SSE2, and if the code is very suitable for the P4. "Generally", AMD FP performance is better.

Edit: Correction: Maximum possible achievable scalar FP performance is higher for A64 than P4, any way you look at it.
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Originally posted by: Vee
The FPU has the same number of registers and the registers have the same size. It is exactly the same as in ia32.

Nope. You're thinking of '387. I'm suggesting using SSE2 also for scalar math. P4 style. And we do have twice as many SSE2 registers in x86-64.

I don't think anybody got encouraging results out of this. Using SSE for skalar (not SIMD) doesn't offer many benefits and leads to further restrictions, e.g. no move immediate.

The problem here is that SSE is a SMID unit with considerable overhead. It is designed to do many things in one sweep, but if you just use it for singe FP instructions there is more overhead.

Here is a thread on this issue:
http://groups.google.com/group...p.compilers%26rnum%3D1

You can try to use of the automatic vectorizer in Intel's lcc. I didn't try it yet, but I am not very optimisating. It's also questionable whether the resulting code will run on x86_64, if I'm not mistaken, lcc is only available for ia32 and ia64 (aka Itanium).

Furthermore, once you use SSE registers you have to take care of them through function calls, and the OS has to save/restore them on context switches. That can be might expensive.

Last but not least, the i387 unit and SSE2 do math with slightly different results which can be undesirable when your boss doesn't understand floating point.
 

Artanis

Member
Nov 10, 2004
124
0
0
Originally posted by: Vee
Maximum possible achivable FP performance, is higher for P4, if and only if, you're using SSE2, and if the code is very suitable for the P4. "Generally", AMD FP performance is better.

Edit: Correction: Maximum possible achievable scalar FP performance is higher for A64 than P4, any way you look at it.

I don't understand why SSE2 is more efficient on P4 than A64. Also is true, that in SuperPi - old FPU intensive test, for example, Athlons are faster than P4s.

Originally posted by: Vee
I'm not terribly motivated. Time will will show, anyway. MartinCracauer might be right about that I'm underestimating 64-bit integers impact on performance (my 30% figure) and giving too much credit to number of registers. Well, we'll see...
You may not be motivated, but AMD should, because A64 has more than a year on the market, and his 64bit capabilities are still useless...If is just can proof the minimum 30% increase (not talking about 400-500% as you said), in some reliable benchmarks, maybe the things will turn is his favor...

Anyway to be ontopic, 64bit OS&apl. represent the future, while 32bit become slowly the past...
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: MartinCracauer

Using SSE for skalar (not SIMD) doesn't offer many benefits and leads to further restrictions, e.g. no move immediate.

The problem here is that SSE is a SMID unit with considerable overhead. It is designed to do many things in one sweep, but if you just use it for singe FP instructions there is more overhead.

You can try to use of the automatic vectorizer in Intel's lcc. I didn't try it yet, but I am not very optimisating. It's also questionable whether the resulting code will run on x86_64, if I'm not mistaken, lcc is only available for ia32 and ia64 (aka Itanium).

Furthermore, once you use SSE registers you have to take care of them through function calls, and the OS has to save/restore them on context switches. That can be might expensive.

Last but not least, the i387 unit and SSE2 do math with slightly different results which can be undesirable when your boss doesn't understand floating point.

Well, you're the optimization expert. And I'm sure all your points are valid, some of the time.
But I'm reasonably confident, vectorizing is the right thing to do, most of the time. If you haven't tried Intel's compiler for the P4 (?), I think you should.
I believe autovectorizing will be available for x86-64, once we're on Windows64 and Intel have their x86-64 out.

The different results would be a problem for consistency across different x86 platforms? Wouldn't dropping the 80-bit precision from '387 solve this?
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: Artanis
I don't understand why SSE2 is more efficient on P4 than A64.

I don't understand it completely either. My guess is that it's because the K8's three FP execution units are specialized. One type of operation can only be handled by one unit. Unless the ops are mixed up, the A64 loses parallelism, and its lower clockrate works against it.
(When the P4 isn't faster than A64, it's mostly due to the P4 not digesting the code well. Underflow, overflow, division. So it's because the P4 slows down. Which it does a lot. But not during video encoding.)

(Another guess is that the K9 | EDIT: Correction: K10 | will get three general purpose FP units instead.)

Originally posted by: Vee
I'm not terribly motivated. Time will will show, anyway. MartinCracauer might be right about that I'm underestimating 64-bit integers impact on performance (my 30% figure) and giving too much credit to number of registers. Well, we'll see...
You may not be motivated, but AMD should, ..

Well, there have been some AMD benchmarks. Whether they are "reliable" is a different issue. Anyway, it seems AMD have gotten out of the benchmark publishing business. We're still on a 32-bit platform until Intel gets out their 6x0 series P4s, and MS can finally release WindowsXP64ed.
64-bit performance need support from libraries, compilers and OS' ABIs. That's the reason I keep saying "we'll see" too. It's a matter of time, as you've pointed out. But I have a hard time believing that we will not see performance improvement from 64-bit apps.

I respect MartinCracauers views. But SSE2 will be used heavily for FP. And the additional visable registers' ability to keep data close to execution, should make a difference some of the time.

Meanwhile, check around in anandtech's archive of articles. There should be something. Mixed results no doubt. But remember, if the 64-bit code is similar to 32-bit code, that is, doesn't utilize 64-bit integer or additional registers, there is no reason to expect improvement, and there will be none.

(I see anandtech has a new one on database servers. It seems to include some 64-bit results.)

 

grant2

Golden Member
May 23, 2001
1,165
23
81
Vee, could you answer my questions without getting into extra registers & any other specific enhancements that happen to be in an A64 chip?
 

Vee

Senior member
Jun 18, 2004
689
0
0
Originally posted by: grant2
Vee, could you answer my questions without getting into extra registers & any other specific enhancements that happen to be in an A64 chip?

(sorry for the delay, I've been busy)
Yes, I can. But the extra registers are specific to the x86-64 architecture as such. Not just to the A64.

Going 64-bit isn't just about increasing width of some registers and adding some instructions. It's about adding essentially an entirely new CPU architecture to the old. Just as the '386 once did. Keeping it *similar* helps with staying backwards compatible, while saving transistors. But the CPU will execute new 64-bit code in a new 64-bit mode. The old 32-bit architecture ('386, '486, Pentiums, K6, Athlon/XP) has three different user modes (or four depending upon how you count), representing three (8086, '286, 32-bit) different CPU personalities in one CPU:

Real mode = original 8086. [8086]
protected mode = supporting both older 16-bit protected mode ['286] and 32 bit computing. [32-bit]
Virtual real mode (virtual mode) = submode of 'protected 32-bit mode', emulates an original 8086 inside the protected mode. [8086]

This is known as 'IA32', but is basically the '386. A consequence of the long '86 PC legacy.
The various extensions since, FPU, MMX, SSE, SSE2, may add registers and instructions, but isn't as fundamental change as going 64-bit, (or going 32-bit from 16-bit).

The x86-64 CPUs now have five (or seven, depending on how you count) modes, representing four (8086, '286, 32-bit, 64-bit) main CPU personalities in one CPU:

Legacy mode/real mode = original 8086. [8086].
Legacy mode/protected mode = protected 16/32 bit code. ['286] & [32-bit].
Legacy mode/virtual mode = emulating 8086 inside protected addressing. [8086].
Long mode/compatibility mode = emulating 16/32 bit protected modes inside a 64-bit space. ['286] & [32-bit].
Long mode/64-bit = Our brave new world! 64-bit computing [64-bit]. A 64-bit virtual address space - this is the main thing! This last mode also includes double the number of registers, and 64-bit integer GP registers. These things are inherent in x86-64. Not just some enhancement that happens to be in the A64.

This is 'x86-64' (aka AMD'86-64, aka AMD64, aka CT (Intel), aka IA32e (Intel), aka EM64T (Intel))

As for the rest of my earlier elaborations on execution, I tried to show that what actually goes on, hardware wise, is slightly different, maybe, from an intuitive understanding of the instructions and registers. (And those are things that "happens to be in" the A64. And other CPUs might do things slightly different.)

So is it true that 64-bit integer math will be performed much faster on a 64bit cpu? (because it doesn't have to break it down into multiple 32-bit operations)?

- Yes. Also, wider still - than 64-bit operations - like in security encryption, will also be faster because it can be broken into fewer, larger parts.

That was the short simple answer. Here's the elaboration:

Also, some stuff can be faster because persistent data can be kept close to the execution, in the extra registers (if you/compiler use them). This holds for both 32-bit and 64-bit integer, in 64-bit code of course.

Is it also true that 64-bit floating point math will not be performed any faster, because modern CPUs already have specialized 128-bit hardware to handle that?

- No. That is not quite right.

It is true that 64-bit GP registers and 64-bit flat addressing, the 64-bit issue as such, will not change things for 64-bit (double precision) FP math.

It is true that we already have, since the '387 FPU coprocessor and the '486DX (integrating the FPU) 64-bit FP registers and 64-bit FP operations (with 80 bit precision).

It is true that we already have, since the Pentium4, 128-bit vector registers, holding 2X64 bit or 4X32 bit FP data. And that we have SIMD vector instructions, that will perform a single 32-bit or 64-bit FP operation, on 4 or 2 FP values, with a single instruction.

That may have been the 'simple' answer you're looking for?
So what about the "- No. That is not quite right."? Again, here's the elaboration.


But:
We do not have hardware performing 128 bit wide operations. Not in x86-64, and not before. 64 bits is the longest datawidth (discounting 80-bit precision inside '87 math) to be operated on by hardware.
128-bit vector instructions are useful because they state explicitely parallel operations, which makes good use of late CPU's hardware parallelism, as well as OoO (Out of Order) timing shuffling.

There is FP and there is FP... '87 FP and vector (aka SIMD, aka 'packed') -FP (3DNow, SSE, SSE2, SSE3..).
For sporadic FP operations, '87 is probably still best to use. In this case there is no change in 64-bit mode.

More FP intensive (and time consuming) computing work, like media encoding, 3D rendering, game 3D-engines, matrix/tensor math (basically all advanced computer math, for physics and engineering) is normally better to handle with vector instructions.

In this case, we have twice the number of registers in 64-bit mode. And I'm suggesting this will mean some FP math maybe will indeed perform better in 64-bit code. I also think there are some media encoding benches that support this. I also think various game developers have made some 64-bit performance claims, that could possible be partially because of this.

Final words: There will be no GENERAL performance increase from going 64-bit. Some things will be faster, by exploiting new features. But understanding modern, PC-, 16-32-64 *-bitness* in the game console paradigm - 64 is twice as 32 bit - is wrong. The essential issue is the virtual addressing space that code and data must live in. Is it flat or segmented, is it big enough, does it have 'elbow space'. This is a much, much bigger and more important thing than some silly data width.

As for the *width* thing, we already have 128-bit wide buses (dual 64-bit channel). As for the total execution width, total collective, sum of parallel execution widths of the A64, at one of the final stages, is a whopping 384 bits. Similar is true about PIII, P4 and Athlon/AthlonXP.

We don't go 64-bit just because wider is faster. Wider is faster! - If you specifically need to operate on long, 32+, bitfields, the 64-bit GPRs will make a lot of difference. But we have already been pursuing that 'wider' path for a good while. Not just with FPU, bus-widths and vector extensions, but also with multiple execution units. Even if our CPUs are just "32-bit". I'm also sure CPUs will continue to get gradually *wider*, while remaining "64" bit.

The primary purpose for 64-bit GP registers and integer instructions, is for handling 64-bit pointers.
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Good post, Vee.

Just one note: unfortunately many of the encryption algorithms that have been designed for performance (for example Rinjadel (AES)) are specifically designed to work on 32 bit values. They are not sped up by a 64 bit CPU. One of anandtech comparisions earlier this year confirms this.

Other encryption routines, especially two-key algorithms like RSA not designed to encrypt masses of data fast (because they only generate a key for a one-key algorithm like DES or AES) will be sped up by 64 bits, but that doesn't do much about overall system performance.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |