8 Physical cores vs 4 Physical 8 Threads?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Cogman

Lifer
Sep 19, 2000
10,278
126
106
You just described my worst nightmares, thanks. Honestly, you really want animated interfaces with 3d graphics in your standard software? The whole point about a good UI is that it's intuitive and not distracting.

And considering that HT does lots of things the programmer can't even influence, I don't see how it makes programmers lazy or not. Especially considering that usually the executeables will have to run on a vast range of x86 processors.

I've never understood this attitude (not yours, the one you are addressing). It is like people think "If programmers were just better, then HT would be useless." which is not the case at all. HT wasn't invented because programmers do threading terribly. It was invented because the CPU had a lot of wasted resources regardless of what program it was running.

The x86 architecture doesn't give info on which parts are in use, it is pretty much impossible for a programmer to say "I'll run this instruction, then this one, that ways the CPU will be fully busy!". Even if they could by perfectly striping their code (EG mov, then add, then some x87 instruction, ect), they have no guarantee that the CPU wouldn't reorder their code anyways. Furthermore, it is difficult to come up with a function that uses both all of the FPU and all of the ALU at the same time.

Even the converse (which I've seen here) is crazy. Some believe that the performance gains from HT are completely dependent on how good a programmer is at programming for HT. Again, this is crazy because a programmer has no control over when a thread is executed on the CPU. In other words, they can't control which thread gets executed with which, that is an OS level choice.
 

PlasmaBomb

Lifer
Nov 19, 2004
11,815
2
81
Best case scenario with i7 HT at the minute is <50&#37; speed up Dufus... (ie. 4 cores/4 threads -> 4 core/8 threads.



Interestingly some games don't like HT (Left 4 Dead, Far Cry 2)

HT was also supposed to make programmers think more about threading, or at least that was Intel's idea at the time of P4, when they were thinking of multiple cores in the future.

The same is true today, 8 cores (16 threads) at the consumer level isn't too far off, so programmers shouldn't be limiting themselves to 2-4 threads, if they start programming now for 8 threads then Intel have a better market when 8 core CPU's come out...
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Hey guys,
Just curious to know what would perform better 8 Physical Cores no HT, or 4 Physical cores with HT i.e. 8 threads both. (assuming same clock speed and other considerations)

Assuming equal cores, then obviously 8 physical cores will beat 4 physical cores in performance. HT is more of a 'supercharger' on a core, which you can use to easily squeeze some 20-30&#37; extra performance out of a single core, with little extra hardware.
8 physical cores will take 100% extra transistors compared to 4 physical cores.
8 threads on a 4 core CPU would theoretically only add about 5% extra transistors. In practice this is 0%, since in all Intel's CPU architectures with HT support, the HT logic is always present, it is simply disabled in the lower-end series.
As a result, you rarely find 'equal cores'. 8 core CPUs tend to be clocked lower than 4 core CPUs with HT. In AMD's case, the cores are also considerably less efficient per cycle, so HT actually wins out over physical cores in various scenarios.
 
Sep 29, 2004
18,665
67
91
Thats a nice way to put it and i agree, but the trend on usability goes to graphics, just imagine animated *live* interfaces running with antialiased text, 3d graphics and dx11 post process filters on windows 7 apps. Its just amazing but requires a change in mindset, from a hardcore start>all programs luddite to gamer designer.

More cores allows for more powerful (verbose) programming languages which result in less efficient applications. But those applications can do alot more. After doing Java for about 7 years now, I could never imagine doing a GUI in C/C++ again.
 

grimpr

Golden Member
Aug 21, 2007
1,095
7
81
You just described my worst nightmares, thanks. Honestly, you really want animated interfaces with 3d graphics in your standard software? The whole point about a good UI is that it's intuitive and not distracting.

And considering that HT does lots of things the programmer can't even influence, I don't see how it makes programmers lazy or not. Especially considering that usually the executeables will have to run on a vast range of x86 processors.

Yes, and i dont see the problem with that, i personally want fluid experiences in guis and user interaction, expertly crafted with the right amount and type of animation where it counts, fully intuitive and not distracting just like you said due to bad design and timing choises, it can be done very intelligently and with some effort embedded deep with the OS but unfortunately we are stuck with ancient practices and the mentality of the office herd. As for HT i didnt argued about its usefullness at all, i only provoked to stir things up a bit.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
Thanks for the real-world examples PlasmaBomb but I was hoping for some synthetic comparisons such as soccerballtux's cache missing code to hopefully learn something about cache misses and branch mis-prediction with HTT. Didn't know FC2 degraded with HT, might have a look at that.

Here's some rough trivial code, "trivial" being the operative word. Both programs will hopefully run up to as many logical threads as available but not more than 32. Code is the same for all threads but different for each program. One works well with HT and the other does not. Affinity can be set via taskmanager and the score used as a reference. The following examples were run on a quad with HT and all cores set to run at the same speed. (No speedstep / no Turbo boost).

LikeHT with 4 threads run on separate physical cores.


LikeHT with 8 threads run on separate logical cores. (~90% improvement in score).



DisLikeHT with 4 threads run on separate physical cores.


DisLikeHT with 8 threads run on separate logical cores. (same score, no improvement).



With LikeHT a ~90% increase in throughput can be seen where as DisklikeHT shows no improvement. Feel free to run/debug/decompile or whatever the attached code if at all interested. Although the timings might appear the same for both LikeHT and DislikeHT it's just a fluke. They use different code so should not be compared against each other but only compared against themselves. ie compare one configuration of LikeHT only with another configuration of LikeHT.

http://www.sendspace.com/file/2vcl6p
 
Last edited:
Dec 30, 2004
12,554
2
76
That's a lot more real world than what I did. I took a look at my assignments again and it appears I mis-spoke. The code I was talking about was written in assembly for a takehome exam (ie not compilable). The large chunk of code I wrote that I thought applied to this, was just a simulation for purpose of comparison of the accuracy (predicting taken/not taken on branches) of 4 different branch prediction algorithms on an instruction trace through an instance of the GCC being compiled for installation. Similar (one could clearly see the wasted resources as a percentage of computation time-- every mis-predicted branch in the real world would require a pipeline flush and time to re-fill it...unless you had HT) but not the same.
 
Last edited:
Dec 30, 2004
12,554
2
76
Thanks for the real-world examples PlasmaBomb but I was hoping for some synthetic comparisons such as soccerballtux's cache missing code to hopefully learn something about cache misses and branch mis-prediction with HTT. Didn't know FC2 degraded with HT, might have a look at that.

Here's some rough trivial code, "trivial" being the operative word. Both programs will hopefully run up to as many logical threads as available but not more than 32. Code is the same for all threads but different for each program. One works well with HT and the other does not. Affinity can be set via taskmanager and the score used as a reference. The following examples were run on a quad with HT and all cores set to run at the same speed. (No speedstep / no Turbo boost).

LikeHT with 4 threads run on separate physical cores.


LikeHT with 8 threads run on separate logical cores. (~90% improvement in score).



DisLikeHT with 4 threads run on separate physical cores.


DisLikeHT with 8 threads run on separate logical cores. (same score, no improvement).



With LikeHT a ~90% increase in throughput can be seen where as DisklikeHT shows no improvement. Feel free to run/debug/decompile or whatever the attached code if at all interested. Although the timings might appear the same for both LikeHT and DislikeHT it's just a fluke. They use different code so should not be compared against each other but only compared against themselves. ie compare one configuration of LikeHT only with another configuration of LikeHT.

http://www.sendspace.com/file/zj9luu

can you tell us where you got that? It's just the exe's. I'd like to see the code behind it. The 4th picture is 2x the time as the 3rd picture and the exact same score. I have trouble believing this.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
I just made it up as I went along. There's not much code and should be easy to see and check with a debugger or decompiler. Seems a couple of bugs were there anyway so here's the Dislike HT v2 which I'll update later, but not today.

Code:
                pushfd
                pop     eax
                mov     edx,eax
                xor     eax,1 shl 21            ;change ID flag
                push    eax
                popfd
                pushfd
                pop     ebx
                push    edx
                popfd                           ;Restore Flags as were
                cmp     eax,ebx                 ;check for CPUID
                jnz     NotSupported

                xor     eax,eax
                cpuid
                cmp     eax,1
                jb      NotSupported

                mov     eax,1
                cpuid
                bt      ecx,19                  ;Check for SSE4.1
                jnc     NotSupported            ;Normally wouldn't bother with the above for just myself

                invoke  GetProcessAffinityMask,-1,ProcAff,SysAff
                invoke  SetProcessAffinityMask,-1,[SysAff]
                xor     edx,edx
                mov     eax,[ProcAff]
                mov     ecx,32
.NextBit:
                shl     eax,1
                jnc     @f
                inc     edx
@@:
                dec     ecx
                jnz     .NextBit

                mov     [Threads],edx            ;windows thread count

                cinvoke wsprintf,Buff,wsformat2,[Threads]
                invoke  MessageBox,0,Buff,Title,MB_OKCANCEL
                cmp     eax,IDCANCEL
                je      exit
again:
                invoke GetProcessAffinityMask,-1,ProcAff,SysAff
                mov     eax,[ProcAff]

                xor     ebx,ebx
                xor     ecx,ecx
.NextBit:
                shr     eax,1
                jnc     @f
                mov     [Affinity+ebx*4],1
                shl     [Affinity+ebx*4],cl
                inc     ebx
@@:
                inc     ecx
                cmp     ecx,32
                jb      .NextBit

                mov     [Threads],ebx           ;thread count

                xor     ebx,ebx
@@:             mov     [Param],ebx
                lea     esi,[ThreadID+ebx*4]
                invoke  CreateThread,0,0,CpuThread,Param,CREATE_SUSPENDED,esi
                mov     [hThread+ebx*4],eax
                mov     edx,[Affinity+ebx*4]
                invoke  SetThreadAffinityMask,[hThread+ebx*4],edx

                inc     ebx
                cmp     ebx,[Threads]
                jb      @b


                invoke  GetTickCount
                mov     [STime],eax

                xor     ebx,ebx
@@:
                invoke  ResumeThread,[hThread+ebx*4]
                inc     ebx
                cmp     ebx,[Threads]
                jb      @b


@@:             invoke WaitForMultipleObjects,ebx,hThread,1,1000
                cmp    eax,102h                 ;should add count for timeout
                je     @b

                invoke  GetTickCount
                sub     eax,[STime]
                mov     ecx,eax
                xor     edx,edx
                cmp     eax,0
                jz      @f
                mov     eax,10000000
                mov     esi,[Threads]
                mul     esi
                div     ecx
                mov     ebx,1000
                xchg    eax,ecx
                xor     edx,edx
                div     ebx
@@:
                cinvoke wsprintf,Buff,wsformat,[Threads],eax,edx,ecx
                invoke  MessageBox,0,Buff,Title,MB_YESNO
                cmp     eax,IDYES
                je      again

exit:
                invoke  ExitProcess,0

NotSupported:
                invoke  MessageBox,0,NotSuppTxt,Title,MB_OK
                jmp     exit
;----------------------------------------
proc            CpuThread,nnnn
                push    edi
                mov     edi,400000000
@@:
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0
                blendvps xmm0,xmm1,xmm0

                dec     edi
                jnz     @b

                pop     edi
                ret
endp
;----------------------------------------

  SysAff             dd ?
  Title              db 'Dislike HT v2',0
  wsformat           db 'Threads Completed = %u',10,10
                     db 'Time =  %u.%03u Seconds',10,10
                     db 'Score = %u',10,10
                     db 'Run Again?',0
  wsformat2          db '%u Threads.',10,10
                     db 'Set Affinty in Taskmanager.',10,0
  NotSuppTxt         db 'CPU Not Supported.',0

align 4
  Zero               dd ?
  STime              dd ?
  Param              dd ?
  ProcAff            dd ?
  Threads            dd ?
  Affinity           rd 32
  ThreadID           rd 32
  hThread            rd 32
  Buff               rb 100
The LikeHT is similar except no check for SSE and uses a loop of CPUID EAX=0 instructions. The times posted are what I got although with the GetTickCount can typically be plus or minus 15.6ms. If you get something much different then please post.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
I'm not sure if your '4 physical cores' is correct.
Judging by the Windows scheduler, every pair of cores is a single physical core, so instead of running on cores 0-3 you should run on cores 0, 2, 4 and 6.
Could you try that and see if it makes a difference?
 

Dufus

Senior member
Sep 20, 2010
675
119
101
Depends on the APICID, for that it was 0,2,4,6,1,3,5,7 and it seems sometimes it can be different.

here you go, in windows cpu core numbers this is logical cores 2 and 5 which are different physical cores.


and here logical cores 2 and 6 which are the same physical core.
 
Last edited:

iCyborg

Golden Member
Aug 8, 2008
1,327
52
91
can you tell us where you got that? It's just the exe's. I'd like to see the code behind it. The 4th picture is 2x the time as the 3rd picture and the exact same score. I have trouble believing this.
The score is num_threads/time x 10^4.
So if 2x threads take 2x time, you get the same score.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
Thanks for the nice explanation iCyborg.

I thought at first soccerballtux was commenting on the timing being exactly twice (ie too perfect) rather than the score. The score was to try and make the relationship easier to see rather than just threads and timing. Seems it made things more confusing instead. Another reason while I'll never make it as a professional programmer lol.

Scali, regarding APIC ID, while I'm using the same hardware, Windows VHP32 will map to 02461357 as did the older W7 RC32 but W7HP64 maps to 01234567. Those above SS's were from VHP32.
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
Scali, regarding APIC ID, while I'm using the same hardware, Windows VHP32 will map to 02461357 as did the older W7 RC32 but W7HP64 maps to 012345678. Those above SS's were from VHP32.

Ah okay... I've only used Win7 x64 on Core i7 machines, and for CPU-intensive tasks, such as compiling with VS2010, it would always just load cores 0, 2, 4 and 6.
I wasn't aware that different versions of Windows may have different mappings.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,761
14,786
136
are 3 and 7 the same physical core ? E5520 W/HT turned on. I get errors reported by those cores in OCCT And are they physical or logical cores ?
 
Last edited:

ModestGamer

Banned
Jun 30, 2010
1,140
0
0
It very workload dependant and to some degree kernel dependant. in a very well coded OS optimized for a specific core count with very aggresive threading technquies and high levels of parrellization the 8 core should beat the 4 core with HT all day, it also depends on the task at hand to.

sometimes one might be better then the other. More then likely the 8core will typically be better though especially in areas where big heavy math is invovled.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
are 3 and 7 the same physical core ? E5520 W/HT turned on.

Mark, maybe you could try downloading and running APIC.zip ~2KB



The above shows
Windows CPU 0 & 1 mapped to physical Core 0.
Windows CPU 2 & 3 mapped to physical Core 1.
Windows CPU 4 & 5 mapped to physical Core 2.
Windows CPU 6 & 7 mapped to physical Core 3.
 
Last edited:

Lorne

Senior member
Feb 5, 2001
874
1
76
Are OS's generaly suppose to use PCore first then thred onward to the HT/LCore or is just random and depend on the programming/er?
 

iCyborg

Golden Member
Aug 8, 2008
1,327
52
91
It's up to the Os's thread scheduling implementation. HT-aware OSes will prefer an idle physical CPU to a logical one. Nonaware ones will not distinguish them.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,882
3,230
126
Hey guys,
Just curious to know what would perform better 8 Physical Cores no HT, or 4 Physical cores with HT i.e. 8 threads both. (assuming same clock speed and other considerations)

If you had a work which took 8 virtual threads, then its a no brainer.
the 8c system would be better then a 4c/8t.

HT threads are not as fast as physical threads, so its impossible for them to be as efficient in everything.

However senario 2:
If the program wasnt so heavy in multi threaded, and used the second core as a starting core, then they would be about the same.

The Purpose on a HT core is to transition a physical core from work -> work, without an idle.

The HT will start a task, and prep it for the physical core to finish its previous work, and then hand off the new assignment.

Scenario 3:
your playing an old 3d game... now this wont matter as all the old games are not even multi core threaded to begin with, so whatever processor you had, even a dual core would be equal.

Mark, maybe you could try downloading and running APIC.zip ~2KB

Well this only shows you affinity.
Problem now is trying to find which one is the virtual and which one is the physical.


Uncleweb made me use a different program back when i had him make RealTemp GT.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,761
14,786
136
For me it says core 3 is core 3-0 and core 7 is core 3-1. Sounds like I have a bad core ? The other identical chip was fried on a defective open box motherboard. Maybe it fried one of my cores ? Both chips were on the motherboard when I tried to fire it up. I have a 950 coming today to replace the E5520.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,882
3,230
126
For me it says core 3 is core 3-0 and core 7 is core 3-1. Sounds like I have a bad core ? The other identical chip was fried on a defective open box motherboard. Maybe it fried one of my cores ? Both chips were on the motherboard when I tried to fire it up. I have a 950 coming today to replace the E5520.

no that cpu is a dual QPI cpu if im not mistaken.

It could be the CPUID posting wrong values because of the secondary QPI.
 

George Powell

Golden Member
Dec 3, 1999
1,265
0
76
As has been said you should not try to compare the clockspeeds of the Xeons you have with an i7.
As it is given your particular use the Xeons are going to be a little bit faster compared to a quad core i7. However in the long run they will be a lot more expensive - this is due to the older less efficient architecture consuming more power.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
Well this only shows you affinity.
Problem now is trying to find which one is the virtual and which one is the physical.

Lets take your SS as an example.



You have 6 physical cores with APIC ID's 0,1,2,8,9,10 and each physical core has 2 threads with HT enabled (0 and 1) which are mapped as 12 logical cores CPU 0 to CPU 11 by Windows.

Looking at it from the physical side would be.



Anyways, it was just something rough and ready that I had to try and help Mark check the affinity core mapping of his CPU.

For me it says core 3 is core 3-0 and core 7 is core 3-1. Sounds like I have a bad core ?
Possibly. If it's giving errors at stock then yes. If it's giving errors at high OC then maybe it's just because core 3 is weaker than the rest.
 

Emulex

Diamond Member
Jan 28, 2001
9,759
1
71
remember HT threads can dispatch i/o so if you have say 10 lun's mapped for sql server and 10 files to spread the load (very common) it is possible that 12 logical cores can dispatch more i/o with the enterprise edition of sql server (parallel indexes) and speed things up a bit.

I'd disable HT for older o/s though (Xp,2003) they don't seem as adept at scheduling. Regardless in a server app i've seen the improvements and the cooling is not an issue.

I got a free seed unit with two nehalem quad cores and 24gb of ram. now if i can get some FERMI in the box maybe we can do some benchmarking. not sure if the dual 650 watt power supplies will cut it. maybe i stick to 4 ssd's only. I'd like to do 16 ssd but the power drain might be too much with a fermi
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |