Does hyperthreading impact...?

intangir · Jul 28, 2010

Scali said:
I was talking about this paper:
http://www.usenix.org/event/usenix05/tech/general/full_papers/short_papers/bulpin/bulpin.pdf

That paper has nothing to say about the way any Windows scheduler works. All their evaluations and data were using various versions of the Linux 2.4.19 kernel.

Incorrect.
If they functioned the same, Intel wouldn't have to recommend people to turn HT off for Windows 2000.

I didn't say they functioned the same. I said they function the same on systems with exactly two logical processors. The burden of proof then lies on you to demonstrate what an OS scheduler could possibly do differently on such a system.

Scali · Jul 28, 2010

intangir said:
That paper has nothing to say about the way any Windows scheduler works. All their evaluations and data were using various versions of the Linux 2.4.19 kernel.

You asked:
"So, what impact do these scheduler improvements have when running a single-core system with two logical processors? How would the Windows 7 scheduler function differently than the original Windows 2k scheduler on such a system?"

This paper gives some explanation on certain of the situations that are to be avoided with HT, and proposes ways to avoid them in the scheduler.
So it at least answers how scheduler improvements can have impact on a single-core HT system.
If you want the exact mechanics of Windows 7 you'll have to google. I know there is are some tech docs on it somewhere, but don't expect me to cough up links, because I forgot where and when I found them.

intangir said:
I didn't say they functioned the same. I said they function the same on systems with exactly two logical processors. The burden of proof then lies on you to demonstrate what an OS scheduler could possibly do differently on such a system.

No it doesn't actually.
Firstly the paper I linked to already answers what you ask.
Secondly, the fact that Intel recommends to turn HT off for Windows 2000 (yes, for single-core Pentium 4 HT processors, specifically) also indicates that apparently something is not entirely the same between 2000 and XP/Vista.
I think the burden of proof is on you now. As I already said... run up a P4HT system and do some benchmarks with 2k and XP. You'll see a difference. If you don't, you have what it takes to prove me wrong. But you won't.
Not sure why you even want to argue. I'm sorry you missed out on those P4HT/Win2k days.

intangir · Jul 28, 2010

Scali said:
You asked:
"So, what impact do these scheduler improvements have when running a single-core system with two logical processors? How would the Windows 7 scheduler function differently than the original Windows 2k scheduler on such a system?"

This paper gives some explanation on certain of the situations that are to be avoided with HT, and proposes ways to avoid them in the scheduler.

So, basically, it doesn't answer the question of what is ACTUALLY implemented in the Windows scheduler?

No it doesn't actually.
Firstly the paper I linked to already answers what you ask.

As I've said before, none of the differences detailed actually perform differently when there are only two logical processors. So, point 1, untrue.

Secondly, the fact that Intel recommends to turn HT off for Windows 2000 (yes, for single-core Pentium 4 HT processors, specifically) also indicates that apparently something is not entirely the same between 2000 and XP/Vista.

Intel makes recommendations for turning off Hyper-Threading in Windows Vista as well. How is this different?

Scali · Jul 28, 2010

intangir said:
So, basically, it doesn't answer the question of what is ACTUALLY implemented in the Windows scheduler?

Nope (unless XP happens to work exactly the same as theirs).

intangir said:
As I've said before, none of the differences detailed actually perform differently when there are only two logical processors. So, point 1, untrue.

I think you just misunderstood the paper then.
Probably useless for me to try and explain it in a forum post then, but anyway, here's the short version:
HT disabled:
1 thread runs on the physical core at a time. The thread gets full use of all cache, execution resources etc.

HT enabled:
2 threads run on the physical core at a time. Certain buffers are partitioned, where each thread gets half the buffers. The instructions of both threads will be competing for the execution resources... the cache will be shared between both threads etc.
There are cases where this effectively reduces performance of both threads (eg competing for cache makes the effective memory latency go up exponentially and both threads end up slower).
Windows 2000 is not aware that such cases exist, since its scheduler was aimed purely at physical CPUs, where no resource sharing took place.

With a HT-aware scheduler, the OS can choose to remove one of the competing threads, and either run the idle thread while the other thread gets full access to the resources (effectively the same as HT disabled, apart from the physical aspect of the partitioned buffers in the CPU, which has virtually no effect on performance, since Pentium 4 is quite overdimensioned in this area), or it can find a thread that doesn't compete for resources as much.

The result:
In Windows 2000 there are various real-world scenarios where performance drops 20-30% with HT enabled.
In Windows XP, these exact same scenarios will see virtually no performance difference between HT enabled or disabled.

So Windows 2000 gives you the pros and the cons of HT and the sharing of resources.
Windows XP gives you the pros, but pretty much none of the cons.

intangir said:
Intel makes recommendations for turning off Hyper-Threading in Windows Vista as well. How is this different?

Erm, no. Re-read the link. XP and Vista are in the "HT-optimized" list.

intangir · Jul 28, 2010

I think you just misunderstood the paper then.
Probably useless for me to try and explain it in a forum post then, but anyway, here's the short version:
HT disabled:
1 thread runs on the physical core at a time. The thread gets full use of all cache, execution resources etc.

HT enabled:
2 threads run on the physical core at a time. Certain buffers are partitioned, where each thread gets half the buffers. The instructions of both threads will be competing for the execution resources... the cache will be shared between both threads etc.
There are cases where this effectively reduces performance of both threads (eg competing for cache makes the effective memory latency go up exponentially and both threads end up slower).
Windows 2000 is not aware that such cases exist, since its scheduler was aimed purely at physical CPUs, where no resource sharing took place.

With a HT-aware scheduler, the OS can choose to remove one of the competing threads, and either run the idle thread while the other thread gets full access to the resources (effectively the same as HT disabled, apart from the physical aspect of the partitioned buffers in the CPU, which has virtually no effect on performance, since Pentium 4 is quite overdimensioned in this area), or it can find a thread that doesn't compete for resources as much.

Perhaps this thread is not the best place for this discussion, yes. But I'm an engineer; I like having concrete algorithms and references, instead of speaking in generalities. When discourse decays to the point of simple contradiction "The paper explains this" "No it doesn't!", we have to delve into further detail if we want to continue conversing without replying "Yes, it does!" "No it doesn't!". I can Google as well as you can. What I need you to do is demonstrate that you can comprehend the documents you're linking.

Anyhow, thanks for writing that up. It shows you're reading too much into the paper. The OS doesn't track CPU resources beyond knowing which logical processor corresponds to which physical processor, and has no way of knowing which threads make more or less demands of the hardware.

All it can do is schedule processes in the ready state onto logical processors, and prefer using the ones on free physical processors. That's all the paper says it does. Nowhere does the paper say that the OS will run an idle thread on a free logical processor if there's a real available thread. It can't "choose to remove" one of the competing threads. The one degree of latitude it has is choice of logical processors. And when there is only one free logical processor, it doesn't have much latitude, does it?

Erm, no. Re-read the link. XP and Vista are in the "HT-optimized" list.

It's application specific, as Intel should tell you. Performance will vary depending on the specific hardware and software you use, and may have a negative effect. See your system manufacturer for details on specific system configurations and performance, blah blah, yada yada yada.

Scali · Jul 28, 2010

intangir said:
What I need you to do is demonstrate that you can comprehend the documents you're linking.

I don't think I'm the one who needs to prove such a thing.

intangir said:
The OS doesn't track CPU resources beyond knowing which logical processor corresponds to which physical processor, and has no way of knowing which threads make more or less demands of the hardware.

This is where the burden of proof is on you.
And the paper actually states that you can get this information from the CPU's performance counters, not something I read into it:

In order to provide throughput-aware scheduling the OS
needs to be able to quantify the current per-thread and
system-wide throughput. It is not sufficient to measure
throughput as instructions per cycle (IPC) because processes
with natively low IPC would be misrepresented.
We choose instead to express the throughput of a process
as a performance ratio specified as its rate of execution
under Hyper-Threading versus its rate of execution when given exclusive use of the processor
...
It is desirable to be able to estimate the performance ratio
of a process while it is running. We want to be able
to do this online using data from the processor hardware
performance counters. A possible method is to look for a
correlation between performance counter values and calculated
performance;

So yes, the OS can indeed know what is going on. You now have to prove that it doesn't use this information.

intangir said:
Nowhere does the paper say that the OS will run an idle thread on a free logical processor if there's a real available thread. It can't "choose to remove" one of the competing threads.

I don't think it has to state that explicitly.
Why does the idle thread exist in the first place? Because you can't stop execution of a CPU.
So if you want to 'remove a thread' from a logical core, the only way to do that is to replace it with the idle thread. The core's instruction pointer must point somewhere, and it will get updated every time it has executed an instruction. That's just how it works.

intangir said:
It's application specific, as Intel should tell you. Performance will vary depending on the specific hardware and software you use, and may have a negative effect. See your system manufacturer for details on specific system configurations and performance, blah blah, yada yada yada.

Yes, I already covered that in an earlier post:
"Win2k could still get the benefits of HT, there was just a bit more 'luck' involved."

aigomorla · Jul 28, 2010

Johnson184 said:
Hey guys,
I've got a few questions about what hyperthreading affects for a dual screen setup.

1. Does it make a difference playing a video game on 1st screen while watching bluray on the 2nd screen?

2. Does it make a difference playing video games and music (iTunes) at the same time?

3. Does it make a difference processing video? (Converting big videos from one format to another?)

I'm just trying to figure out what hyperthreading helps. Seems like everyone is suggested that they don't need it. Is it a gimmick that benefits a very small minority (1&#37?

incase u and others dont have the time to hash though the fire debate going on above.

1. No... not unless your DirectX3d is set on full screen and your watching the movie on the second screen while the first screen is full screened.

2. Sometimes.. it will effect games. Although windows itself has a HTT manger, sometimes, the Virtual core will be used, and the virtual core is about 33% as fast as a physical core.

3. Oh HTT is the BEST for Encoding.
What happens is while the physical core is doing work, the virtual core will start a new work task.
When the physical core is free'd up, the virtual core will start on a new task, and give its task to the faster physical core.
This gives you a lower change time between work, also increases the performance. This is why an i7 is untouchable @ encoding.

HTT isnt for everyone.
For example when i run linX with 12 threads, i will get a lower flop then running it @ 6 threads without HT.
That means in raw work, a virtual core is slower.
But when you encode anything, having HTT on will give you a nice boost in scores vs having it off.

Personally i keep HT On.
I havent seen any issues with it.
And as i said a HT core is meant for a intermediary for a physical core.
So in all, i dont think it will hurt you with it on, unless you want clock speeds greater then 4.6ghz.

intangir · Jul 28, 2010

Scali said:
I don't think I'm the one who needs to prove such a thing

Then I don't consider you qualified to make educated statements about what OS schedulers can or can't do.

Scali said:
This is where the burden of proof is on you.
And the paper actually states that you can get this information from the CPU's performance counters, not something I read into it:

That quote is from the Linux kernel paper. The paper is saying that to make performance measurements (and, by the way, produce pretty graphs for the paper) they have to query the processor. It's telling you what they would like to be able to do, and nowhere does it even imply that the OS is able to do this, much less Windows.

You also left out nice quotes such as the following:

The performance ratio and system speedup metrics
both require knowledge of a process exclusive mode execution
time and are based on the complete execution of
the process. In a running system the former is not known
and the latter can only be known once the process has
terminated by which time the knowledge is of little use.

which tell you that genuine performance measurements are not possible in real time (and so cannot be used by the OS scheduler). And also:

A possible method is to look for a
correlation between performance counter values and calculated
performance; work on this estimation technique
is ongoing, however, we present here a method used to
derive a model for online performance ratio estimation
using an analysis of a training workload set.

which points out that what the authors ended up using for their performance data is an experimental indicator that relies on post-analysis of the workloads in a specific testbench. So the likelihood of this being implemented in the Windows scheduler: zilch.

So yes, the OS can indeed know what is going on. You now have to prove that it doesn't use this information.

Sure!
Mark Russinovich, David A. Solomon (2009). Windows® Internals: Including Windows Server 2008 and Windows Vista, Fifth Edition, p.443:

When a thread becomes ready to run, Windows first tries to schedule the thread to run on an idle processor. If there is a choice of idle processors, preference is given first to the thread's ideal processor, then to the thread's previous processor, and then to the currently executing processor (that is, the CPU on which the scheduling code is running).

To select the best idle processor, Windows starts with the set of idle processors that the thread's affinity mask permits it to run on. If the system is NUMA and there are idle CPUs in the node containing the thread's ideal processor, the list of idle processors is reduced to that set. If this eliminates all idle processors, the reduction is not done. Next, if the system is running hyperthreaded processors and there is a physical processor with all logical processors idle, the list of idle processors is reduced to that set. If that results in an empty set of processors, the reduction is not done.

If the current processor (the processor trying to determine what to do with the thread that wants to run) is in the remaining idle processor set, the thread is scheduled on it. If the current processor is not in the remaining set of idle processors, it is a hyperthreaded system, and there is an idle logical processor on the physical processor containing the ideal processor for the thread, the idle processors are reduced to that set. If not, the system checks whether there are any idle logical processors on the physical processor containing the thread's previous processor. If that set is nonzero, the idle processors are reduced to that list. Finally, the lowest numbered CPU in the remaining set is selected as the processor to run the thread on.

=Scali said:
I don't think it has to state that explicitly.
Why does the idle thread exist in the first place? Because you can't stop execution of a CPU.
So if you want to 'remove a thread' from a logical core, the only way to do that is to replace it with the idle thread. The core's instruction pointer must point somewhere, and it will get updated every time it has executed an instruction. That's just how it works.

It really does need to state such things explicitly, especially when it is contrary to known practice in scheduling algorithms. An idle thread is only scheduled if there is no available thread in the ready state. If the OS has the option of running a real thread, it will. It can't remove a thread on the basis of "performance", whatever the OS's idea of that can possibly be; if there're fewer ready threads than logical processors, they all will run.

Mark Russinovich, David A. Solomon (2009). Windows® Internals: Including Windows Server 2008 and Windows Vista, Fifth Edition, p.418:

In reality, however, the idle threads don't have a priority level because they run only when there are no real threads to run--they are not scheduled and never part of any ready queues.

Scali · Jul 29, 2010

intangir said:
Then I don't consider you qualified to make educated statements about what OS schedulers can or can't do.

Should I care? I'm right anyway, and you're wrong. Tons of people can confirm that. Sadly none of them have been reading this thread.

intangir said:
which tell you that genuine performance measurements are not possible in real time (and so cannot be used by the OS scheduler).

I think I clearly stated 'heuristics', and unless you think that 'heuristics' are 'genuine performance measurements'...

intangir said:
which points out that what the authors ended up using for their performance data is an experimental indicator that relies on post-analysis of the workloads in a specific testbench. So the likelihood of this being implemented in the Windows scheduler: zilch.

intangir said:
It really does need to state such things explicitly, especially when it is contrary to known practice in scheduling algorithms. An idle thread is only scheduled if there is no available thread in the ready state. If the OS has the option of running a real thread, it will. It can't remove a thread on the basis of "performance", whatever the OS's idea of that can possibly be; if there're fewer ready threads than logical processors, they all will run.

We were discussing the statement "removing a thread". Removing is not the same as replacing. So if you REMOVE a thread, as in, no actual thread is going to be scheduled, you need to put the idle thread on there.
I also said it could pick ANOTHER thread OR use the idle thread.
And this was discussing the usenix paper, not the actual Windows scheduler. So we don't need to have any Mark Russinovich quotes.

Bottom line is still: Windows 2000 does it differently than Windows XP/Vista/7.
Because Windows 2000 is not HT-aware, it does not group the processors by their logical pairs. It considers each logical processor to be a regular physical processor, and that means that the idle set will look different.

Again, we can argue about this all day and all night, but as long as we cannot find the actual Microsoft info, the best way is to just run up a system and benchmark. I've already done this years ago, so I no longer need to see proof. For me it ends here. It's not worth discussing any further.
Perhaps you can mail Mark.

Edit: upon re-reading the linked Windows HyperThreading doc, it actually does discuss various technical things that XP does differently. Like for example that they have modified the idle thread to more aggressively HALT the logical processors. And how they have modified their spinlocks (thread synchronization/scheduling objects) to be more HT-friendly in terms of resource contention, by using YIELD instructions to temporarily pause the logical processors.
So there you have at least some technical info of why XP has less resource contention on a single-core HT processor than 2k.

Scali · Jul 29, 2010

I found some interesting data from IBM, regarding a single Xeon HT processor, tested with and without HT, with two different kernel versions:
http://www.ibm.com/developerworks/linux/library/l-htl/

You can clearly see that the 2.5.32 kernel has more gains from HT than the 2.4.19 one (even though 2.4 already has some HT-awareness and avoids contention issues in things like spinlocks).
So clearly single-core HT systems can benefit from HT-aware scheduling very well. It's not something that is only for multi-core/multi-CPU systems.

Conclusion

Intel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel and multithreaded applications. The speed-up from Hyper-Threading could be as high as 30% in stock kernel 2.4.19, to 51% in kernel 2.5.32 due to drastic changes in the scheduler run queue's support and Hyper-Threading awareness.

IntelUser2000 · Jul 29, 2010

If you have Windows 7, and a Hyperthreading enabled system, the physical threads ALWAYS engage before the logical threads.

The performance impact is now extremely minimal in most cases.

From the benchmark tool that Anandtech site has, I derived that the Core i7 Bloomfield had ~35% advantage at the same clock(normalized Turbo too) over the AMD Phenom II Deneb. Now if you look carefully, Intel only has ~15% advantage over AMD in single thread.

The rest? Multi-core optimizations and Hyperthreading. Mostly Hyperthreading.

mv2devnull · Jul 29, 2010

IntelUser2000 said:
If you have Windows 7, and a Hyperthreading enabled system, the physical threads ALWAYS engage before the logical threads.

What are "physical threads" and "logical threads"?

jvroig · Jul 29, 2010

mv2devnull said:
What are "physical threads" and "logical threads"?

Since there are only logical cores, he simply means that the threads are distributed in such a way that if the threads are <= half of logical cores available, all those threads are assigned to logical cores that do not share resources with each other so that each thread will have full use of all the physical resources that the logical core would have access to, with no problem of contention, which is the case with HT-aware schedulers.

aigomorla · Jul 29, 2010

mv2devnull said:
What are "physical threads" and "logical threads"?

phsycial.. ie real... ie.. you can touch or see.

Logical - virtual - assimulated... not real. but is what the physical core pretends to do.

Usually logical cores are split from physical cores.

And physical cores > logical cores.

Scali · Jul 29, 2010

As soon as you enable HT, you no longer have physical cores (at least, not to the OS/software), they're all logical. That was his point.

jvroig · Jul 29, 2010

And physical cores > logical cores.

It really depends, doesn't it? You have 8 logical cores in an i7 920 for example. If you run something that needs 3 threads, you will end up using 3 logical cores. It will still run better than on a Deneb, even though that Deneb would use 3 physical cores. And if you run the same thing on an i7 920 with HT off, it will also use 3 physical cores, but it will perform just the same as the i7 920 with HT on that used 3 logical cores. With today's modern schedulers and HT implementation, it's a very different ballgame than when HT debuted in the Pentium series.

Usually logical cores are split from physical cores.

No, all the time you only have either of the two, physical or logical. With a Deneb, you only have physical cores. With an i7 920 with HT off, you only have physical cores. With an i7 920 with HT on, you only have logical cores, eight of them.

Scali · Jul 29, 2010

Yes, and to add to jvroig's post...
The 'trick' that modern schedulers perform is this:
They know which sets of logical cores belong to the same physical core.
So they start by using one logical core per physical core each.

So say your logical cores are numbered 0 through 7, where 0-1 are the first physical core, 2-3 are the second physical core etc...
If you then run something that needs 3 threads, the OS will first pick the first logical core of each physical core, so you get the threads on core 0, 2 and 4. Which would be equivalent to running them on three physical cores, in the case of no HT.

So although technically the threads aren't running on 'physical cores', it seems that most people understand it as such. I've given up trying to explain it long ago (it came up at least 3 times in this thread, and I ignored it)... but here we are again

aigomorla · Jul 29, 2010

Isnt what you guys talking about more related to virtualization where a physical core is split to act like a logical core?

mv2devnull · Jul 29, 2010

On the Linux side one enumerates (physical) cores and within each core there can be one or more "siblings". Threads are executed in the siblings.

Calling "logical core" a "sibling" avoids the need of the "physical" attribute on "core", and the confusing "logical core" term.

Naive virtualization is merely one process/thread on the host, scheduled like the rest. From the guest side it naturally looks like a computer. And you are recommended to use different scheduler in the guest. The hardware-based virtualization obviously provides some shortcuts, which are not available to regular John Doe threads.

aigomorla · Jul 29, 2010

mv2devnull said:
On the Linux side one enumerates (physical) cores and within each core there can be one or more "siblings". Threads are executed in the siblings.

Calling "logical core" a "sibling" avoids the need of the "physical" attribute on "core", and the confusing "logical core" term.

mmm i think this sounds the most accurate to me.

narsnail · Jul 30, 2010

Turned off HT last night, load temps dropped aproxx 10c, no noticeable performance gains or losses, on Vista 64bit by the way.

Does hyperthreading impact...?

Member

Banned

Member

Banned

Member

Banned

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Member

Banned

Banned

Elite Member

Golden Member

Platinum Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Banned

Platinum Member

Banned

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Golden Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Junior Member